Running Plato¶
Running Plato on Google Colaboratory¶
Go to Google Colaboratory.
Under directory plato/examples/colab/
, the notebook colab_use_terminal.ipynb
provides step-by-step instructions on running Plato on Google Colaboratory, while providing the facilities to use a secure shell to login and to open Visual Studio Code. To run Plato, just use the integrated terminal in the browser.
Running Plato on Digital Research Alliance of Canada¶
Installation¶
SSH into a cluster on Digital Research Alliance of Canada. Here we take Graham as an example, while Cedar and Narval are also available. Then clone the Plato repository to your own directory:
ssh <CCDB username>@graham.computecanada.ca
cd projects/def-baochun/<CCDB username>
git clone https://github.com/TL-System/plato
Your CCDB username can be located after signing into the CCDB portal. Contact Baochun Li (bli@ece.toronto.edu
) for a new account on Digital Research Alliance of Canada.
Preparing the Python Runtime Environment¶
First, load version 3.9 of the Python programming language:
module load gcc/9.3.0 arrow cuda/11 python/3.9 scipy-stack
Then create the directory that contains your own Python virtual environment using virtualenv
:
virtualenv --no-download ~/.federated
where ~/.federated
is assumed to be the directory containing the new virtual environment just created.
You can now activate your environment:
source ~/.federated/bin/activate
Due to the versioning difference for some Python packages on Digital Research Alliance of Canada, some source files may need to be patched. Use the following command to patch these files:
bash ./docs/patches/patch_cc.sh
The next step is to install the required Python packages. PyTorch should be installed following the advice of its getting started website. As for January 2022, Digital Research Alliance of Canada provides GPU with CUDA version 11.2, so the command would be:
pip3 install torch==1.10.1+cu113 torchvision==0.11.2+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
If it prompts MemoryError
, use the alternative command:
pip3 install --no-cache-dir torch==1.10.1+cu113 torchvision==0.11.2+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
Finally, install Plato as a pip package:
pip install .
Tip: Use alias to save trouble for future running Plato.
vim ~/.bashrc
Then add
alias plato='cd ~/projects/def-baochun/<CCDB username>/plato/; module load gcc/9.3.0 arrow cuda/11 python/3.9 scipy-stack; source ~/.federated/bin/activate'
After saving this change and exiting vim
:
source ~/.bashrc
Next time, after you SSH into this cluster, just type plato
:)
Running Plato¶
To start a federated learning training workload with Plato, create a job script:
vi <job script file name>.sh
For exmaple:
cd ~/projects/def-baochun/<CCDB username>/plato
vi cifar_wideresnet.sh
Then add your configuration parameters in the job script. The following is an example:
#!/bin/bash
#SBATCH --time=3:00:00
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --mem=128G
#SBATCH --account=def-baochun
#SBATCH --output=cifar_wideresnet.out
module load gcc/9.3.0 arrow cuda/11 python/3.9 scipy-stack
source ~/.federated/bin/activate
./run -c configs/CIFAR10/fedavg_wideresnet.yml
Note: On cedar
and graham
, one should remove the --nodes=1
option.
Submit the job:
sbatch <job script file name>.sh
For example:
sbatch cifar_wideresnet.sh
To check the status of a submitted job, use the sq
command. Refer to the official Computer Canada documentation for more details.
To monitor the output as it is generated live, use the command:
watch -n 1 tail -n 50 ./cifar_wideresnet.out
where ./cifar_wideresnet.out
is the output file that needs to be monitored, and the -n
parameter for watch
specifies the monitoring frequency in seconds (the default value is 2 seconds), and the -n
parameter for tail
specifies the number of lines at the end of the file to be shown. Type Control + C
to exit the watch
session.
Tip: Make sure you use different port
numbers under server
in different jobs’ configuration files before submitting your jobs if you plan to run them at the same time. This is because they may be allocated to the same node, which is especially common when you use the Narval
cluster. In that case, if the port
and address
under server
in your configuration files of the jobs are the same, you will get OSError: [Errno 48] error while attempting to bind on address: address already in use
.
If there is a need to start an interactive session (for debugging purposes, for example), it is also supported by Digital Research Alliance of Canada using the salloc
command:
salloc --time=2:00:00 --gres=gpu:1 --mem=64G --account=def-baochun
The job will then be queued and waiting for resources:
salloc: Pending job allocation 53923456
salloc: job 53923456 queued and waiting for resources
As soon as your job gets resources, you get the following prompts:
salloc: job 53923456 has been allocated resources
salloc: Granted job allocation 53923456
Then you can run Plato:
./run -c configs/CIFAR10/fedavg_wideresnet.yml
After the job is done, use exit
at the command to relinquish the job allocation.
Note: On Digital Research Alliance of Canada, if there are issues in the code that prevented it from running to completion, the two most possible reasons are:
If runtime exceptions occur that prevent a federated learning session from running to completion, the potential issues could be:
Out of CUDA memory.
Potential solutions: Decrease the
max_concurrency
value in thetrainer
section in your configuration file.Running processes have not been terminated from previous runs.
Potential solutions: Use the command
pkill python
to terminate them so that there will not be CUDA errors in the upcoming run.The time that a client waits for the server to respond before disconnecting is too short. This could happen when training with large neural network models. If you get an
AssertionError
saying that there are not enough launched clients for the server to select, this could be the reason. But make sure you first check if it is due to the out of CUDA memory error.Potential solutions: Add
ping_timeout
in theserver
section in your configuration file. The default value forping_timeout
is 360 (seconds).
Running jobs of HuggingFace¶
Running a job of HuggingFace requires connecting to the Internet to download the dataset and the model. However, Digital Research Alliance of Canada doesn’t allow Internet connections inside sbatch/salloc. Therefore, they need to be pre-downloaded via the following steps:
Run the command first outside sbatch/salloc, for example,
./run -c <your configuration file>
, and usecontrol + C
to terminate the program right after the first client starts training. After this step, the dataset and the model should be automatically downloaded.Switch to running it inside sbatch/salloc, and add
TRANSFORMERS_OFFLINE=1
before the command. The below is a sample job script:
#!/bin/bash
#SBATCH --time=4:00:00
#SBATCH --nodes=1
#SBATCH --gres=gpu:7
#SBATCH --mem=498G
#SBATCH --account=def-baochun
#SBATCH --output=<output file>
module load gcc/9.3.0 arrow cuda/11 python/3.9 scipy-stack
source ~/.federated/bin/activate
TRANSFORMERS_OFFLINE=1 ./run -c <your configuration file>
Removing the Python virtual environment¶
To remove the environment after experiments are completed, just delete the directory:
rm -rf ~/.federated