You could also use /scratch/<CCDB username> to store temporary files using next command.
cd/scratch/<CCDBusername>
Then clone the Plato repository to your own directory:
git clone https://github.com/TL-System/plato
Your CCDB username can be located after signing into the CCDB portal. Contact Baochun Li (bli@ece.toronto.edu) for a new account on Digital Research Alliance of Canada.
Preparing the Python Runtime Environment
First, load version 3.12 of the Python programming language:
To discover the versions of Python available:
moduleavailpython
Load version 3.12 of the Python programming language:
moduleloadpython/3.12
You can then create your own Python virtual environment (for example, one called .federated):
virtualenv--no-download~/.federated# creating your own virtual environmentsource~/.federated/bin/activate
To monitor the output as it is generated live, use the command:
watch-n1tail-n50./cifar_wideresnet.out
where ./cifar_wideresnet.out is the output file that needs to be monitored, and the -n parameter for watch specifies the monitoring frequency in seconds (the default value is 2 seconds), and the -n parameter for tail specifies the number of lines at the end of the file to be shown. Type Control + C to exit the watch session.
Tip
Make sure you use different port numbers under server in different jobs' configuration files before submitting your jobs if you plan to run them at the same time. This is because they may be allocated to the same node, which is especially common when you use the Narval cluster. In that case, if the port and address under server in your configuration files of the jobs are the same, you will get OSError: [Errno 48] error while attempting to bind on address: address already in use.
If there is a need to start an interactive session (for debugging purposes, for example), it is also supported by Digital Research Alliance of Canada using the salloc command:
After the job is done, use exit at the command to relinquish the job allocation.
Note
On the Digital Research Alliance of Canada, if there are issues in the code that prevent it from running to completion, the potential issues could be:
Out of CUDA memory.
Potential solutions: Decrease the max_concurrency value in the trainer section in your configuration file.
Running processes have not been terminated from previous runs.
Potential solutions: Use the command pkill python to terminate them so that there will not be CUDA errors in the upcoming run.
The time that a client waits for the server to respond before disconnecting is too short.
This could happen when training with large neural network models. If you get an AssertionError saying that there are not enough launched clients for the server to select, this could be the reason. But make sure you first check if it is due to the out of CUDA memory error.
Potential solutions: Add ping_timeout in the server section in your configuration file. The default value for ping_timeout is 360 (seconds).
Running jobs of HuggingFace
Running a job of HuggingFace requires connecting to the Internet to download the dataset and the model. However, Digital Research Alliance of Canada doesn't allow Internet connections inside sbatch/salloc. Therefore, they need to be pre-downloaded via the following steps:
Run the command first outside sbatch/salloc, for example, uv run --active plato.py -c <your configuration file>, and use control + C to terminate the program right after the first client starts training. After this step, the dataset and the model should be automatically downloaded.
Switch to running it inside sbatch/salloc, and add TRANSFORMERS_OFFLINE=1 before the command. The below is a sample job script: