added documentation for HPC training
This commit is contained in:
89
HPCProcedures.md
Normal file
89
HPCProcedures.md
Normal file
@@ -0,0 +1,89 @@
|
||||
The general infos on how to ask for an account and how to connect can be found at [registration](https://selfservice.mpcdf.mpg.de/index.php?r=registration) and [documentation](https://docs.mpcdf.mpg.de/) .
|
||||
|
||||
## Use the RAVEN facility for machine learning with pytorch
|
||||
Once an account has been acquired one can start to set up its own framework.
|
||||
|
||||
### Module loading
|
||||
The HPC comes with preinstalled modules that have to be loaded by the user.
|
||||
A list of useful modules to load is the following:
|
||||
- intel/2025.2
|
||||
- mkl/2025.2
|
||||
- openmpi/5.0
|
||||
- python-waterboa/2025.06
|
||||
- cuda/12.8
|
||||
All of these can be load with the command `module load <module_name>`, this can also be done sequentially by putting each name one after the other. N.B. Order matters when loading the modules!!
|
||||
|
||||
### Environment creation
|
||||
Once the basic resources are available one can focus on creating its own python environment. This can be done for example by using Venv.
|
||||
An environment is created by calling:
|
||||
|
||||
`python -m venv <path/to/environment/folder>`
|
||||
|
||||
I personally like to keep the environments in the home folder inside the .venv directory. the name of the environment itself is then the name of the subdirectory in this folder.
|
||||
For this project I created an environment which i called ptl in `$HOME/.venv/ptl`, to activate this environment one must source it.
|
||||
|
||||
`source .venv/ptl/bin/activate`
|
||||
|
||||
This way we can have a simple way to keep all of the needed python libraries in a known place, without having to worry about dependencies.
|
||||
|
||||
The environment I am currently using to launch the training jobs has the following libraries installed, they can simply be installed using `python -m pip install <library_name>` :
|
||||
- lightning
|
||||
- torch
|
||||
- tensorboard
|
||||
- numpy
|
||||
- pandas
|
||||
- matplotlib
|
||||
- seaborn
|
||||
|
||||
The first three libraries are all that's needed for the training procedures, the other are useful for the result visualization, which could also be done offline.
|
||||
|
||||
### Slurm
|
||||
Once the environment is set one can finally launch its first job. MPCDF's HPC facility uses the slurm scheduler to assign computing time to the different jobs. The documentation on how to use slurm on RAVEN can be found at the following link [SLURM documentation](https://docs.mpcdf.mpg.de/doc/computing/raven-user-guide.html#slurm-batch-system) with many [examples](https://docs.mpcdf.mpg.de/doc/computing/raven-user-guide.html#slurm-example-batch-scripts)
|
||||
|
||||
This is a script example to use for the neural network training:
|
||||
```bash
|
||||
#!/bin/bash -l
|
||||
# Standard output and error:
|
||||
#SBATCH -o ./job.out.%j
|
||||
#SBATCH -e ./job.err.%j
|
||||
# Initial working directory:
|
||||
#SBATCH -D ./
|
||||
# Job name
|
||||
#SBATCH -J train_VAEQXT
|
||||
#
|
||||
#SBATCH --ntasks=1
|
||||
#SBATCH --constraint="gpu"
|
||||
#
|
||||
# --- default case: use a single GPU on a shared node ---
|
||||
#SBATCH --gres=gpu:a100:1
|
||||
#SBATCH --cpus-per-task=8
|
||||
#SBATCH --mem=16000
|
||||
#SBATCH --time=03:00:00
|
||||
#SBATCH --mail-type=none
|
||||
#SBATCH --mail-user=your.mail@address
|
||||
|
||||
module purge
|
||||
|
||||
module load intel/2025.2 mkl/2025.2 openmpi/5.0 python-waterboa/2025.06 cuda/12.8
|
||||
|
||||
source /u/<user_name>/<venv>/<folder>/bin/activate
|
||||
|
||||
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
|
||||
|
||||
srun python /u/lucor/src/W7XNN/train.py > train_output.txt
|
||||
|
||||
```
|
||||
|
||||
What this script does is give all of the essential information to slurm, starting on where to write the error and output logs of the job, setting the initial working directory as the one from where the script is launched, and naming the job, so that it can be easily identified.
|
||||
After this first set of instructions the computing power request starts, in this example just one task is requested, with the use of gpu resources, then the more precise instructions are given in the following, an A100 graphics card is requested, 8 cpus and a memory of 16 GB. The requested time to allocate the resources must also be specified, in this instance I requested 3 hours of computing time.
|
||||
One can also set a mail alert to follow the status of the job.
|
||||
|
||||
Once one is satisfied with the script, this can be executed by calling:
|
||||
`srun <script_name>`
|
||||
|
||||
a process ID will then be given and the process will be put in queue.
|
||||
The status of the job can be followed via the `squeue` command.
|
||||
To follow all of the jobs submitted one can use:
|
||||
|
||||
`squeue -u $USER`
|
||||
|
||||
@@ -74,6 +74,11 @@ If this is the case, before running the training procedure, one should run the f
|
||||
|
||||
in order to deselect the possibility of using said GPUs.
|
||||
|
||||
#### Training in HPC
|
||||
It is also possible to train the model on the Raven HPC, even if for a single training, without the need for optimization, it is not necessary to use such a powerful machine.
|
||||
|
||||
The steps to deploy and train the model on the raven HPC, which has NVIDIA GPUs available, are thoroughly described in the file HPCProcedures.md
|
||||
|
||||
### Jupyter Notebooks
|
||||
This is a useful tool for exploring the code and see hands-on examples on how the various steps work together, however it can be messy inside a .git repository, in order to avoid embedding in the version control a great amount of useless data and plot, the use of the nbstripout package is strongly recommended. This package, a possible implementation can be found [here](https://pypi.org/project/nbstripout/) under the 'Using as a Git filter' section.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user