4.1 KiB
The general infos on how to ask for an account and how to connect can be found at registration and documentation .
Use the RAVEN facility for machine learning with pytorch
Once an account has been acquired one can start to set up its own framework.
Module loading
The HPC comes with preinstalled modules that have to be loaded by the user. A list of useful modules to load is the following:
- intel/2025.2
- mkl/2025.2
- openmpi/5.0
- python-waterboa/2025.06
- cuda/12.8
All of these can be load with the commandmodule load <module_name>, this can also be done sequentially by putting each name one after the other. N.B. Order matters when loading the modules!!
Environment creation
Once the basic resources are available one can focus on creating its own python environment. This can be done for example by using Venv. An environment is created by calling:
python -m venv <path/to/environment/folder>
I personally like to keep the environments in the home folder inside the .venv directory. the name of the environment itself is then the name of the subdirectory in this folder.
For this project I created an environment which i called ptl in $HOME/.venv/ptl, to activate this environment one must source it.
source .venv/ptl/bin/activate
This way we can have a simple way to keep all of the needed python libraries in a known place, without having to worry about dependencies.
The environment I am currently using to launch the training jobs has the following libraries installed, they can simply be installed using python -m pip install <library_name> :
- lightning
- torch
- tensorboard
- numpy
- pandas
- matplotlib
- seaborn
The first three libraries are all that's needed for the training procedures, the other are useful for the result visualization, which could also be done offline.
Slurm
Once the environment is set one can finally launch its first job. MPCDF's HPC facility uses the slurm scheduler to assign computing time to the different jobs. The documentation on how to use slurm on RAVEN can be found at the following link SLURM documentation with many examples
This is a script example to use for the neural network training:
#!/bin/bash -l
# Standard output and error:
#SBATCH -o ./job.out.%j
#SBATCH -e ./job.err.%j
# Initial working directory:
#SBATCH -D ./
# Job name
#SBATCH -J train_VAEQXT
#
#SBATCH --ntasks=1
#SBATCH --constraint="gpu"
#
# --- default case: use a single GPU on a shared node ---
#SBATCH --gres=gpu:a100:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=16000
#SBATCH --time=03:00:00
#SBATCH --mail-type=none
#SBATCH --mail-user=your.mail@address
module purge
module load intel/2025.2 mkl/2025.2 openmpi/5.0 python-waterboa/2025.06 cuda/12.8
source /u/<user_name>/<venv>/<folder>/bin/activate
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
srun python /u/lucor/src/W7XNN/train.py > train_output.txt
What this script does is give all of the essential information to slurm, starting on where to write the error and output logs of the job, setting the initial working directory as the one from where the script is launched, and naming the job, so that it can be easily identified. After this first set of instructions the computing power request starts, in this example just one task is requested, with the use of gpu resources, then the more precise instructions are given in the following, an A100 graphics card is requested, 8 cpus and a memory of 16 GB. The requested time to allocate the resources must also be specified, in this instance I requested 3 hours of computing time. One can also set a mail alert to follow the status of the job.
Once one is satisfied with the script, this can be executed by calling:
srun <script_name>
a process ID will then be given and the process will be put in queue.
The status of the job can be followed via the squeue command.
To follow all of the jobs submitted one can use:
squeue -u $USER