XMCTSGuard/HPCProcedures.md at 03a96fe2a054be5421d896764501af1b16c7f262

orlandi/XMCTSGuard

Fork 0

Files

Luca Orlandi 03a96fe2a0 added documentation for HPC training

2026-04-20 10:53:20 +02:00

4.1 KiB

Raw Blame History

The general infos on how to ask for an account and how to connect can be found at registration and documentation .

Use the RAVEN facility for machine learning with pytorch

Once an account has been acquired one can start to set up its own framework.

Module loading

The HPC comes with preinstalled modules that have to be loaded by the user. A list of useful modules to load is the following:

intel/2025.2
mkl/2025.2
openmpi/5.0
python-waterboa/2025.06
cuda/12.8
All of these can be load with the command module load <module_name>, this can also be done sequentially by putting each name one after the other. N.B. Order matters when loading the modules!!

Environment creation

Once the basic resources are available one can focus on creating its own python environment. This can be done for example by using Venv. An environment is created by calling:

python -m venv <path/to/environment/folder>

I personally like to keep the environments in the home folder inside the .venv directory. the name of the environment itself is then the name of the subdirectory in this folder. For this project I created an environment which i called ptl in $HOME/.venv/ptl, to activate this environment one must source it.

source .venv/ptl/bin/activate

This way we can have a simple way to keep all of the needed python libraries in a known place, without having to worry about dependencies.

The environment I am currently using to launch the training jobs has the following libraries installed, they can simply be installed using python -m pip install <library_name> :

lightning
torch
tensorboard
numpy
pandas
matplotlib
seaborn

The first three libraries are all that's needed for the training procedures, the other are useful for the result visualization, which could also be done offline.

Slurm

Once the environment is set one can finally launch its first job. MPCDF's HPC facility uses the slurm scheduler to assign computing time to the different jobs. The documentation on how to use slurm on RAVEN can be found at the following link SLURM documentation with many examples

This is a script example to use for the neural network training:

#!/bin/bash -l
# Standard output and error:
#SBATCH -o ./job.out.%j
#SBATCH -e ./job.err.%j
# Initial working directory:
#SBATCH -D ./
# Job name
#SBATCH -J train_VAEQXT
#
#SBATCH --ntasks=1
#SBATCH --constraint="gpu"
#
# --- default case: use a single GPU on a shared node ---
#SBATCH --gres=gpu:a100:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=16000
#SBATCH --time=03:00:00
#SBATCH --mail-type=none
#SBATCH --mail-user=your.mail@address

module purge

module load intel/2025.2 mkl/2025.2 openmpi/5.0 python-waterboa/2025.06 cuda/12.8

source /u/<user_name>/<venv>/<folder>/bin/activate

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

srun python /u/lucor/src/W7XNN/train.py > train_output.txt

What this script does is give all of the essential information to slurm, starting on where to write the error and output logs of the job, setting the initial working directory as the one from where the script is launched, and naming the job, so that it can be easily identified. After this first set of instructions the computing power request starts, in this example just one task is requested, with the use of gpu resources, then the more precise instructions are given in the following, an A100 graphics card is requested, 8 cpus and a memory of 16 GB. The requested time to allocate the resources must also be specified, in this instance I requested 3 hours of computing time. One can also set a mail alert to follow the status of the job.

Once one is satisfied with the script, this can be executed by calling: srun <script_name>

a process ID will then be given and the process will be put in queue. The status of the job can be followed via the squeue command. To follow all of the jobs submitted one can use:

squeue -u $USER

4.1 KiB Raw Blame History

Use the RAVEN facility for machine learning with pytorch

Module loading

Environment creation

Slurm

4.1 KiB

Raw Blame History