Slurm Environment

To maximize resource utilization, most testbed nodes have been pooled into a Slurm cluster. With this setup we aim to have an environment similar to the Slurm setup on IBEX. However, since we have a smaller pool of users, we can afford to have a more custom configuration and be more relaxed in resource usage constraints.

IBEX portability

For most workloads, you can easily move them from our cluster directly to IBEX and vise-versa. You can also bring to our cluster most sbatch scripts you learn to use from IBEX learning resources (see below). However, note that IBEX’s modules are not supported on our setup.

Another important consideration to note is that our cluster does not share the same storage as IBEX. We have separate filesystems.

Learning to use Slurm

In case you’re unfamiliar with Slurm, we recommend that you watch the IBEX 101 training session. You can find the introduction to IBEX as well as other useful Slurm tutorials on the learning resources page from IBEX.

Interacting with Slurm

To use our Slurm setup, you need to be connected to the testbed. Make sure you have access to the testbed before proceeding.

Interaction with the Slurm cluster is done through the head node, mcmgt01. You will only be able to deploy jobs from this node.

SSH into the head node before running any of the commands below.

ssh mcmgt01

Deploying jobs

There are two types of job you can launch on Slurm: batch jobs and interactive jobs. Select the job type that is most suitable for your usecase.

Batch jobs

Batch jobs are the most common job type you will use. With batch jobs, you specify the resources you require, indicate a sbatch script you want to run and push it to the Slurm queue. Once enough resources are available to run your job, it will be executed. Then, once the job terminates, its resources are automatically released for other jobs.

Batch jobs are configured as sbatch scripts and launched with the sbatch command. An sbatch script is just a bash script with sbatch comments.

sbatch sbatch_script.sh

Here is an example sbatch script you can use when running experiments. You can find all of the possible resource configuration options in the official documentation for sbatch.

Sbatch template - General

#!/bin/bash --login
#
#SBATCH --job-name=epic_job
#SBATCH --output=%x-%j.out
#SBATCH --error=%x-%j.err
#
#SBATCH --time=10:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=50GB
#SBATCH --gres=gpu:p100:1

set -e # stop bash script on first error

mamba activate <your_env>

# Optionally add "$@" to pass additional arguments to the script
python experiment.py --option_1 value_1 -flag "$@"

A few notes on the above script:

Notice the --login flag passed to bash. You need it in order to use your conda/mamba environment.
Slurm allows filename patterns to contain replacement symbols like %x and %j.
If you don’t want to split output and err, delete the --error flag and both stdout and stderr will be in the output.
We’re specifying a time limit of 10 hours. Always try to specify a time limit for your jobs.
Because we didn’t specify, Slurm assumes you are requesting a single node.
The --gres flag specifies we want the node we get to have a free p100 gpu.
- Other gpu options are v100 and a100.
The script will execute under the working directory where you call the sbatch command.

You can optionally pass extra arguments to the script as if you were running a bash script directly.

sbatch sbatch_template.sh --extra_option extra_value ...
# The "$@" bash variable in the sbatch_template.sh will be replaced
# with the extra parameters passed

Interactive jobs

Use interactive jobs when you want to interact with the resources you allocated. This is particularly useful when you are developing or debugging your code. There are multiple ways you can achieve this interactive setup.

Bash session

The simplest way to interact with a node is to use srun to deploy a job and ask for an interactive bash shell.

srun --gres=gpu:p100:1 --time=10:00:00 --pty bash -i
# Example output
user@mcnode01:~$

The deployed job will exist until you close the connection started by srun. To prevent the connection from terminating, you might find it useful to launch the above command in a tmux session. The tmux session will keep the command running across accidental terminal window closings or SSH timeouts to mcmgt01 due to innactivity.

Remote VSCode connection

You may find it useful to interact with a node using your VSCode. This is possible with the Remote-SSH extension for VSCode.

Here are the steps you need to follow:

Create a job to obtain a node (potentially inside a tmux session)

srun --gres=gpu:p100:1 --time=10:00:00 --pty bash -i
# Example output
user@mcnode01:~$

Connect your VSCode directly to the allocated node (mcnode01 in the example)
1. Click bottom left icon to open Remote-SSH
2. Select connect to host
3. Specify hostname of the machine you want to connect (e.g mcnode01)

Note: You will only be able to connect to nodes where you have a running job.

Sbatch template - jupyter lab

Another common usage is to launch a jupyter lab.

#!/bin/bash --login
#SBATCH --job-name=jupyter
#SBATCH --output=%x-%j.out
#
#SBATCH --time=10:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:p100:1
#SBATCH --mem=50G

set -e # stop bash script on first error

# You can install jupyter in a mamba env with 'mamba install jupyter'
mamba activate <env_with_jupyter_installed>

# Ask the OS for a free port on the machine
JUPYTER_LAB_PORT=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')
jupyter lab --ip=0.0.0.0 --port=$JUPYTER_LAB_PORT --no-browser

Once your job has been deployed, you will be able to access the Jupyter Lab session through your local browser. You can find the URL for the Jupyter Lab at the bottom of the output log. It will look something like this:

To access the server, open this file in a browser:
  ...
  Or copy and paste one of these URLs:
    http://mcnode01:50365/lab?token=d80b8c90aff74407e96b57c7c2d516a0177e08ba5832a7a0

Efficient slurm usage

The goal of this Slurm setup is maximize resource utilization, hence minimizing time where our resources are idle. Here are some guidelines we ask our users to follow in order to help us achieve it.

Guidelines

Whenever possible, specify a time limit on your jobs. Especially for interactive jobs, where it is easy to forget to cancel the job.
You should never have more than two simultaneous interactive bash sessions.
If you just need a GPU to run tests, you probably don’t need an a100. Aim select the weakest GPU that serves your usecase p100 > v100 > a100.
Don’t ask for exclusive access to nodes if you don’t need to.

Automatic Job Termination

Guidelines are not enough to enforce good resource utilization. For this reason, we implemented mechanisms to terminate jobs if their resources are not used efficiently. These automatic terminations will mainly affect two types of allocations: interactive allocations when the user is not working; and incorrectly configured jobs (e.g. requesting more GPUs than the job is using).

Termination Policies

Poor GPU utilization (copied IBEX policy):
Jobs will be canceled if they allocate GPUs and have less than ~15% GPU utilization over a one-hour period.

Useful utilities

ginfo

Inspired by IBEX, we decided to port their ginfo command to our Slurm cluster. With ginfo, you can find the number of GPUs being used and how many are not being used. Here is a preview:

$ ginfo
GPU Model        Used    Idle   Drain    Down   Maint   Total
a100                9       7       0       0       0      16
p100                0       3       0       0       0       3
v100                1      11       4       8       0      24
       Totals:     10      21       4       8       0      43

GPUs associated with the Idle state are available for new jobs.

ninfo

Run ninfo to get detailed information about nodes in the cluster including CPU, Memory, available GPUs and specific features like CPU model. Example output

$ ninfo
NODELIST      CPUS      MEMORY      AVAIL_FEATURES                                 GRES
mcnode01      40        128785      cpu_intel_xeon_e5_2630,gpu,gpu_p100            gpu:p100:1
mcnode02      40        128785      cpu_intel_xeon_e5_2630,gpu,gpu_p100            gpu:p100:1
mcnode22      128       515614      cpu_amd_epyc_7763,gpu,gpu_a100                 gpu:a100:4
mcnode25      128       515614      cpu_amd_epyc_7763,gpu,gpu_a100                 gpu:a100:4
mcnode32      16        515644      cpu_intel_xeon_silver_4112,gpu,gpu_v100        gpu:v100:2
mcnode33      16        499516      cpu_intel_xeon_silver_4112,gpu,gpu_v100        gpu:v100:3

Useful Slurm commands

Only show my jobs

squeue --me

Cancel all of my jobs

scancel --me

Retrieve recent job information from my jobs started today

sacct -X -o "JobID,JobName%-20,AllocTRES%-60,NodeList%-10,Elapsed%-14,State%-20,ExitCode"

By default sacct only shows jobs from the current day. Adding --starttime now-2day will show jobs from the last 2 days.
sacct is very flexible, check out its man page for more options.

Get all job details

sacct -j <job_id> --json | less

Show current state of nodes

sinfo

Show characteristics of each node

sinfo -N -o "%14N  %10c  %10m  %45f  %10G"

Troubleshooting jobs

Sometimes you might be surprised to find that your job is not getting the right resources or that it unexpectedly terminated. Here are some actions you can take to troubleshoot these issues.

Allocation issues

Start by checking which resources were allocated to your job with the command:

sacct -X -j <job_id> -o "JOBName%-20, AllocTRES%-60, NodeList%-15"

Notice that some resources are automatically set if you don’t specify them. This is the case for the number of nodes and number of cpus. A common issue is finding that these defaults are not suitable for your case and might need to be explicitly set.

Another common issue is forgetting to specify the memory unit in the --mem option. The default unit is megabytes.

Termination issues

If you haven’t done so, start by checking the output and error logs of your job. You will often find the reason for termination in there. If you can’t find it in the logs, run the command below to find out the end state of your job:

sacct -X -j <job_id> -o "JOBName%-20, State%-20, ExitCode"

Here are examples of the most common reasons for unexpected job termination:

JobID        JobName              State                ExitCode
------------ -------------------- -------------------- --------
180          adv_exp              OUT_OF_MEMORY           0:125
182          launch-jupyter       TIMEOUT                   0:0
185          interactive          CANCELLED BY 192866       0:0

All job info

If the previous suggestions were not enough, consider going through the job details to understand what went wrong. The full list of the job details can be obtained with:

sacct -j <job_id> --json | less

Assistance

If the previous suggestions didn’t work, please reach out to us. We’ll be happy to help. When asking assistance, please specify the issue and the job id.

Direct ssh into the nodes

Once you allocate a job you will be able to access that node via SSH. If you try to access a node that is not part of your slurm job the SSH access will be denied.

When you connect to a node, you will only be allowed to use the resources associated with the job allocation that allowed you to ssh. Resources used under this connection will be added to the resource usage of the job. If you have multiple jobs on the same machine, resources are added to the most recently deployed job.

Node pool

The nodes associated with the Slurm pool are registered on the Trello board under Slurm pool.

Last updated on Oct 3, 2024