Slurm Job Options

For any job on ARC, you will have to define what resources you want to request for that job. To do this, there are a set of configuration options that can be defined.

Commonly used Slurm Job Options

Slurm manuals provide exhaustive information, but here are the most commonly used options with brief explanations:

Short

Long

Function

Optional or Required

Notes

-A <name>

--account=<name>

Name of Slurm billing account

Required

This is not your PID. Account name can be found in Coldfront

-N <#>

--nodes=<#>

Number of nodes

Optional (but recommended)

Before extending to multiple nodes, make sure your code can run on multiple nodes

-p <name>

--partition=<name>

Select the partition to use

Optional

If one is not chosen, the default will be used. Available partitions for each cluster

-n <#>

--ntasks=<#>

Total number of tasks

Optional

This will spread across multiple nodes. Default = 1

n/a

--ntasks-per-node=<#>

Number of tasks per node

Optional

Provides better control than -n

-c <#>

--cpus-per-task=<#>

Number of cores to allocate to each task

Optional

Make sure your code supports multi-threading before using

n/a

--mem=<#G>

Memory needed on each node allocated to the job

Optional (but recommended)

If not provided, the default will be used. Can define other units like M (for Megabyte) or G (for Gigabyte) e.g. --mem=10G

n/a

--mem-per-cpu=<#G>

Memory needed for each CPU core allocated to the job

Optional

Provides greater control over the memory each cpu gets.

n/a

--qos=<name>

Defines which QoS used

Optional

Run showqos to see available options. Can be used, for example, to increase priority (but double billing applies) or increase walltime allowed (lower priority)

n/a

--mail-user=<email@vt.edu> & --mail-type=BEGIN,END,FAIL

Defines an email account that Slurm will sent notifications of type=BEGIN,END,FAIL

Optional

You can tailor which types of emails you like to receive by including or excluding in type (ALL, TIME_LIMIT, and REQUEUE are additional options)

-o <name>

--output=<name>

Defines an output file with the name <name>

Optional

If not defined, default output file is slurm_JOBID.out

-e <name>

--error=<name>

Defines an error file with the name <name>

Optional

If not defined, error message will usually show up in slurm_JOBID.out file

-t DD-HH:MM:SS

--time=DD-HH:MM:SS

Wall time for the job

Optional (but recommended)

Make sure you give your job enough time to finish. Use job inspection tools for finished jobs to help determine an appropriate wall time. Default will be used, if not defined.

GPU-specific Jobs

For GPU-specific partitions (a100_normal_q, h200_normal_q, l40s_normal_q, a30_normal_q, v100_normal_q, t4_normal_q), one of the following MUST BE DEFINED.

Long

Function

Notes

--gres=gpu:<N>

Request N gpus on each allocated node

The total number of gpus = N*#nodes

--gpus=<N>

Request N gpus in total

--gpus-per-node=<N>

Request N gpus on each allocated node

Same as --gres=gpus:<N>

--gpus-per-task=<N>

Request N gpus on each allocated node

Must also define either --ntasks or --gpus

Default settings differ based on which cluster/partition you are running on. You may check the default settings for some of these variables for each partition with the following command scontrol show partition <partition name>.

Slurm constraints

Constraints allow users to make very specific requests to the scheduler such as requesting a specific CPU vendor or architecture features (e.g. AVX512).

To request a constraint, you must add the following line to your submit script:

#SBATCH --constraint=<feature_name>

Constraints are not needed by default and are intended only for advanced users who want to restrict the nodes when multiple architecture types belong to the same partition.

Cluster

Partitions

Feature

Description

TinkerCliffs

normal_q, preemptable_q

amd

Select only nodes with AMD CPUs

TinkerCliffs

normal_q, preemptable_q

intel

Select only nodes with Intel CPUs

TinkerCliffs

normal_q, preemptable_q

avx512

Select only nodes with AVX512 (i.e., the Intel nodes)

TinkerCliffs

a100_normal_q, a100_preemptable_q

hpe-A100

Select only HPE nodes with A100 GPUs

TinkerCliffs

a100_normal_q, a100_preemptable_q

dgx-A100

Select only DGX nodes with A100 GPUs

Owl

normal_q, preemptable_q

avx512

Select only nodes with AVX512 (i.e., the Genoa nodes)

An example to constraint to just dgx A100 nodes on our Tinkercliffs cluster would look like the following:

#SBATCH --constraint=dgx-A100