Slurm Job Options

For any job on ARC, you will have to define what resources you want to request for that job. To do this, there are a set of configuration options that can be defined.

Commonly used Slurm Job Options

Slurm manuals provide exhaustive information, but here are the most commonly used options with brief explanations:

Short	Long	Function	Optional or Required	Notes
`-A <name>`	`--account=<name>`	Name of Slurm billing account	Required	This is not your PID. Account name can be found in Coldfront
`-N <#>`	`--nodes=<#>`	Number of nodes	Optional (but recommended)	Before extending to multiple nodes, make sure your code can run on multiple nodes
`-p <name>`	`--partition=<name>`	Select the partition to use	Optional	If one is not chosen, the default will be used. Available partitions for each cluster
`-n <#>`	`--ntasks=<#>`	Total number of tasks	Optional	This will spread across multiple nodes. Default = 1
n/a	`--ntasks-per-node=<#>`	Number of tasks per node	Optional	Provides better control than `-n`
`-c <#>`	`--cpus-per-task=<#>`	Number of cores to allocate to each task	Optional	Make sure your code supports multi-threading before using
n/a	`--mem=<#G>`	Memory needed on each node allocated to the job	Optional (but recommended)	If not provided, the default will be used. Can define other units like M (for Megabyte) or G (for Gigabyte) e.g. `--mem=10G`
n/a	`--mem-per-cpu=<#G>`	Memory needed for each CPU core allocated to the job	Optional	Provides greater control over the memory each cpu gets.
n/a	`--qos=<name>`	Defines which QoS used	Optional	Run `showqos` to see available options. Can be used, for example, to increase priority (but double billing applies) or increase walltime allowed (lower priority)
n/a	`--mail-user=<email@vt.edu>` & `--mail-type=BEGIN,END,FAIL`	Defines an email account that Slurm will sent notifications of type=BEGIN,END,FAIL	Optional	You can tailor which types of emails you like to receive by including or excluding in type (ALL, TIME_LIMIT, and REQUEUE are additional options)
`-o <name>`	`--output=<name>`	Defines an output file with the name `<name>`	Optional	If not defined, default output file is `slurm_JOBID.out`
`-e <name>`	`--error=<name>`	Defines an error file with the name `<name>`	Optional	If not defined, error message will usually show up in `slurm_JOBID.out` file
`-t DD-HH:MM:SS`	`--time=DD-HH:MM:SS`	Wall time for the job	Optional (but recommended)	Make sure you give your job enough time to finish. Use job inspection tools for finished jobs to help determine an appropriate wall time. Default will be used, if not defined.

GPU-specific Jobs

For GPU-specific partitions (a100_normal_q, h200_normal_q, l40s_normal_q, a30_normal_q, v100_normal_q, t4_normal_q), one of the following MUST BE DEFINED.

Long	Function	Notes
`--gres=gpu:<N>`	Request `N` gpus on each allocated node	The total number of gpus = `N`*#nodes
`--gpus=<N>`	Request `N` gpus in total
`--gpus-per-node=<N>`	Request `N` gpus on each allocated node	Same as `--gres=gpus:<N>`
`--gpus-per-task=<N>`	Request `N` gpus on each allocated node	Must also define either `--ntasks` or `--gpus`

Default settings differ based on which cluster/partition you are running on. You may check the default settings for some of these variables for each partition with the following command scontrol show partition <partition name>.

Slurm constraints

Constraints allow users to make very specific requests to the scheduler such as requesting a specific CPU vendor or architecture features (e.g. AVX512).

To request a constraint, you must add the following line to your submit script:

#SBATCH --constraint=<feature_name>

Constraints are not needed by default and are intended only for advanced users who want to restrict the nodes when multiple architecture types belong to the same partition.

Cluster	Partitions	Feature	Description
TinkerCliffs	normal_q, preemptable_q	amd	Select only nodes with AMD CPUs
TinkerCliffs	normal_q, preemptable_q	intel	Select only nodes with Intel CPUs
TinkerCliffs	normal_q, preemptable_q	avx512	Select only nodes with AVX512 (i.e., the Intel nodes)
TinkerCliffs	a100_normal_q, a100_preemptable_q	hpe-A100	Select only HPE nodes with A100 GPUs
TinkerCliffs	a100_normal_q, a100_preemptable_q	dgx-A100	Select only DGX nodes with A100 GPUs
Owl	normal_q, preemptable_q	avx512	Select only nodes with AVX512 (i.e., the Genoa nodes)

An example to constraint to just dgx A100 nodes on our Tinkercliffs cluster would look like the following:

#SBATCH --constraint=dgx-A100