Slurm Job Options
For any job on ARC, you will have to define what resources you want to request for that job. To do this, there are a set of configuration options that can be defined.
Commonly used Slurm Job Options
Slurm manuals provide exhaustive information, but here are the most commonly used options with brief explanations:
Short |
Long |
Function |
Optional or Required |
Notes |
|---|---|---|---|---|
|
|
Name of Slurm billing account |
Required |
This is not your PID. Account name can be found in Coldfront |
|
|
Number of nodes |
Optional (but recommended) |
Before extending to multiple nodes, make sure your code can run on multiple nodes |
|
|
Select the partition to use |
Optional |
If one is not chosen, the default will be used. Available partitions for each cluster |
|
|
Total number of tasks |
Optional |
This will spread across multiple nodes. Default = 1 |
n/a |
|
Number of tasks per node |
Optional |
Provides better control than |
|
|
Number of cores to allocate to each task |
Optional |
Make sure your code supports multi-threading before using |
n/a |
|
Memory needed on each node allocated to the job |
Optional (but recommended) |
If not provided, the default will be used. Can define other units like M (for Megabyte) or G (for Gigabyte) e.g. |
n/a |
|
Memory needed for each CPU core allocated to the job |
Optional |
Provides greater control over the memory each cpu gets. |
n/a |
|
Defines which QoS used |
Optional |
Run |
n/a |
|
Defines an email account that Slurm will sent notifications of type=BEGIN,END,FAIL |
Optional |
You can tailor which types of emails you like to receive by including or excluding in type (ALL, TIME_LIMIT, and REQUEUE are additional options) |
|
|
Defines an output file with the name |
Optional |
If not defined, default output file is |
|
|
Defines an error file with the name |
Optional |
If not defined, error message will usually show up in |
|
|
Wall time for the job |
Optional (but recommended) |
Make sure you give your job enough time to finish. Use job inspection tools for finished jobs to help determine an appropriate wall time. Default will be used, if not defined. |
GPU-specific Jobs
For GPU-specific partitions (a100_normal_q, h200_normal_q, l40s_normal_q, a30_normal_q, v100_normal_q, t4_normal_q), one of the following MUST BE DEFINED.
Long |
Function |
Notes |
|---|---|---|
|
Request |
The total number of gpus = |
|
Request |
|
|
Request |
Same as |
|
Request |
Must also define either |
Default settings differ based on which cluster/partition you are running on. You may check the default settings for some of these variables for each partition with the following command scontrol show partition <partition name>.
Slurm constraints
Constraints allow users to make very specific requests to the scheduler such as requesting a specific CPU vendor or architecture features (e.g. AVX512).
To request a constraint, you must add the following line to your submit script:
#SBATCH --constraint=<feature_name>
Constraints are not needed by default and are intended only for advanced users who want to restrict the nodes when multiple architecture types belong to the same partition.
Cluster |
Partitions |
Feature |
Description |
|---|---|---|---|
TinkerCliffs |
normal_q, preemptable_q |
amd |
Select only nodes with AMD CPUs |
TinkerCliffs |
normal_q, preemptable_q |
intel |
Select only nodes with Intel CPUs |
TinkerCliffs |
normal_q, preemptable_q |
avx512 |
Select only nodes with AVX512 (i.e., the Intel nodes) |
TinkerCliffs |
a100_normal_q, a100_preemptable_q |
hpe-A100 |
Select only HPE nodes with A100 GPUs |
TinkerCliffs |
a100_normal_q, a100_preemptable_q |
dgx-A100 |
Select only DGX nodes with A100 GPUs |
Owl |
normal_q, preemptable_q |
avx512 |
Select only nodes with AVX512 (i.e., the Genoa nodes) |
An example to constraint to just dgx A100 nodes on our Tinkercliffs cluster would look like the following:
#SBATCH --constraint=dgx-A100