TinkerCliffs - ARC’s Flagship Cluster
Overview
TinkerCliffs has 353 nodes, 44,224 CPU cores, 133 TB RAM, 112 NVIDIA A100 and 56 NVIDIA H200 GPUs. TinkerCliffs hardware is summarized in the table below.
Node Type |
Base Compute Nodes |
Intel Nodes |
High Memory Nodes |
DGX A100 GPU Nodes |
A100 GPU Nodes |
H200 GPU Nodes |
Total |
---|---|---|---|---|---|---|---|
Chip |
- |
||||||
Architecture |
Zen 2 |
Cascade Lake |
Zen 2 |
Zen 2 |
Zen 2 |
Emerald Rapids |
- |
Slurm features |
amd |
intel, avx512 |
amd |
dgx-A100 |
hpe-A100 |
- |
- |
Nodes |
308 |
16 |
8 |
10 |
4 |
7 |
353 |
GPUs |
- |
- |
- |
8x NVIDIA A100-80G |
8x NVIDIA A100-80G |
8x NVIDIA H200-141G |
168 |
Cores/Node |
128 |
96 |
128 |
128 |
128 |
64 |
- |
Memory (GB)/Node |
256 |
384 |
1,024 |
2,048 |
2,048 |
2,048 |
- |
Maximum Memory for Slum (GB)/Node |
243 |
368 |
999 |
2,007 |
2,007 |
2,007 |
- |
Total Cores |
39,424 |
1,536 |
1,024 |
1,280 |
512 |
448 |
44,224 |
Total Memory (GB) |
78,848 |
6,144 |
8,192 |
20,480 |
8,192 |
14,336 |
136,192 |
Local Disk |
480GB SSD |
3.2TB NVMe |
480GB SSD |
30TB Gen4 NVMe |
11.7TB NVMe |
28 TB NVMe |
- |
Interconnect |
HDR-100 IB |
HDR-100 IB |
HDR-100 IB |
8x HDR-200 IB |
4x HDR-200 IB |
8x HDR-200 IB |
- |
The AMD EPYC 7702 base compute nodes are also called Tinkercliffs-Rome nodes.
Tinkercliffs is hosted in the Steger Hall HPC datacenter on the Virginia Tech campus so it is physically separated from other ARC HPC systems which are hosted in the AISB Datacenter at the Corporate Research Center (CRC) in Blacksburg.
An IBM ESS GPFS file system supports /projects
for group collaboration and a VAST /scratch
serves high-performance input/output (I/O).
Get Started
Tinkercliffs can be accessed via one of the two login nodes using your VT credentials:
tinkercliffs1.arc.vt.edu
tinkercliffs2.arc.vt.edu
For testing purposes, all users will be alloted an initial 240 core-hours for 90 days in the “personal” allocation. Researchers at the PI level are able to request resource allocations in the “free” tier (usage fully subsidized by VT) and can allocate 1,000,000 monthly Service Units among their projects.
To create an allocation, log in to the ARC allocation portal https://coldfront.arc.vt.edu
Select or create a project
Click the “+ Request Resource Allocation” button
Choose the “Compute (Free) (Cluster)” allocation type
Usage needs in excess of 1,000,000 monthly Service Units can be purchased via the ARC Cost Center.
Partitions
Users submit jobs to partitions of the cluster depending on the type of resources needed (for example, CPUs or GPUs). Features are optional restrictions users can indicate in their job submission to restrict the execution of their job to nodes meeting specific requirements. If users do not specify the amount of memory requested for a job, the parameter DefMemPerCPU will automatically determine the amount of memory for the job based on the number of CPU cores requested. If the users do not specify the number of CPU cores on a GPU job, the parameter DepCpuPerGPU will automatically determine the number of CPU cores based on the number of GPUs requested. Jobs will be billed against the user’s allocation accounting for the utilization of number of CPU cores, memory, and GPU time. Consult the Slurm configuration to understand how to specify the parameters for your job.
Partition |
normal_q |
preemptable_q |
a100_normal_q |
a100_preemptable_q |
h200_normal_q |
h200_preemptable_q |
---|---|---|---|---|---|---|
Node Type |
Base Compute, Intel, High Memory |
Base Compute, Intel, High Memory |
DGX A100 GPU, A100 GPU |
DGX A100 GPU, A100 GPU |
H200 GPU |
H200 GPU |
Features |
amd,intel,avx512 |
amd,intel,avx512 |
hpe-A100,dgx-A100 |
hpe-A100,dgx-A100 |
- |
- |
Number of Nodes |
332 |
332 |
14 |
14 |
7 |
7 |
DefMemPerCPU (MB) |
1944 |
1944 |
16056 |
16056 |
32112 |
32112 |
DefCpuPerGPU |
- |
- |
8 |
8 |
4 |
4 |
TRESBillingWeights |
CPU=1.0,Mem=0.0625G |
- |
CPU=1.0,Mem=0.0625G,GRES/gpu=100.0 |
- |
CPU=1.0,Mem=0.0625G,GRES/gpu=150 |
- |
PreemptMode |
OFF |
ON |
OFF |
ON |
OFF |
ON |
Quality of Service (QoS)
ARC must balance the needs of individuals with the needs of all to ensure fairness. This is done by providing options which determine the Quality of Service (QoS).
The QoS associated with a job affects the job in three key ways: scheduling priority, resource limits, and time limits. Each partition has a default QoS named partitionname_base with a default priority, resource limits, and time limits. Users can optionally select a different QoS to increase or decrease the priority, resource limits, and time limits. The goal is to offer users multiple flexible options that adjust to their jobs needs. The long QoS allows users to run for an extended period of time (up to 14 days) but reduces the total amount of resources that can be allocated for the job. The short QoS allows users to increase the number of resources for a job but reduces the maximum time to 1 day. ARC staff reserves the right to modify the QoS settings at any point of time to ensure a fair and balanced utilization of resources among all users.
Partition |
QoS |
Priority |
MaxWall |
MaxTRESPerUser |
MaxTRESPerAccount |
UsageFactor |
---|---|---|---|---|---|---|
normal_q |
tc_normal_base |
1000 |
7-00:00:00 |
cpu=8397,mem=18276G |
cpu=16794,mem=36552G |
1 |
normal_q |
tc_normal_long |
500 |
14-00:00:00 |
cpu=2100,mem=4569G |
cpu=4199,mem=9138G |
1 |
normal_q |
tc_normal_short |
2000 |
1-00:00:00 |
cpu=12596,mem=27414G |
cpu=25191,mem=54828G |
2 |
preemptable_q |
tc_preemptable_base |
0 |
30-00:00:00 |
cpu=1050,mem=2285G |
cpu=2100,mem=4569G |
0 |
a100_normal_q |
tc_a100_normal_base |
1000 |
7-00:00:00 |
cpu=359,mem=5642G,gres/gpu=23 |
cpu=717,mem=11284G,gres/gpu=45 |
1 |
a100_normal_q |
tc_a100_normal_long |
500 |
14-00:00:00 |
cpu=90,mem=1411G,gres/gpu=6 |
cpu=180,mem=2821G,gres/gpu=12 |
1 |
a100_normal_q |
tc_a100_normal_short |
2000 |
1-00:00:00 |
cpu=538,mem=8463G,gres/gpu=34 |
cpu=1076,mem=16926G,gres/gpu=68 |
2 |
a100_preemptable_q |
tc_a100_preemptable_base |
0 |
30-00:00:00 |
cpu=45,mem=706G,gres/gpu=3 |
cpu=90,mem=1411G,gres/gpu=6 |
0 |
h200_normal_q |
tc_h200_normal_base |
1000 |
7-00:00:00 |
cpu=90,mem=2868G,gres/gpu=12 |
cpu=180,mem=5735G,gres/gpu=23 |
1 |
h200_normal_q |
tc_h200_normal_long |
500 |
14-00:00:00 |
cpu=23,mem=717G,gres/gpu=3 |
cpu=45,mem=1434G,gres/gpu=6 |
1 |
h200_normal_q |
tc_h200_normal_short |
2000 |
1-00:00:00 |
cpu=135,mem=4301G,gres/gpu=17 |
cpu=269,mem=8602G,gres/gpu=34 |
2 |
h200_preemptable_q |
tc_h200_preemptable_base |
0 |
30-00:00:00 |
cpu=12,mem=359G,gres/gpu=2 |
cpu=23,mem=717G,gres/gpu=3 |
0 |
Features
Features or constraints for Tinkercliffs cluster node types and partitions.
Cluster |
Node Types |
Partitions |
User-selectable features |
---|---|---|---|
Tinkercliffs |
AMD Zen2 “Rome” nodes |
|
|
Tinkercliffs |
AMD Zen2 “Rome” large-memory nodes |
|
|
Tinkercliffs |
Intel “CascadeLake-AP” nodes |
|
|
Tinkercliffs |
HPE 8x A100-80G GPU nodes |
|
|
Tinkercliffs |
Nvidia DGX 8x A100-80G GPU nodes |
|
|
Tinkercliffs |
Dell 8x H200 GPU nodes |
|
n/a - homogeneous partitions |
Examples
Specify use of base compute nodes AMD EPYC 7702 of the normal_q partition
The sbatch slurm script is given below.
The key line below is #SBATCH --constraint=amd
, which specifies the
requisite nodes of the normal_q.
#!/bin/bash
## For tc (tinkercliffs) cluster.
## Job name.
#SBATCH --job-name=i_hope_this_runs
## You will need your own account.
#SBATCH --account=arcadm
# Partition name. Options change based on cluster.
#SBATCH --partition=normal_q
## Specifying that you must have this job run on the
## Tinkercliffs Rome nodes (i.e., the AMD EPYC 7702
## base compute nodes).
#SBATCH --constraint=amd
## This next line is optional because it is the default.
#SBATCH --qos=tc_normal_base
## Always put "%j" into the output and error file names
## in order for the names to contain the SLURM_JOB_ID.
#SBATCH --output=my_hopeful_job.%j.out
#SBATCH --error=my_hopeful_job.%j.err
## Maximum wall clock time.
#SBATCH --time=48:00:00
## Compute resources.
### Number of compute nodes.
#SBATCH --nodes=1
### Number of tasks can be thought of as number of processes
### as in the case of MPI.
#SBATCH --ntasks-per-node=3
### Number of cpus/cores is the number of cores needed
### to run each task (e.g., for parallelism).
#SBATCH --cpus-per-task=16
# Reset modules.
module reset
# Load particular modules.
module load foss/2023b
# Source any virtual environments.
## -----------------------
## EXPORTS
## Exports and variable assignments.
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MV2_ENABLE_AFFINITY=0
echo " SLURM_CPUS_PER_TASK: " $SLURM_CPUS_PER_TASK
echo " OMP_NUM_THREADS: " $OMP_NUM_THREADS
echo " MV2_ENABLE_AFFINITY: " $MV2_ENABLE_AFFINITY
echo " SLURM_NTASKS: " $SLURM_NTASKS
## -----------------------
## MPI JOB.
THE_INPUT="./nim-11-1-1.inp"
THE_EXEC="../bin-tc-openmpi/is-hybrid-tc-openmpi"
srun --mpi=pmix $THE_EXEC $THE_INPUT
Specify use of Intel compute nodes Intel Xeon Platinum 9242 of the normal_q partition
The key lines below are:
#SBATCH --constraint=intel
#SBATCH --constraint=avx512
… which specify the requisite nodes of the normal_q.
#!/bin/bash
## For tc (tinkercliffs) cluster.
## Job name.
#SBATCH --job-name=i_hope_this_runs
## You will need your own account.
#SBATCH --account=arcadm
# Partition name. Options change based on cluster.
#SBATCH --partition=normal_q
## Specifying that you must have this job run on the
## Tinkercliffs Intel Cascade Lake nodes.
#SBATCH --constraint=intel
#SBATCH --constraint=avx512
## This next line is optional because it is the default.
#SBATCH --qos=tc_normal_base
## Always put "%j" into the output and error file names
## in order for the names to contain the SLURM_JOB_ID.
#SBATCH --output=my_hopeful_job.%j.out
#SBATCH --error=my_hopeful_job.%j.err
## Maximum wall clock time.
#SBATCH --time=48:00:00
## Compute resources.
### Number of compute nodes.
#SBATCH --nodes=1
### Number of tasks can be thought of as number of processes
### as in the case of MPI.
#SBATCH --ntasks-per-node=3
### Number of cpus/cores is the number of cores needed
### to run each task (e.g., for parallelism).
#SBATCH --cpus-per-task=16
# Reset modules.
module reset
# Load particular modules.
module load foss/2023b
# Source any virtual environments.
## -----------------------
## EXPORTS
## Exports and variable assignments.
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MV2_ENABLE_AFFINITY=0
echo " SLURM_CPUS_PER_TASK: " $SLURM_CPUS_PER_TASK
echo " OMP_NUM_THREADS: " $OMP_NUM_THREADS
echo " MV2_ENABLE_AFFINITY: " $MV2_ENABLE_AFFINITY
echo " SLURM_NTASKS: " $SLURM_NTASKS
## -----------------------
## MPI JOB.
THE_INPUT="./nim-11-1-1.inp"
THE_EXEC="../bin-tc-openmpi/is-hybrid-tc-openmpi"
srun --mpi=pmix $THE_EXEC $THE_INPUT
Specification of memory that forces use of the Intel compute nodes of the normal_q partition
Let us suppose that you want to run on the normal_q, but you do not care
which type of node you run on, so the --constraint
commands in the first
two examples above are no longer applicable.
For the normal_q, the AMD EPYC 7702 compute nodes have 256 GB/node of memory and the Intel Cascade Lake compute nodes have 384 GB/node.
So if we specify memory larger than, say 255 GB of memory (per node), then the job will automatically be run on the Intel Cascade Lake nodes because the AMD nodes do not have sufficient memory.
An example is below. See the #SBATCH --mem
command.
#!/bin/bash
## For tc (tinkercliffs) cluster.
## Job name.
#SBATCH --job-name=i_hope_this_runs
## You will need your own account.
#SBATCH --account=arcadm
# Partition name. Options change based on cluster.
#SBATCH --partition=normal_q
## This next line is optional because it is the default.
#SBATCH --qos=tc_normal_base
## Always put "%j" into the output and error file names
## in order for the names to contain the SLURM_JOB_ID.
#SBATCH --output=my_hopeful_job.%j.out
#SBATCH --error=my_hopeful_job.%j.err
## Maximum wall clock time.
#SBATCH --time=48:00:00
## Compute resources.
### Number of compute nodes.
#SBATCH --nodes=1
### Number of tasks can be thought of as number of processes
### as in the case of MPI.
#SBATCH --ntasks-per-node=3
### Number of cpus/cores is the number of cores needed
### to run each task (e.g., for parallelism).
#SBATCH --cpus-per-task=16
## Memory requirement.
## This is a per-compute-node memory specification.
## This number is greater than 255 (or 256 GB), so
## slurm will run this job on Intel Cascade Lake nodes
## automatically.
#SBATCH --mem=286G
# Reset modules.
module reset
# Load particular modules.
module load foss/2023b
# Source any virtual environments.
## -----------------------
## EXPORTS
## Exports and variable assignments.
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MV2_ENABLE_AFFINITY=0
echo " SLURM_CPUS_PER_TASK: " $SLURM_CPUS_PER_TASK
echo " OMP_NUM_THREADS: " $OMP_NUM_THREADS
echo " MV2_ENABLE_AFFINITY: " $MV2_ENABLE_AFFINITY
echo " SLURM_NTASKS: " $SLURM_NTASKS
## -----------------------
## MPI JOB.
THE_INPUT="./nim-11-1-1.inp"
THE_EXEC="../bin-tc-openmpi/is-hybrid-tc-openmpi"
srun --mpi=pmix $THE_EXEC $THE_INPUT
Specification of numbers of compute cores (cpus) that forces use of the AMD compute nodes of the normal_q partition
Let us suppose that you want to run on the normal_q, but you do not care
which type of node you run on, so the --constraint
commands in the first
two examples above are no longer applicable.
For the normal_q, the AMD EPYC 7702 compute nodes have 128 cpus (or, equivalently, cores) and the Intel Cascade Lake compute nodes have 96 cores each.
So if we specify that we want 100 compute cores on one compute node, then the job will automatically be run on AMD EPYC 7702 compute nodes because the Intel nodes do not have 100 cores.
An example is below.
The key lines below are the combination of the two lines:
#SBATCH --ntasks-per-node
#SBATCH --cpus-per-task
When the two values are multiplied together, this is the number of cpus per compute node, which in our example is 100.
#!/bin/bash
## For tc (tinkercliffs) cluster.
## Job name.
#SBATCH --job-name=i_hope_this_runs
## You will need your own account.
#SBATCH --account=arcadm
# Partition name. Options change based on cluster.
#SBATCH --partition=normal_q
## This next line is optional because it is the default.
#SBATCH --qos=tc_normal_base
## Always put "%j" into the output and error file names
## in order for the names to contain the SLURM_JOB_ID.
#SBATCH --output=my_hopeful_job.%j.out
#SBATCH --error=my_hopeful_job.%j.err
## Maximum wall clock time.
#SBATCH --time=48:00:00
## Compute resources.
### Number of compute nodes.
#SBATCH --nodes=1
### Number of tasks can be thought of as number of processes
### as in the case of MPI.
#SBATCH --ntasks-per-node=20
### Number of cpus/cores is the number of cores needed
### to run each task (e.g., for parallelism).
#SBATCH --cpus-per-task=5
# Reset modules.
module reset
# Load particular modules.
module load foss/2023b
# Source any virtual environments.
## -----------------------
## EXPORTS
## Exports and variable assignments.
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MV2_ENABLE_AFFINITY=0
echo " SLURM_CPUS_PER_TASK: " $SLURM_CPUS_PER_TASK
echo " OMP_NUM_THREADS: " $OMP_NUM_THREADS
echo " MV2_ENABLE_AFFINITY: " $MV2_ENABLE_AFFINITY
echo " SLURM_NTASKS: " $SLURM_NTASKS
## -----------------------
## MPI JOB.
THE_INPUT="./nim-11-1-1.inp"
THE_EXEC="../bin-tc-openmpi/is-hybrid-tc-openmpi"
srun --mpi=pmix $THE_EXEC $THE_INPUT
Specification to run at higher priority for a shorter duration
In the table above, one runs a ``normal’’ job on the normal_q by specifying qos as tc_normal_base (or not specifying a qos, as this is the default). A job will run for a maximum of seven days, per the table above. However, one can run at a higher priority, but for a shorter duration of up to one day by using qos as tc_normal_short. You can see that the number of cores (cpus) and the amount of memory increase over the base case (tc_normal_base), but the billing rate is twice that of a normal job.
This is our example. We want to run for 23 hours, so we can use the tc_normal_short
qos
value.
Note we will be billed 2x for this usage.
The key lines below are:
#SBATCH --constraint=amd
: because we want to use an AMD EPYC compute node of the normal_q.#SBATCH --qos=tc_normal_short
: because we want higher priority and our wall clock time must be less than or equal to 24 hours.#SBATCH --time=0-23:00:00
: This is the 23 hours of wall clock time.#SBATCH --nodes=70
: This and the next two value, multiplied, must be <= 12596#SBATCH --ntasks-per-node=1
: Because the TOTAL number of cores in job is <= 12596.#SBATCH --cpus-per-task=128
: 70 nodes x 128 cpus/node = 8960, which is too big for the default qos, which is 8397 cores.#SBATCH --mem=255G
Because the total amount of memory, over all compute nodes, 70 x 255G = 17850G, must be <= 27414G.
Note that there are per-user limits and per-account limits, so many people on one account cannot load up and dominate a cluster.
Also note that the memory required, 255G/node x 70 nodes = 17850G, can be done with the
qos=tc_normal_base
, but the number of cores, 128 x 1 x 7 = 8960 is too large for
qos=tc_normal_base
(which has a limit of 8397 cores).
Thus, the qos=tc_normal_short
must be used.
#!/bin/bash
## For tc (tinkercliffs) cluster.
## Job name.
#SBATCH --job-name=i_hope_this_runs
## You will need your own account.
#SBATCH --account=arcadm
# Partition name. Options change based on cluster.
#SBATCH --partition=normal_q
## Specifying that you must have this job run on the
## Tinkercliffs Rome nodes (i.e., the AMD EPYC 7702
## base compute nodes).
#SBATCH --constraint=amd
## This next line is NOT optional.
#SBATCH --qos=tc_normal_short
## Always put "%j" into the output and error file names
## in order for the names to contain the SLURM_JOB_ID.
#SBATCH --output=my_hopeful_job.%j.out
#SBATCH --error=my_hopeful_job.%j.err
## Maximum wall clock time.
#SBATCH --time=23:00:00
## Compute resources.
### Number of compute nodes.
#SBATCH --nodes=70
### Number of tasks can be thought of as number of processes
### as in the case of MPI.
#SBATCH --ntasks-per-node=1
### Number of cpus/cores is the number of cores needed
### to run each task (e.g., for parallelism).
#SBATCH --cpus-per-task=128
## Memory requirement.
## This is a per-compute-node memory specification.
## This number is greater than 255 (or 256 GB), so
## slurm will run this job on Intel Cascade Lake nodes
## automatically.
#SBATCH --mem=255G
# Reset modules.
module reset
# Load particular modules.
module load foss/2023b
# Source any virtual environments.
## -----------------------
## EXPORTS
## Exports and variable assignments.
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MV2_ENABLE_AFFINITY=0
echo " SLURM_CPUS_PER_TASK: " $SLURM_CPUS_PER_TASK
echo " OMP_NUM_THREADS: " $OMP_NUM_THREADS
echo " MV2_ENABLE_AFFINITY: " $MV2_ENABLE_AFFINITY
echo " SLURM_NTASKS: " $SLURM_NTASKS
## -----------------------
## MPI JOB.
THE_INPUT="./nim-11-1-1.inp"
THE_EXEC="../bin-tc-openmpi/is-hybrid-tc-openmpi"
srun --mpi=pmix $THE_EXEC $THE_INPUT
Specification to run at lower priority for a longer duration
In the table above, one runs a ``normal’’ job on the normal_q by specifying qos as tc_normal_base (or not specifying a qos, as this is the default). A job will run for a maximum of seven days, per the table above. However, one can run at a lower priority, but for a longer duration of up to fourteen days by using qos as tc_normal_long. The table above shows that the amount of resources (cpus/cores and memory) are reduced from the base case.
The same variables as in the previous example are important here. We list them without details. The key lines below are:
#SBATCH --constraint=amd
: because we want to use an AMD EPYC compute node of the normal_q.#SBATCH --qos=tc_normal_long
: because we require a longer running time than seven days.#SBATCH --time=13-23:00:00
: This is a little less than 14 days.#SBATCH --nodes=21
: This and the next two value, multiplied, must be <= 2100.#SBATCH --ntasks-per-node=4
: Because the TOTAL number of cores in job is <= 2100.#SBATCH --cpus-per-task=25
.#SBATCH --mem=210G
Because the total amount of memory, here 21 nodes x 210 GB/node = 4410G, over all compute nodes, must be <= 4569G.
Specification to run on a100 GPU nodes
The sbatch slurm script is given below. The matlab file, code02.m, is below the sbatch slurm script.
#!/bin/bash
# Run on tinkercliffs.
#SBATCH -J matgpu
## Wall time.
#SBATCH --time=0-01:00:00
## Account to "charge" to/run against.
#SBATCH --account=arcadm
## Partition/queue.
#SBATCH --partition=a100_normal_q
### This requests 1 node, 1 core. 1 gpu.
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:1
## Slurm output and error files.
## Always use %j in slurm scripts to capture the SLURM_JOB_ID in your names.
#SBATCH -o slurm.matlab.02.gpu.%j.out
#SBATCH -e slurm.matlab.02.gpu.%j.err
## Load modules, if any.
module reset
module load MATLAB/R2024b
## Load virtual environments, if any.
# None in this case.
# Set up
## Get the core number for job and other job details.
echo " ------------"
echo "Set of cores job running on: "
echo " "
scontrol show job -d $SLURM_JOB_ID
echo " "
echo " "
## Monitor the GPU.
## The 3 in the command below means log data every three seconds.
## You may wish to change the time depending on your anticipated
## execution duration.
echo " "
echo " "
echo "Start file and monitoring of GPU."
nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 3 > $SLURM_JOBID.gpu.log &
echo " "
echo " "
echo " "
echo " ------------"
echo "Running IOSTAT"
iostat 2 >iostat-stdout.txt 2>iostat-stderr.txt &
echo " ------------"
echo "Running MPSTAT"
mpstat -P ALL 2 >mpstat-stdout.txt 2>mpstat-stderr.txt &
echo " ------------"
echo "Running VMSTAT"
vmstat 2 >vmstat-stdout.txt 2>vmstat-stderr.txt &
echo " ------------"
echo "Running executable"
# Code to execute.
arrayLength=2000000
numIterations=1000000
## Code name.
mycode="code02"
## Invocation. Matlab syntax in double-quotes.
matlab -nodisplay -nosplash -r "bogus = ${mycode}(${arrayLength}, ${numIterations})"
echo " ------------"
echo "Executable done"
echo " ------------"
echo "Killing IOSTAT"
kill %1
echo " ------------"
echo "Killing MPSTAT"
kill %2
echo " ------------"
echo "Killing VMSTAT"
kill %3
The Matlab code, code02.m, called out in the above sbatch slurm script, is below:
function aa = code02(arrayLength, numIterations)
fprintf('arrayLength: \n');
disp (arrayLength);
fprintf('numIterations: \n');
disp (numIterations);
N = arrayLength;
r = gpuArray.linspace(0,4,N);
x = rand(1,N,"gpuArray");
% numIterations = 1000;
for n=1:numIterations
x = r.*x.*(1-x);
end
% plot(r,x,'.',MarkerSize=1)
% xlabel("Growth Rate")
% ylabel("Population")
% Return argument.
aa="done";
end
Specification to run on h200 GPU nodes
In the sbatch slurm script for the a100 example, please make the following changes:
change
mycode="code02"
tomycode="code04"
.change
#SBATCH --partition=a100_normal_q
to#SBATCH --partition=h200_normal_q
.
The code code04.m is below.
function aa = code04(arrayLength, numIterations)
% Output file.
outfile = "mat.out"
fprintf('arrayLength: \n');
disp (arrayLength);
fprintf('numIterations: \n');
disp (numIterations);
fprintf('outfile: \n');
disp (outfile);
% This is code for gpu computations.
N = arrayLength;
r = gpuArray.linspace(0,4,N);
x = rand(1,N,"gpuArray");
for n=1:numIterations
x = r.*x.*(1-x);
end
% plot(r,x,'.',MarkerSize=1)
% xlabel("Growth Rate")
% ylabel("Population")
% Return argument.
aa="done";
end
Preemptable partitions
Note that from the QoS table above, one could use #SBATCH --qos=tc_preemptable_base
to run jobs up to
30 days of wall clock time on the preemptable_q
partition.
However, as the name implies, your job could be pre-empted, i.e., killed, if a non-preemptable job is
waiting in the slurm job queue.
Example extensions
Note that the examples above are for the normal_q
partition, per the QoS table.
There are analogous examples for a100_normal_q
and h200_normal_q
partitions.
A couple of examples are provided.
These GPU-based partitions have the same types of QoS features
as given in the examples for the normal_q
above, e.g.,
#SBATCH qos=tc_a100_normal_long
and #SBATCH qos=tc_h200_normal_short
.
Optimization
The performance of jobs can be greatly enhanced by appropriate optimizations being applied. Not only does this reduce the execution time of jobs but it also makes more efficient use of the resources for the benefit of all.
See the tuning guides available at https://developer.amd.com and https://www.intel.com/content/www/us/en/developer/
General principles of optimization:
Cache locality really matters - process pinning can make a big difference on performance.
Hybrid programming often pays off - one MPI process per L3 cache with 4 threads is often optimal.
Use the appropriate
-march
flag to optimize the compiled code and-gencode
flag when using the NVCC compiler.
Suggested optimization parameters:
Node Type |
Base Compute Nodes |
Intel Nodes |
High Memory Nodes |
DGX A100 GPU Nodes |
A100 GPU Nodes |
H200 GPU Nodes |
---|---|---|---|---|---|---|
CPU arch |
Zen 2 |
Cascade Lake |
Zen 2 |
Zen 2 |
Zen 2 |
Emerald Rapids |
Compiler flags |
|
|
|
|
|
|
GPU arch |
- |
- |
- |
NVIDIA A100 |
NVIDIA A100 |
NVIDIA H200 |
Compute Capability |
- |
- |
- |
8.0 |
8.0 |
9.0 |
NVCC flags |
- |
- |
- |
|
|
|