TinkerCliffs - ARC’s Flagship Cluster

Overview

TinkerCliffs has 353 nodes, 44,224 CPU cores, 133 TB RAM, 112 NVIDIA A100 and 56 NVIDIA H200 GPUs. TinkerCliffs hardware is summarized in the table below.

Node Type

Base Compute Nodes

Intel Nodes

High Memory Nodes

DGX A100 GPU Nodes

A100 GPU Nodes

H200 GPU Nodes

Total

Chip

AMD EPYC 7702

Intel Xeon Platinum 9242

AMD EPYC 7702

AMD EPYC 7742

AMD EPYC 7742

Intel Xeon Platinum 8562Y+

-

Architecture

Zen 2

Cascade Lake

Zen 2

Zen 2

Zen 2

Emerald Rapids

-

Slurm features

amd

intel, avx512

amd

dgx-A100

hpe-A100

-

-

Nodes

308

16

8

10

4

7

353

GPUs

-

-

-

8x NVIDIA A100-80G

8x NVIDIA A100-80G

8x NVIDIA H200-141G

168

Cores/Node

128

96

128

128

128

64

-

Memory (GB)/Node

256

384

1,024

2,048

2,048

2,048

-

Maximum Memory for Slum (GB)/Node

243

368

999

2,007

2,007

2,007

-

Total Cores

39,424

1,536

1,024

1,280

512

448

44,224

Total Memory (GB)

78,848

6,144

8,192

20,480

8,192

14,336

136,192

Local Disk

480GB SSD

3.2TB NVMe

480GB SSD

30TB Gen4 NVMe

11.7TB NVMe

28 TB NVMe

-

Interconnect

HDR-100 IB

HDR-100 IB

HDR-100 IB

8x HDR-200 IB

4x HDR-200 IB

8x HDR-200 IB

-

The AMD EPYC 7702 base compute nodes are also called Tinkercliffs-Rome nodes.

Tinkercliffs is hosted in the Steger Hall HPC datacenter on the Virginia Tech campus so it is physically separated from other ARC HPC systems which are hosted in the AISB Datacenter at the Corporate Research Center (CRC) in Blacksburg.

An IBM ESS GPFS file system supports /projects for group collaboration and a VAST /scratch serves high-performance input/output (I/O).

Get Started

Tinkercliffs can be accessed via one of the two login nodes using your VT credentials:

  • tinkercliffs1.arc.vt.edu

  • tinkercliffs2.arc.vt.edu

For testing purposes, all users will be alloted an initial 240 core-hours for 90 days in the “personal” allocation. Researchers at the PI level are able to request resource allocations in the “free” tier (usage fully subsidized by VT) and can allocate 1,000,000 monthly Service Units among their projects.

To create an allocation, log in to the ARC allocation portal https://coldfront.arc.vt.edu

  • Select or create a project

  • Click the “+ Request Resource Allocation” button

  • Choose the “Compute (Free) (Cluster)” allocation type

Usage needs in excess of 1,000,000 monthly Service Units can be purchased via the ARC Cost Center.

Partitions

Users submit jobs to partitions of the cluster depending on the type of resources needed (for example, CPUs or GPUs). Features are optional restrictions users can indicate in their job submission to restrict the execution of their job to nodes meeting specific requirements. If users do not specify the amount of memory requested for a job, the parameter DefMemPerCPU will automatically determine the amount of memory for the job based on the number of CPU cores requested. If the users do not specify the number of CPU cores on a GPU job, the parameter DepCpuPerGPU will automatically determine the number of CPU cores based on the number of GPUs requested. Jobs will be billed against the user’s allocation accounting for the utilization of number of CPU cores, memory, and GPU time. Consult the Slurm configuration to understand how to specify the parameters for your job.

Partition

normal_q

preemptable_q

a100_normal_q

a100_preemptable_q

h200_normal_q

h200_preemptable_q

Node Type

Base Compute, Intel, High Memory

Base Compute, Intel, High Memory

DGX A100 GPU, A100 GPU

DGX A100 GPU, A100 GPU

H200 GPU

H200 GPU

Features

amd,intel,avx512

amd,intel,avx512

hpe-A100,dgx-A100

hpe-A100,dgx-A100

-

-

Number of Nodes

332

332

14

14

7

7

DefMemPerCPU (MB)

1944

1944

16056

16056

32112

32112

DefCpuPerGPU

-

-

8

8

4

4

TRESBillingWeights

CPU=1.0,Mem=0.0625G

-

CPU=1.0,Mem=0.0625G,GRES/gpu=100.0

-

CPU=1.0,Mem=0.0625G,GRES/gpu=150

-

PreemptMode

OFF

ON

OFF

ON

OFF

ON

Quality of Service (QoS)

ARC must balance the needs of individuals with the needs of all to ensure fairness. This is done by providing options which determine the Quality of Service (QoS).

The QoS associated with a job affects the job in three key ways: scheduling priority, resource limits, and time limits. Each partition has a default QoS named partitionname_base with a default priority, resource limits, and time limits. Users can optionally select a different QoS to increase or decrease the priority, resource limits, and time limits. The goal is to offer users multiple flexible options that adjust to their jobs needs. The long QoS allows users to run for an extended period of time (up to 14 days) but reduces the total amount of resources that can be allocated for the job. The short QoS allows users to increase the number of resources for a job but reduces the maximum time to 1 day. ARC staff reserves the right to modify the QoS settings at any point of time to ensure a fair and balanced utilization of resources among all users.

Partition

QoS

Priority

MaxWall

MaxTRESPerUser

MaxTRESPerAccount

UsageFactor

normal_q

tc_normal_base

1000

7-00:00:00

cpu=8397,mem=18276G

cpu=16794,mem=36552G

1

normal_q

tc_normal_long

500

14-00:00:00

cpu=2100,mem=4569G

cpu=4199,mem=9138G

1

normal_q

tc_normal_short

2000

1-00:00:00

cpu=12596,mem=27414G

cpu=25191,mem=54828G

2

preemptable_q

tc_preemptable_base

0

30-00:00:00

cpu=1050,mem=2285G

cpu=2100,mem=4569G

0

a100_normal_q

tc_a100_normal_base

1000

7-00:00:00

cpu=359,mem=5642G,gres/gpu=23

cpu=717,mem=11284G,gres/gpu=45

1

a100_normal_q

tc_a100_normal_long

500

14-00:00:00

cpu=90,mem=1411G,gres/gpu=6

cpu=180,mem=2821G,gres/gpu=12

1

a100_normal_q

tc_a100_normal_short

2000

1-00:00:00

cpu=538,mem=8463G,gres/gpu=34

cpu=1076,mem=16926G,gres/gpu=68

2

a100_preemptable_q

tc_a100_preemptable_base

0

30-00:00:00

cpu=45,mem=706G,gres/gpu=3

cpu=90,mem=1411G,gres/gpu=6

0

h200_normal_q

tc_h200_normal_base

1000

7-00:00:00

cpu=90,mem=2868G,gres/gpu=12

cpu=180,mem=5735G,gres/gpu=23

1

h200_normal_q

tc_h200_normal_long

500

14-00:00:00

cpu=23,mem=717G,gres/gpu=3

cpu=45,mem=1434G,gres/gpu=6

1

h200_normal_q

tc_h200_normal_short

2000

1-00:00:00

cpu=135,mem=4301G,gres/gpu=17

cpu=269,mem=8602G,gres/gpu=34

2

h200_preemptable_q

tc_h200_preemptable_base

0

30-00:00:00

cpu=12,mem=359G,gres/gpu=2

cpu=23,mem=717G,gres/gpu=3

0

Features

Features or constraints for Tinkercliffs cluster node types and partitions.

Cluster

Node Types

Partitions

User-selectable features

Tinkercliffs

AMD Zen2 “Rome” nodes

normal_q preemptable_q

--constraint=amd

Tinkercliffs

AMD Zen2 “Rome” large-memory nodes

normal_q preemptable_q

--constraint=amd and --mem=<size> larger than 256G

Tinkercliffs

Intel “CascadeLake-AP” nodes

normal_q preemptable_q

--constraint=intel and --constraint=avx512

Tinkercliffs

HPE 8x A100-80G GPU nodes

a100_normal_q a100_preemptable_q

--constraint=hpe-A100

Tinkercliffs

Nvidia DGX 8x A100-80G GPU nodes

a100_normal_q a100_preemptable_q

--constraint=dgx-A100

Tinkercliffs

Dell 8x H200 GPU nodes

h200_normal_q h200_preemptable_q

n/a - homogeneous partitions

Examples

Specify use of base compute nodes AMD EPYC 7702 of the normal_q partition

The sbatch slurm script is given below.

The key line below is #SBATCH --constraint=amd, which specifies the requisite nodes of the normal_q.

#!/bin/bash

## For tc (tinkercliffs) cluster.

## Job name.
#SBATCH --job-name=i_hope_this_runs

## You will need your own account.
#SBATCH --account=arcadm

# Partition name. Options change based on cluster.
#SBATCH --partition=normal_q 

## Specifying that you must have this job run on the 
## Tinkercliffs Rome nodes (i.e., the AMD EPYC 7702
## base compute nodes).
#SBATCH --constraint=amd

## This next line is optional because it is the default. 
#SBATCH --qos=tc_normal_base


## Always put "%j" into the output and error file names
## in order for the names to contain the SLURM_JOB_ID.
#SBATCH --output=my_hopeful_job.%j.out
#SBATCH --error=my_hopeful_job.%j.err

## Maximum wall clock time.
#SBATCH --time=48:00:00 

## Compute resources.
### Number of compute nodes.
#SBATCH --nodes=1 
### Number of tasks can be thought of as number of processes
### as in the case of MPI.
#SBATCH --ntasks-per-node=3
### Number of cpus/cores is the number of cores needed
### to run each task (e.g., for parallelism).
#SBATCH --cpus-per-task=16

# Reset modules.
module reset

# Load particular modules.
module load foss/2023b

# Source any virtual environments.



## -----------------------
## EXPORTS 
## Exports and variable assignments.
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MV2_ENABLE_AFFINITY=0

echo "   SLURM_CPUS_PER_TASK: " $SLURM_CPUS_PER_TASK
echo "   OMP_NUM_THREADS: " $OMP_NUM_THREADS
echo "   MV2_ENABLE_AFFINITY: "  $MV2_ENABLE_AFFINITY
echo "   SLURM_NTASKS: "  $SLURM_NTASKS


## -----------------------
## MPI JOB.
THE_INPUT="./nim-11-1-1.inp"
THE_EXEC="../bin-tc-openmpi/is-hybrid-tc-openmpi"
srun --mpi=pmix  $THE_EXEC  $THE_INPUT

Specify use of Intel compute nodes Intel Xeon Platinum 9242 of the normal_q partition

The key lines below are:

  1. #SBATCH --constraint=intel

  2. #SBATCH --constraint=avx512

… which specify the requisite nodes of the normal_q.

#!/bin/bash

## For tc (tinkercliffs) cluster.

## Job name.
#SBATCH --job-name=i_hope_this_runs

## You will need your own account.
#SBATCH --account=arcadm

# Partition name. Options change based on cluster.
#SBATCH --partition=normal_q 

## Specifying that you must have this job run on the 
## Tinkercliffs Intel Cascade Lake nodes.
#SBATCH --constraint=intel
#SBATCH --constraint=avx512

## This next line is optional because it is the default. 
#SBATCH --qos=tc_normal_base


## Always put "%j" into the output and error file names
## in order for the names to contain the SLURM_JOB_ID.
#SBATCH --output=my_hopeful_job.%j.out
#SBATCH --error=my_hopeful_job.%j.err

## Maximum wall clock time.
#SBATCH --time=48:00:00 

## Compute resources.
### Number of compute nodes.
#SBATCH --nodes=1 
### Number of tasks can be thought of as number of processes
### as in the case of MPI.
#SBATCH --ntasks-per-node=3
### Number of cpus/cores is the number of cores needed
### to run each task (e.g., for parallelism).
#SBATCH --cpus-per-task=16

# Reset modules.
module reset

# Load particular modules.
module load foss/2023b

# Source any virtual environments.



## -----------------------
## EXPORTS 
## Exports and variable assignments.
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MV2_ENABLE_AFFINITY=0

echo "   SLURM_CPUS_PER_TASK: " $SLURM_CPUS_PER_TASK
echo "   OMP_NUM_THREADS: " $OMP_NUM_THREADS
echo "   MV2_ENABLE_AFFINITY: "  $MV2_ENABLE_AFFINITY
echo "   SLURM_NTASKS: "  $SLURM_NTASKS


## -----------------------
## MPI JOB.
THE_INPUT="./nim-11-1-1.inp"
THE_EXEC="../bin-tc-openmpi/is-hybrid-tc-openmpi"
srun --mpi=pmix  $THE_EXEC  $THE_INPUT

Specification of memory that forces use of the Intel compute nodes of the normal_q partition

Let us suppose that you want to run on the normal_q, but you do not care which type of node you run on, so the --constraint commands in the first two examples above are no longer applicable.

For the normal_q, the AMD EPYC 7702 compute nodes have 256 GB/node of memory and the Intel Cascade Lake compute nodes have 384 GB/node.

So if we specify memory larger than, say 255 GB of memory (per node), then the job will automatically be run on the Intel Cascade Lake nodes because the AMD nodes do not have sufficient memory.

An example is below. See the #SBATCH --mem command.

#!/bin/bash

## For tc (tinkercliffs) cluster.

## Job name.
#SBATCH --job-name=i_hope_this_runs

## You will need your own account.
#SBATCH --account=arcadm

# Partition name. Options change based on cluster.
#SBATCH --partition=normal_q 

## This next line is optional because it is the default. 
#SBATCH --qos=tc_normal_base

## Always put "%j" into the output and error file names
## in order for the names to contain the SLURM_JOB_ID.
#SBATCH --output=my_hopeful_job.%j.out
#SBATCH --error=my_hopeful_job.%j.err

## Maximum wall clock time.
#SBATCH --time=48:00:00 

## Compute resources.
### Number of compute nodes.
#SBATCH --nodes=1 
### Number of tasks can be thought of as number of processes
### as in the case of MPI.
#SBATCH --ntasks-per-node=3
### Number of cpus/cores is the number of cores needed
### to run each task (e.g., for parallelism).
#SBATCH --cpus-per-task=16

## Memory requirement.
## This is a per-compute-node memory specification.
## This number is greater than 255 (or 256 GB), so
## slurm will run this job on Intel Cascade Lake nodes
## automatically.
#SBATCH --mem=286G


# Reset modules.
module reset

# Load particular modules.
module load foss/2023b

# Source any virtual environments.



## -----------------------
## EXPORTS 
## Exports and variable assignments.
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MV2_ENABLE_AFFINITY=0

echo "   SLURM_CPUS_PER_TASK: " $SLURM_CPUS_PER_TASK
echo "   OMP_NUM_THREADS: " $OMP_NUM_THREADS
echo "   MV2_ENABLE_AFFINITY: "  $MV2_ENABLE_AFFINITY
echo "   SLURM_NTASKS: "  $SLURM_NTASKS


## -----------------------
## MPI JOB.
THE_INPUT="./nim-11-1-1.inp"
THE_EXEC="../bin-tc-openmpi/is-hybrid-tc-openmpi"
srun --mpi=pmix  $THE_EXEC  $THE_INPUT

Specification of numbers of compute cores (cpus) that forces use of the AMD compute nodes of the normal_q partition

Let us suppose that you want to run on the normal_q, but you do not care which type of node you run on, so the --constraint commands in the first two examples above are no longer applicable.

For the normal_q, the AMD EPYC 7702 compute nodes have 128 cpus (or, equivalently, cores) and the Intel Cascade Lake compute nodes have 96 cores each.

So if we specify that we want 100 compute cores on one compute node, then the job will automatically be run on AMD EPYC 7702 compute nodes because the Intel nodes do not have 100 cores.

An example is below.

The key lines below are the combination of the two lines:

  1. #SBATCH --ntasks-per-node

  2. #SBATCH --cpus-per-task

When the two values are multiplied together, this is the number of cpus per compute node, which in our example is 100.

#!/bin/bash

## For tc (tinkercliffs) cluster.

## Job name.
#SBATCH --job-name=i_hope_this_runs

## You will need your own account.
#SBATCH --account=arcadm

# Partition name. Options change based on cluster.
#SBATCH --partition=normal_q 

## This next line is optional because it is the default. 
#SBATCH --qos=tc_normal_base


## Always put "%j" into the output and error file names
## in order for the names to contain the SLURM_JOB_ID.
#SBATCH --output=my_hopeful_job.%j.out
#SBATCH --error=my_hopeful_job.%j.err

## Maximum wall clock time.
#SBATCH --time=48:00:00 

## Compute resources.
### Number of compute nodes.
#SBATCH --nodes=1 
### Number of tasks can be thought of as number of processes
### as in the case of MPI.
#SBATCH --ntasks-per-node=20
### Number of cpus/cores is the number of cores needed
### to run each task (e.g., for parallelism).
#SBATCH --cpus-per-task=5

# Reset modules.
module reset

# Load particular modules.
module load foss/2023b

# Source any virtual environments.



## -----------------------
## EXPORTS 
## Exports and variable assignments.
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MV2_ENABLE_AFFINITY=0

echo "   SLURM_CPUS_PER_TASK: " $SLURM_CPUS_PER_TASK
echo "   OMP_NUM_THREADS: " $OMP_NUM_THREADS
echo "   MV2_ENABLE_AFFINITY: "  $MV2_ENABLE_AFFINITY
echo "   SLURM_NTASKS: "  $SLURM_NTASKS


## -----------------------
## MPI JOB.
THE_INPUT="./nim-11-1-1.inp"
THE_EXEC="../bin-tc-openmpi/is-hybrid-tc-openmpi"
srun --mpi=pmix  $THE_EXEC  $THE_INPUT

Specification to run at higher priority for a shorter duration

In the table above, one runs a ``normal’’ job on the normal_q by specifying qos as tc_normal_base (or not specifying a qos, as this is the default). A job will run for a maximum of seven days, per the table above. However, one can run at a higher priority, but for a shorter duration of up to one day by using qos as tc_normal_short. You can see that the number of cores (cpus) and the amount of memory increase over the base case (tc_normal_base), but the billing rate is twice that of a normal job.

This is our example. We want to run for 23 hours, so we can use the tc_normal_short qos value. Note we will be billed 2x for this usage.

The key lines below are:

  1. #SBATCH --constraint=amd: because we want to use an AMD EPYC compute node of the normal_q.

  2. #SBATCH --qos=tc_normal_short: because we want higher priority and our wall clock time must be less than or equal to 24 hours.

  3. #SBATCH --time=0-23:00:00: This is the 23 hours of wall clock time.

  4. #SBATCH --nodes=70: This and the next two value, multiplied, must be <= 12596

  5. #SBATCH --ntasks-per-node=1: Because the TOTAL number of cores in job is <= 12596.

  6. #SBATCH --cpus-per-task=128: 70 nodes x 128 cpus/node = 8960, which is too big for the default qos, which is 8397 cores.

  7. #SBATCH --mem=255G Because the total amount of memory, over all compute nodes, 70 x 255G = 17850G, must be <= 27414G.

Note that there are per-user limits and per-account limits, so many people on one account cannot load up and dominate a cluster.

Also note that the memory required, 255G/node x 70 nodes = 17850G, can be done with the qos=tc_normal_base, but the number of cores, 128 x 1 x 7 = 8960 is too large for qos=tc_normal_base (which has a limit of 8397 cores). Thus, the qos=tc_normal_short must be used.

#!/bin/bash

## For tc (tinkercliffs) cluster.

## Job name.
#SBATCH --job-name=i_hope_this_runs

## You will need your own account.
#SBATCH --account=arcadm

# Partition name. Options change based on cluster.
#SBATCH --partition=normal_q 

## Specifying that you must have this job run on the 
## Tinkercliffs Rome nodes (i.e., the AMD EPYC 7702
## base compute nodes).
#SBATCH --constraint=amd

## This next line is NOT optional.
#SBATCH --qos=tc_normal_short

## Always put "%j" into the output and error file names
## in order for the names to contain the SLURM_JOB_ID.
#SBATCH --output=my_hopeful_job.%j.out
#SBATCH --error=my_hopeful_job.%j.err

## Maximum wall clock time.
#SBATCH --time=23:00:00 

## Compute resources.
### Number of compute nodes.
#SBATCH --nodes=70 
### Number of tasks can be thought of as number of processes
### as in the case of MPI.
#SBATCH --ntasks-per-node=1
### Number of cpus/cores is the number of cores needed
### to run each task (e.g., for parallelism).
#SBATCH --cpus-per-task=128

## Memory requirement.
## This is a per-compute-node memory specification.
## This number is greater than 255 (or 256 GB), so
## slurm will run this job on Intel Cascade Lake nodes
## automatically.
#SBATCH --mem=255G


# Reset modules.
module reset

# Load particular modules.
module load foss/2023b

# Source any virtual environments.



## -----------------------
## EXPORTS 
## Exports and variable assignments.
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MV2_ENABLE_AFFINITY=0

echo "   SLURM_CPUS_PER_TASK: " $SLURM_CPUS_PER_TASK
echo "   OMP_NUM_THREADS: " $OMP_NUM_THREADS
echo "   MV2_ENABLE_AFFINITY: "  $MV2_ENABLE_AFFINITY
echo "   SLURM_NTASKS: "  $SLURM_NTASKS


## -----------------------
## MPI JOB.
THE_INPUT="./nim-11-1-1.inp"
THE_EXEC="../bin-tc-openmpi/is-hybrid-tc-openmpi"
srun --mpi=pmix  $THE_EXEC  $THE_INPUT

Specification to run at lower priority for a longer duration

In the table above, one runs a ``normal’’ job on the normal_q by specifying qos as tc_normal_base (or not specifying a qos, as this is the default). A job will run for a maximum of seven days, per the table above. However, one can run at a lower priority, but for a longer duration of up to fourteen days by using qos as tc_normal_long. The table above shows that the amount of resources (cpus/cores and memory) are reduced from the base case.

The same variables as in the previous example are important here. We list them without details. The key lines below are:

  1. #SBATCH --constraint=amd: because we want to use an AMD EPYC compute node of the normal_q.

  2. #SBATCH --qos=tc_normal_long: because we require a longer running time than seven days.

  3. #SBATCH --time=13-23:00:00: This is a little less than 14 days.

  4. #SBATCH --nodes=21: This and the next two value, multiplied, must be <= 2100.

  5. #SBATCH --ntasks-per-node=4: Because the TOTAL number of cores in job is <= 2100.

  6. #SBATCH --cpus-per-task=25.

  7. #SBATCH --mem=210G Because the total amount of memory, here 21 nodes x 210 GB/node = 4410G, over all compute nodes, must be <= 4569G.

Specification to run on a100 GPU nodes

The sbatch slurm script is given below. The matlab file, code02.m, is below the sbatch slurm script.

#!/bin/bash

# Run on tinkercliffs.

#SBATCH -J matgpu


## Wall time.
#SBATCH --time=0-01:00:00 

## Account to "charge" to/run against.
#SBATCH --account=arcadm

## Partition/queue.
#SBATCH --partition=a100_normal_q

### This requests 1 node, 1 core. 1 gpu.
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1 
#SBATCH --gres=gpu:1


## Slurm output and error files.
## Always use %j in slurm scripts to capture the SLURM_JOB_ID in your names.
#SBATCH -o slurm.matlab.02.gpu.%j.out
#SBATCH -e slurm.matlab.02.gpu.%j.err


## Load modules, if any.
module reset
module load MATLAB/R2024b

## Load virtual environments, if any.
# None in this case.

# Set up 

## Get the core number for job and other job details.
echo " ------------"
echo "Set of cores job running on: "
echo " "
scontrol show job -d  $SLURM_JOB_ID
echo " "
echo " "

## Monitor the GPU.
## The 3 in the command below means log data every three seconds.
## You may wish to change the time depending on your anticipated
## execution duration.
echo " "
echo " "
echo "Start file and monitoring of GPU."
nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 3 > $SLURM_JOBID.gpu.log &
echo " " 
echo " " 

echo " " 
echo " ------------"
echo "Running IOSTAT"

iostat 2 >iostat-stdout.txt 2>iostat-stderr.txt &

echo " ------------"
echo "Running MPSTAT"

mpstat -P ALL 2 >mpstat-stdout.txt 2>mpstat-stderr.txt &

echo " ------------"
echo "Running VMSTAT"

vmstat 2 >vmstat-stdout.txt 2>vmstat-stderr.txt &

echo " ------------"
echo "Running executable"

# Code to execute.
arrayLength=2000000
numIterations=1000000

## Code name.
mycode="code02"

## Invocation.  Matlab syntax in double-quotes.
matlab -nodisplay -nosplash -r "bogus = ${mycode}(${arrayLength}, ${numIterations})"

echo " ------------"
echo "Executable done"

echo " ------------"
echo "Killing IOSTAT"
kill %1

echo " ------------"
echo "Killing MPSTAT"
kill %2

echo " ------------"
echo "Killing VMSTAT"
kill %3

The Matlab code, code02.m, called out in the above sbatch slurm script, is below:

function aa = code02(arrayLength, numIterations)

    fprintf('arrayLength: \n');
    disp (arrayLength);
    fprintf('numIterations: \n');
    disp (numIterations);

    N = arrayLength;
    r = gpuArray.linspace(0,4,N);
    x = rand(1,N,"gpuArray");

    % numIterations = 1000;
    for n=1:numIterations
        x = r.*x.*(1-x);
    end

    % plot(r,x,'.',MarkerSize=1)
    % xlabel("Growth Rate")
    % ylabel("Population")

    % Return argument.
    aa="done";
end

Specification to run on h200 GPU nodes

In the sbatch slurm script for the a100 example, please make the following changes:

  1. change mycode="code02" to mycode="code04".

  2. change #SBATCH --partition=a100_normal_q to #SBATCH --partition=h200_normal_q.

The code code04.m is below.

function aa = code04(arrayLength, numIterations)

    % Output file.
    outfile = "mat.out"

    fprintf('arrayLength: \n');
    disp (arrayLength);
    fprintf('numIterations: \n');
    disp (numIterations);
    fprintf('outfile: \n');
    disp (outfile);

    % This is code for gpu computations.
    N = arrayLength;
    r = gpuArray.linspace(0,4,N);
    x = rand(1,N,"gpuArray");

    for n=1:numIterations
        x = r.*x.*(1-x);
    end

    % plot(r,x,'.',MarkerSize=1)
    % xlabel("Growth Rate")
    % ylabel("Population")

    % Return argument.
    aa="done";
end

Preemptable partitions

Note that from the QoS table above, one could use #SBATCH --qos=tc_preemptable_base to run jobs up to 30 days of wall clock time on the preemptable_q partition. However, as the name implies, your job could be pre-empted, i.e., killed, if a non-preemptable job is waiting in the slurm job queue.

Example extensions

Note that the examples above are for the normal_q partition, per the QoS table. There are analogous examples for a100_normal_q and h200_normal_q partitions. A couple of examples are provided. These GPU-based partitions have the same types of QoS features as given in the examples for the normal_q above, e.g., #SBATCH qos=tc_a100_normal_long and #SBATCH qos=tc_h200_normal_short.

Optimization

The performance of jobs can be greatly enhanced by appropriate optimizations being applied. Not only does this reduce the execution time of jobs but it also makes more efficient use of the resources for the benefit of all.

See the tuning guides available at https://developer.amd.com and https://www.intel.com/content/www/us/en/developer/

General principles of optimization:

  • Cache locality really matters - process pinning can make a big difference on performance.

  • Hybrid programming often pays off - one MPI process per L3 cache with 4 threads is often optimal.

  • Use the appropriate -march flag to optimize the compiled code and -gencode flag when using the NVCC compiler.

Suggested optimization parameters:

Node Type

Base Compute Nodes

Intel Nodes

High Memory Nodes

DGX A100 GPU Nodes

A100 GPU Nodes

H200 GPU Nodes

CPU arch

Zen 2

Cascade Lake

Zen 2

Zen 2

Zen 2

Emerald Rapids

Compiler flags

-march=znver2

-march=cascadelake

-march=znver2

-march=znver2

-march=znver2

-march=native

GPU arch

-

-

-

NVIDIA A100

NVIDIA A100

NVIDIA H200

Compute Capability

-

-

-

8.0

8.0

9.0

NVCC flags

-

-

-

-gencode=arch=compute_80,code=sm_80

-gencode=arch=compute_80,code=sm_80

-gencode=arch=compute_90,code=sm_90