TinkerCliffs - ARC’s Flagship Cluster

Overview

TinkerCliffs has 353 nodes, 44,224 CPU cores, 133 TB RAM, 112 NVIDIA A100 and 56 NVIDIA H200 GPUs. TinkerCliffs hardware is summarized in the table below.

Node Type	Base Compute Nodes	Intel Nodes	High Memory Nodes	DGX A100 GPU Nodes	A100 GPU Nodes	H200 GPU Nodes	Total
Chip	AMD EPYC 7702	Intel Xeon Platinum 9242	AMD EPYC 7702	AMD EPYC 7742	AMD EPYC 7742	Intel Xeon Platinum 8562Y+	-
Architecture	Zen 2	Cascade Lake	Zen 2	Zen 2	Zen 2	Emerald Rapids	-
Slurm features	amd	intel, avx512	amd	dgx-A100	hpe-A100	-	-
Nodes	308	16	8	10	4	7	353
GPUs	-	-	-	8x NVIDIA A100-80G	8x NVIDIA A100-80G	8x NVIDIA H200-141G	168
Cores/Node	128	96	128	128	128	64	-
Memory (GB)/Node	256	384	1,024	2,048	2,048	2,048	-
Maximum Memory for Slum (GB)/Node	243	368	999	2,007	2,007	2,007	-
Total Cores	39,424	1,536	1,024	1,280	512	448	44,224
Total Memory (GB)	78,848	6,144	8,192	20,480	8,192	14,336	136,192
Local Disk	480GB SSD	3.2TB NVMe	480GB SSD	30TB Gen4 NVMe	11.7TB NVMe	28 TB NVMe	-
Interconnect	HDR-100 IB	HDR-100 IB	HDR-100 IB	8x HDR-200 IB	4x HDR-200 IB	8x HDR-200 IB	-

The AMD EPYC 7702 base compute nodes are also called Tinkercliffs-Rome nodes.

Tinkercliffs is hosted in the Steger Hall HPC datacenter on the Virginia Tech campus so it is physically separated from other ARC HPC systems which are hosted in the AISB Datacenter at the Corporate Research Center (CRC) in Blacksburg.

An IBM ESS GPFS file system supports /projects for group collaboration and a VAST /scratch serves high-performance input/output (I/O).

Get Started

Tinkercliffs can be accessed via one of the two login nodes using your VT credentials:

tinkercliffs1.arc.vt.edu
tinkercliffs2.arc.vt.edu

For testing purposes, all users will be alloted an initial 240 core-hours for 90 days in the “personal” allocation. Researchers at the PI level are able to request resource allocations in the “free” tier (usage fully subsidized by VT) and can allocate 2,000,000 monthly Service Units among their projects.

To create an allocation, log in to the ARC allocation portal https://coldfront.arc.vt.edu

Select or create a project
Click the “+ Request Resource Allocation” button
Choose the “Compute (Free) (Cluster)” allocation type

Usage needs in excess of 2,000,000 monthly Service Units can be purchased via the ARC Cost Center.

Partitions

Users submit jobs to partitions of the cluster depending on the type of resources needed (for example, CPUs or GPUs). Features are optional restrictions users can indicate in their job submission to restrict the execution of their job to nodes meeting specific requirements. If users do not specify the amount of memory requested for a job, the parameter DefMemPerCPU will automatically determine the amount of memory for the job based on the number of CPU cores requested. If the users do not specify the number of CPU cores on a GPU job, the parameter DepCpuPerGPU will automatically determine the number of CPU cores based on the number of GPUs requested. Jobs will be billed against the user’s allocation accounting for the utilization of number of CPU cores, memory, and GPU time. Consult the Slurm configuration to understand how to specify the parameters for your job.

Partition	normal_q	preemptable_q	a100_normal_q	a100_preemptable_q	h200_normal_q	h200_preemptable_q
Node Type	Base Compute, Intel, High Memory	Base Compute, Intel, High Memory	DGX A100 GPU, A100 GPU	DGX A100 GPU, A100 GPU	H200 GPU	H200 GPU
Features	amd,intel,avx512	amd,intel,avx512	hpe-A100,dgx-A100	hpe-A100,dgx-A100	-	-
Number of Nodes	332	332	14	14	7	7
DefMemPerCPU (MB)	1944	1944	16056	16056	32112	32112
DefCpuPerGPU	-	-	8	8	4	4
TRESBillingWeights	CPU=1.0,Mem=0.0625G	-	CPU=1.0,Mem=0.0625G,GRES/gpu=100.0	-	CPU=1.0,Mem=0.0625G,GRES/gpu=150	-
PreemptMode	OFF	ON	OFF	ON	OFF	ON

Quality of Service (QoS)

ARC must balance the needs of individuals with the needs of all to ensure fairness. This is done by providing options which determine the Quality of Service (QoS).

The QoS associated with a job affects the job in three key ways: scheduling priority, resource limits, and time limits. Each partition has a default QoS named partitionname_base with a default priority, resource limits, and time limits. Users can optionally select a different QoS to increase or decrease the priority, resource limits, and time limits. The goal is to offer users multiple flexible options that adjust to their jobs needs. The long QoS allows users to run for an extended period of time (up to 14 days) but reduces the total amount of resources that can be allocated for the job. The short QoS allows users to increase the number of resources for a job but reduces the maximum time to 1 day. ARC staff reserves the right to modify the QoS settings at any point of time to ensure a fair and balanced utilization of resources among all users.

Partition	QoS	Priority	MaxWall	MaxTRESPerUser	MaxTRESPerAccount	UsageFactor
normal_q	tc_normal_base	1000	7-00:00:00	cpu=10496,mem=22845G	cpu=20992,mem=45690G	1
normal_q	tc_normal_long	500	14-00:00:00	cpu=2624,mem=5712G	cpu=5248,mem=11423G	1
normal_q	tc_normal_short	2000	1-00:00:00	cpu=15744,mem=34268G	cpu=31488,mem=68535G	2
preemptable_q	tc_preemptable_base	0	30-00:00:00	cpu=1312,mem=2856G	cpu=2624,mem=5712G	0
a100_normal_q	tc_a100_normal_base	1000	7-00:00:00	cpu=448,mem=7053G,gres/gpu=28	cpu=896,mem=14105G,gres/gpu=56	1
a100_normal_q	tc_a100_normal_long	500	14-00:00:00	cpu=112,mem=1764G,gres/gpu=7	cpu=224,mem=3527G,gres/gpu=14	1
a100_normal_q	tc_a100_normal_short	2000	1-00:00:00	cpu=672,mem=10579G,gres/gpu=42	cpu=1344,mem=21158G,gres/gpu=84	2
a100_preemptable_q	tc_a100_preemptable_base	0	30-00:00:00	cpu=56,mem=882G,gres/gpu=4	cpu=112,mem=1764G,gres/gpu=7	0
h200_normal_q	tc_h200_normal_base	1000	7-00:00:00	cpu=112,mem=3584G,gres/gpu=14	cpu=224,mem=7168G,gres/gpu=28	1
h200_normal_q	tc_h200_normal_long	500	14-00:00:00	cpu=28,mem=896G,gres/gpu=4	cpu=56,mem=1792G,gres/gpu=7	1
h200_normal_q	tc_h200_normal_short	2000	1-00:00:00	cpu=168,mem=5376G,gres/gpu=21	cpu=336,mem=10752G,gres/gpu=42	2
h200_preemptable_q	tc_h200_preemptable_base	0	30-00:00:00	cpu=14,mem=448G,gres/gpu=2	cpu=28,mem=896G,gres/gpu=4	0

Features

Features or constraints for Tinkercliffs cluster node types and partitions.

Cluster	Node Types	Partitions	User-selectable features
Tinkercliffs	AMD Zen2 “Rome” nodes	`normal_q` `preemptable_q`	`--constraint=amd`
Tinkercliffs	AMD Zen2 “Rome” large-memory nodes	`normal_q` `preemptable_q`	`--constraint=amd` and `--mem=<size>` larger than 256G
Tinkercliffs	Intel “CascadeLake-AP” nodes	`normal_q` `preemptable_q`	`--constraint=intel` and `--constraint=avx512`
Tinkercliffs	HPE 8x A100-80G GPU nodes	`a100_normal_q` `a100_preemptable_q`	`--constraint=hpe-A100`
Tinkercliffs	Nvidia DGX 8x A100-80G GPU nodes	`a100_normal_q` `a100_preemptable_q`	`--constraint=dgx-A100`
Tinkercliffs	Dell 8x H200 GPU nodes	`h200_normal_q` `h200_preemptable_q`	n/a - homogeneous partitions

Examples

Specify use of base compute nodes AMD EPYC 7702 of the normal_q partition

The sbatch Slurm script is given below. The key line below is #SBATCH --constraint=amd, which specifies the requisite of using the AMD nodes of the normal_q. Therefore, this job won’t run on the Intel nodes.

#!/bin/bash

## For tc (tinkercliffs) cluster.

## Job name.
#SBATCH --job-name=name_for_the_job

## You will need your own account.
#SBATCH --account=your_slurm_account_name

# Partition name. Options change based on cluster.
#SBATCH --partition=normal_q 

## Specifying that you must have this job run on the 
## Tinkercliffs Rome nodes (i.e., the AMD EPYC 7702
## base compute nodes).
#SBATCH --constraint=amd

## This next line is optional because it is the default. 
#SBATCH --qos=tc_normal_base


## Always put "%j" into the output and error file names
## in order for the names to contain the SLURM_JOB_ID.
#SBATCH --output=my_hopeful_job.%j.out
#SBATCH --error=my_hopeful_job.%j.err

## Maximum wall clock time. E.g. 48h
#SBATCH --time=48:00:00 

## Compute resources.
### Number of compute nodes.
#SBATCH --nodes=1 
### Number of tasks can be thought of as number of processes
### as in the case of MPI.
#SBATCH --ntasks-per-node=3
### Number of cpus/cores is the number of cores needed
### to run each task (e.g., for parallelism).
#SBATCH --cpus-per-task=16

# Reset modules.
module reset

# Load particular modules.
module load foss/2023b

# Source any virtual environments.



## -----------------------
## EXPORTS 
## Exports and variable assignments.
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MV2_ENABLE_AFFINITY=0

echo "   SLURM_CPUS_PER_TASK: " $SLURM_CPUS_PER_TASK
echo "   OMP_NUM_THREADS: " $OMP_NUM_THREADS
echo "   MV2_ENABLE_AFFINITY: "  $MV2_ENABLE_AFFINITY
echo "   SLURM_NTASKS: "  $SLURM_NTASKS


## -----------------------
## MPI JOB.
THE_INPUT="./nim-11-1-1.inp"
THE_EXEC="../bin-tc-openmpi/is-hybrid-tc-openmpi"
srun --mpi=pmix  $THE_EXEC  $THE_INPUT

Specify use of Intel compute nodes Intel Xeon Platinum 9242 of the normal_q partition

The key lines below are:

#SBATCH --constraint=intel
#SBATCH --constraint=avx512

#!/bin/bash

## For tc (tinkercliffs) cluster.

## Job name.
#SBATCH --job-name=name_for_the_job

## You will need your own account.
#SBATCH --account=your_slurm_account_name

# Partition name. Options change based on cluster.
#SBATCH --partition=normal_q 

## Specifying that you must have this job run on the 
## Tinkercliffs Intel Cascade Lake nodes.
#SBATCH --constraint=intel
#SBATCH --constraint=avx512

## This next line is optional because it is the default. 
#SBATCH --qos=tc_normal_base


## Always put "%j" into the output and error file names
## in order for the names to contain the SLURM_JOB_ID.
#SBATCH --output=my_hopeful_job.%j.out
#SBATCH --error=my_hopeful_job.%j.err

## Maximum wall clock time. E.g. 48h
#SBATCH --time=48:00:00 

## Compute resources.
### Number of compute nodes.
#SBATCH --nodes=1 
### Number of tasks can be thought of as number of processes
### as in the case of MPI.
#SBATCH --ntasks-per-node=3
### Number of cpus/cores is the number of cores needed
### to run each task (e.g., for parallelism).
#SBATCH --cpus-per-task=16

# Reset modules.
module reset

# Load particular modules.
module load foss/2023b

# Source any virtual environments.



## -----------------------
## EXPORTS 
## Exports and variable assignments.
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MV2_ENABLE_AFFINITY=0

echo "   SLURM_CPUS_PER_TASK: " $SLURM_CPUS_PER_TASK
echo "   OMP_NUM_THREADS: " $OMP_NUM_THREADS
echo "   MV2_ENABLE_AFFINITY: "  $MV2_ENABLE_AFFINITY
echo "   SLURM_NTASKS: "  $SLURM_NTASKS


## -----------------------
## MPI JOB.
THE_INPUT="./nim-11-1-1.inp"
THE_EXEC="../bin-tc-openmpi/is-hybrid-tc-openmpi"
srun --mpi=pmix  $THE_EXEC  $THE_INPUT

Specification to run at higher priority for a shorter duration

In the examples above, one runs a job on the normal_q using the default QoS tc_normal_base. A job will run for a maximum of seven days. However, one can run at a higher priority, but for a shorter duration of up to one day by using the tc_normal_short QoS. You can see that the number of cores (cpus) and the amount of memory increase over the base case (tc_normal_base), but the billing rate is twice that of a normal job.

This is our example. We want to run for 23 hours, so we can use the tc_normal_short QoS. Note we will be billed 2x for this usage.

The key lines below are:

#SBATCH --qos=tc_normal_short: because we want higher priority and our wall clock time must be less than or equal to 24 hours.
#SBATCH --time=0-23:00:00: This is the 23 hours of wall clock time.
#SBATCH --nodes=70: This and the next two values, multiplied, must be <= 12596
#SBATCH --ntasks-per-node=1: Because the TOTAL number of cores in job is <= 12596.
#SBATCH --cpus-per-task=128: 70 nodes x 128 cpus/node = 8960, which is too big for the default QoS, which is 8397 cores.
#SBATCH --mem=243G Because the total amount of memory, over all compute nodes, 70 x 243G = 17010G, must be <= 27414G.

Note that there are per-user limits and per-account limits, so many users on one account cannot load up and dominate a cluster.

Also note that the memory required, 243G/node x 70 nodes = 17010G, can be done with the qos=tc_normal_base, but the number of cores, 128 x 1 x 7 = 8960 is too large for qos=tc_normal_base (which has a limit of 8397 cores). Thus, the qos=tc_normal_short must be used.

#!/bin/bash

## For tc (tinkercliffs) cluster.

## Job name.
#SBATCH --job-name=name_for_the_job

## You will need your own account.
#SBATCH --account=your_slurm_account_name

# Partition name. Options change based on cluster.
#SBATCH --partition=normal_q 

## This next line is NOT optional this time.
#SBATCH --qos=tc_normal_short

## Always put "%j" into the output and error file names
## in order for the names to contain the SLURM_JOB_ID.
#SBATCH --output=my_hopeful_job.%j.out
#SBATCH --error=my_hopeful_job.%j.err

## Maximum wall clock time.
#SBATCH --time=23:00:00 

## Compute resources.
### Number of compute nodes.
#SBATCH --nodes=70 
### Number of tasks can be thought of as number of processes
### as in the case of MPI.
#SBATCH --ntasks-per-node=1
### Number of cpus/cores is the number of cores needed
### to run each task (e.g., for parallelism).
#SBATCH --cpus-per-task=128

## Memory requirement.
## This is a per-compute-node memory specification.
#SBATCH --mem=243G


# Reset modules.
module reset

# Load particular modules.
module load foss/2023b

# Source any virtual environments.



## -----------------------
## EXPORTS 
## Exports and variable assignments.
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MV2_ENABLE_AFFINITY=0

echo "   SLURM_CPUS_PER_TASK: " $SLURM_CPUS_PER_TASK
echo "   OMP_NUM_THREADS: " $OMP_NUM_THREADS
echo "   MV2_ENABLE_AFFINITY: "  $MV2_ENABLE_AFFINITY
echo "   SLURM_NTASKS: "  $SLURM_NTASKS


## -----------------------
## MPI JOB.
THE_INPUT="./nim-11-1-1.inp"
THE_EXEC="../bin-tc-openmpi/is-hybrid-tc-openmpi"
srun --mpi=pmix  $THE_EXEC  $THE_INPUT

Specification to run at lower priority for a longer duration

In the examples above, one runs a job on the normal_q using the default QoS tc_normal_base. A job will run for a maximum of seven days. However, one can run at a lower priority, but for a longer duration of up to fourteen days by using the tc_normal_long QoS. The table above shows that the amount of resources (cpus/cores and memory) are reduced from the base case.

The same variables as in the previous example are important here. We list them without details. The key lines below are:

#SBATCH --qos=tc_normal_long: because we require a longer running time than seven days.
#SBATCH --time=13-23:00:00: This is a little less than 14 days.
#SBATCH --nodes=21: This and the next two value, multiplied, must be <= 2100.
#SBATCH --ntasks-per-node=4: Because the TOTAL number of cores in job is <= 2100.
#SBATCH --cpus-per-task=25.
#SBATCH --mem=210G Because the total amount of memory, here 21 nodes x 210 GB/node = 4410G, over all compute nodes, must be <= 4569G.

Specification to run on A100 GPU nodes

The sbatch slurm script is given below. The matlab file, code02.m, is below the sbatch slurm script.

#!/bin/bash

# Run on tinkercliffs.

#SBATCH -J matlab_gpu

## Wall time.
#SBATCH --time=01:00:00 

## Account to "charge" to/run against.
#SBATCH --account=your_slurm_account_name

## Partition/queue.
#SBATCH --partition=a100_normal_q

### This requests 1 node, 1 core. 1 gpu.
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1 
#SBATCH --gres=gpu:1


## Slurm output and error files.
## Always use %j in slurm scripts to capture the SLURM_JOB_ID in your names.
#SBATCH -o slurm.matlab.02.gpu.%j.out
#SBATCH -e slurm.matlab.02.gpu.%j.err


## Load modules, if any.
module reset
module load MATLAB/R2024b

## Load virtual environments, if any.
# None in this case.

# Set up 

## Get the core number for job and other job details.
echo " ------------"
echo "Set of cores job running on: "
echo " "
scontrol show job -d  $SLURM_JOB_ID
echo " "
echo " "

## Monitor the GPU.
## The 3 in the command below means log data every three seconds.
## You may wish to change the time depending on your anticipated
## execution duration.
echo " "
echo " "
echo "Start file and monitoring of GPU."
nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 3 > $SLURM_JOBID.gpu.log &
echo " " 
echo " " 

echo " " 
echo " ------------"
echo "Running IOSTAT"

iostat 2 >iostat-stdout.txt 2>iostat-stderr.txt &

echo " ------------"
echo "Running MPSTAT"

mpstat -P ALL 2 >mpstat-stdout.txt 2>mpstat-stderr.txt &

echo " ------------"
echo "Running VMSTAT"

vmstat 2 >vmstat-stdout.txt 2>vmstat-stderr.txt &

echo " ------------"
echo "Running executable"

# Code to execute.
arrayLength=2000000
numIterations=1000000

## Code name.
mycode="code02"

## Invocation.  Matlab syntax in double-quotes.
matlab -nodisplay -nosplash -r "bogus = ${mycode}(${arrayLength}, ${numIterations})"

echo " ------------"
echo "Executable done"

echo " ------------"
echo "Killing IOSTAT"
kill %1

echo " ------------"
echo "Killing MPSTAT"
kill %2

echo " ------------"
echo "Killing VMSTAT"
kill %3

The Matlab code, code02.m, called out in the above sbatch slurm script, is below:

function aa = code02(arrayLength, numIterations)

    fprintf('arrayLength: \n');
    disp (arrayLength);
    fprintf('numIterations: \n');
    disp (numIterations);

    N = arrayLength;
    r = gpuArray.linspace(0,4,N);
    x = rand(1,N,"gpuArray");

    % numIterations = 1000;
    for n=1:numIterations
        x = r.*x.*(1-x);
    end

    % plot(r,x,'.',MarkerSize=1)
    % xlabel("Growth Rate")
    % ylabel("Population")

    % Return argument.
    aa="done";
end

Specification to run on H200 GPU nodes

In the sbatch slurm script for the A100 example, please make the following changes:

change mycode="code02" to mycode="code04".
change #SBATCH --partition=a100_normal_q to #SBATCH --partition=h200_normal_q.

The code code04.m is below.

function aa = code04(arrayLength, numIterations)

    % Output file.
    outfile = "mat.out"

    fprintf('arrayLength: \n');
    disp (arrayLength);
    fprintf('numIterations: \n');
    disp (numIterations);
    fprintf('outfile: \n');
    disp (outfile);

    % This is code for gpu computations.
    N = arrayLength;
    r = gpuArray.linspace(0,4,N);
    x = rand(1,N,"gpuArray");

    for n=1:numIterations
        x = r.*x.*(1-x);
    end

    % plot(r,x,'.',MarkerSize=1)
    % xlabel("Growth Rate")
    % ylabel("Population")

    % Return argument.
    aa="done";
end

Preemptable partitions

Note that from the QoS table above, one could use #SBATCH --qos=tc_preemptable_base to run jobs up to 30 days of wall clock time on the preemptable_q partition. However, as the name implies, your job could be pre-empted, i.e., killed, if a non-preemptable job is waiting in the slurm job queue.

Example extensions

Note that the examples above are for the normal_q partition, per the QoS table. There are analogous examples for a100_normal_q and h200_normal_q partitions. A couple of examples are provided. These GPU-based partitions have the same types of QoS features as given in the examples for the normal_q above, e.g., #SBATCH qos=tc_a100_normal_long and #SBATCH qos=tc_h200_normal_short.

Optimization

The performance of jobs can be greatly enhanced by appropriate optimizations being applied. Not only does this reduce the execution time of jobs but it also makes more efficient use of the resources for the benefit of all.

See the tuning guides available at https://developer.amd.com and https://www.intel.com/content/www/us/en/developer/

General principles of optimization:

Cache locality really matters - process pinning can make a big difference on performance.
Hybrid programming often pays off - one MPI process per L3 cache with 4 threads is often optimal.
Use the appropriate -march flag to optimize the compiled code and -gencode flag when using the NVCC compiler.

Suggested optimization parameters:

Node Type	Base Compute Nodes	Intel Nodes	High Memory Nodes	DGX A100 GPU Nodes	A100 GPU Nodes	H200 GPU Nodes
CPU arch	Zen 2	Cascade Lake	Zen 2	Zen 2	Zen 2	Emerald Rapids
Compiler flags	`-march=znver2`	`-march=cascadelake`	`-march=znver2`	`-march=znver2`	`-march=znver2`	`-march=native`
GPU arch	-	-	-	NVIDIA A100	NVIDIA A100	NVIDIA H200
Compute Capability	-	-	-	8.0	8.0	9.0
NVCC flags	-	-	-	`-gencode=arch=compute_80,code=sm_80`	`-gencode=arch=compute_80,code=sm_80`	`-gencode=arch=compute_90,code=sm_90`