OWL - Water-cooled AMD CPU

OWL has 91 nodes, 8,704 CPU cores, and 80 TB RAM.

The compute nodes on OWL are exclusively CPU-based.
Lenovo ThinkSystem SD665 V3 provides direct water-cooling of the base compute nodes that allows for running at boost speeds (3.8GHz) indefinitely. The boost speed is 40% higher than the base clock rate. For comparison, Tinkercliffs AMD base compute nodes run at 2.0GHz.
AMD’s Genoa architecture is the first to feature AVX-512 instructions which provides 512-bit width vectorization (ie. eight-way FP64 SIMD in each clock-cycle). Tinkercliffs AMD base compute nodes support the previous generation AVX2 instructions which has 256-bit width.
12 memory channels per socket (24 per node) provide much higher aggregate memory bandwidth and increased granularity which should provide substantial speedup for memory-bandwidth constrained workload such as finite-element analysis.
DDR5-4800 memory provides a nominal 50% speed increase over DDR4-3200 on Tinkercliffs.
768GB memory per node provides ~8GB memory per core compared to ~2GB/core on the Tinkercliffs AMD base compute nodes.
Three nodes are equipped with very-large memory (4TB or 8TB) enabling computational workloads with exceptionally high memory demands.

Overview

Node Type	Base Compute Nodes	Milan	Large Memory	Huge Memory	Total
Chip	AMD EPYC 9454 - Genoa	AMD EPYC 7543 Milan	AMD EPYC 7763 Milan	AMD EPYC 7763 Milan	-
Architecture	Zen 4	Zen 3	Zen 3	Zen 3	-
Slurm features	amd, genoa, avx512	amd, milan	amd, milan	amd, milan	-
Nodes	84	4	2	1	91
Cores/Node	96	64	128	128	-
Memory (GB)/Node	768	512	4,019	8,051	-
Maximum Memory for Slurm (GB)/Node	747	495	4,011	8,043	-
Total Cores	8,064	256	256	128	8,704
Total Memory (GB)	64,512	2,048	8,038	8,038	82,636
Local Disk	2.9TB NVMe	818GB SSD	2.9TB NVMe	2.9TB NVMe	-
Interconnect	HDR-100 IB	HDR-100 IB	HDR-100 IB	HDR-100 IB	-

Owl is hosted in the Steger Hall HPC datacenter on the Virginia Tech Blacksburg campus.

An IBM ESS GPFS file system supports /projects for group collaboration and a VAST /scratch serves high-performance input/output (I/O).

Get Started

Connecting to the Owl cluster, as with all other ARC systes, is only available from VT networks such as VT’s Eduroam Wi-Fi or the VT Ivanti Secure Access VPN. From those networks, Owl can be accessed via one of the three login nodes using your VT credentials:

owl1.arc.vt.edu
owl2.arc.vt.edu
owl3.arc.vt.edu

or through ARC’s OnDemand web interface.

For testing purposes, new ARC user accounts are provisioned with an initial 240 system-units for 90 days in the “personal” allocation. Researchers at the PI level are able to request resource allocations in the “free” tier (usage fully subsidized by VT) and can allocate 2,000,000 monthly Service Units among their projects.

To create an allocation, PI’s can log in to the ARC allocation portal https://coldfront.arc.vt.edu

Select or create a project
Click the “+ Request Resource Allocation” button
Choose the “Compute (Free) (Cluster)” allocation type

Usage needs in excess of 2,000,000 monthly Service Units can be purchased via the ARC Cost Center.

Partitions

Users submit jobs to partitions of the cluster depending on the type of resources needed. Features are optional restrictions users can indicate in their job submission to restrict the execution of their job to nodes meeting specific requirements. If users do not specify the amount of memory requested for a job, the parameter DefMemPerCPU will automatically determine the amount of memory for the job based on the number of CPU cores requested. Jobs will be billed against the user’s allocation accounting for the utilization of number of CPU cores and memory, and time. Consult the Slurm configuration to understand how to specify the parameters for your job.

Partition	normal_q	preemptable_q
Node Type	All	All
Number of Nodes	91	91
DefMemPerCPU (MB)	7920	7920
TRESBillingWeights	CPU=1.5,Mem=0.0625G	-
PreemptMode	OFF	ON

Quality of Service (QoS)

ARC must balance the needs of individuals with the needs of all to ensure fairness. This is done by providing options which determine the Quality of Service (QoS).

The QoS associated with a job affects the job in three key ways: scheduling priority, resource limits, and time limits. Each partition has a default QoS named partitionname_base with a default priority, resource limits, and time limits. Users can optionally select a different QoS to increase or decrease the priority, resource limits, and time limits. The goal is to offer users multiple flexible options that adjust to their jobs needs. The long QoS allows users to run for an extended period of time (up to 14 days) but reduces the total amount of resources that can be allocated for the job. The short QoS allows users to increase the number of resources for a job but reduces the maximum time to 1 day. ARC staff reserves the right to modify the QoS settings at any point of time to ensure a fair and balanced utilization of resources among all users.

Partition	QoS	Priority	MaxWall	MaxTRESPerUser	MaxTRESPerAccount	UsageFactor
normal_q	owl_normal_base	1000	7-00:00:00	cpu=2176,mem=20381G	cpu=4352,mem=40761G	1
normal_q	owl_normal_long	500	14-00:00:00	cpu=544,mem=5096G	cpu=1088,mem=10191G	1
normal_q	owl_normal_short	2000	1-00:00:00	cpu=3264,mem=30571G	cpu=6528,mem=61141G	2
preemptable_q	owl_preemptable_base	0	30-00:00:00	cpu=272,mem=2548G	cpu=544,mem=5096G	0

Examples

Specify use of base compute nodes AMD EPYC 9454 Zen4 (“Genoa”) of the normal_q partition

There are two types of “basic” compute nodes, i.e., nodes that are NOT “high memory” nodes. These are, from the first table above:

AMD EPYC 9454 Zen4 “Genoa” nodes.
AMD EPYC 7543 Zen3 “Milan” nodes.

In this example, we want to use Genoa nodes because they’re significantly faster than Milan nodes.

The key lines below are #SBATCH --constraint=genoa and #SBATCH --constraint=avx512, which specify that your code will be run on the AMD Zen4 “Genoa” nodes.

#!/bin/bash
#SBATCH --job-name=name_for_the_job

## You will need your own account.
#SBATCH --account=your_slurm_account_name

# Partition name. Options change based on cluster.
#SBATCH --partition=normal_q

## Specifying that you must have this job run on the
## Owl AMD EPYC 9454 Zen4 "Genoa" nodes.
#SBATCH --constraint=genoa
#SBATCH --constraint=avx512

## This next line is optional because it is the default.
#SBATCH --qos=owl_normal_base

## Time limit for the job. The job will be cancelled if the runtime exceeds this.
#SBATCH --time=48:00:00

## Compute resources.
### Number of compute nodes.
#SBATCH --nodes=1
### Number of tasks can be thought of as number of processes
### as in the case of MPI.
#SBATCH --ntasks-per-node=3
### Number of cpus/cores is the number of cores needed
### to run each task (e.g., for parallelism).
#SBATCH --cpus-per-task=16

# Reset modules.
module reset

# Load particular modules.
module load foss/2023b

# Source any virtual environments.


## -----------------------
## EXPORTS
## Exports and variable assignments.
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MV2_ENABLE_AFFINITY=0 

echo "   SLURM_CPUS_PER_TASK: " $SLURM_CPUS_PER_TASK
echo "   OMP_NUM_THREADS: " $OMP_NUM_THREADS
echo "   MV2_ENABLE_AFFINITY: "  $MV2_ENABLE_AFFINITY
echo "   SLURM_NTASKS: "  $SLURM_NTASKS 
 

## -----------------------
## MPI JOB.
THE_INPUT="./nim-11-1-1.inp"
THE_EXEC="../bin-owl-openmpi/is-hybrid-owl-openmpi"
srun --mpi=pmix  $THE_EXEC  $THE_INPUT

Specification to run at higher priority for a shorter duration

In the QoS table above, one runs a job on the normal_q by specifying the default owl_normal_base QoS. A job will run for a maximum of seven days, per the table above. However, one can run at a higher priority, but for a shorter duration of up to one day by using QoS as owl_normal_short. You can see that the number of cores (cpus) and the amount of memory increase over the base case (owl_normal_base), but the billing rate is twice that of a normal job.

This is our example. We want to run for 23 hours, so we can use the owl_normal_short qos value. Note we will be billed 2x for this usage.

Also, we do not specify a particular compute node type, like Genoa or Milan. This job says “either type of compute node will work.”

The key lines below are:

#SBATCH --qos=owl_normal_short: because we want higher priority and our wall clock time must be less than or equal to 24 hours.
#SBATCH --time=23:00:00: This is the 23 hours of wall clock time.
#SBATCH --nodes=27: This and the next two values, multiplied, must be <= 2612.
#SBATCH --ntasks-per-node=1: Because the TOTAL number of cores in job is <= 2612.
#SBATCH --cpus-per-task=96: 27 nodes x 96 cpus/node = 2592 < 2612.
#SBATCH --mem=747G Because the total amount of memory, over all compute nodes, 27 x 747G = 20169GB, must be <= 24457G.

Note that there are per-user limits and per-account limits, so many users on one account cannot load up and dominate a cluster.

Also note that the memory required and the number of cores required each by themselves necessitate the use of owl_normal_short because both of our values above (2592 cores and 20169GB of memory) exceed the limits of the default owl_normal_base.

Thus, the #SBATCH qos=owl_normal_short must be used.

#!/bin/bash

## For Owl cluster.

## Job name.
#SBATCH --job-name=name_for_the_job

## You will need your own account.
#SBATCH --account=your_slurm_account_name

# Partition name. Options change based on cluster.
#SBATCH --partition=normal_q

## This next line is NOT optional this time.
#SBATCH --qos=owl_normal_short

## Always put "%j" into the output and error file names
## in order for the names to contain the SLURM_JOB_ID.
#SBATCH --output=my_hopeful_job.%j.out
#SBATCH --error=my_hopeful_job.%j.err

## Maximum wall clock time.
#SBATCH --time=23:00:00

## Compute resources.
### Number of compute nodes.
#SBATCH --nodes=27
### Number of tasks can be thought of as number of processes
### as in the case of MPI.
#SBATCH --ntasks-per-node=1
### Number of cpus/cores is the number of cores needed
### to run each task (e.g., for parallelism).
#SBATCH --cpus-per-task=96

## Memory requirement.
## This is a per-compute-node memory specification.
## This number is greater than 255 (or 256 GB), so
## slurm will run this job on Intel Cascade Lake nodes
## automatically.
#SBATCH --mem=760G


# Reset modules.
module reset

# Load particular modules.
module load foss/2023b

# Source any virtual environments.



## -----------------------
## EXPORTS
## Exports and variable assignments.
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MV2_ENABLE_AFFINITY=0

echo "   SLURM_CPUS_PER_TASK: " $SLURM_CPUS_PER_TASK
echo "   OMP_NUM_THREADS: " $OMP_NUM_THREADS
echo "   MV2_ENABLE_AFFINITY: "  $MV2_ENABLE_AFFINITY
echo "   SLURM_NTASKS: "  $SLURM_NTASKS


## -----------------------
## MPI JOB.
THE_INPUT="./nim-11-1-1.inp"
THE_EXEC="../bin-owl-openmpi/is-hybrid-owl-openmpi"
srun --mpi=pmix  $THE_EXEC  $THE_INPUT

Preemptable partitions

Note that from the QoS table above, one could use #SBATCH --qos=owl_preemptable_base to run jobs up to 30 days of wall clock time on the preemptable_q partition. However, as the name implies, your job could be pre-empted, i.e., killed, if a non-preemptable job is waiting in the slurm job queue.

Optimization

The performance of jobs can be greatly enhanced by appropriate optimizations being applied. Not only does this reduce the execution time of jobs but it also makes more efficient use of the resources for the benefit of all.

See the tuning guides available at https://developer.amd.com and https://www.intel.com/content/www/us/en/developer/

General principles of optimization:

Cache locality really matters - process pinning can make a big difference on performance.
Hybrid programming often pays off - one MPI process per L3 cache with 4 threads is often optimal.
Use the appropriate -march flag to optimize the compiled code and -gencode flag when using the NVCC compiler.

Suggested optimization parameters:

Node Type	Base Compute Nodes	Milan	Large Memory	Huge Memory
CPU arch	Zen 4	Zen 3	Zen 3	Zen 3
Compiler flags	`-march=znver4`	`-march=znver3`	`-march=znver3`	`-march=znver3`