OWL - Water-cooled AMD CPU

OWL has 91 nodes, 8,704 CPU cores, and 80 TB RAM.

  • The compute nodes on OWL are exclusively CPU-based

  • Direct water-cooling of the base compute nodes allows for running at boost speeds (3.8GHz) indefinitely which is 40% higher than the base clock rate. For example, Tinkercliffs AMD base compute nodes run at 2.0GHz.

  • AMD’s Genoa architecture is the first to feature AVX-512 instructions which provides 512-bit width vectorization (ie. eight-way FP64 SIMD in each clock-cycle). Tinkercliffs AMD base compute nodes support the previous generation AVX2 instructions which has 256-bit width.

  • 12 memory channels per socket (24 per node) provide much higher aggregate memory bandwidth and increased granularity which should provide substantial speedup for memory-bandwidth constrained workload such as finite-element analysis.

  • DDR5-4800 memory provides a nominal 50% speed increase over DDR4-3200 on Tinkercliffs.

  • 768GB memory per node provides ~8GB memory per core compared to Tinkercliffs which has ~2GB/core.

  • Three nodes are equipped with very-large memory (4TB or 8TB) enabling computational workloads for which we have never had sufficient memory resources.

Overview

Node Type

Base Compute Nodes

Milan

Large Memory

Huge Memory

Total

Chip

AMD EPYC 9454 - Genoa

AMD EPYC 7543 Milan

AMD EPYC 7763 Milan

AMD EPYC 7763 Milan

-

Architecture

Zen 4

Zen 3

Zen 3

Zen 3

-

Slurm features

amd, avx512

amd

amd

amd

-

Nodes

84

4

2

1

91

Cores/Node

96

64

128

128

-

Memory (GB)/Node

768

512

4,019

8,038

-

Maximum Memory for Slum (GB)/Node

747

495

4,011

8,043

-

Total Cores

8,064

256

256

128

8,704

Total Memory (GB)

64,512

2,048

8,038

8,038

82,636

Local Disk

2.9TB NVMe

818GB SSD

2.9TB NVMe

2.9TB NVMe

-

Interconnect

HDR-100 IB

HDR-100 IB

HDR-100 IB

HDR-100 IB

-

Owl is hosted in the Steger Hall HPC datacenter on the Virginia Tech campus so it is physically separated from other ARC HPC systems which are hosted in the AISB Datacenter at the Corporate Research Center (CRC) in Blacksburg.

An IBM ESS GPFS file system supports /projects for group collaboration and a VAST /scratch serves high-performance input/output (I/O).

Get Started

Owl can be accessed via one of the three login nodes using your VT credentials:

  • owl1.arc.vt.edu

  • owl2.arc.vt.edu

  • owl3.arc.vt.edu

For testing purposes, all users will be alloted an initial 240 core-hours for 90 days in the “personal” allocation. Researchers at the PI level are able to request resource allocations in the “free” tier (usage fully subsidized by VT) and can allocate 1,000,000 monthly Service Units among their projects.

To create an allocation, log in to the ARC allocation portal https://coldfront.arc.vt.edu

  • Select or create a project

  • Click the “+ Request Resource Allocation” button

  • Choose the “Compute (Free) (Cluster)” allocation type

Usage needs in excess of 1,000,000 monthly Service Units can be purchased via the ARC Cost Center.

Partitions

Users submit jobs to partitions of the cluster depending on the type of resources needed. Features are optional restrictions users can indicate in their job submission to restrict the execution of their job to nodes meeting specific requirements. If users do not specify the amount of memory requested for a job, the parameter DefMemPerCPU will automatically determine the amount of memory for the job based on the number of CPU cores requested. Jobs will be billed against the user’s allocation accounting for the utilization of number of CPU cores and memory, and time. Consult the Slurm configuration to understand how to specify the parameters for your job.

Partition

normal_q

preemptable_q

Node Type

All

All

Number of Nodes

91

91

DefMemPerCPU (MB)

7920

7920

TRESBillingWeights

CPU=1.5,Mem=0.0625G

-

PreemptMode

OFF

ON

Quality of Service (QoS)

ARC must balance the needs of individuals with the needs of all to ensure fairness. This is done by providing options which determine the Quality of Service (QoS).

The QoS associated with a job affects the job in three key ways: scheduling priority, resource limits, and time limits. Each partition has a default QoS named partitionname_base with a default priority, resource limits, and time limits. Users can optionally select a different QoS to increase or decrease the priority, resource limits, and time limits. The goal is to offer users multiple flexible options that adjust to their jobs needs. The long QoS allows users to run for an extended period of time (up to 14 days) but reduces the total amount of resources that can be allocated for the job. The short QoS allows users to increase the number of resources for a job but reduces the maximum time to 1 day. ARC staff reserves the right to modify the QoS settings at any point of time to ensure a fair and balanced utilization of resources among all users.

Partition

QoS

Priority

MaxWall

MaxTRESPerUser

MaxTRESPerAccount

UsageFactor

normal_q

owl_normal_base

1000

7-00:00:00

cpu=1741,mem=16305G

cpu=3482,mem=32609G

1

normal_q

owl_normal_long

500

14-00:00:00

cpu=436,mem=4077G

cpu=871,mem=8153G

1

normal_q

owl_normal_short

2000

1-00:00:00

cpu=2612,mem=24457G

cpu=5223,mem=48913G

2

preemptable_q

owl_preemptable_base

0

30-00:00:00

cpu=218,mem=2039G

cpu=436,mem=4077G

0

Examples

Specify use of base compute nodes AMD EPYC 9454 Zen4 (“Genoa”) of the normal_q partition

There are two types of “basic” compute nodes, i.e., nodes that are NOT “high memory” nodes. These are, from the first table above:

  1. AMD EPYC 9454 Zen4 “Genoa” nodes.

  2. AMD EPYC 7543 Zen3 “Milan” nodes.

In this example, we want Option 1.

The key line below is #SBATCH --constraint=avx512, which specifies that your code will be run on the AMD Zen4 “Genoa” nodes.

#!/bin/bash

## For Owl cluster.

## Job name.
#SBATCH --job-name=i_hope_this_runs

## You will need your own account.
#SBATCH --account=arcadm

# Partition name. Options change based on cluster.
#SBATCH --partition=normal_q

## Specifying that you must have this job run on the
## Owl AMD EPYC 9454 Zen4 "Genoa" nodes.
#SBATCH --constraint=avx512

## This next line is optional because it is the default.
#SBATCH --qos=owl_normal_base


## Always put "%j" into the output and error file names
## in order for the names to contain the SLURM_JOB_ID.
#SBATCH --output=my_hopeful_job.%j.out
#SBATCH --error=my_hopeful_job.%j.err

## Maximum wall clock time.
#SBATCH --time=48:00:00

## Compute resources.
### Number of compute nodes.
#SBATCH --nodes=1
### Number of tasks can be thought of as number of processes
### as in the case of MPI.
#SBATCH --ntasks-per-node=3
### Number of cpus/cores is the number of cores needed
### to run each task (e.g., for parallelism).
#SBATCH --cpus-per-task=16

# Reset modules.
module reset

# Load particular modules.
module load foss/2023b

# Source any virtual environments.


## -----------------------
## EXPORTS
## Exports and variable assignments.
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MV2_ENABLE_AFFINITY=0 

echo "   SLURM_CPUS_PER_TASK: " $SLURM_CPUS_PER_TASK
echo "   OMP_NUM_THREADS: " $OMP_NUM_THREADS
echo "   MV2_ENABLE_AFFINITY: "  $MV2_ENABLE_AFFINITY
echo "   SLURM_NTASKS: "  $SLURM_NTASKS 
 

## -----------------------
## MPI JOB.
THE_INPUT="./nim-11-1-1.inp"
THE_EXEC="../bin-owl-openmpi/is-hybrid-owl-openmpi"
srun --mpi=pmix  $THE_EXEC  $THE_INPUT

Specify use of base compute nodes AMD EPYC 7543 Zen3 (“Milan”) of the normal_q partition

Coming soon.

Specifying a memory requirement that ensures that the AMD EPYC 7543 Zen3 (“Milan”) nodes of the normal_q partition are not used

There are two types of “basic” compute nodes, i.e., nodes that are NOT “high memory” nodes. These are, from the first table above:

  1. AMD EPYC 9454 Zen4 “Genoa” nodes.

  2. AMD EPYC 7543 “Milan” nodes.

Our job requires 600GB of memory per compute node. From the table above, the AMD EPYC 7543 “Milan” nodes do not have sufficient memory; but the AMD EPYC 9454 nodes do (768 GB/node). So, too, do both of the larger memory compute nodes (the large memory and huge memory nodes).

Hence, because of the memory requirement/specification, “Milan” nodes will not be used.

The node types that will be used depends on the load on the system and slurm’s scheduler. We know that the AMD EPYC 9454 Zen4 “Genoa” nodes could be used because they have sufficent memory per node. However, if the AMD EPYC 9454 Zen4 “Genoa” nodes are utilized, then the job may run on the AMD EPYC 7763 Milan:

  1. large memory nodes (4011 GB/node)

  2. huge memory node (8043 GB/node)

Note that the job below specifies 48 cores. All three of the candidate compute node types, based on the memory requirement, can meet this core (i.e., cpu) requirement.

This is an example of how slurm automatically selects a node that meets the specified job requirements, in trying to reduce user wait times. And it shows why, if you want to run on a particular type of compute node, then the node type needs to be specified explicitly.

The key line below is #SBATCH --mem=600G.

#!/bin/bash

## For Owl cluster.

## Job name.
#SBATCH --job-name=i_hope_this_runs

## You will need your own account.
#SBATCH --account=arcadm

# Partition name. Options change based on cluster.
#SBATCH --partition=normal_q


## This next line is optional because it is the default.
#SBATCH --qos=owl_normal_base


## Always put "%j" into the output and error file names
## in order for the names to contain the SLURM_JOB_ID.
#SBATCH --output=my_hopeful_job.%j.out
#SBATCH --error=my_hopeful_job.%j.err

## Maximum wall clock time.
#SBATCH --time=48:00:00

## Compute resources.
### Number of compute nodes.
#SBATCH --nodes=1
### Number of tasks can be thought of as number of processes
### as in the case of MPI.
#SBATCH --ntasks-per-node=3
### Number of cpus/cores is the number of cores needed
### to run each task (e.g., for parallelism).
#SBATCH --cpus-per-task=16

## Memory specification per compute node.
#SBATCH -mem=600G


# Reset modules.
module reset

# Load particular modules.
module load foss/2023b

# Source any virtual environments.


## -----------------------
## EXPORTS
## Exports and variable assignments.
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MV2_ENABLE_AFFINITY=0 

echo "   SLURM_CPUS_PER_TASK: " $SLURM_CPUS_PER_TASK
echo "   OMP_NUM_THREADS: " $OMP_NUM_THREADS
echo "   MV2_ENABLE_AFFINITY: "  $MV2_ENABLE_AFFINITY
echo "   SLURM_NTASKS: "  $SLURM_NTASKS 
 

## -----------------------
## MPI JOB.
THE_INPUT="./nim-11-1-1.inp"
THE_EXEC="../bin-owl-openmpi/is-hybrid-owl-openmpi"
srun --mpi=pmix  $THE_EXEC  $THE_INPUT

Now consider this aside. Suppose instead that we required 120 cores to run this job. In this case, the job will be run on the AMD EPYC 7543 Milan Zen 3 high memory nodes because it is the only node type that meets the memory requirement and the core requirement.

Specification to run at higher priority for a shorter duration

In the QoS table above, one runs a ``normal’’ job on the normal_q by specifying qos as owl_normal_base (or not specifying a qos, as this is the default). A job will run for a maximum of seven days, per the table above. However, one can run at a higher priority, but for a shorter duration of up to one day by using qos as owl_normal_short. You can see that the number of cores (cpus) and the amount of memory increase over the base case (owl_normal_base), but the billing rate is twice that of a normal job.

This is our example. We want to run for 23 hours, so we can use the owl_normal_short qos value. Note we will be billed 2x for this usage.

The key lines below are:

  1. #SBATCH --constraint=avx512: because we want to use an AMD EPYC 9454 compute node (Genoa).

  2. #SBATCH --qos=owl_normal_short: because we want higher priority and our wall clock time must be less than or equal to 24 hours.

  3. #SBATCH --time=0-23:00:00: This is the 23 hours of wall clock time.

  4. #SBATCH --nodes=27: This and the next two values, multiplied, must be <= 2612.

  5. #SBATCH --ntasks-per-node=1: Because the TOTAL number of cores in job is <= 2612.

  6. #SBATCH --cpus-per-task=96: 27 nodes x 96 cpus/node = 2592 < 2612.

  7. #SBATCH --mem=760G Because the total amount of memory, over all compute nodes, 27 x 760G = 20520GB, must be <= 24457G.

Note that there are per-user limits and per-account limits, so many people on one account cannot load up and dominate a cluster.

Also note that the memory required and the number of cores required each by themselves necessitate the use of owl_normal_short because both of our values above (2592 cores and 20520GB of memory) exceed the limits of the default owl_normal_base.

Thus, the #SBATCH qos=owl_normal_short must be used.

While not required, including #SBATCH --constraint=avx512 for the Genoa node type makes the script more readable.

#!/bin/bash

## For Owl cluster.

## Job name.
#SBATCH --job-name=i_hope_this_runs

## You will need your own account.
#SBATCH --account=arcadm

# Partition name. Options change based on cluster.
#SBATCH --partition=normal_q

## Specifying that you must have this job run on the
## the AMD EPYC 9454 Genoa compute nodes.
#SBATCH --constraint=avx512

## This next line is NOT optional.
#SBATCH --qos=owl_normal_short

## Always put "%j" into the output and error file names
## in order for the names to contain the SLURM_JOB_ID.
#SBATCH --output=my_hopeful_job.%j.out
#SBATCH --error=my_hopeful_job.%j.err

## Maximum wall clock time.
#SBATCH --time=23:00:00

## Compute resources.
### Number of compute nodes.
#SBATCH --nodes=27
### Number of tasks can be thought of as number of processes
### as in the case of MPI.
#SBATCH --ntasks-per-node=1
### Number of cpus/cores is the number of cores needed
### to run each task (e.g., for parallelism).
#SBATCH --cpus-per-task=96

## Memory requirement.
## This is a per-compute-node memory specification.
## This number is greater than 255 (or 256 GB), so
## slurm will run this job on Intel Cascade Lake nodes
## automatically.
#SBATCH --mem=760G


# Reset modules.
module reset

# Load particular modules.
module load foss/2023b

# Source any virtual environments.



## -----------------------
## EXPORTS
## Exports and variable assignments.
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MV2_ENABLE_AFFINITY=0

echo "   SLURM_CPUS_PER_TASK: " $SLURM_CPUS_PER_TASK
echo "   OMP_NUM_THREADS: " $OMP_NUM_THREADS
echo "   MV2_ENABLE_AFFINITY: "  $MV2_ENABLE_AFFINITY
echo "   SLURM_NTASKS: "  $SLURM_NTASKS


## -----------------------
## MPI JOB.
THE_INPUT="./nim-11-1-1.inp"
THE_EXEC="../bin-owl-openmpi/is-hybrid-owl-openmpi"
srun --mpi=pmix  $THE_EXEC  $THE_INPUT

Preemptable partitions

Note that from the QoS table above, one could use #SBATCH --qos=owl_preemptable_base to run jobs up to 30 days of wall clock time on the preemptable_q partition. However, as the name implies, your job could be pre-empted, i.e., killed, if a non-preemptable job is waiting in the slurm job queue.

Optimization

The performance of jobs can be greatly enhanced by appropriate optimizations being applied. Not only does this reduce the execution time of jobs but it also makes more efficient use of the resources for the benefit of all.

See the tuning guides available at https://developer.amd.com and https://www.intel.com/content/www/us/en/developer/

General principles of optimization:

  • Cache locality really matters - process pinning can make a big difference on performance.

  • Hybrid programming often pays off - one MPI process per L3 cache with 4 threads is often optimal.

  • Use the appropriate -march flag to optimize the compiled code and -gencode flag when using the NVCC compiler.

Suggested optimization parameters:

Node Type

Base Compute Nodes

Milan

Large Memory

Huge Memory

CPU arch

Zen 4

Zen 3

Zen 3

Zen 3

Compiler flags

-march=znver4

-march=znver3

-march=znver3

-march=znver3