OWL - Water-cooled AMD CPU

OWL has 91 nodes, 8,704 CPU cores, and 80 TB RAM.

  • The compute nodes on OWL are exclusively CPU-based

  • Direct water-cooling of the base compute nodes allows for running at boost speeds (3.8GHz) indefinitely which is 40% higher than the base clock rate. Tinkercliffs AMD base compute nodes run at 2.0GHz.

  • AMD’s Genoa architecture is the first to feature AVX-512 instructions which provides 512-bit width vectorization (ie. eight-way FP64 SIMD in each clock-cycle). Tinkercliffs AMD base compute nodes support the previous generation AVX2 instructions which has 256-bit width.

  • 12 memory channels per socket (24 per node) provide much higher aggregate memory bandwidth and increased granularity which should provide substantial speedup for memory-bandwidth constrained workload such as finite-element analysis.

  • DDR5-4800 memory provides a nominal 50% speed increase over DDR4-3200 on Tinkercliffs.

  • 768GB memory per node provides ~8GB memory per core compared to Tinkercliffs which has ~2GB/core.

  • Three nodes are equipped with very-large memory (4TB or 8TB) enabling computational workloads for which we have never had sufficient memory resources.

Overview

Node Type

Base Compute Nodes

Milan

Large Memory

Huge Memory

Total

Chip

AMD EPYC 9454 - Genoa

AMD EPYC 7543 Milan

AMD EPYC 7763 Milan

AMD EPYC 7763 Milan

-

Architecture

Zen 4

Zen 3

Zen 3

Zen 3

-

Slurm features

amd, avx512

amd

amd

amd

-

Nodes

84

4

2

1

91

Cores/Node

96

64

128

128

-

Memory (GB)/Node

768

512

4,019

8,038

-

Total Cores

8,064

256

256

128

8,704

Total Memory (GB)

64,512

2,048

8,038

8,038

82,636

Local Disk

2.9TB NVMe

818GB SSD

2.9TB NVMe

2.9TB NVMe

-

Interconnect

HDR-100 IB

HDR-100 IB

HDR-100 IB

HDR-100 IB

-

Owl is hosted in the Steger Hall HPC datacenter on the Virginia Tech campus, so it is physically separated from other ARC HPC systems which are hosted in the AISB Datacenter at the Corporate Research Center (CRC) in Blacksburg.

An IBM ESS GPFS file system supports /projects for group collaboration and a VAST /scratch serves high-performance input/output (I/O).

Get Started

Owl can be accessed via one of the three login nodes using your VT credentials:

  • owl1.arc.vt.edu

  • owl2.arc.vt.edu

  • owl3.arc.vt.edu

For testing purposes, all users will be alloted an initial 240 core-hours for 90 days in the “personal” allocation. Researchers at the PI level are able to request resource allocations in the “free” tier (usage fully subsidized by VT) and can allocate 1,000,000 monthly Service Units among their projects.

To create an allocation, log in to the ARC allocation portal https://coldfront.arc.vt.edu

  • select or create a project

  • click the “+ Request Resource Allocation” button

  • Choose the “Compute (Free) (Cluster)” allocation type

Usage needs in excess of 1,000,000 monthly Service Units can be purchased via the ARC Cost Center.

Partitions

Users submit jobs to partitions of the cluster depending on the type of resources needed. Features are optional restrictions users can indicate in their job submission to restrict the execution of their job to nodes meeting specific requirements. If users do not specify the amount of memory requested for a job, the parameter DefMemPerCPU will automatically determine the amount of memory for the job based on the number of CPU cores requested. Jobs will be billed against the user’s allocation accounting for the utilization of number of CPU cores and memory, and time. Consult the Slurm configuration to understand how to specify the parameters for your job.

Partition

normal_q

preemptable_q

Node Type

All

All

Number of Nodes

91

91

DefMemPerCPU (MB)

7920

7920

TRESBillingWeights

CPU=1.5,Mem=0.0625G

-

PreemptMode

OFF

ON

Quality of Service (QoS)

The QOS associated with a job will affect the job in three key ways: scheduling priority, resource limits, and time limits. Each partition has a defaulq QoS named partitionname_base with a default priority, resource limits, and time limits. Users can optionally select a different QoS to increase or decrease the priority, resource limits, and time limits. The goal is to offer users multiple flexible options that adjust to their jobs needs. The long QoS allows users to run for an extended period of time (up to 14 days) but reduces the total amount of resources that can be allocated for the job. The short QoS allows users to increase the number of resources for a job but reduces the maximum time to 1 day. ARC staff reserves the right to modify the QoS settings at any point of time to ensure a fair and balanced utilization of resources among all users.

Partition

QoS

Priority

MaxWall

MaxTRESPerUser

MaxTRESPerAccount

normal_q

owl_normal_base

1000

7-00:00:00

cpu=1741,mem=16305G

cpu=3482,mem=32609G

normal_q

owl_normal_long

500

14-00:00:00

cpu=436,mem=4077G

cpu=871,mem=8153G

normal_q

owl_normal_short

2000

1-00:00:00

cpu=2612,mem=24457G

cpu=5223,mem=48913G

preemptable_q

owl_preemptable_base

0

30-00:00:00

cpu=218,mem=2039G

cpu=436,mem=4077G

Optimization

Node Type

Base Compute Nodes

Milan

Large Memory

Huge Memory

CPU arch

Zen 4

Zen 3

Zen 3

Zen 3

Compiler flags

-march=znver4

-march=znver3

-march=znver3

-march=znver3

See the tuning guides available at https://developer.amd.com

  • Cache locality really matters - process pinning can make a big difference on performance.

  • Hybrid programming often pays off - one MPI process per L3 cache with 4 threads is often optimal.

  • Use the appropritate -march flag to optimize the compiled code.