OWL - Water-cooled AMD CPU
The OWL cluster equipment was acquired in FY23 but full commissioning of the cluster has been delayed by prerequisite datacenter renovation to integrate the direct water-cooling system with the building and datacenter where it is housed. As of February 2024, it is in the late stages of deployment and testing. It was released for general use in August 2024.
The compute nodes on OWL are exclusively CPU-based; there are no GPUs on OWL.
Direct water-cooling of the base compute nodes allows for running at boost speeds (3.8GHz) indefinitely which is 40% higher than the base clock rate. Tinkercliffs base compute nodes run at 2.0GHz.
AMD’s “Genoa” codename architecture is the first to feature AVX-512 instructions which provides 512-bit width vectorization (ie. eight-way FP64 SIMD in each clock-cycle). Tinkercliffs base compute nodes support the previous generation AVX2 instructions which has 256-bit width
12 memory channels per socket (24 per node) provide much higher aggregate memory bandwidth and increased granularity which should provide substantial speedup for memory-bandwidth constrained workload such as finite-element analysis.
DDR5-4800 memory provides a nominal 50% speed increase over DDR4-3200 on Tinkercliffs
768GB memory per node provides 8GB memory per core compared to Tinkercliffs which has 2GB/core
Three nodes are equipped with very-large memory (4TB or 8TB) enabling computational workloads for which we have never had sufficient memory resources.
The large memory nodes were not available in the AMD "Genoa" package at the time of acquisition and equipped with different processors (detail below) and are not water-cooled.
Overview
Base Compute Nodes |
Large Memory |
Huge Memory |
Totals |
|
---|---|---|---|---|
Vendor |
Lenovo |
Lenovo |
Lenovo |
|
Chip |
||||
Nodes |
84 |
2 |
1 |
87 |
Cores/Node |
96 |
128 |
128 |
|
Memory (GiB)/Node |
768 DDR5-4800 |
4019 DDR4-3200 |
8038 DDR4-3200 |
|
Local Disk |
2.9TB NVMe |
2.9TB NVMe |
2.9TB NVMe |
|
Interconnect |
shared 200Gbps HDR Infiniband: |
shared 200Gbps HDR Infiniband: |
shared 200Gbps HDR Infiniband: |
|
Total Memory |
64512 |
8038 |
8038 |
|
Total Cores |
8064 |
256 |
128 |
8448 |
Theoretical Peak |
245.1456 TFLOPS |
Policies
Limits are set on the scale and quantity of jobs at the user and allocation (Slurm account) levels to help ensure availability of resources to a broad set of researchers and applications. These are the limits applied to free tier usage (note that the terms “cpu” and “core” are used interchangably here following Slurm terminology):
Policies for Main Usage Queues/Partitions
The normal_q
, largemem_q
, and hugemem_q
are the partitions (queues) that handle the bulk of utilization on the Tinkercliffs cluster.
normal_q |
largemem_q |
hugemem_q |
|
---|---|---|---|
Node Type |
Base Compute |
Large Memory |
Huge Memory |
Number of Nodes |
84 |
2 |
1 |
MaxRunningJobs (User) |
32 |
2 |
2 |
MaxSubmitJobs (User) |
32 |
8 |
4 |
MaxRunningJobs (Allocation) |
64 |
8 |
4 |
MaxSubmitJobs (Allocation) |
200 |
16 |
8 |
MaxNodes (User) |
32 |
1 |
1 |
MaxNodes (Allocation) |
48 |
2 |
1 |
MaxCPUs (User) |
3072 |
128 |
512 |
MaxCPUs (Allocation) |
4608 |
256 |
768 |
MaxWallTime |
6 days |
3 days |
6 days |
Priority (QoS) |
1000 |
1000 |
1000 |
Policies for Development and Alternative Usage Queues/Partitions
The “dev” partitions (queues) overlap the main usage queues above, but jobs in these queues get higher priority to allow more rapid access to resources for testing and development workloads. The tradeoff is that individuals may only run a small number of short jobs in these partitions.
dev_q |
preemptable_q |
interactive_q |
|
---|---|---|---|
Node Type |
Base Compute |
Base Compute |
Base Compute |
Number of Nodes |
84 |
84 |
4 |
MaxRunningJobs (User) |
2 |
32 |
2 |
MaxSubmitJobs (User) |
4 |
100 |
4 |
MaxRunningJobs (Allocation) |
8 |
64 |
3 |
MaxSubmitJobs (Allocation) |
16 |
200 |
6 |
MaxNodes (User) |
32 |
32 |
1 |
MaxNodes (Allocation) |
48 |
48 |
1 |
MaxCPUs (User) |
3072 |
128 |
512 |
MaxCPUs (Allocation) |
4608 |
256 |
768 |
MaxWallTime |
4 hours |
6 days |
|
Priority (QoS) |
2000 |
0 |
1000 |
AMD Resources
Compiler Options Quick Reference Guide
If you’re using EasyBuild to install software, loading the EasyBuild module we provide will set environment variable that EasyBuild will use to configure the Intel and GCC compilers architecture optimization flags.
Genoa (base nodes) |
Milan (largemem nodes) |
|
---|---|---|
Intel |
|
|
GCC |
|
|
Known Issues
Apptainer may experience issues on login node - use compute nodes instead
user.max_user_namespaces=0
is set as mitigation for a CVE on login nodes. Compute nodes are not affected and do not have this constraint.
Benchmarks
STREAM
HPL
HPCG
High performance conjugate gradient (HPCG) test results.
On Owl using gcc version 13.2.0 and OpenMPI version 4.1.6.
(This is the foss toolchain 2023b, i.e., module load foss/2023b
.)
Inputs: xdim=208, ydim=208, zdim=312, time=1800.
num MPI Processes |
total memory used (GB) |
execution time (s) |
execution rate (GFlops/s) |
---|---|---|---|
2 |
19.30 |
1832.25 |
5.93 |
4 |
38.60 |
1840.99 |
6.50 |
8 |
77.20 |
1835.21 |
8.73 |
16 |
154.41 |
1974.77 |
16.23 |
32 |
308.83 |
1956.86 |
32.75 |
64 |
617.65 |
2001.58 |
64.165 |
On Owl using gcc version 11.3.1 and MVAPICH2 MPI version 2.3.7.
(Using module mvapich2/gcc/64/2.3.7, i.e.,
module load mvapich2/gcc/64/2.3.7
.)
Inputs: xdim=208, ydim=208, zdim=312, time=1800.
These data under revision.
num MPI Processes |
total memory used (GB) |
execution time (s) |
execution rate (GFlops/s) |
---|---|---|---|
2 |
9.65 |
1874.33 |
2.51 |
4 |
9.65 |
1935.02 |
1.54 |
8 |
9.65 |
1929.36 |
0.77 |
16 |
9.65 |
1907.02 |
0.39 |
32 |
9.65 |
1891.03 |
0.39 |
64 |
9.65 |
1909.17 |
0.39 |
MPI
An MPI slurm script for running MPI using OpenMPI.
OpenMPI
#!/bin/bash
#SBATCH -J hpcg
## Wall time.
#SBATCH --time=2-04:00:00 # 2 days and 4 hours.
### Account. Your account number
#SBATCH --account=your_account_number
### Queue/partition.
#SBATCH --partition=normal_q
### This requests 1 node, 1 core.
#SBATCH --nodes=1
### Number of MPI ranks; total over all nodes.
#SBATCH --ntasks=2
### This is the number of MPI processes per node, for MPI jobs.
#SBATCH --ntasks-per-node=2
### Number of cores per task. Includes OpenMP,
### i.e., number of OpenMP threads per MPI process.
#SBATCH --cpus-per-task=6
## Might want to run exclusive for timing studies.
## Unless you have a good reason, comment this out;
## can waste resources.
#SBATCH --exclusive
## Slurm output and error files.
#SBATCH -o slurm.openmpi.hpcg.%j.out
#SBATCH -e slurm.openmpi.hpcg.%j.err
## Notify me when done.
#SBATCH --mail-type=ALL # Send email notification at the start and end of the job
#SBATCH --mail-user=your_vt_email # Send email notification to this address
# Load modules.
module load foss/2023b
## Exports.
export OMP_NUM_THREADS=4
## Time the job with time.
## For MVAPICH2, which we are using here:
## The following are variables, for user to specify: mycode, xdim, ydim, zdim, timedim.
time mpirun ${mycode} ${xdim} ${ydim} ${zdim} ${timedim}
mvapich2 MPI
An MPI slurm script for running MPI using MVAPICH2.
#!/bin/bash
#SBATCH -J hpcg;mvap2
## Wall time.
#SBATCH --time=0-02:00:00 # 2 hours
### Account. Your account number
#SBATCH --account=your_account_number
### Queue.
#SBATCH --partition=normal_q
### This requests 1 node.
#SBATCH --nodes=1
### Number of MPI ranks (i.e., processes); total over all nodes.
#SBATCH --ntasks=2
### This is the number of MPI processes per node, for MPI jobs.
#SBATCH --ntasks-per-node=2
### Number of cores per task. Includes OpenMP,
### i.e., number of OpenMP threads per MPI process.
#SBATCH --cpus-per-task=6
## Might want to run exclusive for timing studies.
## Unless you have a good reason, comment this out;
## can waste resources.
#SBATCH --exclusive
## Slurm output and error files.
#SBATCH -o slurm.hpcg.mvapich2.%j.out
#SBATCH -e slurm.hpcg.mvapich2.%j.err
# Load modules.
module load mvapich2/gcc/64/2.3.7
## Exports.
export OMP_NUM_THREADS=4
## Time the job with time.
## For MVAPICH2, which we are using here:
## The following are variables, for user to specify: mycode, xdim, ydim, zdim, timedim.
time srun ${mycode} ${xdim} ${ydim} ${zdim} ${timedim}