Falcon - mid-range NVIDIA GPU

The Falcon cluster equipment was acquired in FY24 as a mid-tier GPU cluster to replace the Infer cluster. It was released for general use in late 2024.

The compute nodes on Falcon are GPU-based.

Overview

	L40S Nodes	A30	Totals
Vendor	Dell PowerEdge R760XA	Dell PowerEdge R760XA
GPU Device	4x NVIDIA L40S	4x NVIDIA A30	208 GPUs
GPU interconnect	not available	available
CPU	Intel(R) “Sapphire Rapids” Xeon(R) Platinum 8462Y+ 2.80GHz	Intel(R) “Sapphire Rapids” Xeon(R) Platinum 8462Y+ 2.80GHz
Nodes	20	32	52
Cores/Node	64	64	3328
Memory (GiB)/Node	512	512	26.6TB
GPU Memory (GiB)/Node	192 GB (48GB/GPU)	96 GB (24GB/GPU)	6912GB
Local Disk	1.7TB NVMe drive for `/localscratch`	1.7TB NVMe drive for `/localscratch`
Interconnect	200Gbps NDR Infiniband	200Gbps NDR Infiniband
Total Memory	10240 GB	16384 GB
Total Cores	1280	2048
Theoretical Peak	7328 TFLOPS FP32 (no FP64 support)	665.6 TFLOPs FP64, 1331.2 TFLOPS FP32

Policies

Limits are set on the scale and quantity of jobs at the user and allocation (Slurm account) levels to help ensure availability of resources to a broad set of researchers and applications. These are the limits applied to free tier usage:

Policies for Main Usage Queues/Partitions

The l40s_normal_q, and a30_normal_q are the partitions (queues) that handle the bulk of utilization on the Falcon cluster.

	l40s_normal_q	a30_normal_q
Node Type	L40S	A30
Number of Nodes	18	30
MaxRunningJobs (User)	12	12
MaxSubmitJobs (User)	24	24
MaxRunningJobs (Allocation)	24	24
MaxSubmitJobs (Allocation)	48	48
MaxGPUs (User)	40	40
MaxGPUs (Allocation)	40	40
MaxWallTime	6 days	6 days
Priority (QoS)	1000	1000

Policies for Development and Alternative Usage Queues/Partitions

The “dev” partitions (queues) overlap the main usage queues above, but jobs in these queues get higher priority to allow more rapid access to resources for testing and development workloads. The tradeoff is that individuals may only run a small number of short jobs in these partitions.

	l40s_dev_q	a30_dev_q
Node Type	L40S	A30
Number of Nodes	20	32
MaxRunningJobs (User)	2	2
MaxSubmitJobs (User)	4	4
MaxRunningJobs (Allocation)	4	4
MaxSubmitJobs (Allocation)	8	8
MaxGPUs (User)	40	40
MaxGPUs (Allocation)	40	40
MaxWallTime	2 hours	2 hours
Priority (QoS)	2000	2000

Changes compared to previous clusters

`/scratch` high-performance scratch storage

The scratch file system on Falcon is mounted at /scratch. Earlier clusters had used /globalscratch to distinguish from this storage target from /localscratch devices on the individual compute nodes and to designate that this filesystem is available anywhere on the cluster. But /globalscratch is not “global” in the multi-cluster sense and the prefix has been dropped.

"/globalscratch" is "/scratch"

Slurm GPU to CPU bindings

Optimal application performance for GPU accelerated workloads requires that processes launched on the nodes run on CPU core topologically closest to the GPU that the process will use. On Falcon, Slurm is aware of which sets of CPU cores and memory locations have the most direct connection to each GPU. The arrangement is slightly unintuitive:

GPU device bus ID	GPU device	NUMA node	CPU cores
4a:00.0	0 - /dev/nvidia0	1	16-31
61:00.0	1 - /dev/nvidia1	0	0-15
ca:00.0	2 - /dev/nvidia2	3	48-63
e1:00.0	3 - /dev/nvidia3	2	32-47

If we do not inform Slurm of this affinity, then nearly all jobs will have reduced performance due to misalignment of allocated cores and GPUs. By default, these cores will be preferred by Slurm for scheduling with the affiliated GPU device, but other arrangements are possible.

Use the option --gres-flags=enforce-binding to require Slurm to allocate affiliated CPU core(s) with the corresponding GPU device(s)
The option --gres-flags=disable-binding is required to allocate more CPU cores than are bound to a device, but this is discouraged because these core will then be unavailable to their correctly affiliated GPU.

To summarize, these nodes and the Slurm scheduling algorithms will operate most efficiently when jobs consistently request between 1-16 cores per GPU device. For example:

Do this: --gres=gpu:1 --ntasks-per-node=1 --cpus-per-task=16 --gres-flags=enforce-binding which allocates 1 GPU, the associated 16 CPU cores, and 128GB of system memory.

Do not do this: --gres=gpu:1 --exclusive which allocates all the CPU core and all the system memory to the job, but only one GPU device. The other 3 GPUs will be unavailable to your job and also unavailable to other jobs.

Do not do this: --gres=gpu:1 --ntasks-per-node=32 which allocates 256GB of system memory, one GPU device + its 16 affiliated CPU cores AND 16 additional CPUs that have affinity to a different GPU. This isolated GPU is still available to other jobs, but can only run with diminished performance.