Falcon - mid-range NVIDIA GPU
The Falcon cluster equipment was acquired in FY24 as a mid-tier GPU cluster to replace the Infer cluster. It was released for general use in late 2024.
The compute nodes on Falcon are GPU-based.
Overview
L40S Nodes |
A30 |
Totals |
|
---|---|---|---|
Vendor |
Dell PowerEdge R760XA |
Dell PowerEdge R760XA |
|
GPU Device |
4x NVIDIA L40S |
4x NVIDIA A30 |
208 GPUs |
GPU interconnect |
not available |
available |
|
CPU |
Intel(R) “Sapphire Rapids” Xeon(R) Platinum 8462Y+ 2.80GHz |
Intel(R) “Sapphire Rapids” Xeon(R) Platinum 8462Y+ 2.80GHz |
|
Nodes |
20 |
32 |
52 |
Cores/Node |
64 |
64 |
3328 |
Memory (GiB)/Node |
512 |
512 |
26.6TB |
GPU Memory (GiB)/Node |
192 GB (48GB/GPU) |
96 GB (24GB/GPU) |
6912GB |
Local Disk |
1.7TB NVMe drive for |
1.7TB NVMe drive for |
|
Interconnect |
200Gbps NDR Infiniband |
200Gbps NDR Infiniband |
|
Total Memory |
10240 GB |
16384 GB |
|
Total Cores |
1280 |
2048 |
|
Theoretical Peak |
7328 TFLOPS FP32 (no FP64 support) |
665.6 TFLOPs FP64, 1331.2 TFLOPS FP32 |
Policies
Limits are set on the scale and quantity of jobs at the user and allocation (Slurm account) levels to help ensure availability of resources to a broad set of researchers and applications. These are the limits applied to free tier usage:
Policies for Main Usage Queues/Partitions
The l40s_normal_q
, and a30_normal_q
are the partitions (queues) that handle the bulk of utilization on the Falcon cluster.
l40s_normal_q |
a30_normal_q |
|
---|---|---|
Node Type |
L40S |
A30 |
Number of Nodes |
18 |
30 |
MaxRunningJobs (User) |
12 |
12 |
MaxSubmitJobs (User) |
24 |
24 |
MaxRunningJobs (Allocation) |
24 |
24 |
MaxSubmitJobs (Allocation) |
48 |
48 |
MaxGPUs (User) |
40 |
40 |
MaxGPUs (Allocation) |
40 |
40 |
MaxWallTime |
6 days |
6 days |
Priority (QoS) |
1000 |
1000 |
Policies for Development and Alternative Usage Queues/Partitions
The “dev” partitions (queues) overlap the main usage queues above, but jobs in these queues get higher priority to allow more rapid access to resources for testing and development workloads. The tradeoff is that individuals may only run a small number of short jobs in these partitions.
l40s_dev_q |
a30_dev_q |
|
---|---|---|
Node Type |
L40S |
A30 |
Number of Nodes |
20 |
32 |
MaxRunningJobs (User) |
2 |
2 |
MaxSubmitJobs (User) |
4 |
4 |
MaxRunningJobs (Allocation) |
4 |
4 |
MaxSubmitJobs (Allocation) |
8 |
8 |
MaxGPUs (User) |
40 |
40 |
MaxGPUs (Allocation) |
40 |
40 |
MaxWallTime |
2 hours |
2 hours |
Priority (QoS) |
2000 |
2000 |
Changes compared to previous clusters
/scratch
high-performance scratch storage
The scratch file system on Falcon is mounted at /scratch
. Earlier clusters had used /globalscratch
to distinguish from this storage target from /localscratch
devices on the individual compute nodes and to designate that this filesystem is available anywhere on the cluster. But /globalscratch
is not “global” in the multi-cluster sense and the prefix has been dropped.
"/globalscratch" is "/scratch"
Slurm GPU to CPU bindings
Optimal application performance for GPU accelerated workloads requires that processes launched on the nodes run on CPU core topologically closest to the GPU that the process will use. On Falcon, Slurm is aware of which sets of CPU cores and memory locations have the most direct connection to each GPU. The arrangement is slightly unintuitive:
GPU device bus ID |
GPU device |
NUMA node |
CPU cores |
---|---|---|---|
4a:00.0 |
0 - /dev/nvidia0 |
1 |
16-31 |
61:00.0 |
1 - /dev/nvidia1 |
0 |
0-15 |
ca:00.0 |
2 - /dev/nvidia2 |
3 |
48-63 |
e1:00.0 |
3 - /dev/nvidia3 |
2 |
32-47 |
If we do not inform Slurm of this affinity, then nearly all jobs will have reduced performance due to misalignment of allocated cores and GPUs. By default, these cores will be preferred by Slurm for scheduling with the affiliated GPU device, but other arrangements are possible.
Use the option
--gres-flags=enforce-binding
to require Slurm to allocate affiliated CPU core(s) with the corresponding GPU device(s)The option
--gres-flags=disable-binding
is required to allocate more CPU cores than are bound to a device, but this is discouraged because these core will then be unavailable to their correctly affiliated GPU.
To summarize, these nodes and the Slurm scheduling algorithms will operate most efficiently when jobs consistently request between 1-16 cores per GPU device. For example:
Do this: --gres=gpu:1 --ntasks-per-node=1 --cpus-per-task=16 --gres-flags=enforce-binding
which allocates 1 GPU, the associated 16 CPU cores, and 128GB of system memory.
Do not do this: --gres=gpu:1 --exclusive
which allocates all the CPU core and all the system memory to the job, but only one GPU device. The other 3 GPUs will be unavailable to your job and also unavailable to other jobs.
Do not do this: --gres=gpu:1 --ntasks-per-node=32
which allocates 256GB of system memory, one GPU device + its 16 affiliated CPU cores AND 16 additional CPUs that have affinity to a different GPU. This isolated GPU is still available to other jobs, but can only run with diminished performance.