Falcon - mid-range NVIDIA GPU

The Falcon cluster equipment was acquired in FY24 as a mid-tier GPU cluster to replace the Infer cluster. It was released for general use in late 2024.

  • The compute nodes on Falcon are GPU-based.

Overview

L40S Nodes

A30

Totals

Vendor

Dell PowerEdge R760XA

Dell PowerEdge R760XA

GPU Device

4x NVIDIA L40S

4x NVIDIA A30

208 GPUs

GPU interconnect

not available

available

CPU

Intel(R) “Sapphire Rapids” Xeon(R) Platinum 8462Y+ 2.80GHz

Intel(R) “Sapphire Rapids” Xeon(R) Platinum 8462Y+ 2.80GHz

Nodes

20

32

52

Cores/Node

64

64

3328

Memory (GiB)/Node

512

512

26.6TB

GPU Memory (GiB)/Node

192 GB (48GB/GPU)

96 GB (24GB/GPU)

6912GB

Local Disk

1.7TB NVMe drive for /localscratch

1.7TB NVMe drive for /localscratch

Interconnect

200Gbps NDR Infiniband

200Gbps NDR Infiniband

Total Memory

10240 GB

16384 GB

Total Cores

1280

2048

Theoretical Peak

7328 TFLOPS FP32 (no FP64 support)

665.6 TFLOPs FP64, 1331.2 TFLOPS FP32

Policies

Limits are set on the scale and quantity of jobs at the user and allocation (Slurm account) levels to help ensure availability of resources to a broad set of researchers and applications. These are the limits applied to free tier usage:

Policies for Main Usage Queues/Partitions

The l40s_normal_q, and a30_normal_q are the partitions (queues) that handle the bulk of utilization on the Falcon cluster.

l40s_normal_q

a30_normal_q

Node Type

L40S

A30

Number of Nodes

18

30

MaxRunningJobs (User)

12

12

MaxSubmitJobs (User)

24

24

MaxRunningJobs (Allocation)

24

24

MaxSubmitJobs (Allocation)

48

48

MaxGPUs (User)

40

40

MaxGPUs (Allocation)

40

40

MaxWallTime

6 days

6 days

Priority (QoS)

1000

1000

Policies for Development and Alternative Usage Queues/Partitions

The “dev” partitions (queues) overlap the main usage queues above, but jobs in these queues get higher priority to allow more rapid access to resources for testing and development workloads. The tradeoff is that individuals may only run a small number of short jobs in these partitions.

l40s_dev_q

a30_dev_q

Node Type

L40S

A30

Number of Nodes

20

32

MaxRunningJobs (User)

2

2

MaxSubmitJobs (User)

4

4

MaxRunningJobs (Allocation)

4

4

MaxSubmitJobs (Allocation)

8

8

MaxGPUs (User)

40

40

MaxGPUs (Allocation)

40

40

MaxWallTime

2 hours

2 hours

Priority (QoS)

2000

2000

Changes compared to previous clusters

/scratch high-performance scratch storage

The scratch file system on Falcon is mounted at /scratch. Earlier clusters had used /globalscratch to distinguish from this storage target from /localscratch devices on the individual compute nodes and to designate that this filesystem is available anywhere on the cluster. But /globalscratch is not “global” in the multi-cluster sense and the prefix has been dropped.

"/globalscratch" is "/scratch"

Slurm GPU to CPU bindings

Optimal application performance for GPU accelerated workloads requires that processes launched on the nodes run on CPU core topologically closest to the GPU that the process will use. On Falcon, Slurm is aware of which sets of CPU cores and memory locations have the most direct connection to each GPU. The arrangement is slightly unintuitive:

GPU device bus ID

GPU device

NUMA node

CPU cores

4a:00.0

0 - /dev/nvidia0

1

16-31

61:00.0

1 - /dev/nvidia1

0

0-15

ca:00.0

2 - /dev/nvidia2

3

48-63

e1:00.0

3 - /dev/nvidia3

2

32-47

If we do not inform Slurm of this affinity, then nearly all jobs will have reduced performance due to misalignment of allocated cores and GPUs. By default, these cores will be preferred by Slurm for scheduling with the affiliated GPU device, but other arrangements are possible.

  • Use the option --gres-flags=enforce-binding to require Slurm to allocate affiliated CPU core(s) with the corresponding GPU device(s)

  • The option --gres-flags=disable-binding is required to allocate more CPU cores than are bound to a device, but this is discouraged because these core will then be unavailable to their correctly affiliated GPU.

To summarize, these nodes and the Slurm scheduling algorithms will operate most efficiently when jobs consistently request between 1-16 cores per GPU device. For example:

Do this: --gres=gpu:1 --ntasks-per-node=1 --cpus-per-task=16 --gres-flags=enforce-binding which allocates 1 GPU, the associated 16 CPU cores, and 128GB of system memory.

Do not do this: --gres=gpu:1 --exclusive which allocates all the CPU core and all the system memory to the job, but only one GPU device. The other 3 GPUs will be unavailable to your job and also unavailable to other jobs.

Do not do this: --gres=gpu:1 --ntasks-per-node=32 which allocates 256GB of system memory, one GPU device + its 16 affiliated CPU cores AND 16 additional CPUs that have affinity to a different GPU. This isolated GPU is still available to other jobs, but can only run with diminished performance.