Falcon - Mid-Range NVIDIA GPU
Falcon has 111 nodes, 4,896 CPU cores, 44 TB RAM, and 307 GPUs (80 NVIDIA L40S GPUs, 128 NVIDIA A30 GPUs, 40 NVIDIA V100 GPUs, and 19 NVIDIA T4 GPUs).
Overview
Node Type |
L40s Nodes |
A30 Nodes |
V100 Nodes |
T4 Nodes |
Total |
---|---|---|---|---|---|
Chip |
- |
||||
Architecture |
Shappire Rapids |
Shappire Rapids |
Cascade Lake |
Cascade Lake |
- |
Slurm features |
- |
- |
- |
- |
- |
Nodes |
20 |
32 |
40 |
19 |
111 |
GPUs |
4x NVIDIA L40s-48G |
4x NVIDIA A30-24G |
2x NVIDIA V100-16G |
1x NVIDIA T4-16G |
307 |
Cores/Node |
64 |
64 |
24 |
32 |
- |
Memory (GB)/Node |
512 |
512 |
384 |
196 |
- |
Maximum Memory for Slum (GB)/Node |
495 |
495 |
368 |
179 |
- |
Total Cores |
1,280 |
2,048 |
960 |
608 |
4,896 |
Total Memory (GB) |
10,240 |
16,384 |
15,360 |
3,724 |
45,708 |
Local Disk |
1.7TB NVMe |
1.7TB NVMe |
669GB SSD |
371G SSD |
- |
Interconnect |
NDR-200 IB |
NDR-200 IB |
10 Gbps Ethernet |
10 Gbps Ethernet |
- |
Falcon is hosted in the AISB Datacenter at the Corporate Research Center (CRC) in Blacksburg.
An IBM ESS GPFS file system supports /projects
for group collaboration and a VAST /scratch
serves high-performance input/output (I/O).
Get Started
Falcon can be accessed via one of the two login nodes using your VT credentials:
falcon1.arc.vt.edu
falcon2.arc.vt.edu
For testing purposes, all users will be alloted an initial 240 core-hours for 90 days in the “personal” allocation. Researchers at the PI level are able to request resource allocations in the “free” tier (usage fully subsidized by VT) and can allocate 1,000,000 monthly Service Units among their projects.
To create an allocation, log in to the ARC allocation portal https://coldfront.arc.vt.edu
Select or create a project
Click the “+ Request Resource Allocation” button
Choose the “Compute (Free) (Cluster)” allocation type
Usage needs in excess of 1,000,000 monthly Service Units can be purchased via the ARC Cost Center.
Partitions
Users submit jobs to partitions of the cluster depending on the type of GPU resources needed. If users do not specify the amount of memory requested for a job, the parameter DefMemPerCPU will automatically determine the amount of memory for the job based on the number of CPU cores requested. If the users do not specify the number of CPU cores on a GPU job, the parameter DepCpuPerGPU will automatically determine the number of CPU cores based on the number of GPUs requested. Jobs will be billed against the user’s allocation accounting for the utilization of number of CPU cores, memory, and GPU time. Consult the Slurm configuration to understand how to specify the parameters for your job.
Partition |
l40s_normal_q |
l40s_preemptable_q |
a30_normal_q |
a30_preemptable_q |
v100_normal_q |
v100_preemptable_q |
t4_normal_q |
t4_preemptable_q |
---|---|---|---|---|---|---|---|---|
Node Type |
L40s |
L40s |
A30 |
A30 |
V100 |
V100 |
T4 |
T4 |
Features |
- |
- |
- |
- |
- |
- |
- |
- |
Number of Nodes |
20 |
20 |
32 |
32 |
40 |
40 |
19 |
19 |
DefMemPerCPU (MB) |
7920 |
7920 |
7920 |
7920 |
15720 |
15720 |
5744 |
5744 |
DefCpuPerGPU |
8 |
8 |
8 |
8 |
6 |
6 |
6 |
6 |
TRESBillingWeights |
CPU=1,Mem=0.0625G,GRES/gpu=75 |
- |
CPU=1,Mem=0.0625G,GRES/gpu=75 |
- |
CPU=1,Mem=0.0625G,GRES/gpu=50 |
- |
CPU=1,Mem=0.0625G,GRES/gpu=25 |
- |
PreemptMode |
OFF |
ON |
OFF |
ON |
OFF |
ON |
OFF |
ON |
Recommended Uses
The nodes selected for this cluster are intended to provide broad utility for a wide range of GPU-enabled applications.
The L40S GPUs deliver excellent AI/ML inference and training capabilities for models that fit within a single GPU’s 48 GB of device memory or can be sharded across multiple GPUs. However, they do not support double‑precision (FP64) arithmetic, making them unsuitable for most traditional HPC workloads that rely on high‑precision computations.
The A30 nodes do support FP64 and are ideal for GPU‑accelerated applications such as computational fluid dynamics, computational chemistry, and multiphysics simulations. They also handle AI/ML inference and training for smaller models. With 24 GB of device memory and NVIDIA’s Ampere‑generation architectural enhancements, they deliver performance for these tasks that is comparable to—or in some cases slightly better than—existing V100 installations.
The V100 GPUs, based on NVIDIA’s Volta architecture, feature 16 GB of HBM2 memory and full FP64 support at up to 7.8 TFLOPS peak per GPU. They strike a strong balance between HPC and deep‑learning workloads, delivering up to 125 TFLOPS of mixed‑precision (FP16) performance and robust double‑precision throughput. V100s remain a reliable workhorse for traditional simulation codes and large‑scale training jobs.
The T4 GPUs, built on the Turing architecture, offer 16 GB of GDDR6 memory and excel at efficient inference and low‑precision training. With specialized Tensor Cores for INT8 and INT4 operations, T4s provide up to 130 TOPS of INT8 throughput while consuming only 70 W of power.
Quality of Service (QoS)
ARC must balance the needs of individuals with the needs of all to ensure fairness. This is done by providing options which determine the Quality of Service (QoS).
The QoS associated with a job affects the job in three key ways: scheduling priority, resource limits, and time limits. Each partition has a default QoS named partitionname_base with a default priority, resource limits, and time limits. Users can optionally select a different QoS to increase or decrease the priority, resource limits, and time limits. The goal is to offer users multiple flexible options that adjust to their jobs needs. The long QoS allows users to run for an extended period of time (up to 14 days) but reduces the total amount of resources that can be allocated for the job. The short QoS allows users to increase the number of resources for a job but reduces the maximum time to 1 day. ARC staff reserves the right to modify the QoS settings at any point of time to ensure a fair and balanced utilization of resources among all users.
Partition |
QoS |
Priority |
MaxWall |
MaxTRESPerUser |
MaxTRESPerAccount |
UsageFactor |
---|---|---|---|---|---|---|
l40s_normal_q |
fal_l40s_normal_base |
1000 |
7-00:00:00 |
cpu=410,mem=3220G,gres/gpu=16 |
cpu=820,mem=6439G,gres/gpu=32 |
1 |
l40s_normal_q |
fal_l40s_normal_long |
500 |
14-00:00:00 |
cpu=103,mem=805G,gres/gpu=4 |
cpu=205,mem=1610G,gres/gpu=8 |
1 |
l40s_normal_q |
fal_l40s_normal_short |
2000 |
1-00:00:00 |
cpu=615,mem=4829G,gres/gpu=24 |
cpu=1229,mem=9658G,gres/gpu=48 |
2 |
l40s_preemptable_q |
fal_l40s_preemptable_base |
0 |
30-00:00:00 |
cpu=52,mem=403G,gres/gpu=2 |
cpu=103,mem=805G,gres/gpu=4 |
0 |
a30_normal_q |
fal_a30_normal_base |
1000 |
7-00:00:00 |
cpu=256,mem=2012G,gres/gpu=26 |
cpu=512,mem=4024G,gres/gpu=52 |
1 |
a30_normal_q |
fal_a30_normal_long |
500 |
14-00:00:00 |
cpu=64,mem=503G,gres/gpu=7 |
cpu=128,mem=1006G,gres/gpu=13 |
1 |
a30_normal_q |
fal_a30_normal_short |
2000 |
1-00:00:00 |
cpu=384,mem=3018G,gres/gpu=39 |
cpu=768,mem=6036G,gres/gpu=77 |
2 |
a30_preemptable_q |
fal_a30_preemptable_base |
0 |
30-00:00:00 |
cpu=32,mem=252G,gres/gpu=4 |
cpu=64,mem=503G,gres/gpu=7 |
0 |
v100_normal_q |
fal_v100_normal_base |
1000 |
7-00:00:00 |
cpu=192,mem=3008G,gres/gpu=16 |
cpu=384,mem=6016G,gres/gpu=32 |
1 |
v100_normal_q |
fal_v100_normal_long |
500 |
14-00:00:00 |
cpu=48,mem=752G,gres/gpu=4 |
cpu=96,mem=1504G,gres/gpu=8 |
1 |
v100_normal_q |
fal_v100_normal_short |
2000 |
1-00:00:00 |
cpu=288,mem=4512G,gres/gpu=24 |
cpu=576,mem=9024G,gres/gpu=48 |
2 |
v100_preemptable_q |
fal_v100_preemptable_base |
0 |
30-00:00:00 |
cpu=24,mem=376G,gres/gpu=2 |
cpu=48,mem=752G,gres/gpu=4 |
0 |
t4_normal_q |
fal_t4_normal_base |
1000 |
7-00:00:00 |
cpu=122,mem=711G,gres/gpu=4 |
cpu=244,mem=1422G,gres/gpu=8 |
1 |
t4_normal_q |
fal_t4_normal_long |
500 |
14-00:00:00 |
cpu=31,mem=178G,gres/gpu=1 |
cpu=61,mem=356G,gres/gpu=2 |
1 |
t4_normal_q |
fal_t4_normal_short |
2000 |
1-00:00:00 |
cpu=183,mem=1066G,gres/gpu=6 |
cpu=365,mem=2132G,gres/gpu=12 |
2 |
t4_preemptable_q |
fal_t4_preemptable_base |
0 |
30-00:00:00 |
cpu=16,mem=89G,gres/gpu=1 |
cpu=31,mem=178G,gres/gpu=1 |
0 |
Optimization
The performance of jobs can be greatly enhanced by appropriate optimizations being applied. Not only does this reduce the execution time of jobs but it also makes more efficient use of the resources for the benefit of all.
See the tuning guides available at https://developer.amd.com and https://www.intel.com/content/www/us/en/developer/
General principles of optimization:
Cache locality really matters - process pinning can make a big difference on performance.
Hybrid programming often pays off - one MPI process per L3 cache with 4 threads is often optimal.
Use the appropriate
-march
flag to optimize the compiled code and-gencode
flag when using the NVCC compiler.
Suggested optimization parameters:
Node Type |
L40s Nodes |
A30 Nodes |
V100 Nodes |
T4 Nodes |
---|---|---|---|---|
CPU arch |
Shappire Rapids |
Shappire Rapids |
Cascade Lake |
Cascade Lake |
Compiler flags |
|
|
|
|
GPU arch |
NVIDIA L40s |
NVIDIA A30 |
NVIDIA V100 |
NVIDIA T4 |
Compute Capability |
8.9 |
8.0 |
7 |
7.5 |
NVCC flags |
|
|
|
|
Slurm GPU to CPU bindings
Optimal application performance for GPU accelerated workloads requires that processes execute on CPU cores topologically closest to the GPU that the process will use. On Falcon, Slurm is aware of which sets of CPU cores and memory locations have the most direct connection to each GPU. The arrangement is slightly unintuitive:
GPU device bus ID |
GPU device |
NUMA node |
CPU cores |
---|---|---|---|
4a:00.0 |
0 - /dev/nvidia0 |
1 |
16-31 |
61:00.0 |
1 - /dev/nvidia1 |
0 |
0-15 |
ca:00.0 |
2 - /dev/nvidia2 |
3 |
48-63 |
e1:00.0 |
3 - /dev/nvidia3 |
2 |
32-47 |
If we do not inform Slurm of this affinity, then nearly all jobs will have reduced performance due to misalignment of allocated cores and GPUs. By default, these cores will be preferred by Slurm for scheduling with the affiliated GPU device, but other arrangements are possible.
Use the option
--gres-flags=enforce-binding
to require Slurm to allocate affiliated CPU core(s) with the corresponding GPU device(s)The option
--gres-flags=disable-binding
is required to allocate more CPU cores than are bound to a device, but this is discouraged because these core will then be unavailable to their correctly affiliated GPU.
To summarize, these nodes and the Slurm scheduling algorithms will operate most efficiently when jobs consistently request between 1-16 cores per GPU device. For example:
Do this: --gres=gpu:1 --ntasks-per-node=1 --cpus-per-task=16 --gres-flags=enforce-binding
which allocates 1 GPU, the associated 16 CPU cores, and 128GB of system memory.
Do not do this: --gres=gpu:1 --exclusive
which allocates all the CPU core and all the system memory to the job, but only one GPU device. The other 3 GPUs will be unavailable to your job and also unavailable to other jobs but the job will still be charged for the Service Units even though it did not use them.
Do not do this: --gres=gpu:1 --ntasks-per-node=32
which allocates 256GB of system memory, one GPU device + its 16 affiliated CPU cores AND 16 additional CPUs that have affinity to a different GPU. This isolated GPU is still available to other jobs, but can only run with diminished performance.
Examples
Run Matlab on L40S GPU accelerator nodes
The key line below is #SBATCH --constraint=amd
, which specifies the
requisite nodes of the normal_q.
#!/bin/bash
## Run on L40S GPUs of Falcon cluster.
#SBATCH -J matgpu
## Wall time.
#SBATCH --time=0-01:00:00
## Account to "charge" to/run against.
#SBATCH --account=arcadm
## Partition/queue.
#SBATCH --partition=l40s_normal_q
### This requests 1 node, 1 task, 1 core. 1 gpu.
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:1
## Slurm output and error files. Always include %j in names.
#SBATCH -o slurm.matlab.02.gpu.%j.out
#SBATCH -e slurm.matlab.02.gpu.%j.err
## QoS
## This is not required because it is the default.
#SBATCH qos=fal_l40s_normal_base
## Load modules, if any. But first "reset."
module reset
module load MATLAB/R2024b
## Load virtual environments, if any.
## None in this example.
# Set up
## Get the core number for job and other job details.
## -d flag gets you the particular cores running on.
echo " ------------"
echo "Set of cores job running on: "
echo " "
scontrol show job -d $SLURM_JOB_ID
echo " "
echo " "
## Monitor the GPU.
## The 3 means output data every 3 seconds; you will have to tweek
## based on your execution duration.
echo " "
echo " "
echo "Start file and monitoring of GPU."
nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 3 > gpu.perform.$SLURM_JOBID.log &
echo " "
echo " "
## Monitor the cpus (cores) and memory and I/O.
## This is cumbersome, but useful: we start three tools before the job starts and we
## stop the three tools after code ends.
echo " "
echo " ------------"
echo "Running IOSTAT"
iostat 2 >iostat-stdout.txt 2>iostat-stderr.txt &
echo " ------------"
echo "Running MPSTAT"
mpstat -P ALL 2 >mpstat-stdout.txt 2>mpstat-stderr.txt &
echo " ------------"
echo "Running VMSTAT"
vmstat 2 >vmstat-stdout.txt 2>vmstat-stderr.txt &
echo " ------------"
echo "Running executable"
# ------------------------
# Code to execute: Matlab.
arrayLength=10
numIterations=1
## Code name.
mycode="code02b"
## Invocation. Matlab syntax in double-quotes.
matlab -nodisplay -nosplash -r "bogus = ${mycode}(${arrayLength}, ${numIterations})"
echo " ------------"
echo "Executable done"
echo " ------------"
echo "Killing IOSTAT"
kill %1
echo " ------------"
echo "Killing MPSTAT"
kill %2
echo " ------------"
echo "Killing VMSTAT"
kill %3
The Matlab code, code02b.m, called out in the above sbatch slurm script, is below:
function aa = code02b(arrayLength, numIterations)
outfile="mat.02b.out";
fprintf('arrayLength: \n');
disp (arrayLength);
fprintf('numIterations: \n');
disp (numIterations);
fprintf('outfile: \n');
disp (outfile);
N = arrayLength;
r = gpuArray.linspace(1,100,N);
% x = rand(1,N,"gpuArray");
x = gpuArray.linspace(1,100,N);
x = transpose(x);
% numIterations = 1000;
for n=1:numIterations
x = r.*x.*(1-x);
end
% Write x to file.
% We cannot write the command to open the output file; it crashes this page. Really.
% (1) open up output file and assign file handle fid.
% (2) write x to file using fid handle like so: fprintf(fid, '%f\n',x);
% (3) close output file using handle like so: close(fid);
% plot(r,x,'.',MarkerSize=1)
% xlabel("Growth Rate")
% ylabel("Population")
% Return argument.
aa="done";
end
Run Matlab on A30, V100, or T4 GPU accelerator nodes
To run the same code above on any of the other three types of GPU accelerators on Falcon, simply make the following substitutions. in the slurm sbatch script above. Look for the text above in the headings of column 2 and column 3, and replace it with the text in the table for the type of GPU you wish to execute on (first column).
GPU to Execute On |
Comment: ## Run on L40S GPUs of Falcon cluster. |
#SBATCH –partition=l40s_normal_q |
#SBATCH qos=fal_l40s_normal_base |
---|---|---|---|
A30 |
Comment: Run on A30 GPUs of Falcon cluster. |
#SBATCH –partition=a30_normal_q |
#SBATCH qos=fal_a30_normal_base |
V100 |
Comment: Run on V100 GPUs of Falcon cluster. |
#SBATCH –partition=v100_normal_q |
#SBATCH qos=fal_v100_normal_base |
T4 |
Comment: Run on T4 GPUs of Falcon cluster. |
#SBATCH –partition=t4_normal_q |
#SBATCH qos=fal_t4_normal_base |
You may want to alter the following variables in the slurm sbatch script above to increase your execution time (e.g., to see more data in the performance logging files):
arrayLength
numIterations
Execute a longer running job on an L40S GPU node
We want to run a job where the number of iterations is very large. We increase the number of iterations, say, from 10 to 10^9.
Therefore, we want to add a QoS parameter to specify a long-running job.
We make the following changes to the sbatch slurm script in the first example:
Change
#SBATCH qos=fal_l40s_normal_base
to#SBATCH qos=fal_l40s_normal_long
.Change
#SBATCH --time=0-01:00:00
to#SBATCH --time=14-00:00:00
.Change
numIterations=1
tonumIterations=1000000000
.
Execute a longer running job on the other types of GPU nodes
Start with the sbatch slurm script of the previous example.
Then, make the following changes for the text in the headings of columns 2 through 4 with the values corresponding to the GPU type in the first column of the table.
GPU to Execute On |
Comment: ## Run on L40S GPUs of Falcon cluster. |
#SBATCH –partition=l40s_normal_q |
#SBATCH qos=fal_l40s_normal_long |
---|---|---|---|
A30 |
Comment: Run on A30 GPUs of Falcon cluster. |
#SBATCH –partition=a30_normal_q |
#SBATCH qos=fal_a30_normal_long |
V100 |
Comment: Run on V100 GPUs of Falcon cluster. |
#SBATCH –partition=v100_normal_q |
#SBATCH qos=fal_v100_normal_long |
T4 |
Comment: Run on T4 GPUs of Falcon cluster. |
#SBATCH –partition=t4_normal_q |
#SBATCH qos=fal_t4_normal_long |
You may want to alter the following variables in the slurm sbatch script above, which will change the execution time:
arrayLength
numIterations