Monitoring and Logging GPU Utilization in your job
Many people meet the command nvidia-smi
pretty quickly if they’re using Nvidia GPUs with command-line tools. It’s a great way to get a quick view of the status of the GPUs on a node. In the context of a job, the command output will be limited to the GPUs which have been allocated to the job.
[brownm12@tc-gpu001 ~]$ nvidia-smi
Wed Feb 23 17:22:18 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:07:00.0 Off | 0 |
| N/A 64C P0 162W / 400W | 2668MiB / 81251MiB | 45% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:0B:00.0 Off | 0 |
| N/A 62C P0 161W / 400W | 2644MiB / 81251MiB | 39% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... On | 00000000:48:00.0 Off | 0 |
| N/A 55C P0 137W / 400W | 2668MiB / 81251MiB | 42% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... On | 00000000:4C:00.0 Off | 0 |
| N/A 61C P0 138W / 400W | 2644MiB / 81251MiB | 40% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM... On | 00000000:88:00.0 Off | 0 |
| N/A 38C P0 60W / 400W | 0MiB / 81251MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM... On | 00000000:8B:00.0 Off | 0 |
| N/A 41C P0 64W / 400W | 0MiB / 81251MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM... On | 00000000:C8:00.0 Off | 0 |
| N/A 39C P0 69W / 400W | 0MiB / 81251MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM... On | 00000000:CB:00.0 Off | 0 |
| N/A 51C P0 122W / 400W | 3802MiB / 81251MiB | 55% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 118062 C python 2665MiB |
| 1 N/A N/A 118062 C python 2641MiB |
| 2 N/A N/A 18210 C python 2665MiB |
| 3 N/A N/A 18210 C python 2641MiB |
| 7 N/A N/A 98012 C ...onda/envs/test/bin/python 3799MiB |
+-----------------------------------------------------------------------------+
We can see a list of
the visible GPUs (There are 8 here, numbered 0-7)
model, ID, temp, power consumption, PCIe bus ID, % GPU utilization, % GPU memory utilization
list of processes currently running on each GPU
This is nice pretty output, but is no good for logging or continuous monitoring. More concise output and repeated refreshes are needed. Here’s how to get started with that:
nvidia-smi –query-gpu=…
The output can be formatted as comma-separated values (CSV), the query parameters can be customized, and the query can be set to loop or repeat on a regular interval:
nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,temperature.gpu,utilization.gpu,utilization.memory --format=csv -l 5
-l
controls the looping interval (5 seconds here)(-lms
can also be used to defice looping intervals in milliseconds)--format=csv
specifies CSV formatting for the output
Use man nvidia-smi
and scroll to “GPU ATTRIBUTES” and “UNIT ATTRIBUTES” section to get a list of attributes which can be queried.
Make a bash function
Instead of typing all this, or copy-pasting repeatedly, you can create a bash function to act as a wrapper for this query.
[brownm12@tc-gpu001 ~]$ gpumon() { nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,temperature.gpu,utilization.gpu,utilization.memory,memory.used --format=csv -lms 100 ; }
[brownm12@tc-gpu001 ~]$ gpumon
timestamp, name, pci.bus_id, driver_version, temperature.gpu, utilization.gpu [%], utilization.memory [%]
2022/02/23 17:26:54.522, NVIDIA A100-SXM-80GB, 00000000:07:00.0, 470.57.02, 64, 43 %, 4 %
2022/02/23 17:26:54.524, NVIDIA A100-SXM-80GB, 00000000:0B:00.0, 470.57.02, 61, 40 %, 4 %
2022/02/23 17:26:54.526, NVIDIA A100-SXM-80GB, 00000000:48:00.0, 470.57.02, 55, 21 %, 2 %
2022/02/23 17:26:54.527, NVIDIA A100-SXM-80GB, 00000000:4C:00.0, 470.57.02, 61, 34 %, 2 %
2022/02/23 17:26:54.529, NVIDIA A100-SXM-80GB, 00000000:88:00.0, 470.57.02, 38, 0 %, 0 %
2022/02/23 17:26:54.531, NVIDIA A100-SXM-80GB, 00000000:8B:00.0, 470.57.02, 41, 0 %, 0 %
2022/02/23 17:26:54.532, NVIDIA A100-SXM-80GB, 00000000:C8:00.0, 470.57.02, 39, 0 %, 0 %
2022/02/23 17:26:54.534, NVIDIA A100-SXM-80GB, 00000000:CB:00.0, 470.57.02, 52, 56 %, 9 %
2022/02/23 17:26:55.535, NVIDIA A100-SXM-80GB, 00000000:07:00.0, 470.57.02, 64, 43 %, 4 %
2022/02/23 17:26:55.536, NVIDIA A100-SXM-80GB, 00000000:0B:00.0, 470.57.02, 61, 40 %, 4 %
2022/02/23 17:26:55.537, NVIDIA A100-SXM-80GB, 00000000:48:00.0, 470.57.02, 54, 24 %, 2 %
2022/02/23 17:26:55.538, NVIDIA A100-SXM-80GB, 00000000:4C:00.0, 470.57.02, 61, 36 %, 2 %
2022/02/23 17:26:55.538, NVIDIA A100-SXM-80GB, 00000000:88:00.0, 470.57.02, 38, 0 %, 0 %
2022/02/23 17:26:55.539, NVIDIA A100-SXM-80GB, 00000000:8B:00.0, 470.57.02, 41, 0 %, 0 %
2022/02/23 17:26:55.540, NVIDIA A100-SXM-80GB, 00000000:C8:00.0, 470.57.02, 39, 0 %, 0 %
2022/02/23 17:26:55.541, NVIDIA A100-SXM-80GB, 00000000:CB:00.0, 470.57.02, 52, 56 %, 9 %
Show only non-zero utilization and log to a csv file
# run gpumon, but only show nonzero output and send it to a file instead of the terminal display
gpumon | grep -v " 0 %, 0 %" > gpustats.csv