GPU Monitoring: How to Track GPU Performance on Linux Servers
GPU workloads are everywhere now. AI model training, video transcoding, crypto mining, HPC jobs. Your GPUs are probably the most expensive components in your infrastructure, and most server monitoring setups completely ignore them. Standard Linux monitoring gives you CPU, memory, disk, and network metrics out of the box. But GPU utilization, VRAM usage, temperature, power draw? You're on your own unless you know what tools to use and how to wire them up.
This guide covers GPU monitoring on Linux servers, from quick command-line checks to automated monitoring that alerts you before a thermal throttle or OOM kills your training run at 3 AM.
If you're already comfortable with basic Linux server monitoring, this is the natural next step for GPU-heavy infrastructure.
Why GPU Monitoring Matters
CPUs are forgiving. They throttle gracefully, they queue work, and you usually have time to react before things go sideways. GPUs are different. When a GPU runs out of VRAM, your process crashes instantly. When it overheats, it throttles hard and your training time doubles. When a GPU fails silently in a multi-GPU node, you might not notice until your model produces garbage outputs three days later.
Proper GPU monitoring catches a few categories of problems that are really hard to debug otherwise.
First, thermal issues before they cause throttling. GPUs run hot, especially in dense server configurations where airflow is limited. An A100 running at 83°C is fine. The same card at 90°C is about to throttle, and your ML training will slow to a crawl without any error message telling you why.
Then there's VRAM leaks in long-running processes. A training job that starts using 20GB of VRAM might slowly creep to 38GB over days due to memory fragmentation or a subtle bug. Without VRAM monitoring, this goes unnoticed until the OOM killer strikes.
There's also the cost angle. GPU cloud instances are expensive. An H100 node can cost over $30/hour. If your GPU utilization is sitting at 15% because of a data pipeline bottleneck, you're burning money for nothing. Monitoring reveals these inefficiencies immediately.
And finally, hardware degradation. GPUs develop ECC memory errors, fan failures, and power delivery issues over time. Monitoring health indicators lets you schedule replacements before an unplanned outage.
nvidia-smi: The Starting Point for NVIDIA GPUs
If you're running NVIDIA GPUs (and in server environments, you almost certainly are), nvidia-smi is already installed with your NVIDIA drivers. It's the most fundamental GPU monitoring tool available.
Run it without arguments to get a snapshot of all GPUs in the system:
nvidia-smi
This shows you GPU utilization, memory usage, temperature, power draw, and running processes. Useful for a quick check, but the real power comes from its programmatic options.
Continuous Monitoring with nvidia-smi
For ongoing monitoring, use the daemon mode to log GPU metrics at regular intervals:
nvidia-smi dmon -s pucvmet -d 5
This samples power (p), utilization (u), clocks (c), voltage (v), memory (m), ECC errors (e), and temperature (t) every 5 seconds. The output is tab-separated and easy to parse or pipe into a log file.
For CSV output that's easier to work with in scripts:
nvidia-smi --query-gpu=timestamp,name,pci.bus_id,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.used,memory.free,power.draw,clocks.current.sm --format=csv -l 10
This gives you a clean CSV with timestamps, GPU names, temperatures, utilization percentages, memory usage, power draw, and clock speeds, refreshed every 10 seconds.
Monitoring per-Process GPU Usage
One of nvidia-smi's most useful features is showing which processes are using GPU resources:
nvidia-smi pmon -s um -d 5
This shows per-process GPU and memory utilization, which is critical for multi-tenant environments or when multiple training jobs share a GPU server.
Querying Specific Metrics
Need just the temperature of GPU 0? nvidia-smi can give you exactly that:
nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader,nounits -i 0
This returns a single number, which makes it perfect for shell scripts and monitoring integrations. Here are some commonly queried fields:
| Field | Description |
|---|---|
temperature.gpu |
Core GPU temperature in Celsius |
utilization.gpu |
GPU compute utilization percentage |
utilization.memory |
Memory controller utilization percentage |
memory.used |
Used VRAM in MiB |
memory.total |
Total VRAM in MiB |
power.draw |
Current power consumption in Watts |
clocks.current.sm |
Current SM clock frequency in MHz |
ecc.errors.corrected.volatile.total |
Corrected ECC errors since last reboot |
fan.speed |
Fan speed as percentage of maximum |
Setting Up nvidia-smi Alerts
You can build simple threshold alerts using nvidia-smi in a bash script:
#!/bin/bash
# gpu-alert.sh - Basic GPU monitoring alerts
TEMP_THRESHOLD=85
VRAM_THRESHOLD=90
while true; do
gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | head -1)
for ((i=0; i<gpu_count; i++)); do
temp=$(nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader,nounits -i $i)
vram_used=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits -i $i)
vram_total=$(nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits -i $i)
vram_pct=$((vram_used * 100 / vram_total))
if [ "$temp" -gt "$TEMP_THRESHOLD" ]; then
echo "[ALERT] GPU $i temperature: ${temp}°C (threshold: ${TEMP_THRESHOLD}°C)"
# Send notification via webhook, email, etc.
fi
if [ "$vram_pct" -gt "$VRAM_THRESHOLD" ]; then
echo "[ALERT] GPU $i VRAM usage: ${vram_pct}% (${vram_used}/${vram_total} MiB)"
fi
done
sleep 30
done
This works, but it's brittle. You lose data on restart, there's no historical trending, and scaling it across multiple servers gets messy fast. We'll cover better approaches later.
nvtop: Real-Time GPU Monitoring in Your Terminal
nvtop is basically htop for GPUs. It gives you a real-time, interactive terminal interface showing GPU utilization, memory usage, temperature, and per-process breakdowns across all your GPUs (NVIDIA, AMD, and Intel).
Install it on Ubuntu/Debian:
sudo apt install nvtop
On RHEL/Rocky/Alma:
sudo dnf install nvtop
Then just run nvtop. You'll get a color-coded dashboard showing utilization graphs, memory bars, temperature readings, and a process list sorted by GPU usage. It supports multiple GPUs and updates in real time.
nvtop is great for interactive debugging. SSH into a server and you immediately see what's happening across all GPUs. But it's a visual tool, not a monitoring solution. It doesn't store history, send alerts, or integrate with anything else.
gpustat: Quick GPU Status at a Glance
gpustat gives you a compact, colorful summary of NVIDIA GPU status. Think of it as nvidia-smi with better formatting:
pip install gpustat
gpustat --color --watch
The output is clean: one line per GPU showing utilization, temperature, memory usage, and which users/processes are running. The --watch flag refreshes automatically.
For JSON output you can pipe into other tools:
gpustat --json
It's minimal and focused. Perfect for a quick ssh + gpustat check across your fleet.
NVML: Programmatic GPU Monitoring
The NVIDIA Management Library (NVML) is the C library that nvidia-smi is built on. If you need GPU monitoring in your own applications or monitoring agents, NVML gives you direct access to all GPU metrics programmatically.
Python bindings make this accessible:
pip install pynvml
import pynvml
pynvml.nvmlInit()
device_count = pynvml.nvmlDeviceGetCount()
for i in range(device_count):
handle = pynvml.nvmlDeviceGetHandleByIndex(i)
name = pynvml.nvmlDeviceGetName(handle)
temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
util = pynvml.nvmlDeviceGetUtilizationRates(handle)
power = pynvml.nvmlDeviceGetPowerUsage(handle) / 1000 # milliwatts to watts
print(f"GPU {i}: {name}")
print(f" Temperature: {temp}°C")
print(f" GPU Utilization: {util.gpu}%")
print(f" Memory: {mem_info.used / 1024**2:.0f} / {mem_info.total / 1024**2:.0f} MiB")
print(f" Power: {power:.1f} W")
pynvml.nvmlShutdown()
NVML is what you'd use to build custom GPU monitoring agents or integrate GPU metrics into existing monitoring systems. It has one important advantage over parsing nvidia-smi output: it doesn't spawn a new process for each query, making it much more efficient for high-frequency monitoring.
AMD GPU Monitoring with ROCm
If you're running AMD GPUs (Instinct MI series for servers), the ROCm stack provides rocm-smi for monitoring:
rocm-smi --showtemp --showuse --showmemuse --showpower
For a continuously updating view:
watch -n 2 rocm-smi
AMD also provides amdsmi as the newer alternative, and some monitoring tools like nvtop support AMD GPUs natively through the amdgpu kernel driver.
The AMD server GPU ecosystem is growing, especially with the MI300X gaining traction in AI workloads. But the monitoring tooling is still less mature than NVIDIA's. If you're running AMD GPUs, expect to do more custom integration work.
Prometheus and Grafana for GPU Monitoring
For production GPU monitoring across multiple servers, Prometheus with Grafana dashboards is the most common setup. NVIDIA provides the dcgm-exporter (Data Center GPU Manager Exporter) that exposes GPU metrics in Prometheus format.
Setting Up DCGM Exporter
Run it as a Docker container:
docker run -d --gpus all --rm -p 9400:9400 \
nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.0-ubuntu22.04
Verify it's working:
curl localhost:9400/metrics | grep DCGM_FI_DEV_GPU_TEMP
You'll see Prometheus-format metrics for every GPU in the system. Add it as a scrape target in your prometheus.yml:
scrape_configs:
- job_name: 'gpu-metrics'
static_configs:
- targets: ['gpu-server-1:9400', 'gpu-server-2:9400']
Key DCGM Metrics to Track
| Metric | What It Tells You |
|---|---|
DCGM_FI_DEV_GPU_TEMP |
GPU temperature. Alert above 85°C |
DCGM_FI_DEV_GPU_UTIL |
Compute utilization. Spot underuse or saturation |
DCGM_FI_DEV_FB_USED |
Framebuffer (VRAM) used. Detect memory leaks |
DCGM_FI_DEV_FB_FREE |
VRAM available. Predict OOM events |
DCGM_FI_DEV_POWER_USAGE |
Power draw. Capacity planning and cost tracking |
DCGM_FI_DEV_SM_CLOCK |
SM clock speed. Detect throttling |
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL |
Single-bit ECC errors. Early hardware failure warning |
DCGM_FI_DEV_XID_ERRORS |
XID errors. GPU faults and crashes |
The Reality of Prometheus GPU Monitoring
Setting up Prometheus and Grafana for GPU monitoring works, but the effort is real. You need to deploy and maintain Prometheus, configure storage retention and resource limits, set up Grafana with dashboards (or find community dashboards that mostly work), configure alerting rules in PromQL, and manage this infrastructure across your fleet.
For teams already running a Prometheus stack, adding GPU metrics is straightforward. For everyone else, it's a weekend project that turns into ongoing maintenance. If you're managing a handful of GPU servers alongside your regular infrastructure, a dedicated server monitoring tool that handles GPU metrics alongside everything else is often a more practical choice.
Monitoring GPU Health: Beyond Utilization
GPU utilization and temperature are the obvious metrics, but production GPU monitoring needs to go deeper.
ECC Memory Errors
ECC (Error Correcting Code) memory errors indicate hardware degradation. Single-bit errors are corrected automatically but indicate wear. Double-bit errors are uncorrectable and cause application crashes.
nvidia-smi --query-gpu=ecc.errors.corrected.volatile.total,ecc.errors.uncorrected.volatile.total --format=csv
Track ECC errors over time. A steady increase in single-bit errors means you should schedule a GPU replacement. Any uncorrected errors demand immediate investigation.
XID Errors
XID errors are NVIDIA's error codes logged to the kernel. They cover everything from driver bugs to hardware failures:
dmesg | grep -i "NVRM: Xid"
Common XID codes to watch for:
| XID | Meaning |
|---|---|
| 13 | Graphics engine exception, often a memory or power issue |
| 31 | GPU memory page fault, likely a VRAM hardware issue |
| 48 | Double-bit ECC error, needs immediate attention |
| 63 | ECC page retirement limit exceeded, GPU should be replaced |
| 79 | GPU fallen off the bus, PCIe or hardware failure |
PCIe Link Speed and Width
A GPU running at PCIe Gen3 x8 instead of Gen4 x16 will have half the bandwidth, causing mysterious slowdowns that don't show up in GPU utilization metrics:
nvidia-smi --query-gpu=pcie.link.gen.current,pcie.link.width.current --format=csv
Compare against expected values for your hardware. PCIe link degradation can indicate slot issues, cable problems, or BIOS misconfiguration.
Power Limit and Thermal Throttling
Check if your GPUs are power-limited or thermal-limited:
nvidia-smi --query-gpu=power.draw,power.limit,clocks_throttle_reasons.hw_thermal_slowdown,clocks_throttle_reasons.hw_power_brake_slowdown --format=csv
If clocks_throttle_reasons shows active throttling, your GPUs aren't performing at their rated speed. This often points to cooling issues, inadequate power delivery, or overly aggressive power limits.
GPU Monitoring in Docker and Kubernetes
If your GPU workloads run in containers, monitoring requires some additional setup.
Docker GPU Monitoring
With the NVIDIA Container Toolkit installed, containers can access GPU metrics. Run nvidia-smi inside a container:
docker run --gpus all nvidia/cuda:12.2-base nvidia-smi
For monitoring the host-level GPU allocation across containers, combine nvidia-smi pmon with Docker's process information to map GPU processes back to specific containers:
# Get GPU processes
nvidia-smi pmon -c 1 -s um
# Map PID to container
docker inspect --format '{{.Name}}' $(docker ps -q --filter "pid=$PID")
If you're already using Docker container monitoring for CPU and memory, GPU metrics are the missing piece for GPU-accelerated workloads.
Kubernetes GPU Monitoring
In Kubernetes, the NVIDIA device plugin handles GPU allocation, and dcgm-exporter runs as a DaemonSet to expose metrics:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dcgm-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: dcgm-exporter
template:
spec:
containers:
- name: dcgm-exporter
image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.0-ubuntu22.04
ports:
- containerPort: 9400
securityContext:
privileged: true
volumeMounts:
- name: device-dir
mountPath: /dev
volumes:
- name: device-dir
hostPath:
path: /dev
This exposes per-GPU metrics with Kubernetes labels, so you can correlate GPU usage with specific pods and namespaces.
Best Practices for GPU Monitoring
Now that we've covered the tools, here's what actually matters when you go to implement this.
Set your temperature alerts tighter than you think. 80°C for warning, 85°C for critical. Don't wait for the thermal limit. By the time you hit it, the GPU has already been throttling for a while and you've been losing performance without knowing it.
Track VRAM as a percentage, not absolute values. An alert at 90% VRAM usage gives you a buffer before OOM regardless of whether it's a 16GB T4 or an 80GB A100.
Look at utilization trends, not point-in-time snapshots. A GPU at 0% utilization for 5 seconds is completely normal (it's between batches). A GPU sitting at 10% average over an hour is a real problem. Use aggregation periods of at least 5 minutes for utilization alerts.
Watch power draw if you're paying for GPU cloud instances. Power draw correlates directly with compute usage. Low power draw usually means your GPUs aren't doing meaningful work and you're paying for idle capacity.
Start collecting ECC errors from day one. Don't wait for crashes. Having a baseline makes it trivial to spot when a GPU starts degrading.
And keep monitoring lightweight. GPU monitoring tools that query nvidia-smi every second can actually interfere with GPU performance. A 10-30 second collection interval is enough for most workloads. Only go faster if you're debugging a specific issue.
GPU Monitoring Without the Infrastructure Overhead
Setting up comprehensive GPU monitoring with Prometheus, Grafana, DCGM, and alert routing is a serious infrastructure project. For teams running a few GPU servers alongside their regular fleet, it's often overkill.
Fivenines monitors GPU utilization, temperature, memory usage, and power draw alongside all your other server metrics: CPU, memory, disk, network, Docker containers, and more. The agent auto-detects your GPUs and starts collecting metrics immediately. No exporters, no dashboards to configure. You get GPU monitoring and alerts through the same interface you already use for everything else.
If you're running GPU workloads on Linux servers and want visibility without spending a weekend on Prometheus configuration, give Fivenines a try. The free tier includes 5 servers with GPU monitoring included.