GPU Monitoring: How to Track GPU Performance on Linux Servers

GPU Monitoring: How to Track GPU Performance on Linux Servers

GPU workloads are everywhere now. AI model training, video transcoding, crypto mining, HPC jobs. Your GPUs are probably the most expensive components in your infrastructure, and most server monitoring setups completely ignore them. Standard Linux monitoring gives you CPU, memory, disk, and network metrics out of the box. But GPU utilization, VRAM usage, temperature, power draw? You're on your own unless you know what tools to use and how to wire them up.

This guide covers GPU monitoring on Linux servers, from quick command-line checks to automated monitoring that alerts you before a thermal throttle or OOM kills your training run at 3 AM.

If you're already comfortable with basic Linux server monitoring, this is the natural next step for GPU-heavy infrastructure.

Why GPU Monitoring Matters

CPUs are forgiving. They throttle gracefully, they queue work, and you usually have time to react before things go sideways. GPUs are different. When a GPU runs out of VRAM, your process crashes instantly. When it overheats, it throttles hard and your training time doubles. When a GPU fails silently in a multi-GPU node, you might not notice until your model produces garbage outputs three days later.

Proper GPU monitoring catches a few categories of problems that are really hard to debug otherwise.

First, thermal issues before they cause throttling. GPUs run hot, especially in dense server configurations where airflow is limited. An A100 running at 83°C is fine. The same card at 90°C is about to throttle, and your ML training will slow to a crawl without any error message telling you why.

Then there's VRAM leaks in long-running processes. A training job that starts using 20GB of VRAM might slowly creep to 38GB over days due to memory fragmentation or a subtle bug. Without VRAM monitoring, this goes unnoticed until the OOM killer strikes.

There's also the cost angle. GPU cloud instances are expensive. An H100 node can cost over $30/hour. If your GPU utilization is sitting at 15% because of a data pipeline bottleneck, you're burning money for nothing. Monitoring reveals these inefficiencies immediately.

And finally, hardware degradation. GPUs develop ECC memory errors, fan failures, and power delivery issues over time. Monitoring health indicators lets you schedule replacements before an unplanned outage.

nvidia-smi: The Starting Point for NVIDIA GPUs

If you're running NVIDIA GPUs (and in server environments, you almost certainly are), nvidia-smi is already installed with your NVIDIA drivers. It's the most fundamental GPU monitoring tool available.

Run it without arguments to get a snapshot of all GPUs in the system:

nvidia-smi

This shows you GPU utilization, memory usage, temperature, power draw, and running processes. Useful for a quick check, but the real power comes from its programmatic options.

Continuous Monitoring with nvidia-smi

For ongoing monitoring, use the daemon mode to log GPU metrics at regular intervals:

nvidia-smi dmon -s pucvmet -d 5

This samples power (p), utilization (u), clocks (c), voltage (v), memory (m), ECC errors (e), and temperature (t) every 5 seconds. The output is tab-separated and easy to parse or pipe into a log file.

For CSV output that's easier to work with in scripts:

nvidia-smi --query-gpu=timestamp,name,pci.bus_id,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.used,memory.free,power.draw,clocks.current.sm --format=csv -l 10

This gives you a clean CSV with timestamps, GPU names, temperatures, utilization percentages, memory usage, power draw, and clock speeds, refreshed every 10 seconds.

Monitoring per-Process GPU Usage

One of nvidia-smi's most useful features is showing which processes are using GPU resources:

nvidia-smi pmon -s um -d 5

This shows per-process GPU and memory utilization, which is critical for multi-tenant environments or when multiple training jobs share a GPU server.

Querying Specific Metrics

Need just the temperature of GPU 0? nvidia-smi can give you exactly that:

nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader,nounits -i 0

This returns a single number, which makes it perfect for shell scripts and monitoring integrations. Here are some commonly queried fields:

Field Description
temperature.gpu Core GPU temperature in Celsius
utilization.gpu GPU compute utilization percentage
utilization.memory Memory controller utilization percentage
memory.used Used VRAM in MiB
memory.total Total VRAM in MiB
power.draw Current power consumption in Watts
clocks.current.sm Current SM clock frequency in MHz
ecc.errors.corrected.volatile.total Corrected ECC errors since last reboot
fan.speed Fan speed as percentage of maximum

Setting Up nvidia-smi Alerts

You can build simple threshold alerts using nvidia-smi in a bash script:

#!/bin/bash
# gpu-alert.sh - Basic GPU monitoring alerts

TEMP_THRESHOLD=85
VRAM_THRESHOLD=90

while true; do
    gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | head -1)
    
    for ((i=0; i<gpu_count; i++)); do
        temp=$(nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader,nounits -i $i)
        vram_used=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits -i $i)
        vram_total=$(nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits -i $i)
        vram_pct=$((vram_used * 100 / vram_total))
        
        if [ "$temp" -gt "$TEMP_THRESHOLD" ]; then
            echo "[ALERT] GPU $i temperature: ${temp}°C (threshold: ${TEMP_THRESHOLD}°C)"
            # Send notification via webhook, email, etc.
        fi
        
        if [ "$vram_pct" -gt "$VRAM_THRESHOLD" ]; then
            echo "[ALERT] GPU $i VRAM usage: ${vram_pct}% (${vram_used}/${vram_total} MiB)"
        fi
    done
    
    sleep 30
done

This works, but it's brittle. You lose data on restart, there's no historical trending, and scaling it across multiple servers gets messy fast. We'll cover better approaches later.

nvtop: Real-Time GPU Monitoring in Your Terminal

nvtop is basically htop for GPUs. It gives you a real-time, interactive terminal interface showing GPU utilization, memory usage, temperature, and per-process breakdowns across all your GPUs (NVIDIA, AMD, and Intel).

Install it on Ubuntu/Debian:

sudo apt install nvtop

On RHEL/Rocky/Alma:

sudo dnf install nvtop

Then just run nvtop. You'll get a color-coded dashboard showing utilization graphs, memory bars, temperature readings, and a process list sorted by GPU usage. It supports multiple GPUs and updates in real time.

nvtop is great for interactive debugging. SSH into a server and you immediately see what's happening across all GPUs. But it's a visual tool, not a monitoring solution. It doesn't store history, send alerts, or integrate with anything else.

gpustat: Quick GPU Status at a Glance

gpustat gives you a compact, colorful summary of NVIDIA GPU status. Think of it as nvidia-smi with better formatting:

pip install gpustat
gpustat --color --watch

The output is clean: one line per GPU showing utilization, temperature, memory usage, and which users/processes are running. The --watch flag refreshes automatically.

For JSON output you can pipe into other tools:

gpustat --json

It's minimal and focused. Perfect for a quick ssh + gpustat check across your fleet.

NVML: Programmatic GPU Monitoring

The NVIDIA Management Library (NVML) is the C library that nvidia-smi is built on. If you need GPU monitoring in your own applications or monitoring agents, NVML gives you direct access to all GPU metrics programmatically.

Python bindings make this accessible:

pip install pynvml
import pynvml

pynvml.nvmlInit()
device_count = pynvml.nvmlDeviceGetCount()

for i in range(device_count):
    handle = pynvml.nvmlDeviceGetHandleByIndex(i)
    name = pynvml.nvmlDeviceGetName(handle)
    temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
    mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
    util = pynvml.nvmlDeviceGetUtilizationRates(handle)
    power = pynvml.nvmlDeviceGetPowerUsage(handle) / 1000  # milliwatts to watts
    
    print(f"GPU {i}: {name}")
    print(f"  Temperature: {temp}°C")
    print(f"  GPU Utilization: {util.gpu}%")
    print(f"  Memory: {mem_info.used / 1024**2:.0f} / {mem_info.total / 1024**2:.0f} MiB")
    print(f"  Power: {power:.1f} W")

pynvml.nvmlShutdown()

NVML is what you'd use to build custom GPU monitoring agents or integrate GPU metrics into existing monitoring systems. It has one important advantage over parsing nvidia-smi output: it doesn't spawn a new process for each query, making it much more efficient for high-frequency monitoring.

AMD GPU Monitoring with ROCm

If you're running AMD GPUs (Instinct MI series for servers), the ROCm stack provides rocm-smi for monitoring:

rocm-smi --showtemp --showuse --showmemuse --showpower

For a continuously updating view:

watch -n 2 rocm-smi

AMD also provides amdsmi as the newer alternative, and some monitoring tools like nvtop support AMD GPUs natively through the amdgpu kernel driver.

The AMD server GPU ecosystem is growing, especially with the MI300X gaining traction in AI workloads. But the monitoring tooling is still less mature than NVIDIA's. If you're running AMD GPUs, expect to do more custom integration work.

Prometheus and Grafana for GPU Monitoring

For production GPU monitoring across multiple servers, Prometheus with Grafana dashboards is the most common setup. NVIDIA provides the dcgm-exporter (Data Center GPU Manager Exporter) that exposes GPU metrics in Prometheus format.

Setting Up DCGM Exporter

Run it as a Docker container:

docker run -d --gpus all --rm -p 9400:9400 \
  nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.0-ubuntu22.04

Verify it's working:

curl localhost:9400/metrics | grep DCGM_FI_DEV_GPU_TEMP

You'll see Prometheus-format metrics for every GPU in the system. Add it as a scrape target in your prometheus.yml:

scrape_configs:
  - job_name: 'gpu-metrics'
    static_configs:
      - targets: ['gpu-server-1:9400', 'gpu-server-2:9400']

Key DCGM Metrics to Track

Metric What It Tells You
DCGM_FI_DEV_GPU_TEMP GPU temperature. Alert above 85°C
DCGM_FI_DEV_GPU_UTIL Compute utilization. Spot underuse or saturation
DCGM_FI_DEV_FB_USED Framebuffer (VRAM) used. Detect memory leaks
DCGM_FI_DEV_FB_FREE VRAM available. Predict OOM events
DCGM_FI_DEV_POWER_USAGE Power draw. Capacity planning and cost tracking
DCGM_FI_DEV_SM_CLOCK SM clock speed. Detect throttling
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL Single-bit ECC errors. Early hardware failure warning
DCGM_FI_DEV_XID_ERRORS XID errors. GPU faults and crashes

The Reality of Prometheus GPU Monitoring

Setting up Prometheus and Grafana for GPU monitoring works, but the effort is real. You need to deploy and maintain Prometheus, configure storage retention and resource limits, set up Grafana with dashboards (or find community dashboards that mostly work), configure alerting rules in PromQL, and manage this infrastructure across your fleet.

For teams already running a Prometheus stack, adding GPU metrics is straightforward. For everyone else, it's a weekend project that turns into ongoing maintenance. If you're managing a handful of GPU servers alongside your regular infrastructure, a dedicated server monitoring tool that handles GPU metrics alongside everything else is often a more practical choice.

Monitoring GPU Health: Beyond Utilization

GPU utilization and temperature are the obvious metrics, but production GPU monitoring needs to go deeper.

ECC Memory Errors

ECC (Error Correcting Code) memory errors indicate hardware degradation. Single-bit errors are corrected automatically but indicate wear. Double-bit errors are uncorrectable and cause application crashes.

nvidia-smi --query-gpu=ecc.errors.corrected.volatile.total,ecc.errors.uncorrected.volatile.total --format=csv

Track ECC errors over time. A steady increase in single-bit errors means you should schedule a GPU replacement. Any uncorrected errors demand immediate investigation.

XID Errors

XID errors are NVIDIA's error codes logged to the kernel. They cover everything from driver bugs to hardware failures:

dmesg | grep -i "NVRM: Xid"

Common XID codes to watch for:

XID Meaning
13 Graphics engine exception, often a memory or power issue
31 GPU memory page fault, likely a VRAM hardware issue
48 Double-bit ECC error, needs immediate attention
63 ECC page retirement limit exceeded, GPU should be replaced
79 GPU fallen off the bus, PCIe or hardware failure

A GPU running at PCIe Gen3 x8 instead of Gen4 x16 will have half the bandwidth, causing mysterious slowdowns that don't show up in GPU utilization metrics:

nvidia-smi --query-gpu=pcie.link.gen.current,pcie.link.width.current --format=csv

Compare against expected values for your hardware. PCIe link degradation can indicate slot issues, cable problems, or BIOS misconfiguration.

Power Limit and Thermal Throttling

Check if your GPUs are power-limited or thermal-limited:

nvidia-smi --query-gpu=power.draw,power.limit,clocks_throttle_reasons.hw_thermal_slowdown,clocks_throttle_reasons.hw_power_brake_slowdown --format=csv

If clocks_throttle_reasons shows active throttling, your GPUs aren't performing at their rated speed. This often points to cooling issues, inadequate power delivery, or overly aggressive power limits.

GPU Monitoring in Docker and Kubernetes

If your GPU workloads run in containers, monitoring requires some additional setup.

Docker GPU Monitoring

With the NVIDIA Container Toolkit installed, containers can access GPU metrics. Run nvidia-smi inside a container:

docker run --gpus all nvidia/cuda:12.2-base nvidia-smi

For monitoring the host-level GPU allocation across containers, combine nvidia-smi pmon with Docker's process information to map GPU processes back to specific containers:

# Get GPU processes
nvidia-smi pmon -c 1 -s um

# Map PID to container
docker inspect --format '{{.Name}}' $(docker ps -q --filter "pid=$PID")

If you're already using Docker container monitoring for CPU and memory, GPU metrics are the missing piece for GPU-accelerated workloads.

Kubernetes GPU Monitoring

In Kubernetes, the NVIDIA device plugin handles GPU allocation, and dcgm-exporter runs as a DaemonSet to expose metrics:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  template:
    spec:
      containers:
      - name: dcgm-exporter
        image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.0-ubuntu22.04
        ports:
        - containerPort: 9400
        securityContext:
          privileged: true
        volumeMounts:
        - name: device-dir
          mountPath: /dev
      volumes:
      - name: device-dir
        hostPath:
          path: /dev

This exposes per-GPU metrics with Kubernetes labels, so you can correlate GPU usage with specific pods and namespaces.

Best Practices for GPU Monitoring

Now that we've covered the tools, here's what actually matters when you go to implement this.

Set your temperature alerts tighter than you think. 80°C for warning, 85°C for critical. Don't wait for the thermal limit. By the time you hit it, the GPU has already been throttling for a while and you've been losing performance without knowing it.

Track VRAM as a percentage, not absolute values. An alert at 90% VRAM usage gives you a buffer before OOM regardless of whether it's a 16GB T4 or an 80GB A100.

Look at utilization trends, not point-in-time snapshots. A GPU at 0% utilization for 5 seconds is completely normal (it's between batches). A GPU sitting at 10% average over an hour is a real problem. Use aggregation periods of at least 5 minutes for utilization alerts.

Watch power draw if you're paying for GPU cloud instances. Power draw correlates directly with compute usage. Low power draw usually means your GPUs aren't doing meaningful work and you're paying for idle capacity.

Start collecting ECC errors from day one. Don't wait for crashes. Having a baseline makes it trivial to spot when a GPU starts degrading.

And keep monitoring lightweight. GPU monitoring tools that query nvidia-smi every second can actually interfere with GPU performance. A 10-30 second collection interval is enough for most workloads. Only go faster if you're debugging a specific issue.

GPU Monitoring Without the Infrastructure Overhead

Setting up comprehensive GPU monitoring with Prometheus, Grafana, DCGM, and alert routing is a serious infrastructure project. For teams running a few GPU servers alongside their regular fleet, it's often overkill.

Fivenines monitors GPU utilization, temperature, memory usage, and power draw alongside all your other server metrics: CPU, memory, disk, network, Docker containers, and more. The agent auto-detects your GPUs and starts collecting metrics immediately. No exporters, no dashboards to configure. You get GPU monitoring and alerts through the same interface you already use for everything else.

If you're running GPU workloads on Linux servers and want visibility without spending a weekend on Prometheus configuration, give Fivenines a try. The free tier includes 5 servers with GPU monitoring included.

Read more