How to monitor Proxmox virtual machines

How to monitor Proxmox virtual machines

Proxmox Virtual Environment creates unique monitoring challenges that standard Linux server monitoring completely misses. While your typical server monitoring focuses on CPU, memory, and disk usage, Proxmox introduces layers of complexity: hypervisor health, storage pool integrity, VM resource optimization, backup compliance, and cluster coordination. A Proxmox host can show perfect system metrics while storage pools degrade, backups fail silently, or VMs suffer from resource contention.

This guide walks you through implementing production-ready Proxmox monitoring that goes beyond basic system metrics to track the operational health of your virtualization infrastructure. You'll learn to monitor ZFS pool health, track backup job compliance, optimize VM resource allocation, and detect cluster issues before they impact your virtual machines.

Proxmox Monitoring Versus Standard Server Monitoring

Standard server monitoring assumes you're watching a single operating system running applications directly on hardware. Proxmox introduces multiple abstraction layers that each have their own failure modes and performance characteristics.

The hypervisor layer manages VM scheduling, memory ballooning, and resource allocation. Problems here manifest as CPU steal time, memory pressure, or VMs that appear healthy internally but perform poorly. Your VM might show 20% CPU usage while actually being starved for hypervisor resources.

Storage in Proxmox typically involves ZFS pools, replication, and shared storage that can fail independently of the underlying disks. A ZFS pool can enter a degraded state, backups can fail while the filesystem remains accessible, and storage replication can lag without triggering traditional disk space alerts.

Cluster coordination adds another failure domain. Quorum loss, fence device failures, or network partitions can cause VMs to migrate unexpectedly or become unavailable even when individual nodes are healthy.

These Proxmox-specific issues require monitoring the Proxmox API, ZFS status, backup job results, and cluster state, data that doesn't appear in standard Linux metrics.

Setting Up Proxmox API Access

Proxmox monitoring relies heavily on API calls to gather hypervisor-specific data. Start by creating a dedicated monitoring user with minimal required privileges.

Log into your Proxmox web interface and navigate to Datacenter → Permissions → Users. Create a new user called monitoring@pve:

pveum user add monitoring@pve --comment "Monitoring system user"

Create an API token for this user rather than using password authentication:

pveum user token add monitoring@pve monitoring-token --privsep 0

This command returns a token ID and secret. Store these securely—the secret won't be displayed again. The --privsep 0 flag means the token inherits the user's permissions rather than requiring separate privilege assignment.

Grant the monitoring user read-only access to the resources you need to monitor:

pveum acl modify / --users monitoring@pve --roles PVEAuditor

The PVEAuditor role provides read-only access to most Proxmox resources. For backup monitoring, you'll also need access to storage information:

pveum acl modify /storage --users monitoring@pve --roles PVEDatastoreAudit

Test your API access with a simple curl command:

curl -k -H "Authorization: PVEAPIToken=monitoring@pve!monitoring-token=YOUR-SECRET-HERE" \
  https://your-proxmox-host:8006/api2/json/version

This should return version information if authentication is working correctly. The -k flag ignores SSL certificate issues, which is common in lab environments but should be avoided in production.

Storage Health Monitoring

ZFS storage health monitoring requires tracking multiple metrics that don't appear in standard disk monitoring. ZFS can experience silent data corruption, pool degradation, and performance issues that won't trigger traditional disk space or I/O alerts.

Start by monitoring ZFS pool status. The zpool status command provides detailed health information, but you need to parse it programmatically for monitoring systems:

#!/bin/bash
# zfs-health-check.sh

for pool in $(zpool list -H -o name); do
    status=$(zpool status -x "$pool")
    if [[ "$status" != *"is healthy"* ]]; then
        echo "CRITICAL: ZFS pool $pool is not healthy"
        echo "$status"
        exit 2
    fi
    
    # Check for scrub errors
    scrub_errors=$(zpool status "$pool" | grep "errors:" | awk '{print $4}')
    if [[ "$scrub_errors" != "0" ]]; then
        echo "WARNING: ZFS pool $pool has $scrub_errors scrub errors"
        exit 1
    fi
done

echo "OK: All ZFS pools healthy"
exit 0

Monitor scrub scheduling and completion. ZFS scrubs should run regularly to detect and correct data corruption. Check when the last scrub completed:

zpool status | grep -A 1 "scan:"

For automated monitoring, extract scrub timestamps and compare them to your scrub schedule. If scrubs haven't run in your expected interval (typically monthly), generate an alert.

Track ZFS ARC (Adaptive Replacement Cache) efficiency, which directly impacts VM performance:

arc_hit_percent=$(awk '/^hits/ {hits=$3} /^misses/ {misses=$3} END {printf "%.2f", hits*100/(hits+misses)}' /proc/spl/kstat/zfs/arcstats)

ARC hit rates below 80% often indicate memory pressure or workload changes that require attention.

Monitor storage replication lag if you're using Proxmox's built-in replication. Query the replication status via API:

curl -k -H "Authorization: PVEAPIToken=monitoring@pve!token=SECRET" \
  https://proxmox-host:8006/api2/json/nodes/NODE-NAME/replication

Parse the JSON response to extract last sync times and error states. Replication lag exceeding your RPO (Recovery Point Objective) should trigger immediate alerts.

Backup Job Monitoring

Backup monitoring in Proxmox requires tracking both local backup jobs and Proxmox Backup Server integration. Many organizations discover backup failures only when they need to restore, making proactive monitoring critical.

Query backup job status through the Proxmox API:

curl -k -H "Authorization: PVEAPIToken=monitoring@pve!token=SECRET" \
  https://proxmox-host:8006/api2/json/nodes/NODE-NAME/tasks

Filter the task list for backup-related entries and check their status. Look for tasks with type "vzdump" and examine their exit status and duration.

Create a backup compliance checker that verifies each VM has recent successful backups:

#!/bin/bash
# backup-compliance-check.sh

COMPLIANCE_HOURS=48  # Alert if no backup within 48 hours
CURRENT_TIME=$(date +%s)

# Get list of all VMs
vmids=$(pvesh get /cluster/resources --type vm --output-format json | jq -r '.[].vmid')

for vmid in $vmids; do
    # Find most recent successful backup for this VM
    last_backup=$(pvesh get /nodes/$(hostname)/tasks --output-format json | \
        jq -r --arg vmid "$vmid" '.[] | select(.type=="vzdump" and .status=="OK" and (.upid | contains($vmid))) | .starttime' | \
        sort -n | tail -1)
    
    if [[ -z "$last_backup" ]]; then
        echo "CRITICAL: VM $vmid has no successful backups"
        continue
    fi
    
    backup_age_hours=$(( (CURRENT_TIME - last_backup) / 3600 ))
    
    if [[ $backup_age_hours -gt $COMPLIANCE_HOURS ]]; then
        echo "WARNING: VM $vmid last backup was $backup_age_hours hours ago"
    fi
done

Monitor backup storage utilization and retention policy compliance. If you're using Proxmox Backup Server, query its API for storage usage and verify that old backups are being pruned according to your retention policy.

Track backup duration trends to detect performance degradation. Backups that suddenly take much longer often indicate storage issues, increased data change rates, or resource contention.

VM Performance metrics

Proxmox provides detailed VM performance metrics through its API that help identify resource optimization opportunities and performance bottlenecks.

Monitor VM resource utilization patterns to identify over-provisioned or under-utilized VMs:

curl -k -H "Authorization: PVEAPIToken=monitoring@pve!token=SECRET" \
  https://proxmox-host:8006/api2/json/nodes/NODE-NAME/qemu/VMID/status/current

This returns real-time resource usage including CPU utilization, memory consumption, and network I/O. Collect this data over time to identify VMs that consistently use less than 20% of allocated resources or regularly max out their allocations.

Track memory ballooning, which indicates memory pressure on the hypervisor:

# Check balloon driver status inside VMs
grep balloon /proc/meminfo

VMs showing significant ballooned memory may need more RAM allocation, or the hypervisor may be over-committed on memory.

Monitor CPU steal time, which indicates that VMs are waiting for hypervisor CPU scheduling:

# Inside each VM, check steal time
grep steal /proc/stat

High steal time (>5%) suggests CPU over-commitment on the hypervisor or resource contention between VMs.

Track VM disk I/O patterns to identify storage bottlenecks:

curl -k -H "Authorization: PVEAPIToken=monitoring@pve!token=SECRET" \
  https://proxmox-host:8006/api2/json/nodes/NODE-NAME/qemu/VMID/rrd?timeframe=hour

This returns RRD (Round Robin Database) data including disk read/write rates and network traffic. Look for VMs with consistently high I/O wait times or unusual traffic patterns.

High Availability and Cluster Health Monitoring

Proxmox clusters introduce additional monitoring requirements around quorum, fencing, and resource distribution. Cluster failures can cause widespread VM outages even when individual nodes are healthy.

Monitor cluster quorum status continuously:

pvecm status

Parse the output to verify that the cluster has quorum and all expected nodes are online. Loss of quorum prevents cluster operations including VM migrations and storage access.

Track fence device status if you're using hardware fencing:

pvecm fence_device_status

Failed fence devices can prevent proper cluster failover and should trigger immediate alerts.

Monitor VM migration patterns to detect cluster imbalances or hardware issues:

curl -k -H "Authorization: PVEAPIToken=monitoring@pve!token=SECRET" \
  https://proxmox-host:8006/api2/json/cluster/ha/status

Frequent unexpected migrations often indicate hardware problems, resource pressure, or network issues between cluster nodes.

Track cluster resource utilization to ensure proper load balancing:

curl -k -H "Authorization: PVEAPIToken=monitoring@pve!token=SECRET" \
  https://proxmox-host:8006/api2/json/cluster/resources

This returns resource usage across all cluster nodes. Significant imbalances may require VM redistribution or hardware upgrades.

Security and Access Monitoring

Proxmox security monitoring focuses on API access patterns, authentication failures, and unusual administrative activities that could indicate compromise or insider threats.

Monitor Proxmox authentication logs for failed login attempts:

grep "authentication failure" /var/log/daemon.log

Multiple failed authentications from the same IP address may indicate brute force attacks against your Proxmox management interface.

Track API access patterns to detect unusual activity:

tail -f /var/log/pveproxy/access.log

Look for API calls from unexpected IP addresses, unusual request patterns, or access outside normal business hours.

Monitor VM creation and deletion activities:

curl -k -H "Authorization: PVEAPIToken=monitoring@pve!token=SECRET" \
  https://proxmox-host:8006/api2/json/nodes/NODE-NAME/tasks | \
  jq '.data[] | select(.type=="qmcreate" or .type=="qmdestroy")'

Unexpected VM lifecycle operations could indicate unauthorized access or compromised credentials.

Production Monitoring Stack

Implementing comprehensive Proxmox monitoring requires integrating with a proper monitoring stack. Prometheus and Grafana provide the foundation for collecting, storing, and visualizing Proxmox metrics.

Install the Proxmox VE Exporter for Prometheus:

wget https://github.com/prometheus-pve/prometheus-pve-exporter/releases/latest/download/pve_exporter
chmod +x pve_exporter
sudo mv pve_exporter /usr/local/bin/

Create a configuration file for the exporter:

# pve.yml
default:
  user: monitoring@pve
  token_name: monitoring-token
  token_value: YOUR-SECRET-HERE
  verify_ssl: false

Start the exporter:

/usr/local/bin/pve_exporter --config.file=pve.yml

Configure Prometheus to scrape the Proxmox exporter by adding this job to your prometheus.yml:

- job_name: 'proxmox'
  static_configs:
    - targets: ['localhost:9221']
  scrape_interval: 30s

Create Grafana dashboards for Proxmox-specific metrics. Focus on panels that show ZFS pool health, VM resource utilization, backup job status, and cluster state rather than generic system metrics.

Configure Alertmanager rules for Proxmox-specific scenarios:

groups:
- name: proxmox
  rules:
  - alert: ProxmoxNodeDown
    expr: up{job="proxmox"} == 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Proxmox node {{ $labels.instance }} is down"
      
  - alert: ZFSPoolDegraded
    expr: pve_zfs_pool_health != 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "ZFS pool {{ $labels.pool }} is degraded"

Proxmox Performance Issues

When Proxmox performance issues occur, systematic diagnosis helps identify root causes quickly. Start by determining whether the problem affects the hypervisor, specific VMs, or storage subsystems.

For slow VM performance, first check hypervisor resource utilization:

top
iostat -x 1
iotop

High hypervisor CPU usage, memory pressure, or disk I/O wait times indicate resource contention affecting all VMs on the host.

If hypervisor resources look normal, examine VM-specific metrics:

# Check CPU steal time inside the slow VM
grep steal /proc/stat

# Monitor memory ballooning
grep balloon /proc/meminfo

# Check for storage latency
iostat -x 1

High steal time indicates CPU scheduling delays. Memory ballooning suggests the VM needs more RAM allocation. High storage latency may indicate ZFS pool issues or disk problems.

For storage performance issues, examine ZFS ARC efficiency and pool status:

arcstat.py 1 10  # Monitor ARC hit rates
zpool iostat -v 1  # Monitor pool I/O patterns
zpool status  # Check pool health

Poor ARC hit rates combined with high I/O latency often indicate memory pressure forcing ZFS to read from disk more frequently.

Network performance issues between VMs require examining both virtual and physical network layers:

# Test network performance between VMs
iperf3 -s  # On destination VM
iperf3 -c destination-ip  # On source VM

# Check bridge statistics on hypervisor
cat /proc/net/dev | grep vmbr

Low throughput between VMs on the same host suggests virtual network configuration issues, while problems between hosts indicate physical network bottlenecks.

Proxmox monitoring requires a comprehensive approach that goes beyond standard server metrics to track hypervisor health, storage integrity, backup compliance, and cluster coordination. By implementing the monitoring strategies in this guide, you'll catch Proxmox-specific issues before they impact your virtual machines and maintain the operational visibility needed for production virtualization environments.

If you're looking for a monitoring solution that can track your Proxmox infrastructure alongside your other servers, FiveNines provides comprehensive server monitoring with 5-second metric collection and process-level visibility. Our platform can monitor your Proxmox hosts, track critical processes, and alert you to performance issues before they affect your virtual machines.

Read more