How to Troubleshoot Container Monitoring When Your Observability Stack Fails

Sébastien Puyet

27 Jan 2026 — 7 min read

Your container monitoring stack worked perfectly for months. Prometheus scraped metrics reliably, Grafana dashboards loaded instantly, alerts fired when they should. Then one morning you wake up to a flood of "monitoring down" notifications, and suddenly you're flying blind through a production incident with no visibility into what's actually happening.

If you're still setting up your stack, see our guide on building a monitoring stack with Docker Compose for the initial setup.

When your observability stack fails, you face a dangerous catch-22: you need monitoring to troubleshoot problems, but your monitoring system is the problem. This guide walks you through diagnosing and fixing the most common container monitoring failures, from Prometheus crashes to missing metrics, so you can restore visibility when you need it most.

Common Failure Patterns

Container monitoring failures follow predictable patterns. Understanding these patterns helps you diagnose issues faster when everything is on fire.

Prometheus OOMKilled scenarios are the most dramatic failures. Your Prometheus pod restarts constantly, memory usage spikes to the limit, and you lose all recent metrics. The symptoms are unmistakable: kubectl get pods shows your Prometheus pod with a restart count climbing every few minutes, and kubectl describe pod reveals the dreaded "OOMKilled" status.

Check your current Prometheus memory usage:

kubectl top pod -n monitoring prometheus-server-0
kubectl describe pod -n monitoring prometheus-server-0 | grep -A 5 "Last State"

If you see memory usage consistently hitting your resource limits, you're dealing with a cardinality explosion or retention misconfiguration.

Grafana dashboard timeouts present differently. Dashboards load partially, queries time out with "context deadline exceeded" errors, and complex visualizations never render. Users see spinning wheels instead of metrics, and the Grafana logs fill with query timeout messages.

Missing metrics from specific containers create blind spots in your monitoring. Some services report metrics normally while others disappear entirely. This usually indicates service discovery issues, network problems, or authentication failures between Prometheus and your container runtime.

Alert manager silence is perhaps the most dangerous failure because it's invisible. Alerts stop firing, but you don't realize it until something breaks and no one gets notified. Your monitoring appears healthy, but the notification pipeline has quietly failed.

Reliable server alerting requires both proper configuration and regular validation that alerts actually work.

Prometheus Performance Issues

When Prometheus struggles, it usually tells you exactly what's wrong through its own metrics. The key is knowing which metrics to check and how to interpret them.

Start with memory analysis using Prometheus's internal metrics. Connect to your Prometheus instance and run these queries to understand memory consumption:

# Check total memory usage
prometheus_tsdb_symbol_table_size_bytes + prometheus_tsdb_head_series

# Identify high cardinality metrics
topk(10, count by (__name__)({__name__=~".+"}))

# Check WAL size and disk usage
prometheus_tsdb_wal_size_bytes
prometheus_tsdb_size_bytes

If prometheus_tsdb_head_series shows millions of series, you have a cardinality problem. Each series consumes approximately 1-3KB of memory, so 5 million series requires 5-15GB RAM just for the time series database.

Query performance issues show up in the prometheus_engine_* metrics:

# Check query duration
histogram_quantile(0.95, prometheus_engine_query_duration_seconds_bucket)

# Identify slow queries
topk(5, prometheus_engine_query_duration_seconds{quantile="0.95"})

# Check concurrent query load
prometheus_engine_queries_concurrent_max

If the 95th percentile query duration exceeds 30 seconds, your queries are too complex for your data volume, or your Prometheus instance is undersized.

Storage problems manifest in WAL corruption, disk space exhaustion, or retention issues. Check these indicators:

# Monitor disk space usage
df -h /prometheus/data

# Check WAL corruption in Prometheus logs
kubectl logs -n monitoring prometheus-server-0 | grep -i "wal\|corrupt\|repair"

# Verify retention settings
curl http://prometheus:9090/api/v1/status/runtimeinfo | jq '.data.storageRetention'

When you see "WAL corruption" or "repair" messages in the logs, your Prometheus database needs immediate attention. This often happens after unclean shutdowns or disk space exhaustion.

Missing Container Metrics

Missing metrics usually stem from service discovery failures, network connectivity issues, or authentication problems. The fix depends on identifying which layer is broken.

Start by checking Prometheus service discovery. Navigate to your Prometheus web interface and go to Status → Service Discovery. This page shows all discovered targets and their current status. Look for targets in "down" state or with error messages.

For Kubernetes environments, verify that Prometheus can reach the Kubernetes API:

# Check if Prometheus can list pods
kubectl exec -n monitoring prometheus-server-0 -- wget -qO- --header="Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" https://kubernetes.default.svc.cluster.local/api/v1/pods

If this fails, your Prometheus pod lacks proper RBAC permissions. Verify the ServiceAccount, ClusterRole, and ClusterRoleBinding are configured correctly:

kubectl get clusterrolebinding prometheus-server -o yaml
kubectl describe serviceaccount -n monitoring prometheus-server

Network connectivity issues between Prometheus and targets often cause metrics to disappear. Test connectivity from your Prometheus pod to a missing target:

# Test connectivity to a specific pod
kubectl exec -n monitoring prometheus-server-0 -- wget -qO- http://target-pod-ip:8080/metrics --timeout=10

# Check if network policies block access
kubectl get networkpolicies -A
kubectl describe networkpolicy -n target-namespace policy-name

cAdvisor problems cause container-level metrics to disappear. cAdvisor runs as part of kubelet and exposes container metrics on port 10250. Verify it's accessible:

# Check cAdvisor endpoint from Prometheus
kubectl exec -n monitoring prometheus-server-0 -- wget -qO- --no-check-certificate https://node-ip:10250/metrics/cadvisor --timeout=10

If this fails with authentication errors, check your Prometheus configuration for proper TLS and bearer token settings:

kubectl get configmap -n monitoring prometheus-server -o yaml | grep -A 10 -B 10 "cadvisor"

Node-exporter issues affect host-level metrics. Verify node-exporter pods are running on all nodes:

kubectl get pods -n monitoring -l app=node-exporter -o wide
kubectl get nodes --no-headers | wc -l

The number of node-exporter pods should match your node count. If pods are missing, check DaemonSet status and node selectors.

Grafana Visualization Problems

Grafana problems usually manifest as slow dashboards, "no data" errors, or visualization failures. Most issues trace back to inefficient queries or data source configuration problems.

When dashboards load slowly, start by examining the queries. Open the problematic dashboard, click on a slow panel, and select "Edit." Look for queries without proper time range limits or those requesting too much data:

# Problematic query - no rate() function
container_cpu_usage_seconds_total

# Better query - uses rate() and limits time range  
rate(container_cpu_usage_seconds_total[5m])

Use Grafana's query inspector to see actual query execution times. Click the "Query Inspector" button in any panel to see how long each query takes and how much data it returns. Queries returning millions of data points will always be slow.

"No data" scenarios often result from incorrect PromQL queries or time range mismatches. Test your queries directly in Prometheus before using them in Grafana:

# Test query in Prometheus first
curl -G 'http://prometheus:9090/api/v1/query' \
  --data-urlencode 'query=up{job="kubernetes-pods"}' \
  --data-urlencode 'time=2024-01-15T10:00:00Z'

If Prometheus returns data but Grafana shows "no data," check the data source configuration. Go to Configuration → Data Sources in Grafana and test the connection to Prometheus. Verify the URL is correct and accessible from the Grafana pod:

kubectl exec -n monitoring grafana-pod -- wget -qO- http://prometheus:9090/api/v1/query?query=up --timeout=10

Grafana memory and CPU issues cause dashboard loading failures and UI responsiveness problems. Check resource usage and logs:

kubectl top pod -n monitoring grafana-pod
kubectl logs -n monitoring grafana-pod | tail -100

Look for "out of memory" errors or high CPU usage patterns in the logs. Large dashboards with many panels can overwhelm Grafana, especially if queries return large datasets.

Monitoring Recovery Procedures

When your monitoring stack fails completely, you need rapid recovery procedures to restore visibility. These emergency steps prioritize getting basic monitoring back online quickly.

Track how long these recovery events take with our mean time to recovery calculator to establish baselines and improve over time.

For Prometheus recovery, start by preserving any existing data before attempting repairs:

# Create a snapshot before making changes
curl -XPOST http://prometheus:9090/api/v1/admin/snapshot

# Check snapshot location
kubectl exec -n monitoring prometheus-server-0 -- ls -la /prometheus/data/snapshots/

If Prometheus won't start due to data corruption, you may need to rebuild the TSDB. This is destructive but sometimes necessary:

# Stop Prometheus
kubectl scale deployment -n monitoring prometheus-server --replicas=0

# Access the data volume
kubectl run -it --rm debug --image=busybox --overrides='{"spec":{"containers":[{"name":"debug","image":"busybox","volumeMounts":[{"mountPath":"/data","name":"prometheus-data"}]}],"volumes":[{"name":"prometheus-data","persistentVolumeClaim":{"claimName":"prometheus-server"}}]}}'

# Inside the debug pod, backup and clean corrupted data
cd /data
tar czf backup-$(date +%Y%m%d).tar.gz .
rm -rf wal/
rm -rf chunks_head/

# Restart Prometheus
kubectl scale deployment -n monitoring prometheus-server --replicas=1

For temporary monitoring during outages, deploy a minimal Prometheus instance with basic scraping:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-emergency
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus-emergency
  template:
    metadata:
      labels:
        app: prometheus-emergency
    spec:
      containers:
      - name: prometheus
        image: prom/prometheus:latest
        args:
          - '--config.file=/etc/prometheus/prometheus.yml'
          - '--storage.tsdb.retention.time=1h'
          - '--storage.tsdb.retention.size=1GB'
        ports:
        - containerPort: 9090
        volumeMounts:
        - name: config
          mountPath: /etc/prometheus
      volumes:
      - name: config
        configMap:
          name: prometheus-emergency-config

Create a minimal configuration that only scrapes essential services:

global:
  scrape_interval: 30s
scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
    - role: pod
    relabel_configs:
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: true

This emergency setup provides basic visibility while you repair the main monitoring stack.

Preventing Future Monitoring Failures

Prevention beats emergency recovery every time. Implement these safeguards to avoid monitoring failures before they happen.

Set appropriate resource limits and requests for all monitoring components. Prometheus needs significant memory for high-cardinality environments:

resources:
  requests:
    memory: "4Gi"
    cpu: "1000m"
  limits:
    memory: "8Gi"
    cpu: "2000m"

Configure health checks that detect monitoring problems before they become critical:

livenessProbe:
  httpGet:
    path: /-/healthy
    port: 9090
  initialDelaySeconds: 30
  timeoutSeconds: 10
readinessProbe:
  httpGet:
    path: /-/ready
    port: 9090
  initialDelaySeconds: 5
  timeoutSeconds: 5

Implement monitoring for your monitoring stack. Create alerts that fire when Prometheus or Grafana show signs of distress:

- alert: PrometheusHighMemoryUsage
  expr: prometheus_tsdb_symbol_table_size_bytes > 2e9
  for: 5m
  annotations:
    summary: "Prometheus memory usage is high"
    description: "Prometheus is using {{ $value }} bytes of memory"

- alert: PrometheusQueryDurationHigh
  expr: histogram_quantile(0.95, prometheus_engine_query_duration_seconds_bucket) > 30
  for: 5m
  annotations:
    summary: "Prometheus queries are slow"

Set up automated backups of your monitoring configuration and data:

#!/bin/bash
# Backup script for monitoring configs
kubectl get configmap -n monitoring prometheus-server -o yaml > prometheus-config-$(date +%Y%m%d).yaml
kubectl get configmap -n monitoring grafana-dashboards -o yaml > grafana-dashboards-$(date +%Y%m%d).yaml

# Create Prometheus snapshot
curl -XPOST http://prometheus:9090/api/v1/admin/snapshot

Finally, document your recovery procedures and test them regularly. The middle of an outage is not the time to figure out how to restore from backups or rebuild corrupted data. Run quarterly disaster recovery drills where you intentionally break your monitoring stack and practice the recovery procedures.

Container monitoring failures are inevitable, but they don't have to be catastrophic. With proper troubleshooting knowledge, emergency procedures, and preventive measures, you can maintain visibility into your systems even when your observability stack has problems of its own.

For a lightweight alternative to managing your own Prometheus stack, consider dedicated Docker container monitoring solutions that handle collection and alerting out of the box.

For containers running GPU workloads, missing GPU metrics can hide the real cause of performance issues. See our GPU monitoring guide for setting up DCGM exporter and tracking GPU utilization per container.

If you're tired of debugging your monitoring tools instead of your actual infrastructure, fivenines.io offers a simpler approach. Our lightweight agent collects container metrics every 5 seconds without depending on complex orchestration layers or external dependencies that can fail alongside your stack. It monitors Docker containers out of the box, giving you process-level visibility and resource metrics even when your primary observability platform is having a bad day. Sometimes the best monitoring is the one that just works.