How to Troubleshoot Container Monitoring When Your Observability Stack Fails
Your container monitoring stack worked perfectly for months. Prometheus scraped metrics reliably, Grafana dashboards loaded instantly, alerts fired when they should. Then one morning you wake up to a flood of "monitoring down" notifications, and suddenly you're flying blind through a production incident with no visibility into what's actually happening.
When your observability stack fails, you face a dangerous catch-22: you need monitoring to troubleshoot problems, but your monitoring system is the problem. This guide walks you through diagnosing and fixing the most common container monitoring failures, from Prometheus crashes to missing metrics, so you can restore visibility when you need it most.
Common Failure Patterns
Container monitoring failures follow predictable patterns. Understanding these patterns helps you diagnose issues faster when everything is on fire.
Prometheus OOMKilled scenarios are the most dramatic failures. Your Prometheus pod restarts constantly, memory usage spikes to the limit, and you lose all recent metrics. The symptoms are unmistakable: kubectl get pods shows your Prometheus pod with a restart count climbing every few minutes, and kubectl describe pod reveals the dreaded "OOMKilled" status.
Check your current Prometheus memory usage:
kubectl top pod -n monitoring prometheus-server-0
kubectl describe pod -n monitoring prometheus-server-0 | grep -A 5 "Last State"If you see memory usage consistently hitting your resource limits, you're dealing with a cardinality explosion or retention misconfiguration.
Grafana dashboard timeouts present differently. Dashboards load partially, queries time out with "context deadline exceeded" errors, and complex visualizations never render. Users see spinning wheels instead of metrics, and the Grafana logs fill with query timeout messages.
Missing metrics from specific containers create blind spots in your monitoring. Some services report metrics normally while others disappear entirely. This usually indicates service discovery issues, network problems, or authentication failures between Prometheus and your container runtime.
Alert manager silence is perhaps the most dangerous failure because it's invisible. Alerts stop firing, but you don't realize it until something breaks and no one gets notified. Your monitoring appears healthy, but the notification pipeline has quietly failed.
Prometheus Performance Issues
When Prometheus struggles, it usually tells you exactly what's wrong through its own metrics. The key is knowing which metrics to check and how to interpret them.
Start with memory analysis using Prometheus's internal metrics. Connect to your Prometheus instance and run these queries to understand memory consumption:
# Check total memory usage
prometheus_tsdb_symbol_table_size_bytes + prometheus_tsdb_head_series
# Identify high cardinality metrics
topk(10, count by (__name__)({__name__=~".+"}))
# Check WAL size and disk usage
prometheus_tsdb_wal_size_bytes
prometheus_tsdb_size_bytesIf prometheus_tsdb_head_series shows millions of series, you have a cardinality problem. Each series consumes approximately 1-3KB of memory, so 5 million series requires 5-15GB RAM just for the time series database.
Query performance issues show up in the prometheus_engine_* metrics:
# Check query duration
histogram_quantile(0.95, prometheus_engine_query_duration_seconds_bucket)
# Identify slow queries
topk(5, prometheus_engine_query_duration_seconds{quantile="0.95"})
# Check concurrent query load
prometheus_engine_queries_concurrent_maxIf the 95th percentile query duration exceeds 30 seconds, your queries are too complex for your data volume, or your Prometheus instance is undersized.
Storage problems manifest in WAL corruption, disk space exhaustion, or retention issues. Check these indicators:
# Monitor disk space usage
df -h /prometheus/data
# Check WAL corruption in Prometheus logs
kubectl logs -n monitoring prometheus-server-0 | grep -i "wal\|corrupt\|repair"
# Verify retention settings
curl http://prometheus:9090/api/v1/status/runtimeinfo | jq '.data.storageRetention'When you see "WAL corruption" or "repair" messages in the logs, your Prometheus database needs immediate attention. This often happens after unclean shutdowns or disk space exhaustion.
Missing Container Metrics
Missing metrics usually stem from service discovery failures, network connectivity issues, or authentication problems. The fix depends on identifying which layer is broken.
Start by checking Prometheus service discovery. Navigate to your Prometheus web interface and go to Status → Service Discovery. This page shows all discovered targets and their current status. Look for targets in "down" state or with error messages.
For Kubernetes environments, verify that Prometheus can reach the Kubernetes API:
# Check if Prometheus can list pods
kubectl exec -n monitoring prometheus-server-0 -- wget -qO- --header="Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" https://kubernetes.default.svc.cluster.local/api/v1/podsIf this fails, your Prometheus pod lacks proper RBAC permissions. Verify the ServiceAccount, ClusterRole, and ClusterRoleBinding are configured correctly:
kubectl get clusterrolebinding prometheus-server -o yaml
kubectl describe serviceaccount -n monitoring prometheus-serverNetwork connectivity issues between Prometheus and targets often cause metrics to disappear. Test connectivity from your Prometheus pod to a missing target:
# Test connectivity to a specific pod
kubectl exec -n monitoring prometheus-server-0 -- wget -qO- http://target-pod-ip:8080/metrics --timeout=10
# Check if network policies block access
kubectl get networkpolicies -A
kubectl describe networkpolicy -n target-namespace policy-namecAdvisor problems cause container-level metrics to disappear. cAdvisor runs as part of kubelet and exposes container metrics on port 10250. Verify it's accessible:
# Check cAdvisor endpoint from Prometheus
kubectl exec -n monitoring prometheus-server-0 -- wget -qO- --no-check-certificate https://node-ip:10250/metrics/cadvisor --timeout=10If this fails with authentication errors, check your Prometheus configuration for proper TLS and bearer token settings:
kubectl get configmap -n monitoring prometheus-server -o yaml | grep -A 10 -B 10 "cadvisor"Node-exporter issues affect host-level metrics. Verify node-exporter pods are running on all nodes:
kubectl get pods -n monitoring -l app=node-exporter -o wide
kubectl get nodes --no-headers | wc -lThe number of node-exporter pods should match your node count. If pods are missing, check DaemonSet status and node selectors.
Grafana Visualization Problems
Grafana problems usually manifest as slow dashboards, "no data" errors, or visualization failures. Most issues trace back to inefficient queries or data source configuration problems.
When dashboards load slowly, start by examining the queries. Open the problematic dashboard, click on a slow panel, and select "Edit." Look for queries without proper time range limits or those requesting too much data:
# Problematic query - no rate() function
container_cpu_usage_seconds_total
# Better query - uses rate() and limits time range
rate(container_cpu_usage_seconds_total[5m])Use Grafana's query inspector to see actual query execution times. Click the "Query Inspector" button in any panel to see how long each query takes and how much data it returns. Queries returning millions of data points will always be slow.
"No data" scenarios often result from incorrect PromQL queries or time range mismatches. Test your queries directly in Prometheus before using them in Grafana:
# Test query in Prometheus first
curl -G 'http://prometheus:9090/api/v1/query' \
--data-urlencode 'query=up{job="kubernetes-pods"}' \
--data-urlencode 'time=2024-01-15T10:00:00Z'If Prometheus returns data but Grafana shows "no data," check the data source configuration. Go to Configuration → Data Sources in Grafana and test the connection to Prometheus. Verify the URL is correct and accessible from the Grafana pod:
kubectl exec -n monitoring grafana-pod -- wget -qO- http://prometheus:9090/api/v1/query?query=up --timeout=10Grafana memory and CPU issues cause dashboard loading failures and UI responsiveness problems. Check resource usage and logs:
kubectl top pod -n monitoring grafana-pod
kubectl logs -n monitoring grafana-pod | tail -100Look for "out of memory" errors or high CPU usage patterns in the logs. Large dashboards with many panels can overwhelm Grafana, especially if queries return large datasets.
Monitoring Recovery Procedures
When your monitoring stack fails completely, you need rapid recovery procedures to restore visibility. These emergency steps prioritize getting basic monitoring back online quickly.
For Prometheus recovery, start by preserving any existing data before attempting repairs:
# Create a snapshot before making changes
curl -XPOST http://prometheus:9090/api/v1/admin/snapshot
# Check snapshot location
kubectl exec -n monitoring prometheus-server-0 -- ls -la /prometheus/data/snapshots/If Prometheus won't start due to data corruption, you may need to rebuild the TSDB. This is destructive but sometimes necessary:
# Stop Prometheus
kubectl scale deployment -n monitoring prometheus-server --replicas=0
# Access the data volume
kubectl run -it --rm debug --image=busybox --overrides='{"spec":{"containers":[{"name":"debug","image":"busybox","volumeMounts":[{"mountPath":"/data","name":"prometheus-data"}]}],"volumes":[{"name":"prometheus-data","persistentVolumeClaim":{"claimName":"prometheus-server"}}]}}'
# Inside the debug pod, backup and clean corrupted data
cd /data
tar czf backup-$(date +%Y%m%d).tar.gz .
rm -rf wal/
rm -rf chunks_head/
# Restart Prometheus
kubectl scale deployment -n monitoring prometheus-server --replicas=1For temporary monitoring during outages, deploy a minimal Prometheus instance with basic scraping:
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus-emergency
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: prometheus-emergency
template:
metadata:
labels:
app: prometheus-emergency
spec:
containers:
- name: prometheus
image: prom/prometheus:latest
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=1h'
- '--storage.tsdb.retention.size=1GB'
ports:
- containerPort: 9090
volumeMounts:
- name: config
mountPath: /etc/prometheus
volumes:
- name: config
configMap:
name: prometheus-emergency-configCreate a minimal configuration that only scrapes essential services:
global:
scrape_interval: 30s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: trueThis emergency setup provides basic visibility while you repair the main monitoring stack.
Preventing Future Monitoring Failures
Prevention beats emergency recovery every time. Implement these safeguards to avoid monitoring failures before they happen.
Set appropriate resource limits and requests for all monitoring components. Prometheus needs significant memory for high-cardinality environments:
resources:
requests:
memory: "4Gi"
cpu: "1000m"
limits:
memory: "8Gi"
cpu: "2000m"Configure health checks that detect monitoring problems before they become critical:
livenessProbe:
httpGet:
path: /-/healthy
port: 9090
initialDelaySeconds: 30
timeoutSeconds: 10
readinessProbe:
httpGet:
path: /-/ready
port: 9090
initialDelaySeconds: 5
timeoutSeconds: 5Implement monitoring for your monitoring stack. Create alerts that fire when Prometheus or Grafana show signs of distress:
- alert: PrometheusHighMemoryUsage
expr: prometheus_tsdb_symbol_table_size_bytes > 2e9
for: 5m
annotations:
summary: "Prometheus memory usage is high"
description: "Prometheus is using {{ $value }} bytes of memory"
- alert: PrometheusQueryDurationHigh
expr: histogram_quantile(0.95, prometheus_engine_query_duration_seconds_bucket) > 30
for: 5m
annotations:
summary: "Prometheus queries are slow"Set up automated backups of your monitoring configuration and data:
#!/bin/bash
# Backup script for monitoring configs
kubectl get configmap -n monitoring prometheus-server -o yaml > prometheus-config-$(date +%Y%m%d).yaml
kubectl get configmap -n monitoring grafana-dashboards -o yaml > grafana-dashboards-$(date +%Y%m%d).yaml
# Create Prometheus snapshot
curl -XPOST http://prometheus:9090/api/v1/admin/snapshotFinally, document your recovery procedures and test them regularly. The middle of an outage is not the time to figure out how to restore from backups or rebuild corrupted data. Run quarterly disaster recovery drills where you intentionally break your monitoring stack and practice the recovery procedures.
Container monitoring failures are inevitable, but they don't have to be catastrophic. With proper troubleshooting knowledge, emergency procedures, and preventive measures, you can maintain visibility into your systems even when your observability stack has problems of its own.
If you're tired of debugging your monitoring tools instead of your actual infrastructure, fivenines.io offers a simpler approach. Our lightweight agent collects container metrics every 5 seconds without depending on complex orchestration layers or external dependencies that can fail alongside your stack. It monitors Docker containers out of the box, giving you process-level visibility and resource metrics even when your primary observability platform is having a bad day. Sometimes the best monitoring is the one that just works.