High Cardinality Metrics Are Breaking Your Monitoring Budget (And Your Database)

High Cardinality Metrics Are Breaking Your Monitoring Budget (And Your Database)

When your Prometheus server needs 64GB RAM just to track which user clicked which button, you've got a cardinality problem.

Someone adds a user_id label to track user behavior, suddenly metrics storage balloons from gigabytes to hundreds of gigabytes, queries start timing out, monitoring bills triple overnight, and your time series database starts throwing out-of-memory errors like confetti.

High cardinality metrics aren't just expensive, they're the fastest way to kill a monitoring system that was working perfectly fine yesterday.

What High Cardinality Actually Costs

Every unique combination of labels creates a new time series. That sounds innocent until you do the math.

  • Let's say you're tracking HTTP requests with these labels:
  • method (5 values: GET, POST, PUT, DELETE, PATCH)
  • endpoint (100 unique API endpoints)
  • status_code (20 common codes)
  • user_id (10,000 active users)

That's 5 × 100 × 20 × 10,000 = 100 million time series. Each series needs memory for its labels, recent samples, and indexes. In Prometheus, you're looking at roughly 1-3KB per series in memory, plus storage overhead.

So your innocent little HTTP counter just consumed 100GB+ of RAM and several terabytes of disk space. And that's before you consider retention, compression ratios, or the fact that Prometheus needs headroom to operate.

The problem gets worse because time series databases are optimized for append-only writes across many series. When you have millions of series, even basic operations like label queries become expensive. Your {user_id="12345"} query now has to scan through millions of label sets.

The Label Explosion That Kills Databases

Here's where most teams go wrong: they treat metrics like logs. They want to slice and dice by every possible dimension, so they add labels for everything.

  • I've seen metrics with labels like:
  • request_id (unique per request)
  • session_id (unique per session)
  • file_path (thousands of unique files)
  • sql_query (with parameter values baked in)
  • customer_name (growing with each new customer)

Each of these can explode your cardinality. A request_id label makes every single request a unique time series. You're not monitoring patterns anymore. You're just creating an expensive, slow database of individual events.

Prometheus will eventually start dropping metrics when it hits memory limits. But by then, the damage is done. Your monitoring system is unreliable, your queries are slow, and you're paying cloud bills that would make a CFO cry.

The worst part? You probably don't need most of these dimensions for monitoring. You need them for debugging specific requests, but that's what logs are for.

Strategic Metric Design

Good metrics answer questions about system behavior, not individual events. Instead of tracking every user's clicks, track click rates by user type or region. Instead of monitoring every SQL query with its parameters, monitor query patterns by table or operation type.

Here's how to approach metric design:

  • Bound label values upfront.
  • If a label can have unlimited values, it will. Use techniques like:
  • Bucketing numeric values (latency_bucket instead of raw latency)
  • Grouping by category (endpoint_group instead of individual endpoints)
  • Using "other" for long-tail values

Question every label. Ask yourself: "Will I alert on this dimension? Will I dashboard it?" If you just want to filter by it occasionally, you don't need it as a metric label. Put it in logs instead.

Normalize dynamic values. Replace user IDs, request IDs, and timestamps in labels with stable categories. A user_type label (premium, free, admin) is infinitely more useful than individual user IDs for monitoring.

Use recording rules for expensive aggregations. If you're constantly summing across high-cardinality metrics, pre-compute those sums with recording rules. It's cheaper to store the result than to compute it every time someone loads a dashboard.

For example, instead of this cardinality bomb:

http_requests_total{method="GET", endpoint="/api/users/12345", user_id="12345"}

Consider this bounded alternative:

http_requests_total{method="GET", endpoint_group="user_api", user_tier="premium"}

You lose some granularity, but you gain a monitoring system that actually works.

When to Sample Instead of Collect Everything

Sometimes you need high-cardinality data for debugging, but you don't need 100% of it for monitoring. That's where sampling comes in.

For trace data or detailed request metrics, consider collecting a statistical sample rather than every single event. Tools like OpenTelemetry support probabilistic sampling that gives you enough data to spot patterns without drowning your storage.

You can also use different retention policies for different cardinality levels. Keep high-cardinality metrics for hours or days, but keep low-cardinality aggregates for months or years.

For user behavior tracking, consider approximation algorithms like HyperLogLog for unique counts or reservoir sampling for representative samples. They're not perfect, but they're good enough for monitoring and orders of magnitude cheaper.

The Real Cost of Getting This Wrong

I've worked with teams spending $50,000+ per month on monitoring infrastructure because they treated every metric like a log entry. Their queries took minutes to run. Their dashboards timed out. Their alerting was unreliable because the database couldn't keep up.

The fix wasn't upgrading to bigger servers. It was redesigning their metrics with cardinality limits from day one. They went from hundreds of millions of time series down to tens of thousands, and their monitoring actually became more useful, not less.

High cardinality metrics feel like they give you more visibility, but they often do the opposite. When everything is a unique snowflake, nothing stands out. When your database is constantly under memory pressure, your monitoring becomes unreliable exactly when you need it most.

Start with the question you want to answer, then design the minimum set of labels needed to answer it. Your infrastructure budget (and your on-call engineer) will thank you.

If you're looking for monitoring that's designed to avoid these cardinality traps, tools like fivenines focus on the metrics that actually matter for keeping servers healthy, without the label explosion that breaks traditional monitoring stacks.