Victoria Metrics: When Prometheus Runs Out of Room
When Prometheus Runs Out of Room
You've got Prometheus running, it's scraping your endpoints every 15 seconds, Grafana dashboards look great, and alerting is working. Life is good. Then someone asks "can we look at metrics from six months ago?" and you realize your retention is set to 15 days because anything longer made your disk usage uncomfortable.
This is the moment most people discover that Prometheus wasn't really designed for long-term storage. It's fantastic at collecting and querying recent metrics, but keeping months or years of data requires either throwing hardware at the problem or rethinking your architecture. VictoriaMetrics exists largely because of this gap.
What VictoriaMetrics actually is
VictoriaMetrics is a time-series database that can act as long-term storage for Prometheus, a complete replacement for it, or a standalone metrics backend. It speaks PromQL, accepts data via Prometheus remote write, and generally tries to be compatible enough that you can swap it in without rewriting all your queries and dashboards.
The project is open source with an enterprise tier for clustering features. There's a single-node version that handles surprisingly large workloads, and a clustered version that splits ingestion, storage, and querying into separate components for horizontal scaling.
The core pitch is better resource efficiency. VictoriaMetrics uses compression algorithms designed specifically for time-series data, which typically means you can store more metrics for longer using less disk space than Prometheus would need. Whether the difference matters depends on your scale, but at high cardinality or long retention periods, it starts to add up.
The Prometheus relationship
Most people encounter VictoriaMetrics as a solution to Prometheus limitations rather than a replacement for Prometheus itself. The typical setup keeps Prometheus doing what it's good at (service discovery, scraping, short-term queries, alerting via Alertmanager) while offloading storage to VictoriaMetrics via remote write.
The configuration is straightforward. In your prometheus.yml:
remote_write:
- url: http://victoriametrics:8428/api/v1/write
Prometheus continues operating normally, but now every metric it collects also gets sent to VictoriaMetrics. You can query either system, though for historical data you'd point Grafana at VictoriaMetrics instead.
This architecture lets you keep Prometheus retention short (reducing its resource needs) while maintaining months or years of metrics in VictoriaMetrics. Your dashboards and alerts keep working because PromQL compatibility means the same queries run against both backends.
Running it yourself
The single-node version is almost suspiciously easy to deploy. A Docker one-liner gets you started:
docker run -d -p 8428:8428 \
-v victoria-metrics-data:/victoria-metrics-data \
--name victoriametrics \
victoriametrics/victoria-metrics
That gives you a working instance with a web UI at http://localhost:8428 where you can run queries and check ingestion stats. For production you'd want to think about resource limits, backup strategies, and probably running it on dedicated storage, but for evaluation this is enough to start sending metrics and see how it behaves.
The binary installation is similarly minimal. Download, run, point at a data directory. There's no external database dependency, no complex configuration file to write, no cluster coordination to set up. Single-node VictoriaMetrics is genuinely one process with one configuration.
When single-node isn't enough
The clustered version is a different beast. It splits into three component types: vminsert handles ingestion, vmstorage holds the data, and vmselect runs queries. You deploy multiple instances of each, typically behind load balancers, and they coordinate to distribute the workload.
This architecture scales horizontally, meaning you add capacity by running more instances rather than bigger machines. It's the approach you'd take when single-node performance hits its ceiling or when you need redundancy that a single process can't provide.
The trade-off is operational complexity. Instead of one thing to monitor and maintain, you have a distributed system with multiple components that need to find each other, stay healthy, and handle failures gracefully. The enterprise version adds features around this (like replication and downsampling), but even the open source cluster requires more care and feeding than the single-node setup.
For most small to medium deployments, single-node is probably sufficient longer than you'd expect. The question of when to go clustered depends heavily on your ingestion rate, query patterns, and reliability requirements.
PromQL compatibility and its limits
VictoriaMetrics implements PromQL and adds some extensions (MetricsQL) that provide additional functions. For standard queries, you can generally copy them from Prometheus to VictoriaMetrics without changes. Grafana dashboards, alerting rules, and recording rules typically work as-is.
That said, "compatible" doesn't mean "identical." There are edge cases where behavior differs, especially around some of the more obscure PromQL features or specific error handling. If you're doing straightforward metric queries, you probably won't notice. If you've built complex recording rules that depend on subtle Prometheus behaviors, it's worth testing before assuming everything transfers cleanly.
The MetricsQL extensions are genuinely useful additions. Functions for label manipulation, additional aggregations, and some quality-of-life improvements that PromQL lacks. Whether you use them depends on how much you want to tie yourself to VictoriaMetrics specifically versus keeping queries portable.
The broader landscape
VictoriaMetrics isn't the only option in this space. Thanos and Cortex also solve the "Prometheus long-term storage" problem, with different architectural approaches. Thanos uses object storage (S3, GCS) and sidecars. Cortex is a horizontally scalable Prometheus-as-a-service design. InfluxDB is a more general time-series database that predates this whole ecosystem.
Each has trade-offs in complexity, cost, operational burden, and feature set. VictoriaMetrics tends to win on simplicity for the single-node case and on resource efficiency claims, though actual results vary by workload. The right choice depends on what you already have, what you're comfortable operating, and what specific problems you're trying to solve.
Who this makes sense for
If you're running Prometheus and hitting storage limitations, VictoriaMetrics is worth evaluating. The remote write integration means you can try it alongside your existing setup without committing to a full migration. Point a Grafana instance at it, let it collect a few weeks of data, and see if the query performance and storage characteristics work for your needs.
It's also a reasonable choice if you're starting fresh and want something Prometheus-compatible but with better long-term retention out of the box. The single-node deployment is simple enough that the operational overhead is minimal.
For very large scale deployments, the clustered version competes with other distributed time-series solutions, but at that point you're doing serious evaluation work regardless of which system you choose.
If you're already using a monitoring platform that handles metrics storage for you (whether that's a managed Prometheus service, Datadog, or something simpler like FiveNines for basic server metrics), the value proposition is less clear. VictoriaMetrics solves a specific problem around self-hosted metrics at scale, and if you don't have that problem, you might not need the solution.