Nobody Actually Measures MTTR the Same Way

Every incident retrospective includes someone asking "what was our MTTR?" And every time, there's a brief awkward pause while people figure out what exactly they're measuring. Did we start the clock when the alert fired, or when the server actually went down? Does it end when the service is back, or when we've confirmed users can actually use it?

MTTR (Mean Time to Recovery, or Repair, or Restore, or Respond, depending on who you ask) sounds like a simple metric. Average time from broken to fixed. But the more you dig into it, the more you realize it's less a precise measurement and more a rough indicator that only means something if you're consistent about how you calculate it.

The acronym means at least four different things

This is where MTTR gets annoying. Different teams, different vendors, and different blog posts use it to mean:

Mean Time to Recover/Restore is probably the most common interpretation: clock starts when service degrades, stops when service is back to normal. This is what most people picture when they hear MTTR.

Mean Time to Repair focuses specifically on the fix itself, not detection or verification. Useful if you're trying to isolate "how long does the actual fixing take" from "how long did it take us to notice."

Mean Time to Respond measures how quickly someone starts working on the problem after it's detected. If your alerts fire at 2am and nobody looks until 8am, that six hours shows up here.

Mean Time to Resolve sometimes includes the full incident lifecycle through root cause analysis and permanent fix, not just getting service back up.

None of these is wrong, they're just measuring different things. The important part is picking one definition and sticking with it so you can actually compare numbers over time. An MTTR of 30 minutes means nothing if last month you measured time-to-respond and this month you're measuring time-to-resolve.

The math is simple, the inputs are messy

The formula is just division:

MTTR = Total Downtime / Number of Incidents

Three outages lasting 20, 40, and 30 minutes gives you 90 minutes total, divided by 3, equals 30 minutes average. Easy.

The hard part is agreeing on what counts as downtime. If your web server returns 500 errors for 10 minutes, then works intermittently for another 20 minutes while you fix the underlying issue, how much downtime was that? If a cron job fails silently and you don't notice for six hours, does that whole period count even though technically nothing was "down"?

I've seen teams game their MTTR numbers by narrowly defining what counts as an incident. If you exclude "minor" issues or only count complete outages, your MTTR looks great on paper while users still experience plenty of problems. The metric only helps if you're honest about what you're measuring.

What actually moves the number

MTTR breaks down into phases, and improving each one requires different things.

Detection time is often the biggest chunk, especially for problems that don't cause obvious failures. A memory leak that slowly degrades performance over hours isn't going to trigger your uptime check until something actually crashes. Better monitoring with meaningful thresholds catches these earlier. The difference between alerting when memory hits 90% versus waiting for an OOM kill can be substantial.

Diagnosis time depends on whether you can quickly figure out what's actually wrong. This is where having correlated metrics helps: if you can see that CPU spiked, then memory climbed, then response times degraded, then the service crashed, you've got a story. If all you have is "the health check failed," you're starting from scratch. Good logging and metric retention aren't exciting, but they directly affect how long you spend staring at dashboards trying to understand what happened.

Repair time varies enormously based on what broke and how your infrastructure handles it. A crashed process that systemd restarts automatically has near-zero repair time. A corrupted database that needs manual recovery could take hours. The obvious advice here is "automate recovery where possible," but the less obvious part is knowing what's worth automating. If something fails once a year, a documented runbook might be more practical than building self-healing infrastructure.

Verification often gets forgotten. Service is back, metrics look normal, let's call it done and go back to sleep. Then users report it's still broken because you fixed the symptom but not the cause, or because the fix only worked for some percentage of requests. Building in actual verification (not just "the process is running" but "users can complete transactions") extends your MTTR measurement but gives you a more honest number.

The meta-problem with MTTR

Here's the thing about MTTR as a metric: optimizing for it can lead you somewhere weird. If your goal is lowest possible MTTR, the rational move is to restart services aggressively, apply quick patches, and defer root cause analysis until later. Get the number down, deal with the underlying problem tomorrow.

This works fine for transient issues but creates a pattern where you're constantly recovering from the same problems without actually fixing them. Your MTTR looks good, your incident count keeps climbing, and the team is stuck in reactive mode.

Some teams track MTTR alongside incident frequency for this reason. A 15-minute MTTR is great if you have two incidents a month. It's a red flag if you have two incidents a day. The combination tells you more than either number alone.

What this looks like in practice

If you're running a handful of servers or services, you probably don't need a formal MTTR tracking system. What you need is monitoring that tells you when something breaks, enough context to figure out why, and alerts that reach someone who can act on them. Tools like fivenines handle the basics (server health, uptime checks, cron jobs) without requiring you to set up a whole observability stack.

The goal isn't achieving some target MTTR number. It's reducing the time your users spend experiencing problems, which sometimes means faster recovery but more often means preventing incidents in the first place.

Read more