Your Disk Will Warn You Before It Dies (If You're Listening)

Most disk failures aren't sudden. The drive doesn't work perfectly on Tuesday and explode on Wednesday. There's usually a warning period: SMART errors accumulating, reallocated sectors ticking upward, I/O latency creeping higher. The problem is that nobody's looking at these signals until something breaks, and by then the warning period is over.

Disk monitoring on Linux isn't complicated, but it requires actually doing it. The tools are built in or a package install away. The challenge is knowing what to look at, understanding what the numbers mean, and ideally automating the whole thing so you don't have to remember to check manually.

I/O metrics tell you if your disk is the bottleneck

When a server feels slow, the disk is often the culprit, but people check CPU and memory first because those numbers are easier to interpret. Disk I/O requires a bit more context to understand.

The tool you want is iostat, which comes from the sysstat package. Running iostat -xz 1 gives you a continuously updating view of what your disks are actually doing:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          13.06    0.00    1.78    2.18    0.00   82.98

Device            r/s     w/s   %util   r_await   w_await
nvme0n1          0.00  197.00   55.92      0.00      1.74
nvme1n1          0.00  197.00   57.84      0.00      1.92

The %iowait in the CPU section shows how much time the processor spends waiting for disk operations to complete. This is often misunderstood: high iowait doesn't mean the CPU is busy, it means the CPU is idle because it's waiting on storage. If you see 20% iowait alongside 60% idle, your disk is the bottleneck, not your processor.

On the device lines, %util shows what percentage of time the disk had at least one request in flight. When this approaches 100%, the device is saturated and new requests have to queue. The r_await and w_await columns show average latency for read and write operations in milliseconds. For an NVMe drive, you'd expect these to be under 1ms under normal load. For spinning disks, 10-20ms might be normal. When await times start climbing well above baseline, something is making the disk struggle.

If you need to figure out which process is hammering the disk, iotop shows per-process I/O in real time. It's useful for catching that backup job or log rotation that's saturating your storage, but it requires root and can itself add some overhead, so it's more of a diagnostic tool than something you'd run continuously.

Space monitoring is simple but still catches people

Running out of disk space is embarrassing because it's so predictable. Disks don't suddenly fill up (usually). They fill gradually, and df -h has been telling you about it for weeks if you'd bothered to look.

Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p3  455G  204G  229G  48% /
/dev/nvme0n1p1  974M  180M  795M  19% /boot

The Use% column is what matters. Set an alert at 80% or 85%, and you'll have time to clean up or expand storage before hitting 100%. The failure mode when a filesystem fills completely depends on what's using it: databases crash, applications fail to write logs, and sometimes the system itself becomes unresponsive because it can't write to temp files.

When you do hit space issues, du -sh /path/* helps you find what's consuming it. Log files are the usual suspect, particularly if something is logging errors in a loop. I've seen /var/log consume 50GB overnight because an application couldn't connect to a database and logged the failure every second.

One subtlety worth knowing: Linux reserves some space (typically 5%) for root on ext4 filesystems. A disk showing 100% used in df might actually have a few gigabytes available if you're root. This is intentional, so the system can still function enough to let you fix things, but it also means the "full" disk isn't quite as full as it appears.

RAID status is fine until it isn't

RAID gives you redundancy, which means a disk can fail without causing an outage. The catch is that you're now running degraded with no redundancy until you replace the failed disk and rebuild. If a second disk fails during that window, you lose data. This makes knowing about RAID status important: not because a degraded array is an emergency, but because you need to fix it before it becomes one.

For Linux software RAID (mdadm), the quick check is /proc/mdstat:

md2 : active raid1 nvme1n1p3[1] nvme0n1p3[0]
      994827584 blocks super 1.2 [2/2] [UU]

md1 : active raid1 nvme1n1p2[1] nvme0n1p2[0]
      1046528 blocks super 1.2 [2/2] [UU]

The [2/2] [UU] part is what you care about. The first bracket shows how many devices are present out of how many are expected. The second shows their status: U means up, underscore means down. A healthy two-disk mirror shows [2/2] [UU]. A degraded one shows [2/1] [U_] or similar.

If you're rebuilding after replacing a failed disk, you'll see a progress line showing percentage complete and estimated time remaining. Rebuilds can take hours or days depending on array size and I/O load, and during that entire time you're vulnerable to a second failure. Some people reduce I/O load during rebuilds to speed them up; others just accept the risk and let it run at whatever pace it runs.

For more detail on a specific array, mdadm --detail /dev/md0 shows the full picture: which physical devices are members, their state, when the array was last updated, and whether any devices have failed. Hardware RAID controllers have their own tools (megacli for LSI, arcconf for Adaptec, etc.) with different syntax but similar concepts.

SMART data is your early warning system

SMART (Self-Monitoring, Analysis, and Reporting Technology) is the disk's own assessment of its health. Drives track dozens of internal metrics and will tell you about them if you ask. The tool is smartctl from the smartmontools package.

smartctl -a /dev/nvme0

The output is verbose, but a few fields matter more than others. The overall health assessment at the top is the disk's pass/fail self-evaluation. If this says anything other than PASSED, take it seriously.

For SSDs and NVMe drives, look at:

Percentage Used shows how much of the drive's rated write endurance has been consumed. SSDs have a finite number of writes before cells wear out. A drive at 48% used has roughly half its lifespan remaining, assuming your write patterns stay consistent. When this approaches 100%, the drive is nearing end of life even if it still appears to work.

Available Spare indicates how much reserve capacity remains for replacing worn-out cells. When this gets low, the drive is running out of room to compensate for wear.

Media and Data Integrity Errors should be zero. Any non-zero value means the drive has experienced unrecoverable errors, which is bad.

Temperature affects both performance and longevity. NVMe drives throttle when they get too hot, and sustained high temperatures accelerate wear. If your drive is consistently running above 70°C, improve cooling or airflow.

For traditional spinning disks, the concerning SMART attributes are Reallocated_Sector_Ct (sectors that have gone bad and been remapped to spares), Current_Pending_Sector (sectors that couldn't be read and are waiting to be remapped), and Offline_Uncorrectable (sectors that failed during background scans). Any non-zero values in these fields indicate physical media problems. A few reallocated sectors might be fine and the drive could run for years, but a rising count over weeks or months suggests progressive failure.

The challenge with SMART is that it's predictive, not deterministic. A drive with perfect SMART data can still fail suddenly from controller issues or other problems that SMART doesn't monitor. And a drive with some warnings might keep working for years. SMART shifts the odds in your favor by catching gradual degradation, but it's not a guarantee.

Manual monitoring doesn't scale

All of these commands work fine for checking on a server that's misbehaving. They don't work for making sure nothing is wrong across a fleet of servers when you're not actively looking. The failure mode is predictable: you set up monitoring, check it diligently for a few weeks, then gradually stop checking as other priorities take over. Six months later a disk fails and you realize you haven't looked at SMART data since January.

Automated monitoring solves this by checking continuously and only bothering you when something needs attention. The basic requirements are: collect the metrics on a schedule, store them somewhere so you can see trends, and alert when thresholds are crossed.

You can build this yourself with scripts, cron jobs, and whatever alerting system you already use. Plenty of people do. The downside is that you're now maintaining monitoring infrastructure in addition to the servers you're actually trying to monitor.

FiveNines (the company whose blog you're reading) handles disk metrics as part of its server monitoring. The agent collects I/O stats, space usage, and can optionally monitor RAID and SMART data. You get dashboards showing trends over time and alerts when things cross thresholds you set. Whether that's worth paying for versus building your own depends on how much you value your time and how many servers you're managing.

The important thing isn't which tool you use; it's that something is watching continuously so you find out about problems before users do. A disk at 92% capacity is a task to handle this week. A disk at 100% capacity at 3am is an incident.

The metrics that actually predict failures

If you're setting up alerting and need to prioritize, these are the signals most likely to catch real problems:

Disk space above 85% means you're getting close. Above 95% means you're in danger. Some applications (especially databases) behave badly well before hitting 100%, so don't wait until the last minute.

I/O utilization sustained above 90% indicates saturation. Brief spikes are normal; sustained saturation means you need faster storage, less I/O load, or both.

SMART warnings of any kind on the overall health assessment deserve immediate attention. Individual attributes like reallocated sectors are worth watching over time, with the trend mattering more than the absolute number.

RAID degraded status should page someone. You're running without redundancy until it's fixed, and the rebuild itself stresses the remaining disks.

Everything else is useful context for troubleshooting but doesn't necessarily need to wake anyone up. High I/O latency matters if users are experiencing slowness; it's just data if performance is fine. Temperature spikes matter if they're causing throttling; they're informational if they're within spec.

The goal is catching the problems that would otherwise become outages, without generating so many alerts that you start ignoring them. Start with fewer alerts at conservative thresholds, then add more based on what incidents you miss.

Read more