Mastering Disk Monitoring on Linux: From Manual Checks to Automated Alerts

Mastering Disk Monitoring on Linux: From Manual Checks to Automated Alerts

Effective disk monitoring is at the heart of reliable server operations. Disks are the foundation of your data and application performance, overlook them, and you risk unexpected downtime, data loss, or sluggish systems. Whether you’re managing a single server or a fleet, understanding how to monitor disk IO, capacity, RAID status, and disk health is essential. In this article, we’ll guide you through essential Linux commands for manual monitoring and show you how to automate this process with fivenines.io for peace of mind and operational excellence.

Why Disk Monitoring Matters

Disks are often a silent point of failure. A full partition, failing RAID array, or unnoticed disk error can escalate quickly from a minor warning to a major outage. Proactive monitoring lets you:

  1. Detect and address problems before they impact your users.
  2. Optimize performance by identifying IO bottlenecks.
  3. Prevent data loss by catching hardware failures early.

Let’s break down the four critical areas of disk monitoring and see how to tackle each—first manually, then with automation.

1. IO Metrics Monitoring

What & Why

IO (Input/Output) metrics reveal how busy your disks are. High IO wait times or overloaded disks can slow down your entire server.

Manual Monitoring with Linux Commands

  • iostat (from the sysstat package):
    Shows device utilization and IO statistics. Some key fields to watch:
    • %iowait: How much time the CPU is waiting for IO. A high %iowait indicates your CPUs are sitting idle because tasks are blocked on slow or saturated disk I/O, pointing to a storage bottleneck
    • %util : Percentage of time the device had at least one request in service. Near-100 % means you are at the device’s IOPS or throughput ceiling; new requests can only pile up.
    • *_await: Average wait time for IO requests. A high r_await, w_await, or f_await means read, write, or flush requests are taking a long time from queue entry to completion, flagging end-to-end storage latency and a likely bottleneck in the underlying disk path.
    • r/s, w/s , f/s : Read/write operations per second. A high r/s, w/s, or f/s means the system is generating a large number of read, write, or flush operations per second, signalling a heavy I/O workload that can tax the storage subsystem’s throughput and IOPS capacity.
$ iostat -xz 1
Linux 6.1.0-17-amd64 (Debian-bookworm-latest-amd64-base) 	05/19/2025 	_x86_64_	(16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          13.06    0.00    1.78    2.18    0.00   82.98

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
md2              0.00      0.00     0.00   0.00    0.00     0.00  116.80   1343.20     0.00   0.00    0.11    11.50    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.01  39.92
nvme0n1          0.00      0.00     0.00   0.00    0.00     0.00  197.00   1343.70    16.60   7.77    1.74     6.82    0.00      0.00     0.00   0.00    0.00     0.00   96.80    3.41    0.67  55.92
nvme1n1          0.00      0.00     0.00   0.00    0.00     0.00  197.00   1343.70    16.60   7.77    1.92     6.82    0.00      0.00     0.00   0.00    0.00     0.00   96.80    3.80    0.75  57.84
  • iotop (per-process/-thread I/O monitor):
    Real-time view of which processes are causing disk IO.. Key fields to watch:
    • DISK READ / DISK WRITE: These columns show the current read or write throughput for each task; when the numbers are large, that task is the main source of bandwidth load.
    • IO%: This value is the proportion of the last sampling interval that the task spent blocked on disk I/O; when IO% is above about ninety percent, the task is mostly waiting for slow or saturated storage rather than doing useful work.
    • SWAPIN%: This field reports the fraction of time the task spent reading pages in from swap; any non-zero number indicates the slowdown is coming from memory pressure and paging, not ordinary file I/O.
    • Total READ / Total WRITE: These cumulative counters show how many bytes the task has moved since it appeared in the list; large totals combined with only modest momentary rates flag long-running jobs that quietly move a great deal of data, such as backups or rsync.
    • PRIO: This column displays the task’s I/O priority class as set by ionice, real-time (rt), best-effort (be), or idle (id); a high-throughput task running in rt class can starve other processes, while an id-class task will yield under contention. Monitor these few counters to see who, how hard, and why each process is hitting your disks—complementing the device-level view that iostat gives.
$ iotop
Total DISK READ:         0.00 B/s | Total DISK WRITE:      1043.41 K/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:    1130.98 K/s
    TID  PRIO  USER     DISK READ DISK WRITE>    COMMAND                                                                                               
3079891 be/4 postgres    0.00 B/s  317.08 K/s postgres: 17/main: postgres five_nines_production [local] COMMIT
4145683 be/4 postgres    0.00 B/s   66.37 K/s postgres: 17/main: checkpointer
2746665 be/4 postgres    0.00 B/s   51.62 K/s postgres: 17/main: postgres five_nines_production [local] idle
2746674 be/4 postgres    0.00 B/s   51.62 K/s postgres: 17/main: postgres five_nines_production [local] idle
2746508 be/4 postgres    0.00 B/s   44.24 K/s postgres: 17/main: postgres five_nines_production [local] idle
4145686 be/4 postgres    0.00 B/s   29.50 K/s postgres: 17/main: walwriter
1738103 be/4 caddy       0.00 B/s    7.37 K/s caddy run --environ --config /etc/caddy/Caddyfile

2. Disk Space Monitoring

What & Why

Running out of disk space can crash applications, databases, and even your OS.

Manual Monitoring with Linux Commands

  • df -h:
    Displays disk space usage in a human-readable format.
    • Filesystem: This is the device, logical volume, or network share that is mounted; it tells you which physical or virtual storage resource you are looking at.
    • Size: This column shows the total capacity of that filesystem, expressed with human-friendly units such as GiB or TiB.
    • Used: Here you see how much of that capacity is already occupied by data, again in human-readable units.
    • Avail: This value shows the space still free and available for non-root users; it excludes any blocks reserved for the super-user.
    • Use%: This is simply the ratio of Used to Size expressed as a percentage; a value climbing past 80 – 90 % warns that the filesystem is running short on space.
    • Mounted on: It tells you the directory (mount point) where the filesystem is attached in the hierarchy, so you know which path in the OS corresponds to that storage.
$ df -h
Filesystem      Size  Used Avail Use% Mounted on
dev              15G     0   15G   0% /dev
run              15G  4.5M   15G   1% /run
efivarfs        128K   13K  111K  11% /sys/firmware/efi/efivars
/dev/nvme0n1p3  455G  204G  229G  48% /
tmpfs            15G  1.1M   15G   1% /dev/shm
tmpfs           1.0M     0  1.0M   0% /run/credentials/systemd-journald.service
tmpfs           1.0M     0  1.0M   0% /run/credentials/systemd-resolved.service
tmpfs            15G   44M   15G   1% /tmp
/dev/nvme0n1p1  974M  180M  795M  19% /boot
tmpfs           2.9G  8.0K  2.9G   1% /run/user/1000
tmpfs           1.0M     0  1.0M   0% /run/credentials/getty@tty1.service
  • du -sh /path/to/directory:
    Shows space used by a specific directory.
$ du -sh /var/log/*
194M	/var/log/atop
4.0K	/var/log/audit
2.0G	/var/log/btmp
140M	/var/log/caddy
444K	/var/log/grafana
4.1G	/var/log/journal
8.0K	/var/log/lastlog
4.0K	/var/log/old
220K	/var/log/pacman.log
4.0K	/var/log/private
0	/var/log/README
4.0K	/var/log/sa
24M	/var/log/wtmp

3. RAID Monitoring

What & Why

RAID (Redundant Array of Independent Disks) adds resilience, but only if you know when a disk fails or the array is degraded.

Manual Monitoring with Linux Commands

  • cat /proc/mdstat (Linux software-RAID status)
    Displays the live condition of every md (multiple-device) array. Key items to understand:
    • Personalities: This header lists the RAID levels the running kernel supports (for example [raid1] [raid10]), so you know which configurations are available on the host.
    • mdX: Each block starting with md0, md1, etc. is an individual array; the line tells you whether it is active or inactive, which RAID level it uses, and which component devices (e.g. sdb1[1]) currently form the set.
    • blocks … [N/M]: The “blocks” figure gives the array’s usable size, while the bracket shows how many devices are present out of the expected total—for instance [2/2] means both members of a two-disk mirror are online.
    • [UU_] health bitmap: Immediately after the device count you may see letters such as UU or U_; each “U” means that member is up-to-date, while an underscore marks a failed or rebuilding drive.
    • resync / recovery / reshape lines: If a background operation is in progress, an indented line reports its type, percentage complete, blocks processed, estimated finish time, and current speed; monitoring this tells you how long redundancy will be degraded.
    • bitmap, journal, and cluster notes: Additional tags (for example bitmap:internal) indicate optimisations or special modes that affect rebuild speed and consistency guarantees.
$ cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md2 : active raid1 nvme1n1p3[1] nvme0n1p3[0]
      994827584 blocks super 1.2 [2/2] [UU]
      bitmap: 8/8 pages [32KB], 65536KB chunk

md1 : active raid1 nvme1n1p2[1] nvme0n1p2[0]
      1046528 blocks super 1.2 [2/2] [UU]
      
md0 : active raid1 nvme0n1p1[0] nvme1n1p1[1]
      4189184 blocks super 1.2 [2/2] [UU]
  • mdadm --detail /dev/md0:
    Detailed info for a specific RAID device. Shows full metadata and health information for one md device. Key fields to watch:
    • State: Indicates the array’s current health—values such as clean, active, degraded, recovering, or resyncing tell you immediately whether redundancy is intact or maintenance is under way.
    • Raid Level: Confirms the layout (RAID1, RAID10, RAID5, etc.), which defines both fault-tolerance and performance characteristics.
    • Array Size / Used Dev Size: “Array Size” is the total usable capacity, while “Used Dev Size” shows how much of each member disk participates; a mismatch can expose mismatched-sized or partially reused drives.
    • Total / Active / Working / Failed / Spare Devices: These counters give a numeric snapshot of array membership; anything other than zero under Failed Devices or a gap between Total and Active means you are running degraded.
    • Device table (Major Minor RaidDevice State): Lists every member drive with its role (active sync, spare, faulty); missing entries or roles marked spare rebuilding reveal failures or ongoing rebuilds.
    • Events: An ever-increasing counter of metadata-changing operations; sudden jumps can correlate with re-syncs, device drops, or reshapes noted in syslog.
    • Resync/Recover line: Appears only during rebuilds and shows percentage complete, current speed, and estimated finish time, so you can gauge how long redundancy will remain degraded.
    • UUID: A globally unique identifier that lets you unambiguously match this array across reboots and in configuration files.
$ mdadm --detail /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Mon Jan 29 13:34:09 2024
        Raid Level : raid1
        Array Size : 4189184 (4.00 GiB 4.29 GB)
     Used Dev Size : 4189184 (4.00 GiB 4.29 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

       Update Time : Mon May 19 01:09:22 2025
             State : clean 
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : resync

              Name : rescue:0
              UUID : 24e0a417:0f1c014f:29027857:19b73a8e
            Events : 54

    Number   Major   Minor   RaidDevice State
       0     259        1        0      active sync   /dev/nvme0n1p1
       1     259        5        1      active sync   /dev/nvme1n1p1

4. Disk Health Monitoring

What & Why

Disks can silently develop bad sectors or other hardware issues. SMART monitoring helps catch failures before they happen.

Manual Monitoring with Linux Commands

  • smartctl -a /dev/sda (from the smartmontools package):
    Shows firmware-reported reliability counters for a specific disk or SSD. Some key fields to watch:
    • SMART overall-health self-assessment: The drive’s pass/fail verdict; anything other than PASSED means the firmware itself sees a problem.
    • Reallocated_Sector_Ct (ID 5): Counts sectors already remapped to spare area; any non-zero value signals surface defects and is a leading indicator of failure.
    • Current_Pending_Sector (ID 197): Sectors that could not be read and are awaiting rewrite; a rising count points to unstable media and imminent reallocation.
    • Offline_Uncorrectable (ID 198): Unrecoverable read errors detected during background scans; persistent non-zero entries mean data loss has already occurred.
    • Temperature_Celsius (ID 194) or Composite Temperature (NVMe): Current drive temperature; sustained operation above the vendor’s spec shortens lifespan and increases error rates.
    • Power_On_Hours (ID 9): Total hours powered up; high values combined with growing error counts suggest age-related wear.
    • Percentage_Used, Wear_Leveling_Count, or Media_Wearout (IDs 177/233/202, SSD only): Estimates NAND endurance consumed; values approaching 100 % (or low remaining life) warn that the SSD is nearing its rated write limit.
    • Total_LBAs_Written / Data Units Written (IDs 241/246 or NVMe log): Cumulative data written; a sharp daily increase highlights unexpectedly write-heavy workloads that accelerate wear.
sudo smartctl -a /dev/nvme0
smartctl 7.5 2025-04-30 r5714 [x86_64-linux-6.14.5-arch1-1] (local build)
Copyright (C) 2002-25, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       CT500P3PSSD8
Serial Number:                      2301E699808D
Firmware Version:                   P9CR40A
PCI Vendor/Subsystem ID:            0xc0a9
IEEE OUI Identifier:                0x00a075
Controller ID:                      1
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          500,107,862,016 [500 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            6479a7 72700000a0
Local Time is:                      Mon Jun  2 22:07:45 2025 CEST
Firmware Updates (0x12):            1 Slot, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005e):     Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x06):         Cmd_Eff_Lg Ext_Get_Lg
Maximum Data Transfer Size:         64 Pages
Warning  Comp. Temp. Threshold:     85 Celsius
Critical Comp. Temp. Threshold:     95 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.00W  0.0000W       -    0  0  0  0        0       0
 1 +     3.00W  0.0000W       -    0  0  0  0        0       0
 2 +     1.50W  0.0000W       -    0  0  0  0        0       0
 3 -   0.0250W  0.0000W       -    3  3  3  3     5000    1900
 4 -   0.0030W       -        -    4  4  4  4    13000  100000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         1
 1 -    4096       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning:                   0x00
Temperature:                        43 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    48%
Data Units Read:                    1,027,132 [525 GB]
Data Units Written:                 32,777,697 [16.7 TB]
Host Read Commands:                 8,925,512
Host Write Commands:                1,135,875,129
Controller Busy Time:               1,714
Power Cycles:                       15
Power On Hours:                     6,628
Unsafe Shutdowns:                   9
Media and Data Integrity Errors:    0
Error Information Log Entries:      77
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               43 Celsius
Temperature Sensor 2:               49 Celsius
Temperature Sensor 8:               43 Celsius

Error Information (NVMe Log 0x01, 16 of 16 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
  0         77     0  0x101a  0x4004  0x004            0     1     -  Invalid Field in Command
  1         76     0  0x8005  0x4004  0x004            0     1     -  Invalid Field in Command
  2         75     0  0x8016  0x4004  0x004            0     1     -  Invalid Field in Command
  3         74     0  0x7016  0x4004  0x004            0     1     -  Invalid Field in Command
  4         73     0  0x6017  0x4005  0x004            0     1     -  Invalid Field in Command
  5         72     0  0x2000  0x4005  0x004            0     1     -  Invalid Field in Command
  6         71     0  0x1019  0x4005  0x004            0     1     -  Invalid Field in Command
  7         70     0  0x1018  0x4005  0x004            0     1     -  Invalid Field in Command
  8         69     0  0x6015  0x4005  0x004            0     1     -  Invalid Field in Command
  9         68     0  0x1003  0x4005  0x004            0     1     -  Invalid Field in Command
 10         67     0  0x1002  0x4005  0x004            0     1     -  Invalid Field in Command
 11         66     0  0x0013  0x4005  0x004            0     1     -  Invalid Field in Command
 12         65     0  0x5016  0x4005  0x004            0     1     -  Invalid Field in Command
 13         64     0  0x5015  0x4005  0x004            0     1     -  Invalid Field in Command
 14         63     0  0x4017  0x4005  0x004            0     1     -  Invalid Field in Command
 15         62     0  0x0012  0x4005  0x004            0     1     -  Invalid Field in Command

Self-test Log (NVMe Log 0x06, NSID 0xffffffff)
Self-test status: No self-test in progress
No Self-tests Logged

The Challenge of Manual Monitoring

Manual checks are powerful for spot-checks and diagnostics, but they’re not scalable. You risk missing subtle changes or urgent failures if you don’t check often enough. That’s where automation shines.

Automating Disk Monitoring and Alerts with fivenines.io

fivenines.io brings modern automation to Linux disk monitoring, so you can focus on your business, not just your servers. Here’s how it transforms your workflow:

1. Easy Setup

  • Install the fivenines.io agent on your server (copy and paste the setup command for the wizard).
  • The agent automatically starts collecting disk IO, space, you can also enable RAID, and SMART metrics monitoring by turning on the options.

2. Real-Time Dashboards

  • View all critical disk metrics from a unified web dashboard.
  • Track trends over time to spot developing issues before they become emergencies.

3. Automated Alerting

  • Set thresholds for each metric (e.g., disk usage > 90%, RAID degraded, SMART error detected).
  • Receive instant notifications via email, Slack, Telegram or other channels when something goes wrong.
  • No more missed warnings or late-night surprises.

4. Proactive Incident Prevention

  • Historical data helps with capacity planning and predicting failures.
  • Automated checks run 24/7, catching issues even when you’re off the clock.

Real-World Workflow: From Manual to Automated

Scenario:
You notice your database server is lagging.

  1. You check iostat and see high IO wait times, then run smartctl and discover a warning.
  2. With fivenines.io, you’d already have received an alert about the IO spike and SMART warning—giving you a head start to replace the disk before users are affected.

Best Practices for Disk Monitoring

  1. Set reasonable alert thresholds to avoid alert fatigue.
  2. Test your alerts to ensure notifications reach the right people.
  3. Schedule regular reviews of disk health and usage trends.
  4. Keep monitoring agents and RAID tools updated for accurate results.

Conclusion: Take Your Disk Monitoring to the Next Level

Manual Linux commands are essential skills for every sysadmin—but true resilience comes from automation. By combining hands-on knowledge with the proactive capabilities of fivenines.io, you’ll safeguard your infrastructure, prevent outages, and free up time for higher-value work.

Ready to stop worrying about disks and start focusing on growth?
Try fivenines.io for automated, reliable disk monitoring—and never be caught off guard again.