How to monitor a Linux server

How to monitor a Linux server

Linux server monitoring is the practice of continuously tracking system performance, resource usage, and health metrics to ensure optimal operation and prevent downtime.

This comprehensive guide takes you from basic command-line monitoring to production-ready monitoring systems, covering everything from essential Linux commands to enterprise solutions like Prometheus, Grafana, and Nagios.

What to monitor ?

The core metrics taxonomy covers four fundamental areas:

  • CPU: Utilization percentage, load averages, process counts
  • Memory: Available RAM, swap usage, buffer/cache utilization
  • Disk: Space usage, I/O operations per second, read/write latency
  • Network: Bandwidth utilization, packet loss, connection counts

Monitoring frequency depends on your system's criticality. Financial trading systems might need second-by-second monitoring, while development servers can be checked every five minutes. The key is establishing baselines, understanding what "normal" looks like for your specific workload.

Use this quick assessment to determine your current monitoring maturity:

  • Do you know when systems fail? (Reactive)
  • Do you receive alerts before users notice problems? (Proactive)
  • Can you predict capacity needs months in advance? (Predictive)

Commands Every Admin Should Master

Linux provides powerful built-in monitoring tools that require no additional software. These commands form the foundation of server monitoring and troubleshooting.

System Overview with top and htop

The top command provides real-time system statistics. Run it to see CPU usage, memory consumption, and active processes:

top -d 5

This updates every 5 seconds instead of the default 3. The output shows load averages (1, 5, and 15-minute averages), total processes, and per-process resource usage. A load average equal to your CPU core count indicates full utilization.

htop offers a more user-friendly interface with color coding and mouse support:

htop

Memory Analysis with free and vmstat

Check memory usage with human-readable output:

free -h

This shows total, used, free, and available memory. The "available" column is most important, it represents memory that can be freed for new processes.

For deeper memory analysis, use vmstat:

vmstat 2 5

This displays memory, swap, I/O, and CPU statistics every 2 seconds for 5 iterations. Watch the "si" and "so" columns, consistent swap activity indicates memory pressure.

Disk Performance with iostat and df

Monitor disk I/O performance:

iostat -x 2

The "-x" flag provides extended statistics including %util (utilization percentage). Values consistently above 80% indicate I/O bottlenecks.

Check disk space usage:

df -h

Network Monitoring with ss and netstat

Modern Linux systems use ss instead of the older netstat:

ss -tuln

This shows TCP (-t) and UDP (-u) listening (-l) ports with numeric (-n) addresses. Use this to verify services are running and accessible.

Command Chaining for Quick Diagnostics

Combine commands for rapid system assessment:

# Quick system health check
uptime && free -h && df -h / && ss -tuln | grep :80

This one-liner shows system load, memory usage, root disk space, and whether a web server is listening.

Monitoring Scripts Example

Manual monitoring doesn't scale. Automated scripts provide consistent monitoring and can trigger alerts when thresholds are exceeded.

Disk Space Monitoring Script

Create a script that monitors disk usage and sends alerts:

#!/bin/bash
# disk_monitor.sh - Monitor disk space usage

THRESHOLD=85
EMAIL="admin@company.com"
LOGFILE="/var/log/disk_monitor.log"

# Function to log messages
log_message() {
    echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> "$LOGFILE"
}

# Check each mounted filesystem
df -h | awk 'NR>1 {print $5 " " $6}' | while read usage mountpoint; do
    # Remove % sign and convert to integer
    usage_num=$(echo $usage | sed 's/%//')
    
    if [ "$usage_num" -gt "$THRESHOLD" ]; then
        message="ALERT: Disk usage on $mountpoint is ${usage} (threshold: ${THRESHOLD}%)"
        log_message "$message"
        
        # Send email alert
        echo "$message" | mail -s "Disk Space Alert - $(hostname)" "$EMAIL"
        
        # Optional: Send Slack webhook
        curl -X POST -H 'Content-type: application/json' \
             --data "{\"text\":\"$message\"}" \
             YOUR_SLACK_WEBHOOK_URL
    fi
done

Make the script executable and test it:

chmod +x disk_monitor.sh
./disk_monitor.sh

CPU Load Monitoring

Add CPU monitoring to the same script:

# CPU load monitoring function
check_cpu_load() {
    LOAD_THRESHOLD=2.0
    CURRENT_LOAD=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | sed 's/,//')
    
    if (( $(echo "$CURRENT_LOAD > $LOAD_THRESHOLD" | bc -l) )); then
        message="ALERT: High CPU load: $CURRENT_LOAD (threshold: $LOAD_THRESHOLD)"
        log_message "$message"
        echo "$message" | mail -s "CPU Load Alert - $(hostname)" "$EMAIL"
    fi
}

Automating with Cron

Schedule the script to run every 10 minutes:

crontab -e

Add this line:

*/10 * * * * /path/to/disk_monitor.sh

For critical systems, run checks more frequently. Database servers might need minute-by-minute monitoring, while development systems can be checked every 30 minutes.

Prometheus and Grafana for an enterprise grade monitoring

Prometheus and Grafana provide a powerful, scalable monitoring solution that's become the industry standard for modern infrastructure monitoring.

Installing Prometheus

Download and install Prometheus:

# Create prometheus user
sudo useradd --no-create-home --shell /bin/false prometheus

# Download Prometheus
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.40.0/prometheus-2.40.0.linux-amd64.tar.gz
tar xvf prometheus-2.40.0.linux-amd64.tar.gz

# Install binaries
sudo cp prometheus-2.40.0.linux-amd64/prometheus /usr/local/bin/
sudo cp prometheus-2.40.0.linux-amd64/promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool

# Create directories
sudo mkdir /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /etc/prometheus /var/lib/prometheus

Configuring Prometheus

Create the main configuration file:

sudo nano /etc/prometheus/prometheus.yml

Add this configuration:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  # - "first_rules.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['localhost:9100']

Installing Node Exporter

Node Exporter collects system metrics:

# Download and install node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz
tar xvf node_exporter-1.5.0.linux-amd64.tar.gz
sudo cp node_exporter-1.5.0.linux-amd64/node_exporter /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/node_exporter

Create systemd services for both components:

# Prometheus service
sudo nano /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
    --config.file /etc/prometheus/prometheus.yml \
    --storage.tsdb.path /var/lib/prometheus/ \
    --web.console.templates=/etc/prometheus/consoles \
    --web.console.libraries=/etc/prometheus/console_libraries \
    --web.listen-address=0.0.0.0:9090 \
    --web.enable-lifecycle

[Install]
WantedBy=multi-user.target

Start the services:

sudo systemctl daemon-reload
sudo systemctl enable prometheus node_exporter
sudo systemctl start prometheus node_exporter

Installing Grafana

Install Grafana for visualization:

# Add Grafana repository
sudo apt-get install -y software-properties-common
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list

# Install Grafana
sudo apt-get update
sudo apt-get install grafana

# Start Grafana
sudo systemctl enable grafana-server
sudo systemctl start grafana-server

Access Grafana at http://your-server:3000 (default login: admin/admin). Add Prometheus as a data source using URL http://localhost:9090.

Performance Impact Considerations

This monitoring stack typically uses 200-500MB RAM and 1-2% CPU on a moderately busy server. Node Exporter adds minimal overhead (usually under 50MB RAM), making it suitable for production environments.

Log Analysis and Security Monitoring

System logs contain crucial information about security events, application errors, and system behavior. Log monitoring can prevent security breaches and identify problems before they impact users.

Mastering journalctl

Modern Linux systems use systemd's journal for logging. Key commands for log analysis:

# View recent logs
journalctl -f

# Filter by service
journalctl -u nginx -f

# Show logs from last hour
journalctl --since "1 hour ago"

# Filter by priority (emergency, alert, critical, error, warning, notice, info, debug)
journalctl -p err

# Show logs for specific time range
journalctl --since "2023-01-01 00:00:00" --until "2023-01-01 23:59:59"

Security Event Monitoring Script

Create a script to detect common attack patterns:

#!/bin/bash
# security_monitor.sh - Monitor for security events

LOGFILE="/var/log/security_monitor.log"
ALERT_EMAIL="security@company.com"

# Function to send security alert
send_alert() {
    local message="$1"
    echo "$(date '+%Y-%m-%d %H:%M:%S') - SECURITY ALERT: $message" >> "$LOGFILE"
    echo "$message" | mail -s "Security Alert - $(hostname)" "$ALERT_EMAIL"
}

# Monitor failed SSH login attempts
failed_ssh=$(journalctl --since "5 minutes ago" -u ssh | grep "Failed password" | wc -l)
if [ "$failed_ssh" -gt 10 ]; then
    send_alert "High number of failed SSH attempts: $failed_ssh in last 5 minutes"
fi

# Monitor sudo usage
sudo_attempts=$(journalctl --since "5 minutes ago" | grep "sudo.*COMMAND" | wc -l)
if [ "$sudo_attempts" -gt 20 ]; then
    send_alert "Unusual sudo activity: $sudo_attempts commands in last 5 minutes"
fi

# Check for new user accounts
new_users=$(journalctl --since "1 hour ago" | grep "new user" | wc -l)
if [ "$new_users" -gt 0 ]; then
    send_alert "New user account(s) created: $new_users in last hour"
fi

# Monitor disk space for rapid changes (potential log flooding attack)
current_disk=$(df / | awk 'NR==2 {print $5}' | sed 's/%//')
if [ -f /tmp/last_disk_usage ]; then
    last_disk=$(cat /tmp/last_disk_usage)
    disk_change=$((current_disk - last_disk))
    if [ "$disk_change" -gt 5 ]; then
        send_alert "Rapid disk usage increase: ${disk_change}% in 5 minutes"
    fi
fi
echo "$current_disk" > /tmp/last_disk_usage

Log Correlation Techniques

Correlate system metrics with application logs to identify root causes:

# Find high CPU periods and correlate with application errors
journalctl --since "1 hour ago" -p err | while read line; do
    timestamp=$(echo "$line" | awk '{print $1 " " $2 " " $3}')
    # Check system load during error time
    echo "Error at $timestamp - checking system metrics..."
done

Container and Modern Infrastructure Monitoring

Container monitoring requires different approaches than traditional server monitoring. Containers are ephemeral, and traditional monitoring tools may not capture container-specific metrics effectively.

Docker Container Monitoring with cAdvisor

Google's cAdvisor provides detailed container metrics:

# Run cAdvisor as a Docker container
docker run \
  --volume=/:/rootfs:ro \
  --volume=/var/run:/var/run:ro \
  --volume=/sys:/sys:ro \
  --volume=/var/lib/docker/:/var/lib/docker:ro \
  --volume=/dev/disk/:/dev/disk:ro \
  --publish=8080:8080 \
  --detach=true \
  --name=cadvisor \
  --privileged \
  --device=/dev/kmsg \
  gcr.io/cadvisor/cadvisor:latest

Access the cAdvisor web interface at http://your-server:8080 to view container metrics.

Docker Stats and Logs

Use Docker's built-in monitoring commands:

# Real-time container resource usage
docker stats

# Container logs with timestamps
docker logs -f --timestamps container_name

# Inspect container resource limits
docker inspect container_name | grep -A 10 "Memory\|Cpu"

Container-Specific Monitoring Script

#!/bin/bash
# container_monitor.sh - Monitor Docker containers

# Check for stopped containers
stopped_containers=$(docker ps -a --filter "status=exited" --format "table {{.Names}}" | tail -n +2)
if [ ! -z "$stopped_containers" ]; then
    echo "Stopped containers detected: $stopped_containers"
fi

# Monitor container resource usage
docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}" | \
while read line; do
    if [[ $line == *"%"* ]]; then
        container=$(echo $line | awk '{print $1}')
        cpu=$(echo $line | awk '{print $2}' | sed 's/%//')
        
        if (( $(echo "$cpu > 80" | bc -l) )); then
            echo "High CPU usage in container $container: ${cpu}%"
        fi
    fi
done

Kubernetes Monitoring Basics

For Kubernetes environments, monitor cluster health with:

# Check node status
kubectl get nodes

# Monitor pod resource usage
kubectl top pods --all-namespaces

# Check for failed pods
kubectl get pods --all-namespaces --field-selector=status.phase=Failed

Nagios and Zabbix Implementation

Enterprise monitoring solutions provide centralized monitoring, alerting, and reporting capabilities suitable for large-scale infrastructures.

When to Choose Enterprise Solutions

Consider enterprise monitoring when you have:

  • More than 50 servers to monitor
  • Compliance requirements for monitoring and alerting
  • Need for complex escalation procedures
  • Multiple teams requiring different dashboard views
  • Budget for commercial support

Installing Nagios Core

Install Nagios Core for comprehensive infrastructure monitoring:

# Install dependencies
sudo apt-get update
sudo apt-get install -y autoconf gcc libc6 make wget unzip apache2 php libapache2-mod-php7.4 libgd-dev

# Create nagios user
sudo useradd nagios
sudo groupadd nagcmd
sudo usermod -a -G nagcmd nagios

# Download and compile Nagios
cd /tmp
wget -O nagioscore.tar.gz https://github.com/NagiosEnterprises/nagioscore/archive/nagios-4.4.9.tar.gz
tar xzf nagioscore.tar.gz
cd nagioscore-nagios-4.4.9/

# Configure and compile
sudo ./configure --with-httpd-conf=/etc/apache2/sites-enabled
sudo make all
sudo make install
sudo make install-init
sudo make install-commandmode
sudo make install-config
sudo make install-webconf

Basic Nagios Configuration

Configure Nagios to monitor your servers:

# Edit main configuration
sudo nano /usr/local/nagios/etc/nagios.cfg

Create a host definition:

# /usr/local/nagios/etc/objects/servers.cfg
define host {
    use                     linux-server
    host_name               webserver01
    alias                   Web Server 01
    address                 192.168.1.100
    max_check_attempts      5
    check_period            24x7
    notification_interval   30
    notification_period     24x7
}

define service {
    use                     generic-service
    host_name               webserver01
    service_description     HTTP
    check_command           check_http
    max_check_attempts      5
    normal_check_interval   3
    retry_check_interval    2
}

Zabbix Installation

Zabbix offers a more modern interface and better scalability:

# Install Zabbix repository
wget https://repo.zabbix.com/zabbix/6.0/ubuntu/pool/main/z/zabbix-release/zabbix-release_6.0-1+ubuntu20.04_all.deb
sudo dpkg -i zabbix-release_6.0-1+ubuntu20.04_all.deb
sudo apt update

# Install Zabbix server and frontend
sudo apt install zabbix-server-mysql zabbix-frontend-php zabbix-apache-conf zabbix-sql-scripts zabbix-agent

# Configure MySQL database
mysql -uroot -p
mysql> create database zabbix character set utf8 collate utf8_bin;
mysql> create user zabbix@localhost identified by 'password';
mysql> grant all privileges on zabbix.* to zabbix@localhost;
mysql> quit;

# Import initial schema
zcat /usr/share/doc/zabbix-sql-scripts/mysql/create.sql.gz | mysql -uzabbix -p zabbix

ROI Considerations

Enterprise monitoring solutions cost $5-50 per monitored device monthly, but prevent downtime that typically costs $5,600 per minute for e-commerce sites. The investment pays for itself by preventing a single major outage.

Monitoring Best Practices and Alert Management

Effective alerting prevents alert fatigue while ensuring critical issues receive immediate attention.

Alert Threshold Tuning

Set thresholds based on historical data and business impact:

  • Warning thresholds: 70-80% of capacity
  • Critical thresholds: 85-95% of capacity
  • Emergency thresholds: 95%+ or service unavailable

Use hysteresis to prevent flapping alerts, set recovery thresholds 5-10% below alert thresholds.

SLA and SLO Definition

Define clear service level objectives:

# Example SLO definitions
Web Application Availability: 99.9% (43 minutes downtime/month)
API Response Time: 95th percentile under 200ms
Database Query Performance: 99th percentile under 1 second

Escalation Procedures

Implement tiered alerting:

  1. Level 1 (0-15 minutes): On-call engineer via PagerDuty/SMS
  2. Level 2 (15-30 minutes): Team lead and backup engineer
  3. Level 3 (30+ minutes): Management and additional team members

Alert Fatigue Prevention

Reduce noise with intelligent alerting:

  • Group related alerts (don't send 50 alerts for one network outage)
  • Use dependencies (don't alert on web services when the database is down)
  • Implement maintenance windows
  • Regular alert review and tuning

Troubleshooting Common Monitoring Issues

Monitoring systems themselves can fail or create performance problems. Here's how to identify and resolve common issues.

High Resource Usage from Monitoring

If monitoring tools consume excessive resources:

# Check monitoring process resource usage
ps aux | grep -E "(prometheus|grafana|nagios|zabbix)" | sort -k3 -nr

# Reduce Prometheus retention period
# Edit /etc/prometheus/prometheus.yml
--storage.tsdb.retention.time=15d

# Optimize Grafana queries
# Use recording rules for complex queries
# Limit dashboard refresh rates

False Positive Reduction

Common causes and solutions:

  • Network hiccups: Require 2-3 consecutive failures before alerting
  • Scheduled maintenance: Implement maintenance windows
  • Seasonal patterns: Use dynamic thresholds based on time/day

Monitoring System Failures

Implement meta-monitoring:

#!/bin/bash
# monitor_the_monitors.sh
# Check if monitoring services are running

services=("prometheus" "grafana-server" "nagios" "zabbix-server")

for service in "${services[@]}"; do
    if ! systemctl is-active --quiet "$service"; then
        echo "CRITICAL: $service is not running" | mail -s "Monitoring System Alert" admin@company.com
        systemctl restart "$service"
    fi
done

Scaling Your Monitoring Strategy

As your infrastructure grows, monitoring strategies must evolve to handle increased complexity and scale.

Capacity Planning for Monitoring

Plan monitoring resources based on infrastructure size:

  • Small (1-10 servers): Single monitoring server, basic alerting
  • Medium (10-100 servers): Dedicated monitoring cluster, advanced dashboards
  • Large (100+ servers): Distributed monitoring, data retention policies, automated scaling

Multi-Server Monitoring Architecture

For large environments, implement hierarchical monitoring:

# Central Prometheus configuration for federation
- job_name: 'federate'
  scrape_interval: 15s
  honor_labels: true
  metrics_path: '/federate'
  params:
    'match[]':
      - '{job=~"prometheus|node_exporter"}'
  static_configs:
    - targets:
      - 'prometheus-site1:9090'
      - 'prometheus-site2:9090'

Cloud Integration Strategies

For hybrid cloud environments:

  • Use cloud-native monitoring for cloud resources (CloudWatch, Azure Monitor)
  • Implement VPN connections for secure metric collection
  • Consider cloud-hosted monitoring solutions for global visibility
  • Implement data sovereignty compliance for regulated industries

Monitoring Maturity Roadmap

Evolution path for monitoring systems:

  1. Foundation (Months 1-3): Basic monitoring, essential alerts
  2. Enhancement (Months 4-6): Custom dashboards, log aggregation
  3. Optimization (Months 7-12): Predictive monitoring, automation
  4. Innovation (Year 2+): AI-powered anomaly detection, self-healing systems

Next Steps and Continuous Improvement

Effective Linux server monitoring is an ongoing process that evolves with your infrastructure and business needs. Start with the fundamentals, master the built-in Linux commands and establish baseline monitoring with simple scripts. As your confidence and requirements grow, implement more sophisticated solutions like Prometheus and Grafana.

Remember that the best monitoring system is one that's actively maintained and regularly tuned. Schedule monthly reviews of your alerts, thresholds, and dashboards. Involve your entire team in monitoring discussions, the person who gets paged at 3 AM should have input on alert sensitivity.

The monitoring landscape continues evolving with new tools and techniques. Stay current with industry trends, but don't chase every new technology. Focus on solutions that solve real problems for your specific environment and team.

Your monitoring journey doesn't end here, it's a continuous process of improvement, learning, and adaptation. Start implementing these techniques today, and build the monitoring foundation that will keep your systems running smoothly for years to come.

Read more