How to monitor a Linux server
Linux server monitoring is the practice of continuously tracking system performance, resource usage, and health metrics to ensure optimal operation and prevent downtime.
This comprehensive guide takes you from basic command-line monitoring to production-ready monitoring systems, covering everything from essential Linux commands to enterprise solutions like Prometheus, Grafana, and Nagios.
What to monitor ?
The core metrics taxonomy covers four fundamental areas:
- CPU: Utilization percentage, load averages, process counts
- Memory: Available RAM, swap usage, buffer/cache utilization
- Disk: Space usage, I/O operations per second, read/write latency
- Network: Bandwidth utilization, packet loss, connection counts
Monitoring frequency depends on your system's criticality. Financial trading systems might need second-by-second monitoring, while development servers can be checked every five minutes. The key is establishing baselines, understanding what "normal" looks like for your specific workload.
Use this quick assessment to determine your current monitoring maturity:
- Do you know when systems fail? (Reactive)
- Do you receive alerts before users notice problems? (Proactive)
- Can you predict capacity needs months in advance? (Predictive)
Commands Every Admin Should Master
Linux provides powerful built-in monitoring tools that require no additional software. These commands form the foundation of server monitoring and troubleshooting.
System Overview with top and htop
The top command provides real-time system statistics. Run it to see CPU usage, memory consumption, and active processes:
top -d 5This updates every 5 seconds instead of the default 3. The output shows load averages (1, 5, and 15-minute averages), total processes, and per-process resource usage. A load average equal to your CPU core count indicates full utilization.
htop offers a more user-friendly interface with color coding and mouse support:
htopMemory Analysis with free and vmstat
Check memory usage with human-readable output:
free -hThis shows total, used, free, and available memory. The "available" column is most important, it represents memory that can be freed for new processes.
For deeper memory analysis, use vmstat:
vmstat 2 5This displays memory, swap, I/O, and CPU statistics every 2 seconds for 5 iterations. Watch the "si" and "so" columns, consistent swap activity indicates memory pressure.
Disk Performance with iostat and df
Monitor disk I/O performance:
iostat -x 2The "-x" flag provides extended statistics including %util (utilization percentage). Values consistently above 80% indicate I/O bottlenecks.
Check disk space usage:
df -hNetwork Monitoring with ss and netstat
Modern Linux systems use ss instead of the older netstat:
ss -tulnThis shows TCP (-t) and UDP (-u) listening (-l) ports with numeric (-n) addresses. Use this to verify services are running and accessible.
Command Chaining for Quick Diagnostics
Combine commands for rapid system assessment:
# Quick system health check
uptime && free -h && df -h / && ss -tuln | grep :80This one-liner shows system load, memory usage, root disk space, and whether a web server is listening.
Monitoring Scripts Example
Manual monitoring doesn't scale. Automated scripts provide consistent monitoring and can trigger alerts when thresholds are exceeded.
Disk Space Monitoring Script
Create a script that monitors disk usage and sends alerts:
#!/bin/bash
# disk_monitor.sh - Monitor disk space usage
THRESHOLD=85
EMAIL="admin@company.com"
LOGFILE="/var/log/disk_monitor.log"
# Function to log messages
log_message() {
echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> "$LOGFILE"
}
# Check each mounted filesystem
df -h | awk 'NR>1 {print $5 " " $6}' | while read usage mountpoint; do
# Remove % sign and convert to integer
usage_num=$(echo $usage | sed 's/%//')
if [ "$usage_num" -gt "$THRESHOLD" ]; then
message="ALERT: Disk usage on $mountpoint is ${usage} (threshold: ${THRESHOLD}%)"
log_message "$message"
# Send email alert
echo "$message" | mail -s "Disk Space Alert - $(hostname)" "$EMAIL"
# Optional: Send Slack webhook
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"$message\"}" \
YOUR_SLACK_WEBHOOK_URL
fi
doneMake the script executable and test it:
chmod +x disk_monitor.sh
./disk_monitor.shCPU Load Monitoring
Add CPU monitoring to the same script:
# CPU load monitoring function
check_cpu_load() {
LOAD_THRESHOLD=2.0
CURRENT_LOAD=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | sed 's/,//')
if (( $(echo "$CURRENT_LOAD > $LOAD_THRESHOLD" | bc -l) )); then
message="ALERT: High CPU load: $CURRENT_LOAD (threshold: $LOAD_THRESHOLD)"
log_message "$message"
echo "$message" | mail -s "CPU Load Alert - $(hostname)" "$EMAIL"
fi
}Automating with Cron
Schedule the script to run every 10 minutes:
crontab -eAdd this line:
*/10 * * * * /path/to/disk_monitor.shFor critical systems, run checks more frequently. Database servers might need minute-by-minute monitoring, while development systems can be checked every 30 minutes.
Prometheus and Grafana for an enterprise grade monitoring
Prometheus and Grafana provide a powerful, scalable monitoring solution that's become the industry standard for modern infrastructure monitoring.
Installing Prometheus
Download and install Prometheus:
# Create prometheus user
sudo useradd --no-create-home --shell /bin/false prometheus
# Download Prometheus
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.40.0/prometheus-2.40.0.linux-amd64.tar.gz
tar xvf prometheus-2.40.0.linux-amd64.tar.gz
# Install binaries
sudo cp prometheus-2.40.0.linux-amd64/prometheus /usr/local/bin/
sudo cp prometheus-2.40.0.linux-amd64/promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool
# Create directories
sudo mkdir /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /etc/prometheus /var/lib/prometheusConfiguring Prometheus
Create the main configuration file:
sudo nano /etc/prometheus/prometheus.ymlAdd this configuration:
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
# - "first_rules.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node_exporter'
static_configs:
- targets: ['localhost:9100']Installing Node Exporter
Node Exporter collects system metrics:
# Download and install node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz
tar xvf node_exporter-1.5.0.linux-amd64.tar.gz
sudo cp node_exporter-1.5.0.linux-amd64/node_exporter /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/node_exporterCreate systemd services for both components:
# Prometheus service
sudo nano /etc/systemd/system/prometheus.service[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file /etc/prometheus/prometheus.yml \
--storage.tsdb.path /var/lib/prometheus/ \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries \
--web.listen-address=0.0.0.0:9090 \
--web.enable-lifecycle
[Install]
WantedBy=multi-user.targetStart the services:
sudo systemctl daemon-reload
sudo systemctl enable prometheus node_exporter
sudo systemctl start prometheus node_exporterInstalling Grafana
Install Grafana for visualization:
# Add Grafana repository
sudo apt-get install -y software-properties-common
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list
# Install Grafana
sudo apt-get update
sudo apt-get install grafana
# Start Grafana
sudo systemctl enable grafana-server
sudo systemctl start grafana-serverAccess Grafana at http://your-server:3000 (default login: admin/admin). Add Prometheus as a data source using URL http://localhost:9090.
Performance Impact Considerations
This monitoring stack typically uses 200-500MB RAM and 1-2% CPU on a moderately busy server. Node Exporter adds minimal overhead (usually under 50MB RAM), making it suitable for production environments.
Log Analysis and Security Monitoring
System logs contain crucial information about security events, application errors, and system behavior. Log monitoring can prevent security breaches and identify problems before they impact users.
Mastering journalctl
Modern Linux systems use systemd's journal for logging. Key commands for log analysis:
# View recent logs
journalctl -f
# Filter by service
journalctl -u nginx -f
# Show logs from last hour
journalctl --since "1 hour ago"
# Filter by priority (emergency, alert, critical, error, warning, notice, info, debug)
journalctl -p err
# Show logs for specific time range
journalctl --since "2023-01-01 00:00:00" --until "2023-01-01 23:59:59"Security Event Monitoring Script
Create a script to detect common attack patterns:
#!/bin/bash
# security_monitor.sh - Monitor for security events
LOGFILE="/var/log/security_monitor.log"
ALERT_EMAIL="security@company.com"
# Function to send security alert
send_alert() {
local message="$1"
echo "$(date '+%Y-%m-%d %H:%M:%S') - SECURITY ALERT: $message" >> "$LOGFILE"
echo "$message" | mail -s "Security Alert - $(hostname)" "$ALERT_EMAIL"
}
# Monitor failed SSH login attempts
failed_ssh=$(journalctl --since "5 minutes ago" -u ssh | grep "Failed password" | wc -l)
if [ "$failed_ssh" -gt 10 ]; then
send_alert "High number of failed SSH attempts: $failed_ssh in last 5 minutes"
fi
# Monitor sudo usage
sudo_attempts=$(journalctl --since "5 minutes ago" | grep "sudo.*COMMAND" | wc -l)
if [ "$sudo_attempts" -gt 20 ]; then
send_alert "Unusual sudo activity: $sudo_attempts commands in last 5 minutes"
fi
# Check for new user accounts
new_users=$(journalctl --since "1 hour ago" | grep "new user" | wc -l)
if [ "$new_users" -gt 0 ]; then
send_alert "New user account(s) created: $new_users in last hour"
fi
# Monitor disk space for rapid changes (potential log flooding attack)
current_disk=$(df / | awk 'NR==2 {print $5}' | sed 's/%//')
if [ -f /tmp/last_disk_usage ]; then
last_disk=$(cat /tmp/last_disk_usage)
disk_change=$((current_disk - last_disk))
if [ "$disk_change" -gt 5 ]; then
send_alert "Rapid disk usage increase: ${disk_change}% in 5 minutes"
fi
fi
echo "$current_disk" > /tmp/last_disk_usageLog Correlation Techniques
Correlate system metrics with application logs to identify root causes:
# Find high CPU periods and correlate with application errors
journalctl --since "1 hour ago" -p err | while read line; do
timestamp=$(echo "$line" | awk '{print $1 " " $2 " " $3}')
# Check system load during error time
echo "Error at $timestamp - checking system metrics..."
doneContainer and Modern Infrastructure Monitoring
Container monitoring requires different approaches than traditional server monitoring. Containers are ephemeral, and traditional monitoring tools may not capture container-specific metrics effectively.
Docker Container Monitoring with cAdvisor
Google's cAdvisor provides detailed container metrics:
# Run cAdvisor as a Docker container
docker run \
--volume=/:/rootfs:ro \
--volume=/var/run:/var/run:ro \
--volume=/sys:/sys:ro \
--volume=/var/lib/docker/:/var/lib/docker:ro \
--volume=/dev/disk/:/dev/disk:ro \
--publish=8080:8080 \
--detach=true \
--name=cadvisor \
--privileged \
--device=/dev/kmsg \
gcr.io/cadvisor/cadvisor:latestAccess the cAdvisor web interface at http://your-server:8080 to view container metrics.
Docker Stats and Logs
Use Docker's built-in monitoring commands:
# Real-time container resource usage
docker stats
# Container logs with timestamps
docker logs -f --timestamps container_name
# Inspect container resource limits
docker inspect container_name | grep -A 10 "Memory\|Cpu"Container-Specific Monitoring Script
#!/bin/bash
# container_monitor.sh - Monitor Docker containers
# Check for stopped containers
stopped_containers=$(docker ps -a --filter "status=exited" --format "table {{.Names}}" | tail -n +2)
if [ ! -z "$stopped_containers" ]; then
echo "Stopped containers detected: $stopped_containers"
fi
# Monitor container resource usage
docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}" | \
while read line; do
if [[ $line == *"%"* ]]; then
container=$(echo $line | awk '{print $1}')
cpu=$(echo $line | awk '{print $2}' | sed 's/%//')
if (( $(echo "$cpu > 80" | bc -l) )); then
echo "High CPU usage in container $container: ${cpu}%"
fi
fi
doneKubernetes Monitoring Basics
For Kubernetes environments, monitor cluster health with:
# Check node status
kubectl get nodes
# Monitor pod resource usage
kubectl top pods --all-namespaces
# Check for failed pods
kubectl get pods --all-namespaces --field-selector=status.phase=FailedNagios and Zabbix Implementation
Enterprise monitoring solutions provide centralized monitoring, alerting, and reporting capabilities suitable for large-scale infrastructures.
When to Choose Enterprise Solutions
Consider enterprise monitoring when you have:
- More than 50 servers to monitor
- Compliance requirements for monitoring and alerting
- Need for complex escalation procedures
- Multiple teams requiring different dashboard views
- Budget for commercial support
Installing Nagios Core
Install Nagios Core for comprehensive infrastructure monitoring:
# Install dependencies
sudo apt-get update
sudo apt-get install -y autoconf gcc libc6 make wget unzip apache2 php libapache2-mod-php7.4 libgd-dev
# Create nagios user
sudo useradd nagios
sudo groupadd nagcmd
sudo usermod -a -G nagcmd nagios
# Download and compile Nagios
cd /tmp
wget -O nagioscore.tar.gz https://github.com/NagiosEnterprises/nagioscore/archive/nagios-4.4.9.tar.gz
tar xzf nagioscore.tar.gz
cd nagioscore-nagios-4.4.9/
# Configure and compile
sudo ./configure --with-httpd-conf=/etc/apache2/sites-enabled
sudo make all
sudo make install
sudo make install-init
sudo make install-commandmode
sudo make install-config
sudo make install-webconfBasic Nagios Configuration
Configure Nagios to monitor your servers:
# Edit main configuration
sudo nano /usr/local/nagios/etc/nagios.cfgCreate a host definition:
# /usr/local/nagios/etc/objects/servers.cfg
define host {
use linux-server
host_name webserver01
alias Web Server 01
address 192.168.1.100
max_check_attempts 5
check_period 24x7
notification_interval 30
notification_period 24x7
}
define service {
use generic-service
host_name webserver01
service_description HTTP
check_command check_http
max_check_attempts 5
normal_check_interval 3
retry_check_interval 2
}Zabbix Installation
Zabbix offers a more modern interface and better scalability:
# Install Zabbix repository
wget https://repo.zabbix.com/zabbix/6.0/ubuntu/pool/main/z/zabbix-release/zabbix-release_6.0-1+ubuntu20.04_all.deb
sudo dpkg -i zabbix-release_6.0-1+ubuntu20.04_all.deb
sudo apt update
# Install Zabbix server and frontend
sudo apt install zabbix-server-mysql zabbix-frontend-php zabbix-apache-conf zabbix-sql-scripts zabbix-agent
# Configure MySQL database
mysql -uroot -p
mysql> create database zabbix character set utf8 collate utf8_bin;
mysql> create user zabbix@localhost identified by 'password';
mysql> grant all privileges on zabbix.* to zabbix@localhost;
mysql> quit;
# Import initial schema
zcat /usr/share/doc/zabbix-sql-scripts/mysql/create.sql.gz | mysql -uzabbix -p zabbixROI Considerations
Enterprise monitoring solutions cost $5-50 per monitored device monthly, but prevent downtime that typically costs $5,600 per minute for e-commerce sites. The investment pays for itself by preventing a single major outage.
Monitoring Best Practices and Alert Management
Effective alerting prevents alert fatigue while ensuring critical issues receive immediate attention.
Alert Threshold Tuning
Set thresholds based on historical data and business impact:
- Warning thresholds: 70-80% of capacity
- Critical thresholds: 85-95% of capacity
- Emergency thresholds: 95%+ or service unavailable
Use hysteresis to prevent flapping alerts, set recovery thresholds 5-10% below alert thresholds.
SLA and SLO Definition
Define clear service level objectives:
# Example SLO definitions
Web Application Availability: 99.9% (43 minutes downtime/month)
API Response Time: 95th percentile under 200ms
Database Query Performance: 99th percentile under 1 secondEscalation Procedures
Implement tiered alerting:
- Level 1 (0-15 minutes): On-call engineer via PagerDuty/SMS
- Level 2 (15-30 minutes): Team lead and backup engineer
- Level 3 (30+ minutes): Management and additional team members
Alert Fatigue Prevention
Reduce noise with intelligent alerting:
- Group related alerts (don't send 50 alerts for one network outage)
- Use dependencies (don't alert on web services when the database is down)
- Implement maintenance windows
- Regular alert review and tuning
Troubleshooting Common Monitoring Issues
Monitoring systems themselves can fail or create performance problems. Here's how to identify and resolve common issues.
High Resource Usage from Monitoring
If monitoring tools consume excessive resources:
# Check monitoring process resource usage
ps aux | grep -E "(prometheus|grafana|nagios|zabbix)" | sort -k3 -nr
# Reduce Prometheus retention period
# Edit /etc/prometheus/prometheus.yml
--storage.tsdb.retention.time=15d
# Optimize Grafana queries
# Use recording rules for complex queries
# Limit dashboard refresh ratesFalse Positive Reduction
Common causes and solutions:
- Network hiccups: Require 2-3 consecutive failures before alerting
- Scheduled maintenance: Implement maintenance windows
- Seasonal patterns: Use dynamic thresholds based on time/day
Monitoring System Failures
Implement meta-monitoring:
#!/bin/bash
# monitor_the_monitors.sh
# Check if monitoring services are running
services=("prometheus" "grafana-server" "nagios" "zabbix-server")
for service in "${services[@]}"; do
if ! systemctl is-active --quiet "$service"; then
echo "CRITICAL: $service is not running" | mail -s "Monitoring System Alert" admin@company.com
systemctl restart "$service"
fi
doneScaling Your Monitoring Strategy
As your infrastructure grows, monitoring strategies must evolve to handle increased complexity and scale.
Capacity Planning for Monitoring
Plan monitoring resources based on infrastructure size:
- Small (1-10 servers): Single monitoring server, basic alerting
- Medium (10-100 servers): Dedicated monitoring cluster, advanced dashboards
- Large (100+ servers): Distributed monitoring, data retention policies, automated scaling
Multi-Server Monitoring Architecture
For large environments, implement hierarchical monitoring:
# Central Prometheus configuration for federation
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~"prometheus|node_exporter"}'
static_configs:
- targets:
- 'prometheus-site1:9090'
- 'prometheus-site2:9090'Cloud Integration Strategies
For hybrid cloud environments:
- Use cloud-native monitoring for cloud resources (CloudWatch, Azure Monitor)
- Implement VPN connections for secure metric collection
- Consider cloud-hosted monitoring solutions for global visibility
- Implement data sovereignty compliance for regulated industries
Monitoring Maturity Roadmap
Evolution path for monitoring systems:
- Foundation (Months 1-3): Basic monitoring, essential alerts
- Enhancement (Months 4-6): Custom dashboards, log aggregation
- Optimization (Months 7-12): Predictive monitoring, automation
- Innovation (Year 2+): AI-powered anomaly detection, self-healing systems
Next Steps and Continuous Improvement
Effective Linux server monitoring is an ongoing process that evolves with your infrastructure and business needs. Start with the fundamentals, master the built-in Linux commands and establish baseline monitoring with simple scripts. As your confidence and requirements grow, implement more sophisticated solutions like Prometheus and Grafana.
Remember that the best monitoring system is one that's actively maintained and regularly tuned. Schedule monthly reviews of your alerts, thresholds, and dashboards. Involve your entire team in monitoring discussions, the person who gets paged at 3 AM should have input on alert sensitivity.
The monitoring landscape continues evolving with new tools and techniques. Stay current with industry trends, but don't chase every new technology. Focus on solutions that solve real problems for your specific environment and team.
Your monitoring journey doesn't end here, it's a continuous process of improvement, learning, and adaptation. Start implementing these techniques today, and build the monitoring foundation that will keep your systems running smoothly for years to come.