Linux Administration Lesson 22 – Monitoring System Resources | Dataplexa

Section II — User, Process & Package Management

Monitoring System Resources

In this lesson

CPU and load Memory with free and vmstat Disk I/O with iostat Network with ss and iftop Historical data with sar

System resource monitoring is the practice of observing CPU usage, memory consumption, disk throughput, and network activity — both in real time during an incident and historically to understand trends and capacity. A Linux administrator who cannot read these metrics is flying blind: performance problems, runaway processes, and impending capacity crises all announce themselves in the numbers long before they become outages.

CPU Usage and Load Average

CPU monitoring has two distinct dimensions: utilisation (what percentage of CPU time is being consumed and by what kind of work) and load average (how many processes are waiting for CPU time). Both matter — a system can have low CPU utilisation but high load if processes are stuck waiting on I/O, and a system can have high CPU utilisation but zero load if one process legitimately needs it all.

Fig 1 — CPU time breakdown and how to interpret the three load average values

# Quick system overview — uptime, users, load averages
uptime

# Number of CPU cores (divide load average by this to assess pressure)
nproc
lscpu | grep "^CPU(s):"

# Real-time CPU stats — refresh every 2 seconds, 5 iterations
vmstat 2 5

# Per-CPU breakdown — shows if load is balanced across cores
mpstat -P ALL 1 3

# top sorted by CPU — press P inside top to sort by CPU
top

# Show only CPU-intensive processes (threshold > 10% CPU)
ps aux --sort=-%cpu | awk 'NR==1 || $3 > 10'

# uptime
 14:31:22 up 12 days,  3:42,  2 users,  load average: 0.42, 1.87, 3.21

# vmstat 2 3
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 412344 123456 2048000    0    0     4    28  321  612 21 10 58  3  0
 3  0      0 411200 123456 2048000    0    0     0    64  418  731 34 12 52  2  0
 1  0      0 410800 123456 2048000    0    0     0    12  289  541 18  8 72  2  0

What just happened? vmstat revealed key patterns across three samples. The r column (processes waiting to run) climbed from 2 to 3, matching the high 15-minute load average. The wa column stayed at 2–3% — not alarming. The si/so columns (swap in/out) are both zero, meaning this system is not thrashing swap — the load is genuinely CPU-bound, not memory-starved.

Memory Monitoring with free and /proc/meminfo

Memory monitoring on Linux requires understanding the difference between memory that is used and memory that is unavailable. Linux aggressively uses free RAM as a disk cache — this is intentional and beneficial. The number to watch is available memory, not free memory. When available memory approaches zero, the system starts using swap and performance degrades sharply.

Analogy: "Free" memory is like cash sitting idle in a drawer — technically available but unproductive. Linux takes that idle cash and puts it to work as disk cache, speeding up file reads. "Available" memory is the amount that can be reclaimed almost instantly if a new process needs it — free RAM plus reclaimable cache. A high "used" but also high "available" reading is healthy, not alarming.

# Human-readable memory overview — the most important command
free -h

# Refresh every 2 seconds
free -h -s 2

# Detailed memory breakdown from the kernel
cat /proc/meminfo | grep -E "MemTotal|MemFree|MemAvailable|Cached|Buffers|SwapTotal|SwapFree"

# Show per-process memory consumption — RSS = actual physical RAM used
ps aux --sort=-%mem | head -15

# Show top 10 memory-consuming processes cleanly
ps -eo pid,user,%mem,rss,comm --sort=-%mem | head -11

# Check if the OOM killer has activated (kills processes when RAM is exhausted)
dmesg | grep -i "oom\|killed process"
journalctl -k | grep -i "oom\|killed process"

# free -h
               total        used        free      shared  buff/cache   available
Mem:           7.8Gi       3.2Gi       412Mi       156Mi       4.2Gi       4.1Gi
Swap:          2.0Gi          0B       2.0Gi

# cat /proc/meminfo | grep -E "MemTotal|MemFree|MemAvailable|Cached|SwapFree"
MemTotal:        8142508 kB
MemFree:          422344 kB
MemAvailable:    4198400 kB
Buffers:          126464 kB
Cached:          4071680 kB
SwapTotal:       2097148 kB
SwapFree:        2097148 kB

# dmesg | grep -i "oom"
[142831.204702] Out of memory: Killed process 14821 (java) total-vm:8421376kB, \
  anon-rss:7340032kB, file-rss:0kB, shmem-rss:0kB

What just happened? free -h showed 412Mi free but 4.1Gi available — the difference is 4.2Gi of reclaimable buff/cache. Swap is untouched at 2Gi free, confirming healthy memory pressure. However, dmesg revealed that an OOM kill event occurred earlier — a Java process consumed 7GB of anonymous memory and the kernel forcibly terminated it to protect system stability. This is a critical finding that explains a past service crash.

Disk I/O Monitoring with iostat and iotop

A system can have plenty of free CPU and memory yet feel completely unresponsive because a disk is saturated. iostat provides per-device I/O statistics — throughput, IOPS, and crucially the utilisation percentage and average wait time that reveal whether a disk is becoming a bottleneck.

iostat -x Output — Key Columns

Column	Meaning and what to watch for
`r/s w/s`	Read and write operations per second (IOPS). High values on HDDs (>200) may indicate saturation; SSDs handle thousands.
`rkB/s wkB/s`	Read/write throughput in KB/s. Compare against the device's rated sequential speed to see how close to saturation it is.
`await`	Average time (ms) for I/O requests to complete. For HDDs: <10ms good, >50ms bad. For SSDs: <1ms expected. A rising await is the earliest warning of disk problems.
`%util`	Percentage of time the device was busy. >80% on a HDD means it is saturated. SSDs can sustain near 100% without being a bottleneck due to internal parallelism.
`aqu-sz`	Average queue size — number of I/O requests waiting. A sustained queue >1 on a single HDD means more I/O is being requested than the disk can service.

# Install sysstat (provides iostat, mpstat, sar)
sudo apt install sysstat -y       # Debian/Ubuntu
sudo dnf install sysstat -y       # RHEL/Rocky

# Extended disk stats — refresh every 2 seconds
iostat -xh 2

# Show only specific device
iostat -xh 2 /dev/sda

# Find which process is causing high disk I/O (requires root)
sudo iotop -o         # -o = show only active processes

# Non-interactive iotop snapshot
sudo iotop -b -n 3 -o | head -20

# Watch disk I/O in real time via /proc
watch -n 1 cat /proc/diskstats

# iostat -xh 2
Device  r/s   w/s  rkB/s  wkB/s  await  %util
sda    2.4   48.2   96.0  4821.1   3.2    18%
sdb    0.2  312.4    8.0 52428.8  124.8   97%

# sudo iotop -b -n 1 -o
Total DISK READ:   0.00 B/s | Total DISK WRITE: 51.62 M/s
  TID  PRIO  USER     DISK READ  DISK WRITE  COMMAND
 9821 be/4  postgres  0.00 B/s  48.34 M/s   postgres: autovacuum worker
 9044 be/4  www-data  0.00 B/s   3.28 M/s   php-fpm: pool www

What just happened? iostat immediately flagged sdb — 97% utilised with a 124.8ms await time. That disk is saturated. iotop then identified the culprit: PostgreSQL's autovacuum worker writing 48MB/s. The combination of iostat (which device?) and iotop (which process?) is the standard two-step I/O investigation workflow.

Network Monitoring with ss, ip, and iftop

Network monitoring answers three distinct questions: what ports are open and which processes own them (ss), what are the current interface throughput figures (ip / iftop), and which remote hosts are consuming the most bandwidth (iftop / nethogs).

# ss — the modern replacement for netstat

# Show all listening TCP ports with process names
sudo ss -tlnp

# Show all established TCP connections
sudo ss -tnp state established

# Show connections to a specific port
sudo ss -tnp sport = :443

# Count connections per state (useful for detecting SYN floods)
ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn

# Show socket statistics summary
ss -s

# ── Interface throughput ──────────────────────────────────────────

# Show interface statistics (RX/TX bytes, errors, drops)
ip -s link show eth0

# Watch live interface counters
watch -n 1 'ip -s link show eth0 | grep -A 4 "RX:"'

# iftop — real-time bandwidth by connection (requires install)
sudo apt install iftop -y
sudo iftop -i eth0

# nethogs — bandwidth per process
sudo apt install nethogs -y
sudo nethogs eth0

# sudo ss -tlnp
State   Recv-Q  Send-Q  Local Address:Port  Peer Address:Port  Process
LISTEN  0       128     0.0.0.0:22          0.0.0.0:*          users:(("sshd",pid=892))
LISTEN  0       511     0.0.0.0:80          0.0.0.0:*          users:(("nginx",pid=1235))
LISTEN  0       511     0.0.0.0:443         0.0.0.0:*          users:(("nginx",pid=1235))
LISTEN  0       128     127.0.0.1:5432      0.0.0.0:*          users:(("postgres",pid=3012))

# ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn
    412 ESTABLISHED
     18 TIME-WAIT
      4 LISTEN
      1 State

What just happened? ss -tlnp confirmed that PostgreSQL is bound only to 127.0.0.1:5432 — not exposed to the network, which is correct. The connection state count shows 412 established connections and 18 TIME-WAIT — both within normal ranges. A large number of TIME-WAIT connections (thousands) or rapidly growing CLOSE-WAIT connections would signal a connection handling problem worth investigating.

Historical Monitoring with sar

sar (System Activity Reporter) is part of the sysstat package and records system performance data every 10 minutes by default. Unlike real-time tools, sar lets you look back in time — answering "what was CPU usage at 3 AM last Thursday when the backup ran?" or "when exactly did memory start climbing yesterday?"

sar -u

Historical CPU utilisation. Shows user, system, iowait, and idle percentages across the day.

sar -r

Historical memory usage — free, used, cached, and available memory across time.

sar -d

Historical disk activity — await times and utilisation for each block device over time.

sar -n DEV

Historical network interface throughput — RX/TX packets and bytes per second over time.

# Enable sysstat data collection (disabled by default on some distros)
sudo systemctl enable --now sysstat

# On Debian/Ubuntu — enable in the config file
sudo sed -i 's/ENABLED="false"/ENABLED="true"/' /etc/default/sysstat
sudo systemctl restart sysstat

# View today's CPU history — all samples
sar -u

# View CPU history for a specific hour window today
sar -u -s 02:00:00 -e 04:00:00

# View CPU history from a specific past day (data files in /var/log/sysstat/)
sar -u -f /var/log/sysstat/sa10       # day 10 of the current month

# Memory history for today
sar -r

# Disk activity history today
sar -d -p                              # -p = pretty device names

# Network throughput history today
sar -n DEV | grep eth0

# Run sar live — 5 samples, 2 seconds apart (like vmstat)
sar -u 2 5

# sar -u -s 02:00:00 -e 04:00:00
Linux 6.5.0 (server1)   03/12/2025   _x86_64_   (2 CPU)

02:00:01  CPU  %user  %nice %system %iowait  %steal  %idle
02:10:01  all   2.14   0.00    0.82    0.12    0.00  96.92
02:20:01  all   1.98   0.00    0.71    0.08    0.00  97.23
02:30:01  all  48.21   0.00    8.42   12.31    0.00  31.06   ← backup spike
02:40:01  all  51.33   0.00    9.18   14.22    0.00  25.27
02:50:01  all   2.31   0.00    0.89    0.11    0.00  96.69
03:00:01  all   2.05   0.00    0.74    0.09    0.00  97.12

What just happened? The sar output immediately revealed a pattern invisible to real-time monitoring: a scheduled backup at 02:30 caused CPU usage to spike from 2% to 51% and I/O wait to jump from 0.1% to 14%. This explains why users reported slow response at that time. Without sar's historical data, you would need the incident to recur to diagnose it.

A Systematic Monitoring Workflow for Performance Incidents

When a server is slow or unresponsive, the temptation is to jump to conclusions — to restart nginx, reboot the server, or blame the database. A systematic four-resource check takes under two minutes and almost always identifies the actual bottleneck before any changes are made.

Step 1 — Check the load and CPU (uptime, vmstat 1 5)

Is load above the number of CPU cores? Is iowait elevated? Is there swap activity (si/so > 0)? A high r column with low iowait = CPU bottleneck. High iowait = I/O bottleneck. Swap activity = memory problem.

Step 2 — Check memory (free -h, dmesg | grep oom)

Is available memory below 10% of total? Is swap being used? Has the OOM killer fired? If yes to any, identify the memory hog with ps aux --sort=-%mem | head.

Step 3 — Check disk I/O (iostat -xh 2)

Is any disk at >80% utilisation? Is await time elevated? If yes, run sudo iotop -o to find which process is causing the I/O.

Step 4 — Check network (ss -s, ip -s link)

Are there unusual numbers of connections? Are interface error or drop counters climbing? Is there unexpected bandwidth consumption? Run sudo iftop to see real-time bandwidth by connection.

# Full 4-resource check — run this sequence when a server is slow

# 1. CPU and load
uptime
vmstat 1 5

# 2. Memory
free -h
dmesg | grep -i oom | tail -5

# 3. Disk I/O
iostat -xh 2 3

# 4. Network
ss -s
ip -s link show eth0 | grep -A 2 "RX:\|TX:"

# Bonus: identify the top 5 processes consuming most resources
ps -eo pid,user,%cpu,%mem,comm --sort=-%cpu | head -6

High iowait Does Not Mean the Disk Is Broken — It Means Processes Are Waiting for It

A common misreading: iowait in vmstat and the CPU stats does not mean the CPU is doing I/O. It means the CPU is idle because processes are blocked waiting for I/O. High iowait with low %util in iostat means processes are issuing many small random reads that keep the CPU waiting but do not saturate the disk. High iowait with high %util means the disk is genuinely saturated. Always read both numbers together before drawing conclusions.

Lesson Checklist

✔ I can interpret load average relative to CPU core count, and distinguish CPU-bound, I/O-bound, and memory-bound problems from vmstat output

✔ I read free -h using the available column (not free), and I check dmesg | grep oom to detect past OOM kill events

✔ I use iostat -xh to identify saturated disks, and iotop -o to find which process is causing the I/O

✔ I use ss -tlnp to audit open ports and owning processes, replacing the deprecated netstat

✔ I have enabled sysstat data collection and can use sar to look back at historical CPU, memory, and disk metrics to reconstruct past performance incidents

Teacher's Note

Enable sysstat on every server you provision — it is a tiny overhead with enormous diagnostic value. The single most useful feature is being able to run sar -u -f /var/log/sysstat/saYY the day after an incident and see exactly what the system was doing when the alert fired. Without it, past performance data is gone and post-mortems rely on guesswork.

Practice Questions

1. A 4-core server shows a load average of 0.8 / 2.4 / 6.1. Interpret each value and describe what the trend suggests about the server's state over the last 15 minutes. Is intervention urgent? What would you check next?

2. iostat -xh shows sdb at 94% utilisation with an await of 187ms. Describe the complete two-command sequence you would run to (a) confirm the disk is saturated and (b) identify which specific process is responsible for the I/O load.

3. free -h shows: total 16Gi, used 14Gi, free 128Mi, available 1.2Gi, swap used 3.1Gi. Is this system in a memory crisis? Explain your reasoning using the correct column to assess memory pressure.

Lesson Quiz

1. On a 2-core server, the load average is reported as 4.2 3.8 2.1. What does this indicate?

The server is healthy — load average below 5 is always acceptable The server is currently overloaded (load 4.2 > 2 cores) but the 15-min trend of 2.1 suggests it is recovering from a recent spike The three values represent CPU, memory, and disk load respectively Load average only matters on single-core systems — multi-core servers are never overloaded

2. In free -h output, why should you look at the available column rather than the free column to assess memory pressure?

The free column includes swap and is therefore unreliable Linux uses free RAM as disk cache, so free is intentionally low; available includes reclaimable cache and shows how much memory new processes can actually use available shows the RAM reserved for the kernel which should never be used by applications They show the same value — available is just a more recent label for the same metric

3. Which tool would you use to answer the question "what was the CPU usage on this server between 3 AM and 4 AM last Tuesday?"

top -b -n 1 — batch mode records all samples to disk sar -u -f /var/log/sysstat/saXX — sar stores historical system activity data files by day vmstat --history — vmstat maintains a rolling history log journalctl --cpu-stats --since "last tuesday"

Up Next

Lesson 23 — Managing Software Repositories

Adding, configuring, and securing package repositories on Debian and RHEL-family systems

← Previous Course Index Next →