Linux Administration Lesson 25 – Linux Administration Best Practices | Dataplexa
Section II — User, Process & Package Management

Linux Administration Best Practices

In this lesson

Change management Principle of least privilege Documentation habits Patch management Server provisioning standards

Linux administration best practices are the operational habits, disciplines, and standards that separate administrators who build reliable, secure, maintainable systems from those who build systems that work until they suddenly don't. Technical skill matters, but it is the practised habits around change, documentation, access control, and maintenance that determine whether a fleet of servers stays healthy over months and years — not just on the day it was provisioned.

The Administration Mindset

Before the specific practices, there is an underlying mindset that makes them coherent. Professional Linux administration is grounded in three principles that inform every decision: reversibility (every change should have a known rollback path), auditability (every change should leave a record), and minimal footprint (the system should contain only what is needed, configured to the minimum required privilege).

Reversibility Every change has a known rollback path Back up before edit. Test before reload. One change at a time. Know how to undo it. Auditability Every change leaves a traceable record sudo logs every command. Document what you changed. Use config management. Commit infra to git. Minimal Footprint Only what is needed, at minimum privilege No root login. Least-privilege accounts. Remove unused packages. Close unused ports. Every specific practice in this lesson flows from one of these three principles

Fig 1 — The three principles underlying professional Linux administration

Change Management

Most production outages are caused not by hardware failure but by uncontrolled change — a configuration edit made under pressure, a package upgrade without a maintenance window, a firewall rule added without testing. Disciplined change management is the single most impactful practice for reducing incident rate on production infrastructure.

Understand the full scope before touching anything

What is the intended outcome? What are the dependencies? What will break if this goes wrong? Who needs to be informed? A change made without answers to these questions is a gamble, not an operation.

Back up, validate, make one change at a time

Every config file gets a timestamped backup before editing. Every service gets a syntax check before reloading. Changes are made individually — never four edits simultaneously — so a problem is immediately attributable to the last action.

Test in staging before production

Any change that will touch a production service should be tested on a staging or development environment first. If no staging environment exists, creating a minimal one for testing configuration changes is worth the investment.

Document what changed, when, and why

A change log entry takes 30 seconds to write and has saved countless hours of post-incident investigation. The format does not matter — a git commit message, a ticket comment, or a dated line in a plain text file all work.

# Change log — append a timestamped entry after every significant action
echo "$(date -u '+%Y-%m-%d %H:%M UTC') | $(whoami) | Updated nginx worker_processes to auto — ticket #4421" \
  >> /var/log/admin-changes.log

# Backup pattern — timestamped, before every config edit
sudo cp /etc/nginx/nginx.conf /etc/nginx/nginx.conf.$(date +%Y%m%d-%H%M%S)

# Diff a config against its backup to confirm only intended changes
diff /etc/nginx/nginx.conf /etc/nginx/nginx.conf.20250312-141200

# Validate before applying
sudo nginx -t && sudo systemctl reload nginx

# Check what changed on the system today (package installs, file changes)
grep "$(date +%Y-%m-%d)" /var/log/dpkg.log    # Debian/Ubuntu
grep "$(date +%Y-%m-%d)" /var/log/dnf.log      # RHEL/Rocky
# cat /var/log/admin-changes.log
2025-03-10 09:14 UTC | alice | Increased nginx worker_connections to 2048 — perf test results
2025-03-11 14:22 UTC | alice | Added /etc/nginx/conf.d/api.conf — new API reverse proxy
2025-03-12 14:12 UTC | bob   | Updated nginx worker_processes to auto — ticket #4421

# diff /etc/nginx/nginx.conf /etc/nginx/nginx.conf.20250312-141200
4c4
< worker_processes auto;
---
> worker_processes 4;

What just happened? The change log shows a clear, attributable history of who changed what and why. The diff confirmed the change was exactly what was intended — only worker_processes changed from a hardcoded value to auto. If an incident occurs tonight, this diff instantly proves the change was deliberate and scoped to a single setting.

Access Control and the Principle of Least Privilege

The principle of least privilege states that every user, service, and process should operate with only the permissions it needs to do its job — nothing more. Applied consistently, this limits the blast radius of compromised accounts, misconfigured services, and human error. It is not a single setting but a continuous practice across users, files, services, and network access.

Users

No direct root logins. Every administrator has a named personal account. sudo access is granted per-command where possible, not blanket ALL. Departed users are locked immediately.

Services

Every daemon runs as a dedicated non-root service account. Unit files use NoNewPrivileges=true, PrivateTmp=true, and capability dropping.

Files

Secrets and config files are chmod 600, owned by the service account. World-readable is the default only where explicitly required. Never chmod 777.

Network

Firewall default-deny with explicit allow rules. Services bind to 127.0.0.1 unless external access is required. SSH keys only — no password authentication.

# Audit all UID-0 accounts — should only ever be root
awk -F: '($3 == 0) {print $1}' /etc/passwd

# Audit sudo permissions for all users
sudo grep -r "ALL" /etc/sudoers /etc/sudoers.d/ 2>/dev/null

# Find world-writable files (security risk)
sudo find / -xdev -perm -0002 -type f -ls 2>/dev/null | grep -v "/proc\|/sys"

# Find SUID/SGID binaries — legitimate ones are expected but review unknowns
sudo find / -xdev \( -perm -4000 -o -perm -2000 \) -type f -ls 2>/dev/null

# Audit which services are listening on network interfaces
sudo ss -tlnp

# Check SSH configuration for password authentication
grep -E "PasswordAuthentication|PermitRootLogin|PermitEmptyPasswords" /etc/ssh/sshd_config
# awk -F: '($3 == 0) {print $1}' /etc/passwd
root
toor    ← !! second UID-0 account — security incident

# grep -E "PasswordAuthentication|PermitRootLogin" /etc/ssh/sshd_config
PermitRootLogin no
PasswordAuthentication no

# sudo ss -tlnp
LISTEN  0  128  0.0.0.0:22    users:(("sshd",pid=892))
LISTEN  0  128  127.0.0.1:5432  users:(("postgres",pid=3012))
LISTEN  0  511  0.0.0.0:443   users:(("nginx",pid=1235))

What just happened? The UID-0 audit surfaced a second root-level account called toor — a classic backdoor indicator. This single command, run as part of a routine security audit, found what could have been an undetected compromise. The SSH config confirms correct hardening: no root login, no passwords. PostgreSQL correctly binds only to localhost.

Patch Management

Unpatched systems are the most common entry point for attackers. A consistent patch management process — checking for updates, evaluating security relevance, applying in a tested order, and verifying outcomes — keeps the attack surface minimal without sacrificing stability. The goal is not to apply every update the moment it appears, but to have a process that ensures critical security patches are applied within a defined window.

Patch Management — Practical Approach
Activity Cadence Notes
Security-only patches Within 72 hours of CVE disclosure Critical CVEs (CVSS ≥ 9.0) within 24 hours. Use apt-get --only-upgrade or dnf update --security.
Full system updates Monthly maintenance window Test in staging first. Hold pinned packages. Schedule reboot if kernel was updated.
Kernel updates With full updates + scheduled reboot Keep previous kernel in GRUB for rollback. Verify service health after reboot.
Vulnerability scanning Weekly Tools: lynis, unattended-upgrades (Debian), dnf-automatic (RHEL).
Patch verification After every patch cycle Confirm affected services still run. Check logs for errors. Update change log.
# Check for available security updates only (Debian/Ubuntu)
sudo apt list --upgradable 2>/dev/null | grep -i security

# Apply security updates only
sudo apt-get --only-upgrade install $(apt list --upgradable 2>/dev/null | \
  grep security | cut -d/ -f1 | tr '\n' ' ')

# Apply security updates only (RHEL/Rocky)
sudo dnf update --security -y

# Check if a reboot is required after patching (Debian/Ubuntu)
[ -f /var/run/reboot-required ] && cat /var/run/reboot-required.pkgs

# Run a basic security audit with lynis
sudo apt install lynis -y
sudo lynis audit system --quick

# Enable automatic security updates (Debian/Ubuntu)
sudo apt install unattended-upgrades -y
sudo dpkg-reconfigure -plow unattended-upgrades

# View unattended-upgrades log
cat /var/log/unattended-upgrades/unattended-upgrades.log | tail -20
# cat /var/run/reboot-required.pkgs
linux-image-6.5.0-1021-aws

# sudo lynis audit system --quick (summary section)
-[ Lynis 3.0.8 Results ]-
  Hardening index : 68 [##############      ]
  Tests performed : 247
  Plugins enabled : 2

  Components:
  - Firewall               [V]
  - Malware scanner        [X]
  - File integrity         [X]

  Suggestions (12):
  * Consider using a file integrity checker [FINT-4350]
  * Install a malware scanner [MALW-3280]

What just happened? The reboot-required check confirmed a kernel update is pending — a reboot must be scheduled to activate it. Lynis scored the system at 68 out of 100 and surfaced two actionable gaps: no file integrity checker (like AIDE) and no malware scanner installed. A lynis score below 70 typically indicates a server that has not been hardened beyond defaults and warrants a focused hardening pass.

Documentation and Reproducibility

An undocumented system is a liability. When the person who built it leaves, or when disaster recovery requires rebuilding it under pressure, documentation is the difference between a two-hour rebuild and a two-day investigation. The goal is not comprehensive prose documentation — it is runbooks (operational procedures), infrastructure as code (the configuration expressed in version-controlled files), and change logs (what changed and when).

Analogy: A documented system is like a recipe with a photo of the finished dish. Anyone can follow the recipe to reproduce the result — even years later, even under pressure, even someone who has never seen it before. An undocumented system is like handing someone a plate of food and asking them to recreate it from scratch. The plate looks the same but the method is lost.

# Document installed packages — capture the state of a known-good system
dpkg --get-selections > /root/installed-packages-$(date +%Y%m%d).txt     # Debian/Ubuntu
rpm -qa > /root/installed-packages-$(date +%Y%m%d).txt                   # RHEL/Rocky

# Document all listening services and their ports
sudo ss -tlnp > /root/listening-services-$(date +%Y%m%d).txt

# Document current crontab entries
crontab -l > /root/crontab-backup-$(date +%Y%m%d).txt
sudo crontab -l >> /root/crontab-backup-$(date +%Y%m%d).txt

# Capture firewall rules
sudo iptables-save > /root/firewall-rules-$(date +%Y%m%d).txt             # iptables
sudo ufw status verbose > /root/ufw-status-$(date +%Y%m%d).txt           # ufw

# Export current systemd enabled services
systemctl list-unit-files --state=enabled > /root/enabled-services-$(date +%Y%m%d).txt

# Version-control /etc with etckeeper (tracks all changes as git commits)
sudo apt install etckeeper -y
sudo etckeeper init
sudo etckeeper commit "Initial commit — post-provisioning baseline"
# sudo etckeeper commit "Updated nginx worker_processes — ticket #4421"
[master 3f8a2d1] Updated nginx worker_processes — ticket #4421
 1 file changed, 1 insertion(+), 1 deletion(-)

# git -C /etc log --oneline | head -5
3f8a2d1 Updated nginx worker_processes — ticket #4421
b2c3d4e Added /etc/nginx/conf.d/api.conf — new API proxy
a1b2c3d Enabled ufw with default-deny policy
9dc8582 Post-provisioning baseline — 2025-03-01

What just happened? etckeeper tracks every change to /etc as a git commit automatically on package install and manually when you commit. The log shows a complete, timestamped history of every configuration change made to this server since provisioning — with ticket references. This is audit-ready documentation produced with zero extra effort beyond normal operation.

New Server Provisioning Checklist

The first 30 minutes after provisioning a new server set the security and operational baseline for everything that follows. Applying the same checklist consistently means every server in a fleet starts from the same known-good state — reducing configuration drift, simplifying auditing, and making automation possible.

1. Update
Apply all updates immediately after provisioning

Cloud images are often weeks behind on patches. Run apt update && apt upgrade -y before doing anything else. Schedule a reboot if a kernel was updated.

2. Admin user
Create a named non-root admin account with sudo access

Never use the root account for ongoing work. Create a named personal account, add it to the sudo or wheel group, and verify sudo access before disabling root login.

3. SSH hardening
Disable root login and password authentication over SSH

Set PermitRootLogin no and PasswordAuthentication no in sshd_config. Always validate with sshd -t before reloading.

4. Firewall
Enable firewall with default-deny inbound policy

Allow only the ports the server actually needs. On cloud instances, combine host firewall (ufw/firewalld) with security group rules for defence in depth.

5. Time sync
Verify NTP is running and the clock is accurate

A server with a drifting clock produces unreliable logs, causes TLS certificate failures, and breaks distributed systems. Run timedatectl status to verify.

6. Monitoring
Enable sysstat, configure journal limits, set up log forwarding

A server that cannot be monitored or whose logs are not retained is invisible. Enable sysstat data collection, configure journal size limits, and forward logs to a central destination.

# New server provisioning — complete first-steps script

# 1. Update everything
sudo apt update && sudo apt upgrade -y

# 2. Create admin user
sudo useradd -m -s /bin/bash -G sudo adminuser
sudo passwd adminuser
# Copy SSH public key: ssh-copy-id adminuser@server

# 3. Harden SSH (after confirming key login works!)
sudo sed -i 's/#PermitRootLogin.*/PermitRootLogin no/' /etc/ssh/sshd_config
sudo sed -i 's/#PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
sudo sshd -t && sudo systemctl reload sshd

# 4. Enable firewall
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow 22/tcp comment "SSH"
sudo ufw --force enable

# 5. Verify time sync
timedatectl status
systemctl is-active systemd-timesyncd

# 6. Enable sysstat and etckeeper
sudo apt install sysstat etckeeper -y
sudo sed -i 's/ENABLED="false"/ENABLED="true"/' /etc/default/sysstat
sudo systemctl restart sysstat
sudo etckeeper init && sudo etckeeper commit "Post-provisioning baseline"

Always Verify SSH Key Access Before Disabling Password Authentication

A common and serious mistake during SSH hardening: disabling PasswordAuthentication before confirming that SSH key authentication works for your admin account. If your key is not in ~/.ssh/authorized_keys for the admin user, you will lock yourself out of the server permanently — requiring physical or out-of-band console access to recover. The correct sequence is: copy key → open a new SSH session and verify key login works → then disable password authentication.

Lesson Checklist

I follow the reversibility principle — every config edit has a timestamped backup, every service change has a validation step, and I know how to roll back before I start
I routinely audit UID-0 accounts, world-writable files, SUID binaries, and listening ports as part of normal security hygiene
I apply security patches within a defined window (critical CVEs within 24–72 hours) and verify service health after every patch cycle
I use etckeeper or equivalent to version-control configuration changes, and I write a change log entry for every significant action
I apply a consistent six-step provisioning checklist to every new server: update, admin user, SSH hardening, firewall, time sync, monitoring

Teacher's Note

The single habit with the highest return on investment from this lesson is etckeeper. Install it on every server you provision, make the initial commit, and then every configuration change you make is automatically tracked with a timestamp. Six months later when an incident report asks "when was this setting changed and by whom?", the answer is git -C /etc log --oneline. It requires zero discipline to maintain once installed — it works automatically in the background.

Practice Questions

1. You are provisioning a new Ubuntu 24.04 server that will run a web application. Write a concise provisioning script (or ordered list of commands) covering all six steps of the provisioning checklist — update, admin user, SSH hardening, firewall, time sync, and monitoring setup.

2. A security audit finds an unknown account called svc_helper in the sudo group with UID 1003 on a production server. No one on the team knows who created it or why. Describe the exact steps you would take to investigate this account and explain the reasoning behind each step.

3. Your team currently applies patches manually whenever someone remembers, resulting in servers that are sometimes months out of date. Propose a practical patch management policy — specifying cadence, tooling, and verification steps — that the team could actually follow without it becoming a full-time job.

Lesson Quiz

1. Which of the following best describes the principle of least privilege applied to a database service?

2. What does etckeeper do, and why is it valuable compared to manually keeping config file backups?

3. During provisioning you disable password authentication in sshd_config and reload sshd — but you had not yet copied your SSH public key to the server. What happens, and how do you recover?

Up Next

Lesson 26 — Linux Networking Basics

Network interfaces, IP addressing, routing, and DNS configuration on Linux systems