Linux Administration Lesson 27 – | Dataplexa

Section III — Networking, Security & Storage

Network Configuration & Troubleshooting

In this lesson

Advanced ip commands Bonding and VLANs tcpdump ss in depth Systematic troubleshooting

Network configuration and troubleshooting goes beyond setting an IP address. Real-world server networking involves multiple redundant links, logical segmentation with VLANs, precise socket inspection, and packet-level diagnosis when standard tools cannot pinpoint the problem. This lesson builds the advanced toolkit that closes the gap between "the network is broken" and "the exact packet on the exact port is being dropped by this specific rule."

Advanced ip Command Usage

The ip command suite covers far more than addresses and routes. Its sub-commands for monitoring, policy routing, network namespaces, and statistics give administrators precise, real-time visibility into every aspect of the network stack.

ip — Sub-Command Reference

Sub-command	Purpose
`ip addr`	View and manage IP addresses on interfaces.
`ip link`	Manage network interfaces — rename, set MTU, bring up/down, create virtual devices.
`ip route`	View and manipulate the kernel routing table. Supports multiple routing tables via `table` parameter.
`ip neigh`	ARP/NDP neighbour table — IP-to-MAC mappings. Add, delete, or flush entries.
`ip rule`	Policy routing rules — route traffic based on source IP, tos, or fwmark rather than just destination.
`ip netns`	Network namespaces — isolated network stacks. The foundation of container networking.
`ip -s link`	Show per-interface statistics — RX/TX packets, bytes, errors, dropped, overruns.

# Monitor network events in real time — interface state changes, address changes
ip monitor all

# Watch interface statistics refresh every second
watch -n 1 'ip -s link show eth0'

# Manage the ARP cache — useful when a remote host has changed its MAC
ip neigh show
ip neigh flush dev eth0          # flush all ARP entries for eth0
sudo ip neigh add 192.168.1.50 lladdr 52:54:00:ab:cd:ef dev eth0  # add static entry

# Change interface MTU (e.g. for jumbo frames on a 10GbE network)
sudo ip link set eth0 mtu 9000

# Rename an interface
sudo ip link set eth0 name wan0

# Create a dummy interface for testing
sudo ip link add dummy0 type dummy
sudo ip addr add 10.99.0.1/24 dev dummy0
sudo ip link set dummy0 up

# ip -s link show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP
    RX: bytes  packets  errors  dropped  overrun  mcast
    14285312   98421    0       0        0        0
    TX: bytes  packets  errors  dropped  carrier  collsn
    3932160    27841    0       0        0        0

# ip neigh show
192.168.1.1   dev eth0 lladdr 52:54:00:11:22:33 REACHABLE
192.168.1.20  dev eth0 lladdr 52:54:00:44:55:66 STALE

What just happened? ip -s link showed zero errors and zero dropped packets on eth0 — a healthy interface. The ARP table shows the gateway (192.168.1.1) in REACHABLE state, meaning it has recently been confirmed active. The STALE entry for 192.168.1.20 means it has not been confirmed recently but the MAC is still cached — it will be probed on next use. Persistent errors or dropped packets in the statistics are the first indicators of a hardware or driver problem.

Bonding and VLANs

Production servers often need more than a single network interface. Bonding (also called teaming or link aggregation) combines multiple physical interfaces into one logical interface for redundancy or increased throughput. VLANs (Virtual LANs) segment a single physical link into multiple logical networks — the switch tags traffic with an 802.1Q VLAN ID, and Linux creates virtual sub-interfaces to handle each tag.

Network Bonding — Link Aggregation

Multiple physical NICs appear as one logical interface. Common modes: active-backup (failover), balance-rr (round-robin), 802.3ad (LACP — requires switch support).

# Netplan bond example
bonds:
  bond0:
    interfaces: [eth0, eth1]
    parameters:
      mode: active-backup
      primary: eth0

VLANs — Logical Segmentation

One physical interface carries traffic for multiple networks. Each VLAN creates a sub-interface named eth0.10 (interface.vlan-id). Requires a trunk port on the connected switch.

# Netplan VLAN example
vlans:
  eth0.10:
    id: 10
    link: eth0
    addresses: [10.10.0.5/24]
  eth0.20:
    id: 20
    link: eth0
    addresses: [10.20.0.5/24]

# ── Create a bond manually (temporary) ───────────────────────────

# Load the bonding kernel module
sudo modprobe bonding

# Create the bond interface
sudo ip link add bond0 type bond mode active-backup

# Add physical interfaces as bond members (take them down first)
sudo ip link set eth0 down
sudo ip link set eth1 down
sudo ip link set eth0 master bond0
sudo ip link set eth1 master bond0

# Bring the bond up and assign an IP
sudo ip link set bond0 up
sudo ip addr add 192.168.1.10/24 dev bond0

# Check bond status
cat /proc/net/bonding/bond0

# ── Create a VLAN sub-interface manually (temporary) ─────────────

sudo ip link add link eth0 name eth0.10 type vlan id 10
sudo ip addr add 10.10.0.5/24 dev eth0.10
sudo ip link set eth0.10 up

# View VLAN interfaces
ip link show type vlan

# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v6.5.0
Bonding Mode: fault-tolerance (active-backup)
Primary Slave: eth0 (primary_reselect failure)
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Speed: 1000 Mbps
Duplex: full

Slave Interface: eth1
MII Status: up
Speed: 1000 Mbps
Duplex: full

What just happened? /proc/net/bonding/bond0 confirmed the bond is healthy — both slaves are up at 1000Mbps full-duplex, and eth0 is the active slave. In active-backup mode, all traffic goes through eth0; if eth0 fails, the kernel automatically switches to eth1 within the MII polling interval of 100ms — transparent to any application using the bond0 interface.

Deep Socket Inspection with ss

ss (socket statistics) is the modern replacement for netstat. It reads directly from the kernel's socket structures rather than parsing /proc files, making it faster and more accurate — particularly important when a system has thousands of connections. Its filter syntax allows very precise queries that netstat cannot match.

# All listening TCP sockets with process info
sudo ss -tlnp

# All established TCP connections
sudo ss -tnp state established

# All sockets — TCP + UDP + Unix
sudo ss -anp

# Filter by destination port — who is connected to port 443?
sudo ss -tnp dst :443

# Filter by source port — what are our outgoing connections on 5432?
sudo ss -tnp src :5432

# Filter by state — all sockets in TIME-WAIT
ss -tan state time-wait

# Count sockets per state — detect connection storms
ss -tan | awk 'NR>1 {state[$1]++} END {for(s in state) print s, state[s]}' | sort -k2 -rn

# Show socket memory usage — helpful for diagnosing buffer exhaustion
ss -tm

# Show detailed TCP internals — retransmits, RTT, congestion window
sudo ss -ti dst 8.8.8.8

# ss -tan | awk 'NR>1 {state[$1]++} END {for(s in state) print s, state[s]}' | sort -k2 -rn
ESTABLISHED  412
TIME-WAIT     18
LISTEN         4
CLOSE-WAIT     3

# sudo ss -ti dst 8.8.8.8
State  Recv-Q  Send-Q   Local Address:Port  Peer Address:Port
ESTAB  0       0        192.168.1.10:52341  8.8.8.8:443
         cubic wscale:7,7 rto:204 rtt:3.218/1.432 ato:40
         mss:1460 pmtu:1500 rcvmss:1460 advmss:1460
         cwnd:10 bytes_sent:2184 bytes_acked:2184 bytes_received:4096
         retrans:0/0 dsack_dups:0 reord_seen:0

What just happened? The TCP internals view from ss -ti revealed the connection's round-trip time (3.2ms), congestion window (10 segments), and — critically — zero retransmits. Non-zero retransmits indicate packet loss on the path to that destination. This level of per-connection detail was previously only available through packet capture tools like tcpdump.

Packet Capture with tcpdump

When higher-level tools cannot identify a problem, tcpdump captures actual network packets — the ground truth of what is happening on the wire. It is the definitive tool for confirming whether traffic is arriving, what its content looks like, and whether firewall rules or NAT are transforming it unexpectedly.

Host filter

Capture all traffic to or from a specific host.

host 192.168.1.50
src host 10.0.0.5
dst host 8.8.8.8

Port filter

Capture traffic on a specific port regardless of direction.

port 443
port 80 or port 443
portrange 8000-9000

Protocol filter

Capture only specific protocol traffic.

tcp
udp
icmp
arp

Combined

Use and, or, not for compound filters.

host 10.0.0.5 and port 443
tcp and not port 22

# Capture on eth0 — show packet summaries (no DNS resolution, more verbose)
sudo tcpdump -i eth0 -n

# Capture only HTTPS traffic to/from a specific host
sudo tcpdump -i eth0 -n host 192.168.1.50 and port 443

# Capture all ICMP (ping) traffic to debug reachability
sudo tcpdump -i eth0 -n icmp

# Capture and save to a file for analysis in Wireshark
sudo tcpdump -i eth0 -n -w /tmp/capture.pcap

# Replay or read a saved capture file
sudo tcpdump -r /tmp/capture.pcap -n

# Print packet payload in ASCII — see HTTP requests/responses
sudo tcpdump -i eth0 -n -A port 80

# Print first 100 bytes of payload in hex + ASCII
sudo tcpdump -i eth0 -n -X -s 100 port 8080

# Capture only SYN packets — detect connection attempts (useful for DDoS detection)
sudo tcpdump -i eth0 -n 'tcp[tcpflags] & tcp-syn != 0 and tcp[tcpflags] & tcp-ack == 0'

# sudo tcpdump -i eth0 -n host 192.168.1.50 and port 443
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes

14:32:01.123456 IP 192.168.1.10.52341 > 192.168.1.50.443: Flags [S], seq 123456789, win 64240
14:32:01.124100 IP 192.168.1.50.443 > 192.168.1.10.52341: Flags [S.], seq 987654321, ack 123456790
14:32:01.124210 IP 192.168.1.10.52341 > 192.168.1.50.443: Flags [.], ack 987654322, win 502
14:32:01.128000 IP 192.168.1.10.52341 > 192.168.1.50.443: Flags [P.], length 517
14:32:01.140000 IP 192.168.1.50.443 > 192.168.1.10.52341: Flags [P.], length 1448

What just happened? The capture showed a complete TCP three-way handshake: [S] SYN from client, [S.] SYN-ACK from server, [.] ACK completing the handshake. Then [P.] PSH-ACK data transfers in both directions — confirming a successfully established TLS connection. If the server had not replied to the SYN, we would know the problem is at the server (firewall dropping, service not listening), not the client.

Systematic Network Troubleshooting Methodology

Network problems in production almost always fall into one of five root cause categories. A disciplined methodology that maps symptoms to categories — and then applies the right tool for each — resolves the majority of incidents without guesswork.

Fig 1 — Network troubleshooting decision tree: each question isolates one layer

# Step 1 — gateway reachable?
ping -c 3 $(ip route | grep default | awk '{print $3}')

# Step 2 — internet IP reachable?
ping -c 3 8.8.8.8

# Step 3 — DNS resolution working?
dig +short google.com
# or bypass system resolver to isolate:
dig @8.8.8.8 +short google.com

# Step 4 — target port open?
nc -zv target.example.com 443
# or:
timeout 3 bash -c 'cat < /dev/null > /dev/tcp/target.example.com/443' \
  && echo "port open" || echo "port closed/filtered"

# Step 5 — if port appears closed, confirm with tcpdump whether packets arrive
sudo tcpdump -i eth0 -n host target.example.com and port 443

# Check iptables rules if packets arrive but port appears closed
sudo iptables -L -n -v | grep -E "443|REJECT|DROP"

Common Network Issues and Their Signatures

Experienced administrators recognise network problems by their characteristic patterns in tool output. The following signatures appear repeatedly in real-world troubleshooting and are worth memorising as diagnostic patterns.

High TIME-WAIT

Many thousands of TIME-WAIT sockets

Normal for busy HTTP servers — each closed connection enters TIME-WAIT for 60 seconds. Becomes a problem when local port exhaustion occurs. Mitigate with net.ipv4.tcp_tw_reuse=1 in sysctl and keep-alive connections.

Growing CLOSE-WAIT

Accumulating CLOSE-WAIT connections — application bug

CLOSE-WAIT means the remote side closed the connection but the local application has not called close() on its socket. A growing count means the application has a socket leak. Restart the service and file a bug.

Retransmits in ss -ti

Non-zero retransmit count — packet loss on the path

TCP retransmits confirm packets are being dropped between source and destination. Run mtr target.example.com to identify which hop is losing packets.

SYN no reply in tcpdump

SYN sent, no SYN-ACK received — firewall or service down

If tcpdump on the server shows SYN arriving but no SYN-ACK going out, either the port is not listening or the local firewall (iptables/nftables) is dropping the packet before it reaches the application.

Dropped in ip -s link

Rising dropped packets on an interface — buffer overflow

The kernel receive buffer is full before the application can read the data — either the application is too slow, the ring buffer is undersized, or interrupt affinity needs tuning. Check with ethtool -S eth0.

tcpdump on a Busy Interface Can Itself Cause Packet Drops

Running tcpdump without filters on a high-traffic interface (10Gbps+) can consume significant CPU and create a feedback loop — the capture tool competes for the same CPU as the network stack, potentially increasing the very packet drops you are trying to diagnose. Always apply the most specific filter expression possible (host X and port Y), use -s to limit snapshot length, and prefer writing to a file (-w) over printing to the terminal on busy production servers.

Lesson Checklist

✔ I can use ip -s link and ip neigh to inspect interface statistics and ARP cache entries, and I know what error/dropped counters indicate

✔ I understand the difference between bonding (link redundancy/aggregation) and VLANs (logical segmentation), and can configure both using Netplan or ip link

✔ I use ss -ti to inspect TCP internals including retransmits and RTT, and I can count connections per state to detect anomalies

✔ I can write targeted tcpdump filter expressions and interpret TCP flag sequences to confirm whether a connection is completing its handshake

✔ I follow the four-question decision tree (gateway → public IP → DNS → port) to isolate network failures to a specific layer before applying targeted fixes

Teacher's Note

The most powerful single troubleshooting technique in this lesson is running tcpdump simultaneously on both ends of a failing connection. If packets appear on the sender's tcpdump but not on the receiver's, the packet is being dropped between them — by a firewall, a router, or the kernel's own netfilter rules. If packets appear on both ends but the connection still fails, the problem is in the application or TLS layer. This two-sided capture technique resolves an entire class of "it's the network" vs "it's the app" disputes in minutes.

Practice Questions

1. An application server reports it cannot connect to the database at db.internal:5432. The application logs show "Connection refused". Walk through the exact diagnostic commands you would run on both the application server and the database server, explaining what each command's output would tell you.

On the app server: dig db.internal — confirms DNS resolves correctly. nc -zv db.internal 5432 — tests TCP connectivity to the port directly; "Connection refused" confirms the port is reachable but nothing is accepting. On the DB server: ss -tlnp | grep 5432 — checks if PostgreSQL is actually listening (and on which address: 127.0.0.1 vs 0.0.0.0). sudo systemctl status postgresql — confirms the service is running. sudo ufw status or sudo iptables -L — checks if a firewall is blocking port 5432 from the app server's IP.

2. ss -tan | awk shows 4,200 CLOSE-WAIT connections on a web server. Explain what CLOSE-WAIT means in the TCP state machine, why accumulating CLOSE-WAIT connections is a problem, and what is the likely root cause.

3. You need to capture all TCP SYN packets arriving on eth0 destined for port 443 and save them to a file for offline analysis. Write the exact tcpdump command, including appropriate flags for a busy production server, and explain the purpose of each flag.

sudo tcpdump -i eth0 -nn -s 96 -w /tmp/syn443.pcap 'tcp dst port 443 and tcp[tcpflags] & tcp-syn != 0' — -i eth0: listen on eth0. -nn: no DNS/port name resolution (critical on busy servers to avoid slowdown). -s 96: capture only the first 96 bytes (headers only, reduces file size). -w /tmp/syn443.pcap: write raw packets to file for Wireshark analysis. The filter matches TCP packets destined for port 443 with the SYN flag set.

Lesson Quiz

1. A server has two physical NICs configured as a bond in active-backup mode. What happens to the server's network connectivity when the switch port connected to eth0 goes down?

The server loses network connectivity until eth0 is restored The kernel automatically fails over to eth1 within the MII polling interval — the bond interface stays up and the IP address is unchanged Both eth0 and eth1 lose connectivity because they share the same bond interface Traffic is split equally between eth0 and eth1 so throughput halves but connectivity is maintained

2. A tcpdump capture shows incoming SYN packets on port 8080 but no SYN-ACK responses. The application is confirmed running. What is the most likely cause?

The client's firewall is blocking the SYN-ACK response A local firewall rule on the server is dropping inbound packets on port 8080 before they reach the application The application is listening on 127.0.0.1 rather than 0.0.0.0 or the server's public IP The MTU setting is too large for the packet to pass through

3. Which ss command shows TCP internals — including retransmit count and round-trip time — for the connection to 10.0.0.50?

ss -tnp dst 10.0.0.50 ss -ti dst 10.0.0.50 ss -tm dst 10.0.0.50 ss -tan state established dst 10.0.0.50

Up Next

Lesson 28 — SSH and Secure Access

SSH key management, tunnelling, port forwarding, and hardening SSH for production use

← Previous Course Index Next →