Kubernetes Lesson 58 – Kubernetes Troubleshooting | Dataplexa

Advanced Workloads & Operations · Lesson 58

Kubernetes Troubleshooting

Production incidents rarely announce themselves clearly. A Deployment is degraded, a service is unreachable, a node is NotReady — and you have to work backwards from the symptom to the root cause. This lesson gives you the systematic diagnostic framework and the exact commands used by experienced on-call engineers to resolve the most common Kubernetes failures.

The Troubleshooting Mental Model

Every Kubernetes problem can be traced through a hierarchy of objects. Start at the top and work down — most problems reveal themselves quickly when you follow this order:

1. Is the Service reachable?

Endpoints, DNS, port mapping

2. Are Pods Running?

Phase, Ready condition, restarts

3. What do logs say?

Container stdout/stderr, previous crash logs

4. What do events say?

kubectl describe, OOMKill, image pull errors

5. Are nodes healthy?

Node conditions, kubelet status, disk/memory pressure

Scenario 1: Pod Stuck in Pending

A Pending Pod has been accepted by the API server but no node has been assigned. The scheduler rejected every node for a reason — kubectl describe shows you exactly why.

kubectl describe pod payment-api-7d9f4-abc12 -n payments
# Look at the Events section at the bottom:

# "Insufficient cpu" or "Insufficient memory"
# --> No node has enough allocatable resources
# Fix: check node capacity, add nodes, reduce requests, or check reserved system resources
kubectl describe nodes | grep -A5 "Allocated resources"

# "didn't match Pod's node affinity/selector"
# --> nodeSelector or affinity rules exclude all available nodes
# Fix: check labels on nodes vs what the Pod requires
kubectl get nodes --show-labels | grep accelerator

# "had taint ... that the pod didn't tolerate"
# --> A node taint is blocking scheduling
# Fix: add a toleration to the Pod or remove the taint
kubectl describe node gpu-node-1 | grep Taints

# "persistentvolumeclaim not found" or "pod has unbound PersistentVolumeClaims"
# --> PVC is Pending (no PV available) or the PVC name is wrong
kubectl get pvc -n payments
kubectl describe pvc postgres-data-postgres-0 -n payments
# Check events for "no persistent volumes available" or StorageClass issues

$ kubectl describe pod payment-api-7d9f4-abc12 -n payments
Name:         payment-api-7d9f4-abc12
Namespace:    payments
Status:       Pending
[...]
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  45s   default-scheduler  0/3 nodes are available:
    3 node(s) had untolerated taint {dedicated: gpu}: pod has no toleration,
    2 node(s) had insufficient cpu.
# Root cause: wrong node group -- this is a regular app Pod landing on GPU nodes
# Fix: check nodeSelector / tolerations on the Pod spec

$ kubectl get nodes --show-labels | grep dedicated
ip-10-0-1-12   dedicated=gpu   ← all 3 nodes are GPU-tainted
# Solution: add regular worker nodes or remove the incorrect taint from one node

Scenario 2: Pod in CrashLoopBackOff

CrashLoopBackOff means the container starts, exits with a non-zero code, Kubernetes restarts it, it crashes again. The backoff timer increases between restarts (10s → 20s → 40s → 80s → 160s → 5m max). The crash logs are your primary evidence.

# Get logs from the PREVIOUS (crashed) container -- not the current one
kubectl logs payment-api-7d9f4-abc12 -n payments --previous
# This shows what was logged before the last crash

# If the container crashes too fast to log anything -- check the exit code
kubectl describe pod payment-api-7d9f4-abc12 -n payments | grep -A3 "Last State"
# Last State:  Terminated
#   Reason:    OOMKilled      --> container exceeded memory limit
#   Exit Code: 137            --> 128 + 9 (SIGKILL)
#   Exit Code: 1              --> application error (check logs)
#   Exit Code: 2              --> misuse of shell builtin or missing config
#   Exit Code: 125            --> Docker/container runtime error

# Common causes and fixes:
# OOMKilled (137): increase memory limit or fix memory leak
kubectl set resources deployment/payment-api -n payments \
  --limits=memory=512Mi --requests=memory=256Mi

# Missing environment variable or secret:
# App logs: "Error: DATABASE_URL is required"
# Fix: check that the Secret and envFrom/valueFrom references are correct
kubectl get secret postgres-credentials -n payments
kubectl describe pod ... | grep -A10 "Environment"

# Failed readiness probe causing restart loop:
kubectl describe pod payment-api-7d9f4-abc12 -n payments | grep -A5 "Liveness\|Readiness"
# "Liveness probe failed" -- probe is killing healthy containers
# Fix: increase initialDelaySeconds or loosen the probe threshold

$ kubectl logs payment-api-7d9f4-abc12 -n payments --previous
2025-03-10T14:32:11Z ERROR Failed to connect to database
2025-03-10T14:32:11Z ERROR POSTGRES_PASSWORD environment variable not set
2025-03-10T14:32:11Z FATAL Exiting: missing required configuration
# Root cause: Secret not mounted correctly

$ kubectl describe pod payment-api-7d9f4-abc12 -n payments | grep -A5 "Environment"
    Environment Variables from:
      Secret payments/postgres-credentials  Optional: false  Error: secret not found
# The Secret "postgres-credentials" does not exist in this namespace!

$ kubectl get secrets -n payments
NAME                  TYPE     DATA
api-tls-cert          ...
# postgres-credentials is missing -- External Secrets Operator likely failed to sync

$ kubectl get externalsecret -n payments
NAME                      READY   STATUS
postgres-credentials-sync False   SecretSyncedError: AccessDenied from AWS SM
# Found it: IRSA permissions issue preventing ESO from reading the secret

Scenario 3: Service Not Reachable

A service is unreachable — curl returns connection refused, timeout, or DNS NXDOMAIN. The diagnosis ladder: DNS → ClusterIP → Endpoints → Pod.

# Step 1: Test from inside the cluster using a debug Pod
kubectl run debug --image=nicolaka/netshoot --restart=Never -n payments -- sleep 3600
kubectl exec -it debug -n payments -- bash

# Step 2: Check DNS resolution
nslookup payment-api.payments.svc.cluster.local
# NXDOMAIN --> DNS is broken (CoreDNS issue or wrong service name)
# Returns IP but connection fails --> DNS ok, issue is at Service or Pod level

# Step 3: Check the Service has Endpoints
kubectl get endpoints payment-api -n payments
# NAME          ENDPOINTS         AGE
# payment-api               5m   --> No endpoints! Pods aren't matching the selector
# payment-api   192.168.2.15:8080 5m   --> Endpoints exist, issue is elsewhere

# Why no endpoints? The Service selector doesn't match any Pod labels
kubectl get service payment-api -n payments -o yaml | grep -A3 selector
# selector: app: payment-api
kubectl get pods -n payments --show-labels | grep payment-api
# Labels: app=payment-api-v2  ← label mismatch! Service selector doesn't match

# Step 4: If Endpoints exist, test direct Pod connectivity
kubectl exec -it debug -n payments -- curl -v http://192.168.2.15:8080/health
# Connection refused --> app isn't listening on that port
# 200 OK --> Pod is healthy, issue is with Service port mapping

# Step 5: Check Service port mapping
kubectl get service payment-api -n payments -o yaml
# spec.ports:
#   port: 80          ← ClusterIP listens on 80
#   targetPort: 8080  ← forwards to container port 8080
# If container is listening on 8081, this breaks the chain

# Step 6: Check Network Policy isn't blocking traffic
kubectl get networkpolicy -n payments
# If default-deny-all exists, ensure an ingress rule allows the traffic

$ kubectl get endpoints payment-api -n payments
NAME          ENDPOINTS   AGE
payment-api   <none>      5m   ← no Pods matching the selector

$ kubectl get service payment-api -n payments -o yaml | grep -A3 selector
  selector:
    app: payment-api   ← service expects this label

$ kubectl get pods -n payments --show-labels | grep payment
payment-api-7d9f4   1/1   Running   app=payment-api-v2   ← label mismatch!

# Fix: update the Service selector to match
kubectl patch service payment-api -n payments   -p '{"spec":{"selector":{"app":"payment-api-v2"}}}'

$ kubectl get endpoints payment-api -n payments
NAME          ENDPOINTS             AGE
payment-api   192.168.2.15:8080     5m   ← endpoints populated ✓

# Verify from debug Pod
$ kubectl exec -it debug -n payments -- curl -s http://payment-api/health
{"status":"ok","version":"3.1.0"}

What just happened?

The Endpoints object is the truth — When a Service has no Endpoints, it means no Pods match its selector. This is one of the most common Kubernetes misconfiguration issues — a typo in the selector (app: payment-api vs app: payment-api-v2) silently drops all traffic. Always check Endpoints before digging deeper into DNS or network.

nicolaka/netshoot is your Swiss Army knife — This image contains curl, nslookup, dig, netstat, tcpdump, ping, traceroute, and dozens of other network tools. Running it as a debug Pod in the target namespace lets you test connectivity from the exact network position a real Pod would use — same network policy, same DNS configuration, same service account.

Scenario 4: Node NotReady

A node transitions to NotReady — Pods on it will be evicted after the node lifecycle controller timeout (default 5 minutes). You have a narrow window to diagnose before workloads are disrupted.

# Identify what's wrong with the node
kubectl describe node ip-10-0-2-44.us-east-1.compute.internal

# Look at Conditions section:
# Type              Status  Message
# MemoryPressure    True    kubelet has memory pressure  --> node is low on memory
# DiskPressure      True    kubelet has disk pressure    --> node is low on disk
# PIDPressure       True    kubelet has PID pressure     --> too many processes
# Ready             False   kubelet stopped posting node status  --> kubelet crashed

# For kubelet issues -- SSH to the node and check:
sudo systemctl status kubelet
sudo journalctl -u kubelet -n 100 --no-pager
# Common: "failed to start ContainerManager" -> cgroup driver mismatch after runtime upgrade
# Common: certificate expired -> kubelet certs need rotation

# For disk pressure -- find what's consuming disk:
df -h /var/lib/kubelet    # kubelet data
df -h /var/log             # logs
crictl images | sort -k4 -h | tail -20    # Large container images
# Fix: remove unused images
crictl rmi --prune

# For memory pressure -- check what's consuming memory:
kubectl top pods -A --sort-by=memory | head -20
# Fix: evict the largest non-critical Pod or add a node

# Cordon the node to prevent new scheduling while you investigate
kubectl cordon ip-10-0-2-44.us-east-1.compute.internal
# If the node is unrecoverable:
kubectl drain ip-10-0-2-44.us-east-1.compute.internal \
  --ignore-daemonsets --delete-emptydir-data
kubectl delete node ip-10-0-2-44.us-east-1.compute.internal
# Cluster Autoscaler will provision a replacement

$ kubectl describe node ip-10-0-2-44.us-east-1.compute.internal | grep -A10 Conditions
Conditions:
  Type              Status  Reason                Message
  ----              ------  ------                -------
  MemoryPressure    False   KubeletHasSufficientMemory
  DiskPressure      True    KubeletHasDiskPressure  kubelet has disk pressure
  PIDPressure       False   KubeletHasSufficientPID
  Ready             False   KubeletNotReady         container runtime not ready

# Root cause: disk full -- check largest consumers
$ df -h /var/lib/kubelet
Filesystem      Size  Used  Avail  Use%
/dev/xvda1       100G   98G    2G   98%   ← 98% full

$ crictl images | sort -k4 -rh | head -5
IMAGE                          TAG      SIZE
...old-app:1.0.0               latest   2.1GB  ← stale image from 6 months ago

$ crictl rmi --prune
Deleted: sha256:abc123...  (freed 8.4 GB)

$ df -h /var/lib/kubelet
Filesystem      Size  Used  Avail  Use%
/dev/xvda1       100G   60G   40G   60%   ← pressure relieved

$ kubectl uncordon ip-10-0-2-44.us-east-1.compute.internal
node/ip-10-0-2-44 uncordoned   ✓

The Essential Troubleshooting Toolkit

Command	What it reveals
kubectl describe pod <name>	Events (scheduling failures, probe failures, image pull errors), exit codes, resource usage
kubectl logs <pod> --previous	Last output from a crashed container before restart
kubectl get events -n <ns> --sort-by='.lastTimestamp'	All recent events in a namespace, chronological order
kubectl get endpoints <service>	Whether Pods match the Service selector
kubectl top pods / nodes	Real-time CPU and memory usage (requires Metrics Server)
kubectl describe node <name>	Node conditions, allocatable resources, taints, running Pods
kubectl exec -it debug -- bash	Interactive shell in the cluster for network testing
kubectl get all -n <ns>	Quick overview: Deployments, ReplicaSets, Pods, Services in a namespace
kubectl rollout status / history	Deployment rollout progress, previous revision info for rollback

Teacher's Note: Building intuition through kubectl describe

kubectl describe is your first stop for every Kubernetes problem without exception. The Events section at the bottom is written by every Kubernetes controller and the kubelet — it tells you exactly what happened, when, and why. Most problems are diagnosed in under 60 seconds by reading this section carefully.

The second most important habit: always check the Endpoints object when a Service is unreachable before assuming a networking problem. 80% of "service unreachable" incidents are label selector mismatches that show up immediately in kubectl get endpoints.

When you are genuinely stuck — the logs are empty, events are unhelpful, the Pod crashes too fast to debug — reach for kubectl debug: kubectl debug -it payment-api-7d9f4-abc12 --image=busybox --target=payment-api. This injects a debug container into a running Pod's process namespace, letting you inspect the filesystem, environment variables, and running processes of the original container without modifying it.

Practice Questions

1. A Service returns no response. Before investigating DNS or network policies, which command shows whether any Pods are matching the Service's selector?

2. A Pod is in CrashLoopBackOff and has restarted 8 times. How do you see the logs from the container run just before the last crash?

3. kubectl describe pod shows Last State: Terminated, Exit Code: 137. What does this indicate?

Quiz

Up Next · Lesson 59

Kubernetes on AWS (EKS)

Amazon Elastic Kubernetes Service — how to create and configure an EKS cluster, integrate with AWS IAM, VPC networking, load balancers, and storage. The AWS-specific patterns every EKS operator needs to know.

← Previous Course Index Next →

Kubernetes Course