Kubernetes Course
Kubernetes Troubleshooting
Production incidents rarely announce themselves clearly. A Deployment is degraded, a service is unreachable, a node is NotReady — and you have to work backwards from the symptom to the root cause. This lesson gives you the systematic diagnostic framework and the exact commands used by experienced on-call engineers to resolve the most common Kubernetes failures.
The Troubleshooting Mental Model
Every Kubernetes problem can be traced through a hierarchy of objects. Start at the top and work down — most problems reveal themselves quickly when you follow this order:
Scenario 1: Pod Stuck in Pending
A Pending Pod has been accepted by the API server but no node has been assigned. The scheduler rejected every node for a reason — kubectl describe shows you exactly why.
kubectl describe pod payment-api-7d9f4-abc12 -n payments
# Look at the Events section at the bottom:
# "Insufficient cpu" or "Insufficient memory"
# --> No node has enough allocatable resources
# Fix: check node capacity, add nodes, reduce requests, or check reserved system resources
kubectl describe nodes | grep -A5 "Allocated resources"
# "didn't match Pod's node affinity/selector"
# --> nodeSelector or affinity rules exclude all available nodes
# Fix: check labels on nodes vs what the Pod requires
kubectl get nodes --show-labels | grep accelerator
# "had taint ... that the pod didn't tolerate"
# --> A node taint is blocking scheduling
# Fix: add a toleration to the Pod or remove the taint
kubectl describe node gpu-node-1 | grep Taints
# "persistentvolumeclaim not found" or "pod has unbound PersistentVolumeClaims"
# --> PVC is Pending (no PV available) or the PVC name is wrong
kubectl get pvc -n payments
kubectl describe pvc postgres-data-postgres-0 -n payments
# Check events for "no persistent volumes available" or StorageClass issues
$ kubectl describe pod payment-api-7d9f4-abc12 -n payments
Name: payment-api-7d9f4-abc12
Namespace: payments
Status: Pending
[...]
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 45s default-scheduler 0/3 nodes are available:
3 node(s) had untolerated taint {dedicated: gpu}: pod has no toleration,
2 node(s) had insufficient cpu.
# Root cause: wrong node group -- this is a regular app Pod landing on GPU nodes
# Fix: check nodeSelector / tolerations on the Pod spec
$ kubectl get nodes --show-labels | grep dedicated
ip-10-0-1-12 dedicated=gpu ← all 3 nodes are GPU-tainted
# Solution: add regular worker nodes or remove the incorrect taint from one nodeScenario 2: Pod in CrashLoopBackOff
CrashLoopBackOff means the container starts, exits with a non-zero code, Kubernetes restarts it, it crashes again. The backoff timer increases between restarts (10s → 20s → 40s → 80s → 160s → 5m max). The crash logs are your primary evidence.
# Get logs from the PREVIOUS (crashed) container -- not the current one
kubectl logs payment-api-7d9f4-abc12 -n payments --previous
# This shows what was logged before the last crash
# If the container crashes too fast to log anything -- check the exit code
kubectl describe pod payment-api-7d9f4-abc12 -n payments | grep -A3 "Last State"
# Last State: Terminated
# Reason: OOMKilled --> container exceeded memory limit
# Exit Code: 137 --> 128 + 9 (SIGKILL)
# Exit Code: 1 --> application error (check logs)
# Exit Code: 2 --> misuse of shell builtin or missing config
# Exit Code: 125 --> Docker/container runtime error
# Common causes and fixes:
# OOMKilled (137): increase memory limit or fix memory leak
kubectl set resources deployment/payment-api -n payments \
--limits=memory=512Mi --requests=memory=256Mi
# Missing environment variable or secret:
# App logs: "Error: DATABASE_URL is required"
# Fix: check that the Secret and envFrom/valueFrom references are correct
kubectl get secret postgres-credentials -n payments
kubectl describe pod ... | grep -A10 "Environment"
# Failed readiness probe causing restart loop:
kubectl describe pod payment-api-7d9f4-abc12 -n payments | grep -A5 "Liveness\|Readiness"
# "Liveness probe failed" -- probe is killing healthy containers
# Fix: increase initialDelaySeconds or loosen the probe threshold
$ kubectl logs payment-api-7d9f4-abc12 -n payments --previous
2025-03-10T14:32:11Z ERROR Failed to connect to database
2025-03-10T14:32:11Z ERROR POSTGRES_PASSWORD environment variable not set
2025-03-10T14:32:11Z FATAL Exiting: missing required configuration
# Root cause: Secret not mounted correctly
$ kubectl describe pod payment-api-7d9f4-abc12 -n payments | grep -A5 "Environment"
Environment Variables from:
Secret payments/postgres-credentials Optional: false Error: secret not found
# The Secret "postgres-credentials" does not exist in this namespace!
$ kubectl get secrets -n payments
NAME TYPE DATA
api-tls-cert ...
# postgres-credentials is missing -- External Secrets Operator likely failed to sync
$ kubectl get externalsecret -n payments
NAME READY STATUS
postgres-credentials-sync False SecretSyncedError: AccessDenied from AWS SM
# Found it: IRSA permissions issue preventing ESO from reading the secretScenario 3: Service Not Reachable
A service is unreachable — curl returns connection refused, timeout, or DNS NXDOMAIN. The diagnosis ladder: DNS → ClusterIP → Endpoints → Pod.
# Step 1: Test from inside the cluster using a debug Pod
kubectl run debug --image=nicolaka/netshoot --restart=Never -n payments -- sleep 3600
kubectl exec -it debug -n payments -- bash
# Step 2: Check DNS resolution
nslookup payment-api.payments.svc.cluster.local
# NXDOMAIN --> DNS is broken (CoreDNS issue or wrong service name)
# Returns IP but connection fails --> DNS ok, issue is at Service or Pod level
# Step 3: Check the Service has Endpoints
kubectl get endpoints payment-api -n payments
# NAME ENDPOINTS AGE
# payment-api 5m --> No endpoints! Pods aren't matching the selector
# payment-api 192.168.2.15:8080 5m --> Endpoints exist, issue is elsewhere
# Why no endpoints? The Service selector doesn't match any Pod labels
kubectl get service payment-api -n payments -o yaml | grep -A3 selector
# selector: app: payment-api
kubectl get pods -n payments --show-labels | grep payment-api
# Labels: app=payment-api-v2 ← label mismatch! Service selector doesn't match
# Step 4: If Endpoints exist, test direct Pod connectivity
kubectl exec -it debug -n payments -- curl -v http://192.168.2.15:8080/health
# Connection refused --> app isn't listening on that port
# 200 OK --> Pod is healthy, issue is with Service port mapping
# Step 5: Check Service port mapping
kubectl get service payment-api -n payments -o yaml
# spec.ports:
# port: 80 ← ClusterIP listens on 80
# targetPort: 8080 ← forwards to container port 8080
# If container is listening on 8081, this breaks the chain
# Step 6: Check Network Policy isn't blocking traffic
kubectl get networkpolicy -n payments
# If default-deny-all exists, ensure an ingress rule allows the traffic
$ kubectl get endpoints payment-api -n payments
NAME ENDPOINTS AGE
payment-api <none> 5m ← no Pods matching the selector
$ kubectl get service payment-api -n payments -o yaml | grep -A3 selector
selector:
app: payment-api ← service expects this label
$ kubectl get pods -n payments --show-labels | grep payment
payment-api-7d9f4 1/1 Running app=payment-api-v2 ← label mismatch!
# Fix: update the Service selector to match
kubectl patch service payment-api -n payments -p '{"spec":{"selector":{"app":"payment-api-v2"}}}'
$ kubectl get endpoints payment-api -n payments
NAME ENDPOINTS AGE
payment-api 192.168.2.15:8080 5m ← endpoints populated ✓
# Verify from debug Pod
$ kubectl exec -it debug -n payments -- curl -s http://payment-api/health
{"status":"ok","version":"3.1.0"}What just happened?
The Endpoints object is the truth — When a Service has no Endpoints, it means no Pods match its selector. This is one of the most common Kubernetes misconfiguration issues — a typo in the selector (app: payment-api vs app: payment-api-v2) silently drops all traffic. Always check Endpoints before digging deeper into DNS or network.
nicolaka/netshoot is your Swiss Army knife — This image contains curl, nslookup, dig, netstat, tcpdump, ping, traceroute, and dozens of other network tools. Running it as a debug Pod in the target namespace lets you test connectivity from the exact network position a real Pod would use — same network policy, same DNS configuration, same service account.
Scenario 4: Node NotReady
A node transitions to NotReady — Pods on it will be evicted after the node lifecycle controller timeout (default 5 minutes). You have a narrow window to diagnose before workloads are disrupted.
# Identify what's wrong with the node
kubectl describe node ip-10-0-2-44.us-east-1.compute.internal
# Look at Conditions section:
# Type Status Message
# MemoryPressure True kubelet has memory pressure --> node is low on memory
# DiskPressure True kubelet has disk pressure --> node is low on disk
# PIDPressure True kubelet has PID pressure --> too many processes
# Ready False kubelet stopped posting node status --> kubelet crashed
# For kubelet issues -- SSH to the node and check:
sudo systemctl status kubelet
sudo journalctl -u kubelet -n 100 --no-pager
# Common: "failed to start ContainerManager" -> cgroup driver mismatch after runtime upgrade
# Common: certificate expired -> kubelet certs need rotation
# For disk pressure -- find what's consuming disk:
df -h /var/lib/kubelet # kubelet data
df -h /var/log # logs
crictl images | sort -k4 -h | tail -20 # Large container images
# Fix: remove unused images
crictl rmi --prune
# For memory pressure -- check what's consuming memory:
kubectl top pods -A --sort-by=memory | head -20
# Fix: evict the largest non-critical Pod or add a node
# Cordon the node to prevent new scheduling while you investigate
kubectl cordon ip-10-0-2-44.us-east-1.compute.internal
# If the node is unrecoverable:
kubectl drain ip-10-0-2-44.us-east-1.compute.internal \
--ignore-daemonsets --delete-emptydir-data
kubectl delete node ip-10-0-2-44.us-east-1.compute.internal
# Cluster Autoscaler will provision a replacement
$ kubectl describe node ip-10-0-2-44.us-east-1.compute.internal | grep -A10 Conditions Conditions: Type Status Reason Message ---- ------ ------ ------- MemoryPressure False KubeletHasSufficientMemory DiskPressure True KubeletHasDiskPressure kubelet has disk pressure PIDPressure False KubeletHasSufficientPID Ready False KubeletNotReady container runtime not ready # Root cause: disk full -- check largest consumers $ df -h /var/lib/kubelet Filesystem Size Used Avail Use% /dev/xvda1 100G 98G 2G 98% ← 98% full $ crictl images | sort -k4 -rh | head -5 IMAGE TAG SIZE ...old-app:1.0.0 latest 2.1GB ← stale image from 6 months ago $ crictl rmi --prune Deleted: sha256:abc123... (freed 8.4 GB) $ df -h /var/lib/kubelet Filesystem Size Used Avail Use% /dev/xvda1 100G 60G 40G 60% ← pressure relieved $ kubectl uncordon ip-10-0-2-44.us-east-1.compute.internal node/ip-10-0-2-44 uncordoned ✓
The Essential Troubleshooting Toolkit
| Command | What it reveals |
|---|---|
| kubectl describe pod <name> | Events (scheduling failures, probe failures, image pull errors), exit codes, resource usage |
| kubectl logs <pod> --previous | Last output from a crashed container before restart |
| kubectl get events -n <ns> --sort-by='.lastTimestamp' | All recent events in a namespace, chronological order |
| kubectl get endpoints <service> | Whether Pods match the Service selector |
| kubectl top pods / nodes | Real-time CPU and memory usage (requires Metrics Server) |
| kubectl describe node <name> | Node conditions, allocatable resources, taints, running Pods |
| kubectl exec -it debug -- bash | Interactive shell in the cluster for network testing |
| kubectl get all -n <ns> | Quick overview: Deployments, ReplicaSets, Pods, Services in a namespace |
| kubectl rollout status / history | Deployment rollout progress, previous revision info for rollback |
Teacher's Note: Building intuition through kubectl describe
kubectl describe is your first stop for every Kubernetes problem without exception. The Events section at the bottom is written by every Kubernetes controller and the kubelet — it tells you exactly what happened, when, and why. Most problems are diagnosed in under 60 seconds by reading this section carefully.
The second most important habit: always check the Endpoints object when a Service is unreachable before assuming a networking problem. 80% of "service unreachable" incidents are label selector mismatches that show up immediately in kubectl get endpoints.
When you are genuinely stuck — the logs are empty, events are unhelpful, the Pod crashes too fast to debug — reach for kubectl debug: kubectl debug -it payment-api-7d9f4-abc12 --image=busybox --target=payment-api. This injects a debug container into a running Pod's process namespace, letting you inspect the filesystem, environment variables, and running processes of the original container without modifying it.
Practice Questions
1. A Service returns no response. Before investigating DNS or network policies, which command shows whether any Pods are matching the Service's selector?
2. A Pod is in CrashLoopBackOff and has restarted 8 times. How do you see the logs from the container run just before the last crash?
3. kubectl describe pod shows Last State: Terminated, Exit Code: 137. What does this indicate?
Quiz
1. A Pod is stuck in Pending. What is the first command to run and where do you look in the output?
2. kubectl get endpoints payment-api -n payments shows ENDPOINTS: <none>. What does this mean and how do you fix it?
3. You need to test DNS resolution and HTTP connectivity to a Service from inside a specific namespace. Which debug image provides the best toolkit for this?
Up Next · Lesson 59
Kubernetes on AWS (EKS)
Amazon Elastic Kubernetes Service — how to create and configure an EKS cluster, integrate with AWS IAM, VPC networking, load balancers, and storage. The AWS-specific patterns every EKS operator needs to know.