Kubernetes Course
Health Checks
A container that's running is not the same as a container that's healthy. Kubernetes has three distinct probes for three distinct questions — and understanding the difference between them is what separates clusters that route traffic correctly from clusters that send requests into black holes.
Three Probes, Three Questions
Kubernetes is continuously asking questions about every container in your cluster. Three different probes answer three different questions, and each one triggers a different action when it fails.
| Probe | Question it answers | Action on failure | When to use |
|---|---|---|---|
| Liveness | Is the container still alive and functional — or deadlocked/stuck? | Kill and restart the container | Apps that can get stuck (deadlocks, infinite loops, leaked connections) |
| Readiness | Is the container ready to serve traffic right now? | Remove from Service endpoints — no restart | Always — controls when traffic reaches your Pod |
| Startup | Has the container finished its initial startup sequence? | Kill and restart — but only during startup window | Slow-starting apps (JVM, legacy apps) where liveness would kill them prematurely |
The critical distinction between liveness and readiness
A failed liveness probe says "this container is broken beyond recovery — restart it." A failed readiness probe says "this container is temporarily not ready — stop sending traffic, but leave it running." Confusing the two causes serious production problems. If you use a liveness probe for something that fails during legitimate high load (like a slow HTTP response), Kubernetes will kill your Pod when it's actually just busy — making the overload worse by restarting into the same load.
Three Probe Mechanisms
Each probe type (liveness, readiness, startup) can use any of three check mechanisms. The mechanism determines how Kubernetes tests the container — not what it does with the result.
🌐 httpGet
Kubernetes sends an HTTP GET request to a path and port on the container. HTTP status 200–399 = success. Anything else = failure.
Best for HTTP services — the most common probe type
🔌 tcpSocket
Kubernetes attempts to open a TCP connection to a port on the container. If the connection succeeds (even briefly), the probe passes.
Best for TCP services (databases, queues) that don't speak HTTP
⚡ exec
Kubernetes runs a command inside the container. Exit code 0 = success. Any non-zero exit code = failure.
Best for custom health logic that can't be expressed as HTTP
Liveness Probe: Restart on Deadlock
The scenario: Your team runs a Go-based order processing service. It works great under normal load but occasionally gets into a deadlock state where it stops processing messages but doesn't crash. The process is still running, the port is still open, but it's not doing anything useful. Without a liveness probe, Kubernetes has no idea — the Pod stays running and stuck forever. With a liveness probe, it gets restarted automatically within seconds of the deadlock being detected.
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-processor
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: order-processor
template:
metadata:
labels:
app: order-processor
spec:
containers:
- name: order-processor
image: company/order-processor:2.4.0
ports:
- containerPort: 8080
livenessProbe: # livenessProbe: is the container still alive and functional?
httpGet: # Mechanism: send an HTTP GET request
path: /healthz # The health check endpoint the app exposes
port: 8080 # Port to send the request to (must match containerPort)
httpHeaders: # Optional: custom headers (e.g. for auth on the health endpoint)
- name: Custom-Header
value: liveness-check
initialDelaySeconds: 15 # Wait 15s after container starts before first probe
# Give the app time to start up before checking
periodSeconds: 20 # Check every 20 seconds after the initial delay
timeoutSeconds: 5 # Probe fails if no response within 5 seconds
failureThreshold: 3 # Fail 3 times in a row before restarting the container
# 3 failures × 20s = 60s of consecutive failure before restart
successThreshold: 1 # Only 1 success needed to mark liveness as passing
# (successThreshold must be 1 for liveness — Kubernetes enforces this)
$ kubectl apply -f order-processor-deployment.yaml
deployment.apps/order-processor created
$ kubectl describe pod order-processor-7f9b4d-2xkpj -n production | grep -A20 "Liveness:"
Liveness: http-get http://:8080/healthz delay=15s timeout=5s period=20s #success=1 #failure=3
(simulating a deadlock — liveness probe starts failing)
$ kubectl describe pod order-processor-7f9b4d-2xkpj -n production | grep -A5 "Events:"
Events:
Warning Unhealthy 2m kubelet Liveness probe failed: HTTP probe failed with statuscode: 503
Warning Unhealthy 100s kubelet Liveness probe failed: HTTP probe failed with statuscode: 503
Warning Unhealthy 80s kubelet Liveness probe failed: HTTP probe failed with statuscode: 503
Normal Killing 80s kubelet Container order-processor failed liveness probe, will be restarted
Normal Pulled 75s kubelet Container image already present on machine
Normal Started 75s kubelet Started container order-processorWhat just happened?
initialDelaySeconds — Without this, Kubernetes starts probing the moment the container process starts — before the app has had a chance to bind to its port and start its HTTP server. The probe fails immediately, Kubernetes restarts the container, and you get a crash loop that has nothing to do with the app actually being broken. Set initialDelaySeconds to slightly more than your app's typical cold-start time.
failureThreshold × periodSeconds = restart window — With failureThreshold: 3 and periodSeconds: 20, the container must fail consecutively for 60 seconds before Kubernetes restarts it. This prevents transient blips (a momentary slow response, a garbage collection pause) from triggering unnecessary restarts. Tune this based on how quickly you want to recover from real failures.
Events section — The Events at the bottom of kubectl describe pod show every probe failure and the eventual restart with timestamp. This is your audit trail for why a Pod restarted — far more informative than just seeing RESTARTS: 1 in kubectl get pods.
Readiness Probe: Control When Traffic Arrives
The readiness probe is arguably the most important of the three. It controls whether a Pod is included in a Service's endpoint list — meaning it controls whether real user traffic reaches that Pod. A Pod that fails its readiness probe is silently removed from rotation. No restart. No alert (unless you set one up). Just... no more traffic.
The scenario: Your payment API caches its product catalogue in memory on startup. Until the cache is warm (typically 20–30 seconds), responses take 10–15 seconds — completely unacceptable for users. You need to prevent traffic from reaching the Pod until the cache is ready, and you also want to temporarily remove a Pod from rotation if it starts returning errors (maybe its database connection pool is exhausted).
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-api
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: payment-api
template:
metadata:
labels:
app: payment-api
spec:
containers:
- name: payment-api
image: company/payment-api:3.0.0
ports:
- containerPort: 8080
readinessProbe: # readinessProbe: is this container ready for traffic?
httpGet:
path: /ready # Different endpoint from /healthz — reports cache warm status
port: 8080
initialDelaySeconds: 5 # Start checking 5s after container starts
periodSeconds: 10 # Check every 10 seconds
timeoutSeconds: 3 # Fail if no response within 3 seconds
failureThreshold: 3 # Remove from endpoints after 3 consecutive failures
# Pod stays running — just no traffic sent to it
successThreshold: 2 # Require 2 consecutive successes to re-add to endpoints
# successThreshold > 1 is valid for readiness (not for liveness)
# This prevents flapping: Pod must prove it's stable before getting traffic back
livenessProbe: # Also add a liveness probe — different endpoint, different concern
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30 # More delay than readiness — give the app time to warm up
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 3
$ kubectl apply -f payment-api-deployment.yaml deployment.apps/payment-api created $ kubectl get pods -n production -w NAME READY STATUS RESTARTS AGE payment-api-5c8d9f-2xkpj 0/1 Running 0 8s payment-api-5c8d9f-7rvqn 0/1 Running 0 8s payment-api-5c8d9f-m4czl 0/1 Running 0 8s payment-api-5c8d9f-2xkpj 1/1 Running 0 28s ← readiness passed, traffic enabled payment-api-5c8d9f-7rvqn 1/1 Running 0 31s payment-api-5c8d9f-m4czl 1/1 Running 0 34s $ kubectl describe service payment-api-svc -n production | grep Endpoints Endpoints: 10.244.0.5:8080,10.244.1.3:8080,10.244.2.7:8080 (simulating one Pod's readiness probe failing) $ kubectl describe service payment-api-svc -n production | grep Endpoints Endpoints: 10.244.0.5:8080,10.244.2.7:8080 ← only 2 endpoints — sick Pod removed silently
What just happened?
READY 0/1 → 1/1 transition — When the Pod first starts, it shows READY: 0/1 because the readiness probe hasn't passed yet. The container is running but receiving no traffic. After the cache warms up and /ready starts returning 200, the Pod transitions to READY: 1/1 and the Service adds it to the endpoints list. This is zero-downtime startup in action.
Endpoints removed silently — When a Pod fails its readiness probe, it's removed from the Service endpoints. The other two Pods absorb the traffic. The failing Pod is still running — Kubernetes is just waiting for it to recover. When it passes successThreshold: 2 consecutive checks, it's re-added to endpoints automatically.
successThreshold: 2 for readiness — Requiring two consecutive successes before re-adding to the endpoint list prevents flapping — where a marginally healthy Pod alternates between passing and failing, causing traffic to oscillate on and off. A Pod that passes twice in a row is demonstrably stable, not just lucky.
Startup Probe: Protecting Slow Starters
The startup probe solves a specific and common problem: you have a legacy Java application or a database that takes 90 seconds to fully start. If you set initialDelaySeconds: 90 on the liveness probe, you wait 90 seconds before checking on every healthy Pod too — wasting time and delaying detection of real problems. If you set a shorter delay, the liveness probe kills the slow-starting Pod before it's done starting.
The startup probe is designed for exactly this. While the startup probe is running, both liveness and readiness probes are disabled. Once the startup probe succeeds, liveness and readiness take over as normal. This gives slow starters a generous startup window without giving them a permanent liveness probe exemption.
The scenario: Your team is migrating a legacy Java monolith to Kubernetes. The JVM cold-start takes up to 120 seconds on a loaded node. You need to give it up to 2 minutes to start, but once it's running you want tight 30-second liveness checks to catch deadlocks quickly.
apiVersion: apps/v1
kind: Deployment
metadata:
name: legacy-monolith
namespace: production
spec:
replicas: 2
selector:
matchLabels:
app: legacy-monolith
template:
metadata:
labels:
app: legacy-monolith
spec:
containers:
- name: legacy-monolith
image: company/legacy-monolith:8.2.1
ports:
- containerPort: 8443
startupProbe: # startupProbe: has the container finished initial startup?
httpGet:
path: /actuator/health # Spring Boot actuator health endpoint
port: 8443
failureThreshold: 24 # Allow up to 24 failures before giving up
periodSeconds: 5 # Check every 5 seconds
# 24 failures × 5s = 120s maximum startup window
# If startup takes longer than 120s, container is killed
successThreshold: 1 # One success = startup complete, hand off to liveness/readiness
livenessProbe: # This probe is DISABLED until startupProbe succeeds
httpGet:
path: /actuator/health/liveness
port: 8443
periodSeconds: 30 # Tight 30s check — once started, catch deadlocks fast
timeoutSeconds: 10
failureThreshold: 3
readinessProbe: # This probe is also DISABLED until startupProbe succeeds
httpGet:
path: /actuator/health/readiness
port: 8443
initialDelaySeconds: 0 # No extra delay — startupProbe already handled the wait
periodSeconds: 10
failureThreshold: 3
successThreshold: 1
$ kubectl apply -f legacy-monolith-deployment.yaml
deployment.apps/legacy-monolith created
$ kubectl get pods -n production -w
NAME READY STATUS RESTARTS AGE
legacy-monolith-8b4f9c-p2rkx 0/1 Running 0 10s
legacy-monolith-8b4f9c-p2rkx 0/1 Running 0 45s
legacy-monolith-8b4f9c-p2rkx 0/1 Running 0 87s
legacy-monolith-8b4f9c-p2rkx 1/1 Running 0 93s ← startup complete at ~93s
$ kubectl describe pod legacy-monolith-8b4f9c-p2rkx -n production | grep -A5 "Startup:"
Startup: http-get http://:8443/actuator/health delay=0s timeout=1s period=5s #success=1 #failure=24
Liveness: http-get http://:8443/actuator/health/liveness delay=0s timeout=10s period=30s #success=1 #failure=3
Readiness: http-get http://:8443/actuator/health/readiness delay=0s timeout=1s period=10s #success=1 #failure=3What just happened?
failureThreshold × periodSeconds = max startup time — With failureThreshold: 24 and periodSeconds: 5, the startup probe allows up to 120 seconds for the container to pass before killing it. The Pod was running for 87 seconds with READY: 0/1 and nobody panicked — because we designed it that way. On the 19th check at ~93 seconds, /actuator/health returned 200. Startup probe succeeded. Liveness and readiness took over immediately.
Spring Boot actuator endpoints — Spring Boot (and many other frameworks) expose dedicated health endpoints: /actuator/health for general health, /actuator/health/liveness for liveness state, and /actuator/health/readiness for readiness state. Using separate endpoints for each probe is best practice — each endpoint can return different signals based on different internal state checks.
No extra initialDelaySeconds needed — Because the startup probe already handled the waiting period, the liveness and readiness probes can use initialDelaySeconds: 0. They'll start running immediately after the startup probe reports success. No double-waiting.
The exec and tcpSocket Probe Mechanisms
The scenario: Your platform team runs Redis as a caching layer and PostgreSQL as a primary database. Neither speaks HTTP — you need probes that test them at the protocol level. For Redis you can run redis-cli ping inside the container. For PostgreSQL you can attempt a TCP connection on port 5432.
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis-cache
namespace: production
spec:
replicas: 1
selector:
matchLabels:
app: redis-cache
template:
metadata:
labels:
app: redis-cache
spec:
containers:
- name: redis
image: redis:7.2
ports:
- containerPort: 6379
livenessProbe:
exec: # exec mechanism: run a command inside the container
command:
- redis-cli # The command to run — must be in the container image
- ping # redis-cli ping returns "PONG" with exit code 0 if healthy
initialDelaySeconds: 10
periodSeconds: 15
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
tcpSocket: # tcpSocket mechanism: attempt a TCP connection
port: 6379 # Connect to this port — success = container is listening
initialDelaySeconds: 5
periodSeconds: 10
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres-db
namespace: production
spec:
replicas: 1
selector:
matchLabels:
app: postgres-db
template:
metadata:
labels:
app: postgres-db
spec:
containers:
- name: postgres
image: postgres:15
ports:
- containerPort: 5432
livenessProbe:
exec:
command:
- pg_isready # PostgreSQL built-in readiness check utility
- -U # -U flag: specify user
- postgres # Username to check connectivity as
- -d # -d flag: specify database
- postgres # Database name to connect to
# Returns exit code 0 if accepting connections, non-zero otherwise
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 6 # More tolerance for a database — 60s before restart
$ kubectl apply -f redis-postgres-deployments.yaml
deployment.apps/redis-cache created
deployment.apps/postgres-db created
$ kubectl get pods -n production
NAME READY STATUS RESTARTS AGE
redis-cache-6c8b4f-9kvpm 1/1 Running 0 22s
postgres-db-4d7c9b-xr7nq 1/1 Running 0 22s
$ kubectl describe pod redis-cache-6c8b4f-9kvpm -n production | grep -A3 "Liveness:"
Liveness: exec [redis-cli ping] delay=10s timeout=5s period=15s #success=1 #failure=3
Readiness: tcp-socket :6379 delay=5s timeout=1s period=10s #success=1 #failure=3What just happened?
exec probe — the command must be in the image — The exec probe runs a command inside the container using whatever binaries are in the container image. redis-cli is bundled with the official Redis image. pg_isready is bundled with the official PostgreSQL image. If you're using a minimal distroless or scratch image that has no shell utilities, exec probes won't work — use httpGet or tcpSocket instead.
tcpSocket — deceptively simple but powerful — A TCP probe only tests whether the port is accepting connections. It doesn't test whether the application is actually processing requests correctly. Redis could be listening on port 6379 but be full and refusing new keys — the TCP probe would still pass. For Redis, the redis-cli ping exec liveness probe is more meaningful. Using TCP for readiness and exec for liveness is a valid combination.
failureThreshold: 6 for postgres — Databases deserve more liveness tolerance than stateless APIs. A PostgreSQL instance doing a checkpoint or vacuum might briefly fail a health check — you don't want to restart a database mid-checkpoint. 6 failures × 10s = 60 seconds of tolerance before a restart. For most databases, set failureThreshold higher than you would for a stateless service.
Probe Timing: The Full Picture
Understanding how the timing parameters interact is essential for configuring probes that are sensitive enough to catch real problems but tolerant enough to avoid false alarms:
Probe Timeline for a Typical HTTP Service
starts
(initialDelay)
(+period 20s)
(timeout 5s)
→ RESTART
| Parameter | Default | What it controls | Tune when... |
|---|---|---|---|
| initialDelaySeconds | 0 | Seconds to wait after container start before first probe | App has slow cold start; set to app startup time |
| periodSeconds | 10 | How often to run the probe | Increase to reduce probe load; decrease for faster detection |
| timeoutSeconds | 1 | Probe fails if no response within this time | Increase for slow responses under load — 1s default is very tight |
| failureThreshold | 3 | Consecutive failures before action is taken | Increase for flaky endpoints or transient slowness |
| successThreshold | 1 | Consecutive successes before marked healthy again | Increase on readiness probe to prevent traffic flapping |
Teacher's Note: The four mistakes everyone makes with probes
1. Using liveness for "not ready" scenarios. A liveness probe that fails because the app is overloaded will restart the Pod into the same load — making things worse. Use readiness to shed traffic; use liveness only to restart genuinely broken containers.
2. Setting timeoutSeconds too low. The default is 1 second. Under CPU throttling or heavy load, legitimate health check responses take longer than 1 second. Your app appears unhealthy when it's just slow. Set timeoutSeconds: 3 or 5 for any production service.
3. Probing the same endpoint for liveness and readiness. If your /health endpoint checks database connectivity, a DB blip will trigger the liveness probe and restart your Pod when it doesn't need to be restarted — it just temporarily lost DB access. Your liveness endpoint should check only things a restart would actually fix (in-process state). Your readiness endpoint can check external dependencies.
4. No startup probe for slow starters. Engineers set initialDelaySeconds: 120 on liveness to handle a slow-starting app, which means every healthy Pod waits 2 minutes before getting liveness checks on every restart. A startup probe gives the slow Pod time to start while still running tight liveness checks once it's up.
Practice Questions
1. Which probe removes a Pod from a Service's endpoints when it fails — without restarting the container — so that traffic stops reaching a temporarily unhealthy Pod?
2. A startup probe is configured with periodSeconds: 5. You want to give the container up to 90 seconds to start before Kubernetes kills it. What value should you set for failureThreshold?
3. You want to health-check a Redis container using redis-cli ping. The Redis container doesn't expose an HTTP endpoint. Which probe mechanism do you use?
Quiz
1. A Pod has all three probes configured: startup, liveness, and readiness. The container has just started. Which statement best describes the probe behaviour during the startup window?
2. A developer configures a liveness probe that calls /health with a 1-second timeout. Under peak load, the health endpoint takes 2 seconds to respond. What is the likely outcome?
3. Which value for successThreshold does Kubernetes enforce as a hard requirement on liveness probes — and why?
Up Next · Lesson 24
Scaling Applications
Manual scaling, the Horizontal Pod Autoscaler, and the patterns that let your services handle 10x traffic spikes without anyone touching the cluster.