Kubernetes Lesson 23 – Health Checks | Dataplexa
Core Kubernetes Concepts · Lesson 23

Health Checks

A container that's running is not the same as a container that's healthy. Kubernetes has three distinct probes for three distinct questions — and understanding the difference between them is what separates clusters that route traffic correctly from clusters that send requests into black holes.

Three Probes, Three Questions

Kubernetes is continuously asking questions about every container in your cluster. Three different probes answer three different questions, and each one triggers a different action when it fails.

Probe Question it answers Action on failure When to use
Liveness Is the container still alive and functional — or deadlocked/stuck? Kill and restart the container Apps that can get stuck (deadlocks, infinite loops, leaked connections)
Readiness Is the container ready to serve traffic right now? Remove from Service endpoints — no restart Always — controls when traffic reaches your Pod
Startup Has the container finished its initial startup sequence? Kill and restart — but only during startup window Slow-starting apps (JVM, legacy apps) where liveness would kill them prematurely

The critical distinction between liveness and readiness

A failed liveness probe says "this container is broken beyond recovery — restart it." A failed readiness probe says "this container is temporarily not ready — stop sending traffic, but leave it running." Confusing the two causes serious production problems. If you use a liveness probe for something that fails during legitimate high load (like a slow HTTP response), Kubernetes will kill your Pod when it's actually just busy — making the overload worse by restarting into the same load.

Three Probe Mechanisms

Each probe type (liveness, readiness, startup) can use any of three check mechanisms. The mechanism determines how Kubernetes tests the container — not what it does with the result.

🌐 httpGet

Kubernetes sends an HTTP GET request to a path and port on the container. HTTP status 200–399 = success. Anything else = failure.

Best for HTTP services — the most common probe type

🔌 tcpSocket

Kubernetes attempts to open a TCP connection to a port on the container. If the connection succeeds (even briefly), the probe passes.

Best for TCP services (databases, queues) that don't speak HTTP

⚡ exec

Kubernetes runs a command inside the container. Exit code 0 = success. Any non-zero exit code = failure.

Best for custom health logic that can't be expressed as HTTP

Liveness Probe: Restart on Deadlock

The scenario: Your team runs a Go-based order processing service. It works great under normal load but occasionally gets into a deadlock state where it stops processing messages but doesn't crash. The process is still running, the port is still open, but it's not doing anything useful. Without a liveness probe, Kubernetes has no idea — the Pod stays running and stuck forever. With a liveness probe, it gets restarted automatically within seconds of the deadlock being detected.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-processor
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: order-processor
  template:
    metadata:
      labels:
        app: order-processor
    spec:
      containers:
        - name: order-processor
          image: company/order-processor:2.4.0
          ports:
            - containerPort: 8080
          livenessProbe:                  # livenessProbe: is the container still alive and functional?
            httpGet:                      # Mechanism: send an HTTP GET request
              path: /healthz              # The health check endpoint the app exposes
              port: 8080                  # Port to send the request to (must match containerPort)
              httpHeaders:               # Optional: custom headers (e.g. for auth on the health endpoint)
                - name: Custom-Header
                  value: liveness-check
            initialDelaySeconds: 15       # Wait 15s after container starts before first probe
                                          # Give the app time to start up before checking
            periodSeconds: 20             # Check every 20 seconds after the initial delay
            timeoutSeconds: 5             # Probe fails if no response within 5 seconds
            failureThreshold: 3           # Fail 3 times in a row before restarting the container
                                          # 3 failures × 20s = 60s of consecutive failure before restart
            successThreshold: 1           # Only 1 success needed to mark liveness as passing
                                          # (successThreshold must be 1 for liveness — Kubernetes enforces this)
$ kubectl apply -f order-processor-deployment.yaml
deployment.apps/order-processor created

$ kubectl describe pod order-processor-7f9b4d-2xkpj -n production | grep -A20 "Liveness:"
    Liveness:     http-get http://:8080/healthz delay=15s timeout=5s period=20s #success=1 #failure=3

(simulating a deadlock — liveness probe starts failing)

$ kubectl describe pod order-processor-7f9b4d-2xkpj -n production | grep -A5 "Events:"
Events:
  Warning  Unhealthy  2m   kubelet  Liveness probe failed: HTTP probe failed with statuscode: 503
  Warning  Unhealthy  100s kubelet  Liveness probe failed: HTTP probe failed with statuscode: 503
  Warning  Unhealthy  80s  kubelet  Liveness probe failed: HTTP probe failed with statuscode: 503
  Normal   Killing    80s  kubelet  Container order-processor failed liveness probe, will be restarted
  Normal   Pulled     75s  kubelet  Container image already present on machine
  Normal   Started    75s  kubelet  Started container order-processor

What just happened?

initialDelaySeconds — Without this, Kubernetes starts probing the moment the container process starts — before the app has had a chance to bind to its port and start its HTTP server. The probe fails immediately, Kubernetes restarts the container, and you get a crash loop that has nothing to do with the app actually being broken. Set initialDelaySeconds to slightly more than your app's typical cold-start time.

failureThreshold × periodSeconds = restart window — With failureThreshold: 3 and periodSeconds: 20, the container must fail consecutively for 60 seconds before Kubernetes restarts it. This prevents transient blips (a momentary slow response, a garbage collection pause) from triggering unnecessary restarts. Tune this based on how quickly you want to recover from real failures.

Events section — The Events at the bottom of kubectl describe pod show every probe failure and the eventual restart with timestamp. This is your audit trail for why a Pod restarted — far more informative than just seeing RESTARTS: 1 in kubectl get pods.

Readiness Probe: Control When Traffic Arrives

The readiness probe is arguably the most important of the three. It controls whether a Pod is included in a Service's endpoint list — meaning it controls whether real user traffic reaches that Pod. A Pod that fails its readiness probe is silently removed from rotation. No restart. No alert (unless you set one up). Just... no more traffic.

The scenario: Your payment API caches its product catalogue in memory on startup. Until the cache is warm (typically 20–30 seconds), responses take 10–15 seconds — completely unacceptable for users. You need to prevent traffic from reaching the Pod until the cache is ready, and you also want to temporarily remove a Pod from rotation if it starts returning errors (maybe its database connection pool is exhausted).

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-api
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: payment-api
  template:
    metadata:
      labels:
        app: payment-api
    spec:
      containers:
        - name: payment-api
          image: company/payment-api:3.0.0
          ports:
            - containerPort: 8080
          readinessProbe:               # readinessProbe: is this container ready for traffic?
            httpGet:
              path: /ready              # Different endpoint from /healthz — reports cache warm status
              port: 8080
            initialDelaySeconds: 5      # Start checking 5s after container starts
            periodSeconds: 10           # Check every 10 seconds
            timeoutSeconds: 3           # Fail if no response within 3 seconds
            failureThreshold: 3         # Remove from endpoints after 3 consecutive failures
                                        # Pod stays running — just no traffic sent to it
            successThreshold: 2         # Require 2 consecutive successes to re-add to endpoints
                                        # successThreshold > 1 is valid for readiness (not for liveness)
                                        # This prevents flapping: Pod must prove it's stable before getting traffic back

          livenessProbe:                # Also add a liveness probe — different endpoint, different concern
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 30     # More delay than readiness — give the app time to warm up
            periodSeconds: 30
            timeoutSeconds: 5
            failureThreshold: 3
$ kubectl apply -f payment-api-deployment.yaml
deployment.apps/payment-api created

$ kubectl get pods -n production -w
NAME                          READY   STATUS    RESTARTS   AGE
payment-api-5c8d9f-2xkpj      0/1     Running   0          8s
payment-api-5c8d9f-7rvqn      0/1     Running   0          8s
payment-api-5c8d9f-m4czl      0/1     Running   0          8s
payment-api-5c8d9f-2xkpj      1/1     Running   0          28s   ← readiness passed, traffic enabled
payment-api-5c8d9f-7rvqn      1/1     Running   0          31s
payment-api-5c8d9f-m4czl      1/1     Running   0          34s

$ kubectl describe service payment-api-svc -n production | grep Endpoints
Endpoints:  10.244.0.5:8080,10.244.1.3:8080,10.244.2.7:8080

(simulating one Pod's readiness probe failing)

$ kubectl describe service payment-api-svc -n production | grep Endpoints
Endpoints:  10.244.0.5:8080,10.244.2.7:8080   ← only 2 endpoints — sick Pod removed silently

What just happened?

READY 0/1 → 1/1 transition — When the Pod first starts, it shows READY: 0/1 because the readiness probe hasn't passed yet. The container is running but receiving no traffic. After the cache warms up and /ready starts returning 200, the Pod transitions to READY: 1/1 and the Service adds it to the endpoints list. This is zero-downtime startup in action.

Endpoints removed silently — When a Pod fails its readiness probe, it's removed from the Service endpoints. The other two Pods absorb the traffic. The failing Pod is still running — Kubernetes is just waiting for it to recover. When it passes successThreshold: 2 consecutive checks, it's re-added to endpoints automatically.

successThreshold: 2 for readiness — Requiring two consecutive successes before re-adding to the endpoint list prevents flapping — where a marginally healthy Pod alternates between passing and failing, causing traffic to oscillate on and off. A Pod that passes twice in a row is demonstrably stable, not just lucky.

Startup Probe: Protecting Slow Starters

The startup probe solves a specific and common problem: you have a legacy Java application or a database that takes 90 seconds to fully start. If you set initialDelaySeconds: 90 on the liveness probe, you wait 90 seconds before checking on every healthy Pod too — wasting time and delaying detection of real problems. If you set a shorter delay, the liveness probe kills the slow-starting Pod before it's done starting.

The startup probe is designed for exactly this. While the startup probe is running, both liveness and readiness probes are disabled. Once the startup probe succeeds, liveness and readiness take over as normal. This gives slow starters a generous startup window without giving them a permanent liveness probe exemption.

The scenario: Your team is migrating a legacy Java monolith to Kubernetes. The JVM cold-start takes up to 120 seconds on a loaded node. You need to give it up to 2 minutes to start, but once it's running you want tight 30-second liveness checks to catch deadlocks quickly.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: legacy-monolith
  namespace: production
spec:
  replicas: 2
  selector:
    matchLabels:
      app: legacy-monolith
  template:
    metadata:
      labels:
        app: legacy-monolith
    spec:
      containers:
        - name: legacy-monolith
          image: company/legacy-monolith:8.2.1
          ports:
            - containerPort: 8443
          startupProbe:                   # startupProbe: has the container finished initial startup?
            httpGet:
              path: /actuator/health      # Spring Boot actuator health endpoint
              port: 8443
            failureThreshold: 24          # Allow up to 24 failures before giving up
            periodSeconds: 5              # Check every 5 seconds
                                          # 24 failures × 5s = 120s maximum startup window
                                          # If startup takes longer than 120s, container is killed
            successThreshold: 1           # One success = startup complete, hand off to liveness/readiness

          livenessProbe:                  # This probe is DISABLED until startupProbe succeeds
            httpGet:
              path: /actuator/health/liveness
              port: 8443
            periodSeconds: 30             # Tight 30s check — once started, catch deadlocks fast
            timeoutSeconds: 10
            failureThreshold: 3

          readinessProbe:                 # This probe is also DISABLED until startupProbe succeeds
            httpGet:
              path: /actuator/health/readiness
              port: 8443
            initialDelaySeconds: 0        # No extra delay — startupProbe already handled the wait
            periodSeconds: 10
            failureThreshold: 3
            successThreshold: 1
$ kubectl apply -f legacy-monolith-deployment.yaml
deployment.apps/legacy-monolith created

$ kubectl get pods -n production -w
NAME                             READY   STATUS    RESTARTS   AGE
legacy-monolith-8b4f9c-p2rkx     0/1     Running   0          10s
legacy-monolith-8b4f9c-p2rkx     0/1     Running   0          45s
legacy-monolith-8b4f9c-p2rkx     0/1     Running   0          87s
legacy-monolith-8b4f9c-p2rkx     1/1     Running   0          93s  ← startup complete at ~93s

$ kubectl describe pod legacy-monolith-8b4f9c-p2rkx -n production | grep -A5 "Startup:"
    Startup:  http-get http://:8443/actuator/health delay=0s timeout=1s period=5s #success=1 #failure=24
    Liveness:  http-get http://:8443/actuator/health/liveness delay=0s timeout=10s period=30s #success=1 #failure=3
    Readiness: http-get http://:8443/actuator/health/readiness delay=0s timeout=1s period=10s #success=1 #failure=3

What just happened?

failureThreshold × periodSeconds = max startup time — With failureThreshold: 24 and periodSeconds: 5, the startup probe allows up to 120 seconds for the container to pass before killing it. The Pod was running for 87 seconds with READY: 0/1 and nobody panicked — because we designed it that way. On the 19th check at ~93 seconds, /actuator/health returned 200. Startup probe succeeded. Liveness and readiness took over immediately.

Spring Boot actuator endpoints — Spring Boot (and many other frameworks) expose dedicated health endpoints: /actuator/health for general health, /actuator/health/liveness for liveness state, and /actuator/health/readiness for readiness state. Using separate endpoints for each probe is best practice — each endpoint can return different signals based on different internal state checks.

No extra initialDelaySeconds needed — Because the startup probe already handled the waiting period, the liveness and readiness probes can use initialDelaySeconds: 0. They'll start running immediately after the startup probe reports success. No double-waiting.

The exec and tcpSocket Probe Mechanisms

The scenario: Your platform team runs Redis as a caching layer and PostgreSQL as a primary database. Neither speaks HTTP — you need probes that test them at the protocol level. For Redis you can run redis-cli ping inside the container. For PostgreSQL you can attempt a TCP connection on port 5432.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-cache
  namespace: production
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis-cache
  template:
    metadata:
      labels:
        app: redis-cache
    spec:
      containers:
        - name: redis
          image: redis:7.2
          ports:
            - containerPort: 6379
          livenessProbe:
            exec:                         # exec mechanism: run a command inside the container
              command:
                - redis-cli               # The command to run — must be in the container image
                - ping                    # redis-cli ping returns "PONG" with exit code 0 if healthy
            initialDelaySeconds: 10
            periodSeconds: 15
            timeoutSeconds: 5
            failureThreshold: 3

          readinessProbe:
            tcpSocket:                    # tcpSocket mechanism: attempt a TCP connection
              port: 6379                  # Connect to this port — success = container is listening
            initialDelaySeconds: 5
            periodSeconds: 10

---

apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres-db
  namespace: production
spec:
  replicas: 1
  selector:
    matchLabels:
      app: postgres-db
  template:
    metadata:
      labels:
        app: postgres-db
    spec:
      containers:
        - name: postgres
          image: postgres:15
          ports:
            - containerPort: 5432
          livenessProbe:
            exec:
              command:
                - pg_isready               # PostgreSQL built-in readiness check utility
                - -U                       # -U flag: specify user
                - postgres                 # Username to check connectivity as
                - -d                       # -d flag: specify database
                - postgres                 # Database name to connect to
                                           # Returns exit code 0 if accepting connections, non-zero otherwise
            initialDelaySeconds: 30
            periodSeconds: 10
            failureThreshold: 6            # More tolerance for a database — 60s before restart
$ kubectl apply -f redis-postgres-deployments.yaml
deployment.apps/redis-cache created
deployment.apps/postgres-db created

$ kubectl get pods -n production
NAME                           READY   STATUS    RESTARTS   AGE
redis-cache-6c8b4f-9kvpm       1/1     Running   0          22s
postgres-db-4d7c9b-xr7nq       1/1     Running   0          22s

$ kubectl describe pod redis-cache-6c8b4f-9kvpm -n production | grep -A3 "Liveness:"
    Liveness:   exec [redis-cli ping] delay=10s timeout=5s period=15s #success=1 #failure=3
    Readiness:  tcp-socket :6379 delay=5s timeout=1s period=10s #success=1 #failure=3

What just happened?

exec probe — the command must be in the image — The exec probe runs a command inside the container using whatever binaries are in the container image. redis-cli is bundled with the official Redis image. pg_isready is bundled with the official PostgreSQL image. If you're using a minimal distroless or scratch image that has no shell utilities, exec probes won't work — use httpGet or tcpSocket instead.

tcpSocket — deceptively simple but powerful — A TCP probe only tests whether the port is accepting connections. It doesn't test whether the application is actually processing requests correctly. Redis could be listening on port 6379 but be full and refusing new keys — the TCP probe would still pass. For Redis, the redis-cli ping exec liveness probe is more meaningful. Using TCP for readiness and exec for liveness is a valid combination.

failureThreshold: 6 for postgres — Databases deserve more liveness tolerance than stateless APIs. A PostgreSQL instance doing a checkpoint or vacuum might briefly fail a health check — you don't want to restart a database mid-checkpoint. 6 failures × 10s = 60 seconds of tolerance before a restart. For most databases, set failureThreshold higher than you would for a stateless service.

Probe Timing: The Full Picture

Understanding how the timing parameters interact is essential for configuring probes that are sensitive enough to catch real problems but tolerant enough to avoid false alarms:

Probe Timeline for a Typical HTTP Service

0s
Container
starts
+15s
First probe
(initialDelay)
+35s
Second probe
(+period 20s)
Failure 1
(timeout 5s)
Failure 2
Failure 3
→ RESTART
← initialDelay →
← timeout (5s) →
Parameter Default What it controls Tune when...
initialDelaySeconds 0 Seconds to wait after container start before first probe App has slow cold start; set to app startup time
periodSeconds 10 How often to run the probe Increase to reduce probe load; decrease for faster detection
timeoutSeconds 1 Probe fails if no response within this time Increase for slow responses under load — 1s default is very tight
failureThreshold 3 Consecutive failures before action is taken Increase for flaky endpoints or transient slowness
successThreshold 1 Consecutive successes before marked healthy again Increase on readiness probe to prevent traffic flapping

Teacher's Note: The four mistakes everyone makes with probes

1. Using liveness for "not ready" scenarios. A liveness probe that fails because the app is overloaded will restart the Pod into the same load — making things worse. Use readiness to shed traffic; use liveness only to restart genuinely broken containers.

2. Setting timeoutSeconds too low. The default is 1 second. Under CPU throttling or heavy load, legitimate health check responses take longer than 1 second. Your app appears unhealthy when it's just slow. Set timeoutSeconds: 3 or 5 for any production service.

3. Probing the same endpoint for liveness and readiness. If your /health endpoint checks database connectivity, a DB blip will trigger the liveness probe and restart your Pod when it doesn't need to be restarted — it just temporarily lost DB access. Your liveness endpoint should check only things a restart would actually fix (in-process state). Your readiness endpoint can check external dependencies.

4. No startup probe for slow starters. Engineers set initialDelaySeconds: 120 on liveness to handle a slow-starting app, which means every healthy Pod waits 2 minutes before getting liveness checks on every restart. A startup probe gives the slow Pod time to start while still running tight liveness checks once it's up.

Practice Questions

1. Which probe removes a Pod from a Service's endpoints when it fails — without restarting the container — so that traffic stops reaching a temporarily unhealthy Pod?



2. A startup probe is configured with periodSeconds: 5. You want to give the container up to 90 seconds to start before Kubernetes kills it. What value should you set for failureThreshold?



3. You want to health-check a Redis container using redis-cli ping. The Redis container doesn't expose an HTTP endpoint. Which probe mechanism do you use?



Quiz

1. A Pod has all three probes configured: startup, liveness, and readiness. The container has just started. Which statement best describes the probe behaviour during the startup window?


2. A developer configures a liveness probe that calls /health with a 1-second timeout. Under peak load, the health endpoint takes 2 seconds to respond. What is the likely outcome?


3. Which value for successThreshold does Kubernetes enforce as a hard requirement on liveness probes — and why?


Up Next · Lesson 24

Scaling Applications

Manual scaling, the Horizontal Pod Autoscaler, and the patterns that let your services handle 10x traffic spikes without anyone touching the cluster.