Kubernetes Lesson 25 – Rolling Updates | Dataplexa
Core Kubernetes Concepts · Lesson 25

Rolling Updates

Zero-downtime deployments aren't a feature you have to build — Kubernetes gives them to you by default. The rolling update strategy is the mechanism behind every kubectl apply that doesn't interrupt users, and understanding its two control knobs is what separates a safe deploy pipeline from a risky one.

What a Rolling Update Actually Does

When you update a Deployment — change the image tag, bump a resource limit, update an env var — Kubernetes doesn't take down all your Pods at once and restart them. That would cause downtime. Instead it replaces them gradually, in waves, using the RollingUpdate strategy.

The process: create a new Pod with the new version → wait for it to pass its readiness probe → remove one old Pod → repeat until all Pods are on the new version. At every point during the rollout, there are healthy Pods serving traffic. Users never see downtime.

This works because Kubernetes creates a new ReplicaSet for the new version while keeping the old ReplicaSet running. The Deployment controller gradually shifts Pods from old to new — scaling the new ReplicaSet up and the old one down in coordinated steps.

The two update strategies Kubernetes supports

RollingUpdate (default): Replace Pods gradually. Zero downtime. Controlled by maxUnavailable and maxSurge. The right choice for almost all production workloads.

Recreate: Kill all old Pods first, then create new ones. Causes downtime. Use only when you absolutely cannot have two versions running simultaneously — for example, a database schema migration that breaks backward compatibility.

The Two Control Knobs: maxUnavailable and maxSurge

The entire rolling update behaviour is controlled by just two fields. Get these right and you control exactly how fast and how safe your deployments are.

Field What it controls Default Accepts
maxUnavailable Maximum Pods that can be unavailable (not ready) at any point during the rollout. Controls minimum available capacity. 25% Integer or percentage
maxSurge Maximum extra Pods that can exist above the desired replica count during a rollout. Controls how fast new capacity comes online. 25% Integer or percentage

maxUnavailable is about safety — how many Pods can be down at once. maxSurge is about speed — how many extra Pods can run simultaneously. Setting maxUnavailable: 0 and maxSurge: 1 gives the safest possible rollout — never lose capacity, add one new Pod at a time. Note: both fields cannot be zero simultaneously, as that would make progress impossible.

A Complete Rolling Update Manifest

The scenario: You're a platform engineer deploying a new version of the checkout API for a high-traffic e-commerce platform. The service runs 6 replicas and handles thousands of requests per minute. Any capacity reduction directly impacts conversion rates — so you need a rollout that never drops below 5 healthy Pods, but also completes within a few minutes rather than creeping along one Pod at a time.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: checkout-api
  namespace: production
  annotations:
    kubernetes.io/change-cause: "Deploy v2.4.0: faster cart calculation — PR #512"
spec:
  replicas: 6                           # Run 6 Pods in steady state
  revisionHistoryLimit: 10              # Keep 10 revisions for rollbacks
  selector:
    matchLabels:
      app: checkout-api
  strategy:
    type: RollingUpdate                 # Default — specify explicitly for clarity
    rollingUpdate:
      maxUnavailable: 1                 # At most 1 Pod can be unavailable at any time
                                        # Guarantees minimum 5/6 Pods always serving traffic
      maxSurge: 2                       # Allow up to 2 extra Pods above 6 during rollout (max 8 total)
                                        # Lets 2 new Pods start before removing old ones → faster rollout
                                        # Higher maxSurge = faster rollout, more temporary resource usage
  template:
    metadata:
      labels:
        app: checkout-api
        version: "2.4.0"
    spec:
      containers:
        - name: checkout-api
          image: company/checkout-api:2.4.0   # Updated image — this triggers the rolling update
          ports:
            - containerPort: 3000
          resources:
            requests:
              cpu: "150m"
              memory: "200Mi"
            limits:
              cpu: "500m"
              memory: "350Mi"
          readinessProbe:               # readinessProbe is CRITICAL for safe rolling updates
            httpGet:                    # New Pod only enters rotation AFTER this passes
              path: /ready
              port: 3000
            initialDelaySeconds: 10
            periodSeconds: 5
            failureThreshold: 3
            successThreshold: 1
          livenessProbe:
            httpGet:
              path: /healthz
              port: 3000
            initialDelaySeconds: 20
            periodSeconds: 15
            failureThreshold: 3
$ kubectl apply -f checkout-deployment.yaml
deployment.apps/checkout-api configured

$ kubectl rollout status deployment/checkout-api -n production
Waiting for deployment "checkout-api" rollout to finish: 2 out of 6 new replicas have been updated...
Waiting for deployment "checkout-api" rollout to finish: 4 out of 6 new replicas have been updated...
Waiting for deployment "checkout-api" rollout to finish: 5 out of 6 new replicas have been updated...
Waiting for deployment "checkout-api" rollout to finish: 1 old replicas are pending termination...
deployment "checkout-api" successfully rolled out

$ kubectl get pods -n production -l app=checkout-api
NAME                             READY   STATUS    RESTARTS   AGE
checkout-api-9c4b2f-2xkpj        1/1     Running   0          2m
checkout-api-9c4b2f-7rvqn        1/1     Running   0          2m
checkout-api-9c4b2f-m4czl        1/1     Running   0          1m
checkout-api-9c4b2f-p9wxt        1/1     Running   0          1m
checkout-api-9c4b2f-r2skl        1/1     Running   0          58s
checkout-api-9c4b2f-xb4fm        1/1     Running   0          45s

What just happened?

The readiness probe gates the rollout — This is the most important sentence in this lesson. The rolling update controller does not mark a new Pod as "available" until its readiness probe passes. If the new image has a bug that prevents startup, the readiness probe keeps failing, the Pod never becomes available, and the rollout halts — with your old Pods still running. The broken deploy can't fully replace the good one. Without a readiness probe, Kubernetes considers a Pod ready the moment the container starts — regardless of whether the app is actually serving requests.

maxSurge: 2 in action — During the rollout, up to 8 Pods existed temporarily (6 desired + 2 surge). The controller created 2 new Pods, waited for them to be ready, removed 2 old ones, then repeated. The rollout completed in about 2 minutes because maxSurge let two new Pods start simultaneously rather than one at a time.

New ReplicaSet hash — All new Pods have the hash 9c4b2f. This is the new ReplicaSet created for v2.4.0. The old ReplicaSet (with 0 Pods) is kept around for rollback purposes.

How the Rolling Update Progresses — Step by Step

With replicas: 6, maxUnavailable: 1, and maxSurge: 2, here's every step the rollout controller takes:

Rolling Update Steps — 6 replicas, maxUnavailable: 1, maxSurge: 2

Start
v2.3 ✓
v2.3 ✓
v2.3 ✓
v2.3 ✓
v2.3 ✓
v2.3 ✓
6 old, 0 new
Step 1
v2.3 ✓
v2.3 ✓
v2.3 ✓
v2.3 ✓
v2.3 ✓
v2.3 ✓
v2.4 ⏳
v2.4 ⏳
8 total (maxSurge=2)
Step 2
v2.3 ✓
v2.3 ✓
v2.3 ✓
v2.3 ✓
v2.4 ✓
v2.4 ✓
2 new ready → 2 old removed
Step 3
v2.3 ✓
v2.3 ✓
v2.4 ✓
v2.4 ✓
v2.4 ⏳
v2.4 ⏳
2 more new starting
Done
v2.4 ✓
v2.4 ✓
v2.4 ✓
v2.4 ✓
v2.4 ✓
v2.4 ✓
All 6 on v2.4 — 0 downtime
v2.3 ✓ = old, ready v2.4 ⏳ = new, starting v2.4 ✓ = new, ready

What Triggers a Rolling Update

Not every change to a Deployment triggers a rolling update. Kubernetes only rolls out new Pods when the Pod template (spec.template) changes. Changes to the Deployment metadata, replica count, or strategy do not trigger a rollout.

Change type Triggers rolling update? Why
Container image change ✅ Yes Pod template changed — new ReplicaSet created
Env var added or changed ✅ Yes Pod template changed
Resource requests/limits changed ✅ Yes Pod template changed
Replica count change only ❌ No Only scales existing ReplicaSet up or down
Deployment metadata/annotations ❌ No Not part of the Pod template
ConfigMap/Secret content changed ⚠️ No — manual restart needed Kubernetes doesn't watch ConfigMap/Secret content — use kubectl rollout restart

Monitoring and Aborting a Live Rollout

The scenario: You've triggered a rolling update in production and you're watching it live. You need to know how to monitor progress, spot problems early, and abort if something goes wrong — all while the rollout is in flight.

kubectl rollout status deployment/checkout-api -n production
# Streams live progress — blocks until complete or failed
# Shows how many Pods have been updated and how many old ones are pending termination
# Exit code 0 = success, 1 = failed — integrate this into CI/CD pipelines for automatic abort

kubectl get pods -n production -l app=checkout-api -w
# -w: watch mode — streams Pod status changes in real time
# You should never see Ready drop below (replicas - maxUnavailable) during a healthy rollout

kubectl describe deployment checkout-api -n production
# Full rollout state: conditions, replica counts, recent events
# Look for: "Deployment does not have minimum availability" — means rollout is stalled
# Check Events at the bottom for the specific failure reason

kubectl rollout undo deployment/checkout-api -n production
# If something looks wrong mid-rollout — abort and roll back immediately
# Kubernetes reverses the rollout: scales new ReplicaSet back down, old ReplicaSet back up
$ kubectl rollout status deployment/checkout-api -n production
Waiting for deployment "checkout-api" rollout to finish: 2 out of 6 new replicas have been updated...
Waiting for deployment "checkout-api" rollout to finish: 2 out of 6 new replicas have been updated...

(new Pods stuck — readiness probe failing on the new image)

error: deployment "checkout-api" exceeded its progress deadline

$ kubectl describe deployment checkout-api -n production | grep -A6 "Conditions:"
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    False   ProgressDeadlineExceeded

$ kubectl rollout undo deployment/checkout-api -n production
deployment.apps/checkout-api rolled back

$ kubectl rollout status deployment/checkout-api -n production
deployment "checkout-api" successfully rolled out

What just happened?

ProgressDeadlineExceeded — Kubernetes has a progressDeadlineSeconds field on Deployments (default 600s — 10 minutes). If the rollout doesn't complete within that window, the Progressing condition flips to False. The rollout stalls but is not automatically aborted — you still need to run kubectl rollout undo manually, or have your CI/CD pipeline trigger it based on the non-zero exit code from kubectl rollout status.

Available: True during a failed rollout — Even though the rollout failed, the old Pods kept running. The readiness probe prevented the broken new Pods from entering the Service endpoints. Zero user-facing downtime even for a broken deploy — this is the entire value of readiness probes in a rolling update context.

rollout undo mid-rollout — Running rollout undo during a stalled rollout scales the new (broken) ReplicaSet back to 0 and restores the old ReplicaSet to full desired count. Recovery is fast because some old ReplicaSet Pods were never terminated.

The Recreate Strategy

The scenario: You're deploying a breaking database schema change. The new app version is incompatible with the old schema and the old app version is incompatible with the new schema. You cannot run both versions simultaneously without data corruption. This is the one legitimate use case for Recreate.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: schema-migration-app
  namespace: production
spec:
  replicas: 2
  selector:
    matchLabels:
      app: schema-migration-app
  strategy:
    type: Recreate               # Recreate: kill ALL old Pods first, then create new ones
                                 # WARNING: causes downtime — all Pods are unavailable during transition
                                 # No rollingUpdate block — Recreate takes no parameters
                                 # Use only when two versions running simultaneously is truly impossible
  template:
    metadata:
      labels:
        app: schema-migration-app
    spec:
      containers:
        - name: schema-migration-app
          image: company/schema-migration-app:2.0.0
          ports:
            - containerPort: 8080
$ kubectl apply -f schema-migration-deployment.yaml
deployment.apps/schema-migration-app configured

$ kubectl get pods -n production -l app=schema-migration-app -w
NAME                                    READY   STATUS        RESTARTS   AGE
schema-migration-app-6b4c9d-2xkpj       1/1     Terminating   0          3d
schema-migration-app-6b4c9d-7rvqn       1/1     Terminating   0          3d
(both old Pods fully terminated — downtime window begins here)
schema-migration-app-9f2a3c-m4czl       0/1     ContainerCreating   0   2s
schema-migration-app-9f2a3c-p9wxt       0/1     ContainerCreating   0   2s
schema-migration-app-9f2a3c-m4czl       1/1     Running             0   14s
schema-migration-app-9f2a3c-p9wxt       1/1     Running             0   16s

What just happened?

The downtime gap — The watch output shows it clearly: both old Pods terminated before any new Pods started. There's a gap of roughly 10–15 seconds (image pull + container start) where zero Pods are serving traffic. If you use Recreate, plan for this downtime — schedule it in a maintenance window, notify users, and have monitoring ready to confirm recovery.

Avoiding Recreate with migration discipline — Most teams avoid Recreate by running schema migrations as a Kubernetes Job before the new app version deploys, and writing migrations to be backward compatible so both old and new app versions can run against the updated schema simultaneously. This takes more engineering discipline but eliminates the downtime window entirely.

Teacher's Note: The readiness probe is not optional for safe rolling updates

I've seen teams deploy without readiness probes and then wonder why rolling updates cause intermittent errors. Here's what happens: without a readiness probe, Kubernetes marks a new Pod as "available" the instant the container starts — before the application has finished binding to its port, loading configuration, or warming caches. Traffic hits the new Pod. The Pod isn't ready. Users see errors.

With a readiness probe, Kubernetes waits until the Pod responds with a 200 before adding it to Service endpoints. The rollout is a bit slower — but zero requests reach an unready Pod. This is the correct trade-off for production.

The formula for a production-safe rolling update: readiness probe (gates traffic) + maxUnavailable: 0 or 1 (preserves capacity) + maxSurge: 1 or 2 (controls speed) + progressDeadlineSeconds tuned to your startup time (auto-detects stuck rollouts). These four together give you deployments that are both fast and safe — and CI/CD pipelines that automatically abort broken deploys via the non-zero exit code from kubectl rollout status.

Practice Questions

1. Which rollingUpdate field controls the minimum number of healthy Pods that must remain available throughout a rolling update — preventing capacity from dropping too low?



2. What prevents a new Pod from receiving traffic during a rolling update until the application inside it is fully started and ready to serve requests?



3. Which Deployment update strategy terminates all existing Pods before creating new ones — causing a downtime window — and should only be used when two versions cannot coexist simultaneously?



Quiz

1. You update a Deployment by changing spec.replicas from 3 to 6 but make no other changes. Does Kubernetes perform a rolling update?


2. A new image is deployed via rolling update but the readiness probe keeps failing on every new Pod. What happens?


3. You want the safest possible rolling update — no capacity reduction at any point, add only one new Pod at a time. Which settings achieve this?


Up Next · Lesson 26

Rollbacks

A bad deploy just hit production. Error rates are spiking. Here's how to roll back in one command, how revision history works, and the Git drift trap that re-introduces the broken version after you've already recovered.