Kubernetes Course
Rolling Updates
Zero-downtime deployments aren't a feature you have to build — Kubernetes gives them to you by default. The rolling update strategy is the mechanism behind every kubectl apply that doesn't interrupt users, and understanding its two control knobs is what separates a safe deploy pipeline from a risky one.
What a Rolling Update Actually Does
When you update a Deployment — change the image tag, bump a resource limit, update an env var — Kubernetes doesn't take down all your Pods at once and restart them. That would cause downtime. Instead it replaces them gradually, in waves, using the RollingUpdate strategy.
The process: create a new Pod with the new version → wait for it to pass its readiness probe → remove one old Pod → repeat until all Pods are on the new version. At every point during the rollout, there are healthy Pods serving traffic. Users never see downtime.
This works because Kubernetes creates a new ReplicaSet for the new version while keeping the old ReplicaSet running. The Deployment controller gradually shifts Pods from old to new — scaling the new ReplicaSet up and the old one down in coordinated steps.
The two update strategies Kubernetes supports
RollingUpdate (default): Replace Pods gradually. Zero downtime. Controlled by maxUnavailable and maxSurge. The right choice for almost all production workloads.
Recreate: Kill all old Pods first, then create new ones. Causes downtime. Use only when you absolutely cannot have two versions running simultaneously — for example, a database schema migration that breaks backward compatibility.
The Two Control Knobs: maxUnavailable and maxSurge
The entire rolling update behaviour is controlled by just two fields. Get these right and you control exactly how fast and how safe your deployments are.
| Field | What it controls | Default | Accepts |
|---|---|---|---|
| maxUnavailable | Maximum Pods that can be unavailable (not ready) at any point during the rollout. Controls minimum available capacity. | 25% | Integer or percentage |
| maxSurge | Maximum extra Pods that can exist above the desired replica count during a rollout. Controls how fast new capacity comes online. | 25% | Integer or percentage |
maxUnavailable is about safety — how many Pods can be down at once. maxSurge is about speed — how many extra Pods can run simultaneously. Setting maxUnavailable: 0 and maxSurge: 1 gives the safest possible rollout — never lose capacity, add one new Pod at a time. Note: both fields cannot be zero simultaneously, as that would make progress impossible.
A Complete Rolling Update Manifest
The scenario: You're a platform engineer deploying a new version of the checkout API for a high-traffic e-commerce platform. The service runs 6 replicas and handles thousands of requests per minute. Any capacity reduction directly impacts conversion rates — so you need a rollout that never drops below 5 healthy Pods, but also completes within a few minutes rather than creeping along one Pod at a time.
apiVersion: apps/v1
kind: Deployment
metadata:
name: checkout-api
namespace: production
annotations:
kubernetes.io/change-cause: "Deploy v2.4.0: faster cart calculation — PR #512"
spec:
replicas: 6 # Run 6 Pods in steady state
revisionHistoryLimit: 10 # Keep 10 revisions for rollbacks
selector:
matchLabels:
app: checkout-api
strategy:
type: RollingUpdate # Default — specify explicitly for clarity
rollingUpdate:
maxUnavailable: 1 # At most 1 Pod can be unavailable at any time
# Guarantees minimum 5/6 Pods always serving traffic
maxSurge: 2 # Allow up to 2 extra Pods above 6 during rollout (max 8 total)
# Lets 2 new Pods start before removing old ones → faster rollout
# Higher maxSurge = faster rollout, more temporary resource usage
template:
metadata:
labels:
app: checkout-api
version: "2.4.0"
spec:
containers:
- name: checkout-api
image: company/checkout-api:2.4.0 # Updated image — this triggers the rolling update
ports:
- containerPort: 3000
resources:
requests:
cpu: "150m"
memory: "200Mi"
limits:
cpu: "500m"
memory: "350Mi"
readinessProbe: # readinessProbe is CRITICAL for safe rolling updates
httpGet: # New Pod only enters rotation AFTER this passes
path: /ready
port: 3000
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
successThreshold: 1
livenessProbe:
httpGet:
path: /healthz
port: 3000
initialDelaySeconds: 20
periodSeconds: 15
failureThreshold: 3
$ kubectl apply -f checkout-deployment.yaml deployment.apps/checkout-api configured $ kubectl rollout status deployment/checkout-api -n production Waiting for deployment "checkout-api" rollout to finish: 2 out of 6 new replicas have been updated... Waiting for deployment "checkout-api" rollout to finish: 4 out of 6 new replicas have been updated... Waiting for deployment "checkout-api" rollout to finish: 5 out of 6 new replicas have been updated... Waiting for deployment "checkout-api" rollout to finish: 1 old replicas are pending termination... deployment "checkout-api" successfully rolled out $ kubectl get pods -n production -l app=checkout-api NAME READY STATUS RESTARTS AGE checkout-api-9c4b2f-2xkpj 1/1 Running 0 2m checkout-api-9c4b2f-7rvqn 1/1 Running 0 2m checkout-api-9c4b2f-m4czl 1/1 Running 0 1m checkout-api-9c4b2f-p9wxt 1/1 Running 0 1m checkout-api-9c4b2f-r2skl 1/1 Running 0 58s checkout-api-9c4b2f-xb4fm 1/1 Running 0 45s
What just happened?
The readiness probe gates the rollout — This is the most important sentence in this lesson. The rolling update controller does not mark a new Pod as "available" until its readiness probe passes. If the new image has a bug that prevents startup, the readiness probe keeps failing, the Pod never becomes available, and the rollout halts — with your old Pods still running. The broken deploy can't fully replace the good one. Without a readiness probe, Kubernetes considers a Pod ready the moment the container starts — regardless of whether the app is actually serving requests.
maxSurge: 2 in action — During the rollout, up to 8 Pods existed temporarily (6 desired + 2 surge). The controller created 2 new Pods, waited for them to be ready, removed 2 old ones, then repeated. The rollout completed in about 2 minutes because maxSurge let two new Pods start simultaneously rather than one at a time.
New ReplicaSet hash — All new Pods have the hash 9c4b2f. This is the new ReplicaSet created for v2.4.0. The old ReplicaSet (with 0 Pods) is kept around for rollback purposes.
How the Rolling Update Progresses — Step by Step
With replicas: 6, maxUnavailable: 1, and maxSurge: 2, here's every step the rollout controller takes:
Rolling Update Steps — 6 replicas, maxUnavailable: 1, maxSurge: 2
What Triggers a Rolling Update
Not every change to a Deployment triggers a rolling update. Kubernetes only rolls out new Pods when the Pod template (spec.template) changes. Changes to the Deployment metadata, replica count, or strategy do not trigger a rollout.
| Change type | Triggers rolling update? | Why |
|---|---|---|
| Container image change | ✅ Yes | Pod template changed — new ReplicaSet created |
| Env var added or changed | ✅ Yes | Pod template changed |
| Resource requests/limits changed | ✅ Yes | Pod template changed |
| Replica count change only | ❌ No | Only scales existing ReplicaSet up or down |
| Deployment metadata/annotations | ❌ No | Not part of the Pod template |
| ConfigMap/Secret content changed | ⚠️ No — manual restart needed | Kubernetes doesn't watch ConfigMap/Secret content — use kubectl rollout restart |
Monitoring and Aborting a Live Rollout
The scenario: You've triggered a rolling update in production and you're watching it live. You need to know how to monitor progress, spot problems early, and abort if something goes wrong — all while the rollout is in flight.
kubectl rollout status deployment/checkout-api -n production
# Streams live progress — blocks until complete or failed
# Shows how many Pods have been updated and how many old ones are pending termination
# Exit code 0 = success, 1 = failed — integrate this into CI/CD pipelines for automatic abort
kubectl get pods -n production -l app=checkout-api -w
# -w: watch mode — streams Pod status changes in real time
# You should never see Ready drop below (replicas - maxUnavailable) during a healthy rollout
kubectl describe deployment checkout-api -n production
# Full rollout state: conditions, replica counts, recent events
# Look for: "Deployment does not have minimum availability" — means rollout is stalled
# Check Events at the bottom for the specific failure reason
kubectl rollout undo deployment/checkout-api -n production
# If something looks wrong mid-rollout — abort and roll back immediately
# Kubernetes reverses the rollout: scales new ReplicaSet back down, old ReplicaSet back up
$ kubectl rollout status deployment/checkout-api -n production Waiting for deployment "checkout-api" rollout to finish: 2 out of 6 new replicas have been updated... Waiting for deployment "checkout-api" rollout to finish: 2 out of 6 new replicas have been updated... (new Pods stuck — readiness probe failing on the new image) error: deployment "checkout-api" exceeded its progress deadline $ kubectl describe deployment checkout-api -n production | grep -A6 "Conditions:" Conditions: Type Status Reason ---- ------ ------ Available True MinimumReplicasAvailable Progressing False ProgressDeadlineExceeded $ kubectl rollout undo deployment/checkout-api -n production deployment.apps/checkout-api rolled back $ kubectl rollout status deployment/checkout-api -n production deployment "checkout-api" successfully rolled out
What just happened?
ProgressDeadlineExceeded — Kubernetes has a progressDeadlineSeconds field on Deployments (default 600s — 10 minutes). If the rollout doesn't complete within that window, the Progressing condition flips to False. The rollout stalls but is not automatically aborted — you still need to run kubectl rollout undo manually, or have your CI/CD pipeline trigger it based on the non-zero exit code from kubectl rollout status.
Available: True during a failed rollout — Even though the rollout failed, the old Pods kept running. The readiness probe prevented the broken new Pods from entering the Service endpoints. Zero user-facing downtime even for a broken deploy — this is the entire value of readiness probes in a rolling update context.
rollout undo mid-rollout — Running rollout undo during a stalled rollout scales the new (broken) ReplicaSet back to 0 and restores the old ReplicaSet to full desired count. Recovery is fast because some old ReplicaSet Pods were never terminated.
The Recreate Strategy
The scenario: You're deploying a breaking database schema change. The new app version is incompatible with the old schema and the old app version is incompatible with the new schema. You cannot run both versions simultaneously without data corruption. This is the one legitimate use case for Recreate.
apiVersion: apps/v1
kind: Deployment
metadata:
name: schema-migration-app
namespace: production
spec:
replicas: 2
selector:
matchLabels:
app: schema-migration-app
strategy:
type: Recreate # Recreate: kill ALL old Pods first, then create new ones
# WARNING: causes downtime — all Pods are unavailable during transition
# No rollingUpdate block — Recreate takes no parameters
# Use only when two versions running simultaneously is truly impossible
template:
metadata:
labels:
app: schema-migration-app
spec:
containers:
- name: schema-migration-app
image: company/schema-migration-app:2.0.0
ports:
- containerPort: 8080
$ kubectl apply -f schema-migration-deployment.yaml deployment.apps/schema-migration-app configured $ kubectl get pods -n production -l app=schema-migration-app -w NAME READY STATUS RESTARTS AGE schema-migration-app-6b4c9d-2xkpj 1/1 Terminating 0 3d schema-migration-app-6b4c9d-7rvqn 1/1 Terminating 0 3d (both old Pods fully terminated — downtime window begins here) schema-migration-app-9f2a3c-m4czl 0/1 ContainerCreating 0 2s schema-migration-app-9f2a3c-p9wxt 0/1 ContainerCreating 0 2s schema-migration-app-9f2a3c-m4czl 1/1 Running 0 14s schema-migration-app-9f2a3c-p9wxt 1/1 Running 0 16s
What just happened?
The downtime gap — The watch output shows it clearly: both old Pods terminated before any new Pods started. There's a gap of roughly 10–15 seconds (image pull + container start) where zero Pods are serving traffic. If you use Recreate, plan for this downtime — schedule it in a maintenance window, notify users, and have monitoring ready to confirm recovery.
Avoiding Recreate with migration discipline — Most teams avoid Recreate by running schema migrations as a Kubernetes Job before the new app version deploys, and writing migrations to be backward compatible so both old and new app versions can run against the updated schema simultaneously. This takes more engineering discipline but eliminates the downtime window entirely.
Teacher's Note: The readiness probe is not optional for safe rolling updates
I've seen teams deploy without readiness probes and then wonder why rolling updates cause intermittent errors. Here's what happens: without a readiness probe, Kubernetes marks a new Pod as "available" the instant the container starts — before the application has finished binding to its port, loading configuration, or warming caches. Traffic hits the new Pod. The Pod isn't ready. Users see errors.
With a readiness probe, Kubernetes waits until the Pod responds with a 200 before adding it to Service endpoints. The rollout is a bit slower — but zero requests reach an unready Pod. This is the correct trade-off for production.
The formula for a production-safe rolling update: readiness probe (gates traffic) + maxUnavailable: 0 or 1 (preserves capacity) + maxSurge: 1 or 2 (controls speed) + progressDeadlineSeconds tuned to your startup time (auto-detects stuck rollouts). These four together give you deployments that are both fast and safe — and CI/CD pipelines that automatically abort broken deploys via the non-zero exit code from kubectl rollout status.
Practice Questions
1. Which rollingUpdate field controls the minimum number of healthy Pods that must remain available throughout a rolling update — preventing capacity from dropping too low?
2. What prevents a new Pod from receiving traffic during a rolling update until the application inside it is fully started and ready to serve requests?
3. Which Deployment update strategy terminates all existing Pods before creating new ones — causing a downtime window — and should only be used when two versions cannot coexist simultaneously?
Quiz
1. You update a Deployment by changing spec.replicas from 3 to 6 but make no other changes. Does Kubernetes perform a rolling update?
2. A new image is deployed via rolling update but the readiness probe keeps failing on every new Pod. What happens?
3. You want the safest possible rolling update — no capacity reduction at any point, add only one new Pod at a time. Which settings achieve this?
Up Next · Lesson 26
Rollbacks
A bad deploy just hit production. Error rates are spiking. Here's how to roll back in one command, how revision history works, and the Git drift trap that re-introduces the broken version after you've already recovered.