Kubernetes Course
Rollbacks
You pushed a bad deployment. Error rates are spiking. Users are hitting 500s. The fastest path back to stability is a rollback — and in Kubernetes, the whole thing takes one command and about 30 seconds. This lesson covers how rollback history works, how to execute a rollback under pressure, and how to avoid the mistakes that make rollbacks fail when you need them most.
How Kubernetes Remembers Your Deployments
Every time you change a Deployment's Pod template — update the image, modify env vars, change resource limits — Kubernetes saves the previous state as a revision. Each revision corresponds to a ReplicaSet. Rolling back simply means telling Kubernetes to make a previous ReplicaSet the active one again.
The key insight: the old ReplicaSet is never deleted when you roll forward. It's kept around with its Pod template intact, scaled to zero. A rollback just scales the old ReplicaSet back up and scales the current one back to zero — using the same rolling update mechanism as a forward deploy. The Pods come back online before the broken Pods come down.
revisionHistoryLimit controls how many revisions are kept
By default, Kubernetes keeps the last 10 revisions for every Deployment. This is controlled by the revisionHistoryLimit field in the Deployment spec. Keep it at 10 for production. If you set it to 0, you lose the ability to roll back entirely — which is never worth the tiny etcd space saving.
Viewing Rollout History
The scenario: Your team has been actively deploying the payment API over the past two weeks. You're now investigating a production incident and need to understand exactly what has changed across the last several deployments — which image versions were deployed, when, and what changed.
kubectl rollout history deployment/payment-api -n production
# history: list all saved revisions for a Deployment
# Shows REVISION number and CHANGE-CAUSE (if annotated — more on this below)
# REVISION 1 = first ever deploy, higher numbers = more recent
kubectl rollout history deployment/payment-api --revision=3 -n production
# --revision=N: inspect a specific revision in detail
# Shows the full Pod template that was active at that revision:
# image, env vars, resource limits, labels — everything
$ kubectl rollout history deployment/payment-api -n production
deployment.apps/payment-api
REVISION CHANGE-CAUSE
1 <none>
2 <none>
3 <none>
4 <none>
5 <none>
$ kubectl rollout history deployment/payment-api --revision=4 -n production
deployment.apps/payment-api with revision #4
Pod Template:
Labels: app=payment-api
version=3.1.0
Containers:
payment-api:
Image: company/payment-api:3.1.0
Port: 8080/TCP
Limits: cpu: 500m, memory: 350Mi
Requests: cpu: 150m, memory: 200Mi
Environment:
APP_ENV: production
LOG_LEVEL: info
$ kubectl rollout history deployment/payment-api --revision=5 -n production
deployment.apps/payment-api with revision #5
Pod Template:
Labels: app=payment-api
version=3.2.0
Containers:
payment-api:
Image: company/payment-api:3.2.0
Port: 8080/TCPWhat just happened?
CHANGE-CAUSE: <none> — This is the most common and most frustrating rollout history output. Every revision shows <none> because nobody set the kubernetes.io/change-cause annotation. You can see what image changed by inspecting each revision individually, but you can't immediately tell why the deployment happened. We'll fix this next.
--revision=N for diff-style debugging — By inspecting revision 4 and 5 individually, you can manually diff them. Revision 4 ran company/payment-api:3.1.0 and revision 5 runs company/payment-api:3.2.0. If 3.2.0 is the bad deploy, you know exactly which revision to roll back to.
Making History Useful: CHANGE-CAUSE Annotations
A rollout history full of <none> entries is almost useless under incident pressure. The kubernetes.io/change-cause annotation populates the CHANGE-CAUSE column and turns your rollout history into a readable changelog. There are two ways to set it.
The scenario: Your team is adopting a standard practice — every deployment must have a human-readable change cause so that during incidents, anyone can read the history and understand what changed and why without hunting through Slack or Jira.
kubectl annotate deployment/payment-api \
kubernetes.io/change-cause="Deploy v3.2.0: adds Apple Pay support — PR #441" \
-n production
# annotate: add or update an annotation on an existing object
# kubernetes.io/change-cause: the special annotation that populates the CHANGE-CAUSE column
# Set this AFTER kubectl apply — or bake it into your CI/CD pipeline
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-api
namespace: production
annotations:
kubernetes.io/change-cause: "Deploy v3.2.0: adds Apple Pay support — PR #441"
# Bake the change-cause into the manifest itself
# Every kubectl apply will record this as the CHANGE-CAUSE for the new revision
# CI/CD pipelines can inject this dynamically: --change-cause="Deploy ${IMAGE_TAG} by ${USER}"
spec:
revisionHistoryLimit: 10 # Keep the last 10 revisions — never set to 0
replicas: 3
selector:
matchLabels:
app: payment-api
template:
metadata:
labels:
app: payment-api
version: "3.2.0"
spec:
containers:
- name: payment-api
image: company/payment-api:3.2.0
ports:
- containerPort: 8080
resources:
requests:
cpu: "150m"
memory: "200Mi"
limits:
cpu: "500m"
memory: "350Mi"
$ kubectl apply -f payment-api-deployment.yaml deployment.apps/payment-api configured $ kubectl rollout history deployment/payment-api -n production deployment.apps/payment-api REVISION CHANGE-CAUSE 1 Deploy v3.0.0: initial release — PR #388 2 Deploy v3.0.1: fix session timeout bug — PR #401 3 Deploy v3.1.0: adds card tokenisation — PR #419 4 Deploy v3.1.1: hotfix payment retry logic — PR #437 5 Deploy v3.2.0: adds Apple Pay support — PR #441
What just happened?
Readable rollout history — Now the history tells a story. At 3am with alerts firing, you can read the CHANGE-CAUSE column and immediately see that revision 5 added Apple Pay support. If errors started after that deploy, you know exactly which revision to roll back to without opening a single Jira ticket.
CI/CD pipeline integration — The best practice is to have your deployment pipeline inject the change cause dynamically: commit SHA, PR number, deployer's name, and timestamp. Something like "Deploy 3.2.0 by alice@company.com from PR #441 at 2025-03-14T10:22Z". With this, rollout history becomes a complete audit trail at no extra effort cost.
Executing a Rollback
The scenario: It's 10:31 AM. Payment API v3.2.0 deployed 8 minutes ago. Error rates jumped from 0.1% to 14% immediately after the deploy. Payments are failing. You have confirmed in the logs it's related to the Apple Pay integration — a dependency on an external Apple API that isn't available in your production environment yet. You need to roll back to v3.1.1 immediately.
kubectl rollout undo deployment/payment-api -n production
# undo: roll back to the PREVIOUS revision (one step back)
# This is the fastest path — no revision number needed
# Kubernetes reverses the last rolling update using the same rolling mechanism
# The previous ReplicaSet scales up while the current one scales down
kubectl rollout undo deployment/payment-api --to-revision=4 -n production
# --to-revision=N: roll back to a SPECIFIC revision
# Use this when the previous revision is also bad and you need to go further back
# Revision 4 = v3.1.1 (from our history above) — the last known good state
kubectl rollout status deployment/payment-api -n production
# Monitor the rollback in real time — blocks until complete
# Shows which replica set is scaling up/down
# Exit 0 = rollback complete, all Pods on previous version, all Ready
$ kubectl rollout undo deployment/payment-api --to-revision=4 -n production deployment.apps/payment-api rolled back $ kubectl rollout status deployment/payment-api -n production Waiting for deployment "payment-api" rollout to finish: 1 out of 3 new replicas have been updated... Waiting for deployment "payment-api" rollout to finish: 2 out of 3 new replicas have been updated... Waiting for deployment "payment-api" rollout to finish: 1 old replicas are pending termination... deployment "payment-api" successfully rolled out $ kubectl rollout history deployment/payment-api -n production deployment.apps/payment-api REVISION CHANGE-CAUSE 1 Deploy v3.0.0: initial release — PR #388 2 Deploy v3.0.1: fix session timeout bug — PR #401 3 Deploy v3.1.0: adds card tokenisation — PR #419 5 Deploy v3.2.0: adds Apple Pay support — PR #441 6 Deploy v3.1.1: hotfix payment retry logic — PR #437 $ kubectl get pods -n production -l app=payment-api NAME READY STATUS RESTARTS AGE payment-api-7d9c4b-2xkpj 1/1 Running 0 38s payment-api-7d9c4b-8rvnq 1/1 Running 0 36s payment-api-7d9c4b-m4czl 1/1 Running 0 33s
What just happened?
Rollback in ~30 seconds — The rollback used the same RollingUpdate mechanism as a forward deploy. New Pods (running v3.1.1) were created and made ready before the v3.2.0 Pods were terminated. At no point were there zero healthy Pods — the Service kept routing traffic throughout. Total user-visible downtime: near zero.
Revision 4 became revision 6 — After the rollback, the history shows revision 4 is gone and a new revision 6 has appeared with the same content. When Kubernetes rolls back to a previous revision, it creates a new revision entry at the end of the history — it doesn't time-travel. This is important: rolling back to revision 4 doesn't mean you're "at" revision 4. You're at revision 6, which has the same Pod template as revision 4 did.
ReplicaSet hash changed — The new Pods have a different ReplicaSet hash in their names (7d9c4b) than the v3.2.0 Pods had. Even though the image is the same as v3.1.1, the rollback created a fresh set of Pods from the restored ReplicaSet. This is exactly what you want — clean Pods with no baggage from the bad deploy.
The Rollback Lifecycle: What Kubernetes Does Internally
Understanding what happens inside the cluster during a rollback helps you interpret the output and troubleshoot if something goes wrong:
Rollback Internals — What Kubernetes Does
kubectl rollout undo → Deployment controller updates the Pod template to match ReplicaSet-v3.1.1's template. This creates a new ReplicaSet (or reactivates the old one).
Pausing and Resuming a Rollout
Sometimes you want to deploy a new version but not push it all the way out immediately. You can deploy to one Pod, watch it for a few minutes, and only proceed if everything looks healthy. Kubernetes lets you pause a rollout mid-way and resume it when you're satisfied.
The scenario: Your team is deploying a major refactor of the checkout service. It passed all staging tests, but the team wants to watch one production Pod for 15 minutes before rolling it out to all replicas. If anything looks wrong, you rollback before the rest of the fleet is affected.
kubectl rollout pause deployment/checkout-api -n production
# pause: freeze the rollout mid-way
# Any existing Pods that have already been updated stay updated
# New Pods won't be updated until you resume
# Useful for canary-style validation: update 1 Pod, watch it, then proceed
kubectl set image deployment/checkout-api \
checkout-api=company/checkout-api:2.4.0 \
-n production
# set image: update the image — but because the rollout is paused,
# only the minimum number of new Pods are created (1, based on maxSurge)
# The rest of the fleet stays on the old version
kubectl get pods -n production -l app=checkout-api
# Verify the mixed state: some Pods on old image, one on new image
kubectl rollout resume deployment/checkout-api -n production
# resume: unpause — Kubernetes continues the rollout to all remaining Pods
# If you saw problems, run kubectl rollout undo instead to abort
kubectl rollout undo deployment/checkout-api -n production
# If you saw problems during the pause window — abort and roll back
# Only the 1 updated Pod gets rolled back — minimal blast radius
$ kubectl rollout pause deployment/checkout-api -n production deployment.apps/checkout-api paused $ kubectl set image deployment/checkout-api checkout-api=company/checkout-api:2.4.0 -n production deployment.apps/checkout-api image updated $ kubectl get pods -n production -l app=checkout-api NAME READY STATUS RESTARTS AGE IMAGE checkout-api-6f8b9d-2xkpj 1/1 Running 0 14d company/checkout-api:2.3.0 checkout-api-6f8b9d-7rvqn 1/1 Running 0 14d company/checkout-api:2.3.0 checkout-api-9c4b2f-m8nzx 1/1 Running 0 45s company/checkout-api:2.4.0 (15 minutes pass — metrics look good, no errors on the new Pod) $ kubectl rollout resume deployment/checkout-api -n production deployment.apps/checkout-api resumed $ kubectl rollout status deployment/checkout-api -n production Waiting for deployment "checkout-api" rollout to finish: 1 out of 3 new replicas have been updated... deployment "checkout-api" successfully rolled out
What just happened?
Pause creates a manual canary — By pausing after updating the image, you created a manual canary deployment: 1 Pod on the new version, 2 Pods on the old version. Real production traffic hits all three Pods. If the new Pod shows elevated error rates in your monitoring dashboard, you catch the problem at 33% blast radius instead of 100%.
resume vs undo — After the observation window, you have two options. resume rolls the new version out to the remaining Pods. undo reverses the single updated Pod back to the old version. The undo here only affects the one Pod that was updated during the pause — a much smaller rollback than undoing a full 100% rollout.
Note on kubectl describe during pause — A paused Deployment shows Conditions: Progressing=Unknown/DeploymentPaused in kubectl describe deployment. This is not an error — it's the expected state. Don't panic when you see it; it means exactly what it says.
The Rollback Decision Tree
When an incident hits post-deploy, here's the decision framework used by experienced SREs:
kubectl set image with the patched version. Don't introduce a second rollback into a live incident unless necessary.
kubectl rollout undo to last known good. Restore stability first, investigate cause second.
Teacher's Note: The Git drift problem after rollbacks
The most common mistake after a rollback is forgetting to update Git. You roll back with kubectl rollout undo, the cluster is back on v3.1.1, but the Git repo still has v3.2.0 in the Deployment manifest. The next time anyone runs kubectl apply from the repo — a routine pipeline run, a different engineer applying a config change — it silently re-deploys the broken v3.2.0. You've just re-introduced the incident that you just recovered from.
After every rollback, one of the first tasks in your post-incident checklist should be: update the Deployment manifest in Git to match the rolled-back state and merge it. Block the pipeline from running until that commit is merged.
One more tip: kubectl diff -f deployment.yaml before every kubectl apply in production. If the diff shows you're about to deploy something different from what's running, stop and understand why before proceeding. It takes 10 seconds and has prevented countless accidental re-deployments of bad versions.
Practice Questions
1. Write the kubectl command to roll back the payment-api Deployment in the production namespace to revision 4.
2. What annotation key do you set on a Deployment to populate the CHANGE-CAUSE column in kubectl rollout history?
3. What Deployment spec field controls how many old ReplicaSets (and therefore how many rollback revisions) Kubernetes keeps? What is the default value?
Quiz
1. You have revisions 1–5 in your rollout history. You roll back to revision 4. What does the rollout history show afterwards?
2. You want to deploy a risky new version to just one Pod in production, observe it for 15 minutes, and only proceed if it looks healthy. What is the correct approach?
3. You roll back a Deployment from v3.2.0 to v3.1.1 using kubectl rollout undo but forget to update the Deployment manifest in Git. What is the danger?
Up Next · Lesson 27
Kubernetes Volumes
Why container storage disappears when Pods restart — and the volume types that give your applications persistent, shareable, and durable storage.