Kubernetes Lesson 26 – Rollbacks | Dataplexa

Core Kubernetes Concepts · Lesson 26

Rollbacks

You pushed a bad deployment. Error rates are spiking. Users are hitting 500s. The fastest path back to stability is a rollback — and in Kubernetes, the whole thing takes one command and about 30 seconds. This lesson covers how rollback history works, how to execute a rollback under pressure, and how to avoid the mistakes that make rollbacks fail when you need them most.

How Kubernetes Remembers Your Deployments

Every time you change a Deployment's Pod template — update the image, modify env vars, change resource limits — Kubernetes saves the previous state as a revision. Each revision corresponds to a ReplicaSet. Rolling back simply means telling Kubernetes to make a previous ReplicaSet the active one again.

The key insight: the old ReplicaSet is never deleted when you roll forward. It's kept around with its Pod template intact, scaled to zero. A rollback just scales the old ReplicaSet back up and scales the current one back to zero — using the same rolling update mechanism as a forward deploy. The Pods come back online before the broken Pods come down.

revisionHistoryLimit controls how many revisions are kept

By default, Kubernetes keeps the last 10 revisions for every Deployment. This is controlled by the revisionHistoryLimit field in the Deployment spec. Keep it at 10 for production. If you set it to 0, you lose the ability to roll back entirely — which is never worth the tiny etcd space saving.

Viewing Rollout History

The scenario: Your team has been actively deploying the payment API over the past two weeks. You're now investigating a production incident and need to understand exactly what has changed across the last several deployments — which image versions were deployed, when, and what changed.

kubectl rollout history deployment/payment-api -n production
# history: list all saved revisions for a Deployment
# Shows REVISION number and CHANGE-CAUSE (if annotated — more on this below)
# REVISION 1 = first ever deploy, higher numbers = more recent

kubectl rollout history deployment/payment-api --revision=3 -n production
# --revision=N: inspect a specific revision in detail
# Shows the full Pod template that was active at that revision:
# image, env vars, resource limits, labels — everything

$ kubectl rollout history deployment/payment-api -n production
deployment.apps/payment-api
REVISION  CHANGE-CAUSE
1         <none>
2         <none>
3         <none>
4         <none>
5         <none>

$ kubectl rollout history deployment/payment-api --revision=4 -n production
deployment.apps/payment-api with revision #4
Pod Template:
  Labels: app=payment-api
          version=3.1.0
  Containers:
   payment-api:
    Image:      company/payment-api:3.1.0
    Port:       8080/TCP
    Limits:     cpu: 500m, memory: 350Mi
    Requests:   cpu: 150m, memory: 200Mi
    Environment:
      APP_ENV:  production
      LOG_LEVEL: info

$ kubectl rollout history deployment/payment-api --revision=5 -n production
deployment.apps/payment-api with revision #5
Pod Template:
  Labels: app=payment-api
          version=3.2.0
  Containers:
   payment-api:
    Image:      company/payment-api:3.2.0
    Port:       8080/TCP

What just happened?

CHANGE-CAUSE: <none> — This is the most common and most frustrating rollout history output. Every revision shows <none> because nobody set the kubernetes.io/change-cause annotation. You can see what image changed by inspecting each revision individually, but you can't immediately tell why the deployment happened. We'll fix this next.

--revision=N for diff-style debugging — By inspecting revision 4 and 5 individually, you can manually diff them. Revision 4 ran company/payment-api:3.1.0 and revision 5 runs company/payment-api:3.2.0. If 3.2.0 is the bad deploy, you know exactly which revision to roll back to.

Making History Useful: CHANGE-CAUSE Annotations

A rollout history full of <none> entries is almost useless under incident pressure. The kubernetes.io/change-cause annotation populates the CHANGE-CAUSE column and turns your rollout history into a readable changelog. There are two ways to set it.

The scenario: Your team is adopting a standard practice — every deployment must have a human-readable change cause so that during incidents, anyone can read the history and understand what changed and why without hunting through Slack or Jira.

kubectl annotate deployment/payment-api \
  kubernetes.io/change-cause="Deploy v3.2.0: adds Apple Pay support — PR #441" \
  -n production
# annotate: add or update an annotation on an existing object
# kubernetes.io/change-cause: the special annotation that populates the CHANGE-CAUSE column
# Set this AFTER kubectl apply — or bake it into your CI/CD pipeline

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-api
  namespace: production
  annotations:
    kubernetes.io/change-cause: "Deploy v3.2.0: adds Apple Pay support — PR #441"
    # Bake the change-cause into the manifest itself
    # Every kubectl apply will record this as the CHANGE-CAUSE for the new revision
    # CI/CD pipelines can inject this dynamically: --change-cause="Deploy ${IMAGE_TAG} by ${USER}"
spec:
  revisionHistoryLimit: 10              # Keep the last 10 revisions — never set to 0
  replicas: 3
  selector:
    matchLabels:
      app: payment-api
  template:
    metadata:
      labels:
        app: payment-api
        version: "3.2.0"
    spec:
      containers:
        - name: payment-api
          image: company/payment-api:3.2.0
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: "150m"
              memory: "200Mi"
            limits:
              cpu: "500m"
              memory: "350Mi"

$ kubectl apply -f payment-api-deployment.yaml
deployment.apps/payment-api configured

$ kubectl rollout history deployment/payment-api -n production
deployment.apps/payment-api
REVISION  CHANGE-CAUSE
1         Deploy v3.0.0: initial release — PR #388
2         Deploy v3.0.1: fix session timeout bug — PR #401
3         Deploy v3.1.0: adds card tokenisation — PR #419
4         Deploy v3.1.1: hotfix payment retry logic — PR #437
5         Deploy v3.2.0: adds Apple Pay support — PR #441

What just happened?

Readable rollout history — Now the history tells a story. At 3am with alerts firing, you can read the CHANGE-CAUSE column and immediately see that revision 5 added Apple Pay support. If errors started after that deploy, you know exactly which revision to roll back to without opening a single Jira ticket.

CI/CD pipeline integration — The best practice is to have your deployment pipeline inject the change cause dynamically: commit SHA, PR number, deployer's name, and timestamp. Something like "Deploy 3.2.0 by alice@company.com from PR #441 at 2025-03-14T10:22Z". With this, rollout history becomes a complete audit trail at no extra effort cost.

Executing a Rollback

The scenario: It's 10:31 AM. Payment API v3.2.0 deployed 8 minutes ago. Error rates jumped from 0.1% to 14% immediately after the deploy. Payments are failing. You have confirmed in the logs it's related to the Apple Pay integration — a dependency on an external Apple API that isn't available in your production environment yet. You need to roll back to v3.1.1 immediately.

kubectl rollout undo deployment/payment-api -n production
# undo: roll back to the PREVIOUS revision (one step back)
# This is the fastest path — no revision number needed
# Kubernetes reverses the last rolling update using the same rolling mechanism
# The previous ReplicaSet scales up while the current one scales down

kubectl rollout undo deployment/payment-api --to-revision=4 -n production
# --to-revision=N: roll back to a SPECIFIC revision
# Use this when the previous revision is also bad and you need to go further back
# Revision 4 = v3.1.1 (from our history above) — the last known good state

kubectl rollout status deployment/payment-api -n production
# Monitor the rollback in real time — blocks until complete
# Shows which replica set is scaling up/down
# Exit 0 = rollback complete, all Pods on previous version, all Ready

$ kubectl rollout undo deployment/payment-api --to-revision=4 -n production
deployment.apps/payment-api rolled back

$ kubectl rollout status deployment/payment-api -n production
Waiting for deployment "payment-api" rollout to finish: 1 out of 3 new replicas have been updated...
Waiting for deployment "payment-api" rollout to finish: 2 out of 3 new replicas have been updated...
Waiting for deployment "payment-api" rollout to finish: 1 old replicas are pending termination...
deployment "payment-api" successfully rolled out

$ kubectl rollout history deployment/payment-api -n production
deployment.apps/payment-api
REVISION  CHANGE-CAUSE
1         Deploy v3.0.0: initial release — PR #388
2         Deploy v3.0.1: fix session timeout bug — PR #401
3         Deploy v3.1.0: adds card tokenisation — PR #419
5         Deploy v3.2.0: adds Apple Pay support — PR #441
6         Deploy v3.1.1: hotfix payment retry logic — PR #437

$ kubectl get pods -n production -l app=payment-api
NAME                             READY   STATUS    RESTARTS   AGE
payment-api-7d9c4b-2xkpj         1/1     Running   0          38s
payment-api-7d9c4b-8rvnq         1/1     Running   0          36s
payment-api-7d9c4b-m4czl         1/1     Running   0          33s

What just happened?

Rollback in ~30 seconds — The rollback used the same RollingUpdate mechanism as a forward deploy. New Pods (running v3.1.1) were created and made ready before the v3.2.0 Pods were terminated. At no point were there zero healthy Pods — the Service kept routing traffic throughout. Total user-visible downtime: near zero.

Revision 4 became revision 6 — After the rollback, the history shows revision 4 is gone and a new revision 6 has appeared with the same content. When Kubernetes rolls back to a previous revision, it creates a new revision entry at the end of the history — it doesn't time-travel. This is important: rolling back to revision 4 doesn't mean you're "at" revision 4. You're at revision 6, which has the same Pod template as revision 4 did.

ReplicaSet hash changed — The new Pods have a different ReplicaSet hash in their names (7d9c4b) than the v3.2.0 Pods had. Even though the image is the same as v3.1.1, the rollback created a fresh set of Pods from the restored ReplicaSet. This is exactly what you want — clean Pods with no baggage from the bad deploy.

The Rollback Lifecycle: What Kubernetes Does Internally

Understanding what happens inside the cluster during a rollback helps you interpret the output and troubleshoot if something goes wrong:

Rollback Internals — What Kubernetes Does

Before

Active: ReplicaSet-v3.2.0 (3 Pods running) | Inactive: ReplicaSet-v3.1.1 (0 Pods, template preserved)

Step 1

kubectl rollout undo → Deployment controller updates the Pod template to match ReplicaSet-v3.1.1's template. This creates a new ReplicaSet (or reactivates the old one).

Step 2

ReplicaSet-v3.1.1 scales from 0 → 1. New Pod starts, pulls v3.1.1 image (likely cached on node), passes readiness probe. Service begins routing some traffic to the new Pod.

Step 3

ReplicaSet-v3.2.0 scales from 3 → 2. One old Pod is terminated (graceful shutdown, SIGTERM → grace period). Service removes it from endpoints before sending SIGTERM.

Done

After: ReplicaSet-v3.1.1 (3 Pods running) | Inactive: ReplicaSet-v3.2.0 (0 Pods, template preserved for future reference). Total downtime: ~0s.

Pausing and Resuming a Rollout

Sometimes you want to deploy a new version but not push it all the way out immediately. You can deploy to one Pod, watch it for a few minutes, and only proceed if everything looks healthy. Kubernetes lets you pause a rollout mid-way and resume it when you're satisfied.

The scenario: Your team is deploying a major refactor of the checkout service. It passed all staging tests, but the team wants to watch one production Pod for 15 minutes before rolling it out to all replicas. If anything looks wrong, you rollback before the rest of the fleet is affected.

kubectl rollout pause deployment/checkout-api -n production
# pause: freeze the rollout mid-way
# Any existing Pods that have already been updated stay updated
# New Pods won't be updated until you resume
# Useful for canary-style validation: update 1 Pod, watch it, then proceed

kubectl set image deployment/checkout-api \
  checkout-api=company/checkout-api:2.4.0 \
  -n production
# set image: update the image — but because the rollout is paused,
# only the minimum number of new Pods are created (1, based on maxSurge)
# The rest of the fleet stays on the old version

kubectl get pods -n production -l app=checkout-api
# Verify the mixed state: some Pods on old image, one on new image

kubectl rollout resume deployment/checkout-api -n production
# resume: unpause — Kubernetes continues the rollout to all remaining Pods
# If you saw problems, run kubectl rollout undo instead to abort

kubectl rollout undo deployment/checkout-api -n production
# If you saw problems during the pause window — abort and roll back
# Only the 1 updated Pod gets rolled back — minimal blast radius

$ kubectl rollout pause deployment/checkout-api -n production
deployment.apps/checkout-api paused

$ kubectl set image deployment/checkout-api checkout-api=company/checkout-api:2.4.0 -n production
deployment.apps/checkout-api image updated

$ kubectl get pods -n production -l app=checkout-api
NAME                             READY   STATUS    RESTARTS   AGE    IMAGE
checkout-api-6f8b9d-2xkpj        1/1     Running   0          14d    company/checkout-api:2.3.0
checkout-api-6f8b9d-7rvqn        1/1     Running   0          14d    company/checkout-api:2.3.0
checkout-api-9c4b2f-m8nzx        1/1     Running   0          45s    company/checkout-api:2.4.0

(15 minutes pass — metrics look good, no errors on the new Pod)

$ kubectl rollout resume deployment/checkout-api -n production
deployment.apps/checkout-api resumed

$ kubectl rollout status deployment/checkout-api -n production
Waiting for deployment "checkout-api" rollout to finish: 1 out of 3 new replicas have been updated...
deployment "checkout-api" successfully rolled out

What just happened?

Pause creates a manual canary — By pausing after updating the image, you created a manual canary deployment: 1 Pod on the new version, 2 Pods on the old version. Real production traffic hits all three Pods. If the new Pod shows elevated error rates in your monitoring dashboard, you catch the problem at 33% blast radius instead of 100%.

resume vs undo — After the observation window, you have two options. resume rolls the new version out to the remaining Pods. undo reverses the single updated Pod back to the old version. The undo here only affects the one Pod that was updated during the pause — a much smaller rollback than undoing a full 100% rollout.

Note on kubectl describe during pause — A paused Deployment shows Conditions: Progressing=Unknown/DeploymentPaused in kubectl describe deployment. This is not an error — it's the expected state. Don't panic when you see it; it means exactly what it says.

The Rollback Decision Tree

When an incident hits post-deploy, here's the decision framework used by experienced SREs:

🚨 Errors spiking after a deploy

Can you fix it with a new deploy in <5 min?

patch is ready, low risk

Fix is unknown or will take >5 min?

root cause unclear

→ Deploy the fix forward.
kubectl set image with the patched version. Don't introduce a second rollback into a live incident unless necessary.

→ Roll back immediately.
kubectl rollout undo to last known good. Restore stability first, investigate cause second.

After any rollback: Update the YAML in Git to match the rolled-back state → commit → open post-mortem → fix forward in the next sprint

Teacher's Note: The Git drift problem after rollbacks

The most common mistake after a rollback is forgetting to update Git. You roll back with kubectl rollout undo, the cluster is back on v3.1.1, but the Git repo still has v3.2.0 in the Deployment manifest. The next time anyone runs kubectl apply from the repo — a routine pipeline run, a different engineer applying a config change — it silently re-deploys the broken v3.2.0. You've just re-introduced the incident that you just recovered from.

After every rollback, one of the first tasks in your post-incident checklist should be: update the Deployment manifest in Git to match the rolled-back state and merge it. Block the pipeline from running until that commit is merged.

One more tip: kubectl diff -f deployment.yaml before every kubectl apply in production. If the diff shows you're about to deploy something different from what's running, stop and understand why before proceeding. It takes 10 seconds and has prevented countless accidental re-deployments of bad versions.

Practice Questions

1. Write the kubectl command to roll back the payment-api Deployment in the production namespace to revision 4.

2. What annotation key do you set on a Deployment to populate the CHANGE-CAUSE column in kubectl rollout history?

3. What Deployment spec field controls how many old ReplicaSets (and therefore how many rollback revisions) Kubernetes keeps? What is the default value?

Quiz

Up Next · Lesson 27

Kubernetes Volumes

Why container storage disappears when Pods restart — and the volume types that give your applications persistent, shareable, and durable storage.

← Previous Course Index Next →

Kubernetes Course

Rollbacks

How Kubernetes Remembers Your Deployments

Viewing Rollout History

Making History Useful: CHANGE-CAUSE Annotations

Executing a Rollback

The Rollback Lifecycle: What Kubernetes Does Internally

Pausing and Resuming a Rollout

The Rollback Decision Tree

Practice Questions

Quiz