CI/CD Course
Rollback Strategies
In this lesson
A rollback is the act of returning a production environment to a previously known-good state after a deployment introduces a problem. It is not a failure of CI/CD — it is evidence that the safety net works. Every team that deploys software to production will eventually need to roll back. The question is not whether rollback will be needed, but whether the team has a defined, practiced, and fast procedure for doing it when the moment comes. A rollback that takes four hours under pressure is qualitatively different from one that takes four minutes and runs automatically.
Rollback vs Roll Forward — Choosing the Right Response
Not every production problem calls for a rollback. The first decision after detecting an incident is whether to roll back to the previous version or roll forward with a fix. Each approach has a different risk profile and a different time cost, and the right choice depends on the nature and severity of the problem.
Rollback vs Roll Forward — Decision Guide
The Emergency Exit Analogy
A fire exit is not a sign that a building is poorly designed — it is a sign that the designers understood that emergencies happen and planned for them. A fire exit that is locked, blocked, or leads somewhere unexpected is far more dangerous than no plan at all. Rollback is the emergency exit for a deployment. The test is not whether you ever need it; the test is whether it opens quickly, leads where it should, and has been tested recently enough that everyone knows how to use it.
Artifact-Based Rollback — The Fastest Path to Recovery
The most reliable rollback mechanism in a CI/CD pipeline is artifact-based rollback: redeploying a specific previous artifact directly from the registry, identified by its commit SHA or version tag. Because the previous artifact is already built, tested, and stored — and because the deployment process is automated and scripted — the rollback is essentially a re-run of the deploy job with a different artifact reference. No code changes required, no build step, no pipeline from scratch.
This is why artifact immutability and retention policies (covered in Lesson 14) are not administrative details — they are prerequisites for fast rollback. If previous artifacts are overwritten or deleted on a short retention schedule, the option to roll back to a specific version disappears. The artifact must still exist in the registry, under its original tag, in exactly the state it was when it was deployed.
Database Migrations — The Rollback Complication
Application code is easy to roll back. Database schema changes are not. A database migration that adds a column, renames a table, or drops a field changes the persistent state of the system in ways that may be irreversible — or at least expensive to reverse. Rolling back the application code while leaving the migrated schema in place can produce a mismatch that causes the previous version to fail immediately.
The industry-standard approach to making migrations rollback-safe is the expand-contract pattern. Instead of making a breaking schema change in a single migration, the change is split across multiple deployments. First, the new column or table is added alongside the old one (expand). The application is updated to write to both. After several deployment cycles of confirmed stability, the old column is removed (contract). At any point during the expand phase, rolling back the application code is safe because the old schema is still present. By the time the contract migration runs, the team is confident no rollback is needed.
Automated Rollback — Making Recovery Instant
Automated rollback triggers a redeployment of the previous artifact without human intervention, based on a defined failure signal — a failed smoke test, a spike in error rate from the observability platform, a health check that stops returning 200. The pipeline (or a monitoring integration) detects the signal, identifies the last known-good artifact, and deploys it immediately.
Automated rollback is the gold standard for production stability, but it requires two preconditions: the failure signal must be reliable enough that false positives do not trigger unnecessary rollbacks, and the rollback itself must be tested regularly so the team knows it works when needed. A rollback procedure that has never been tested in production is a procedure the team cannot trust in an incident. Platforms like Kubernetes, AWS ECS, and deployment tools like Argo Rollouts support automated rollback natively.
Automated Rollback on Smoke Test Failure — GitHub Actions
jobs:
deploy-production:
runs-on: ubuntu-latest
environment: production
steps:
- name: Deploy new artifact
run: ./deploy.sh production ${{ github.sha }}
- name: Run smoke tests
id: smoke
run: ./smoke-test.sh https://myapp.com
continue-on-error: true # Don't fail the job yet — check the result first
- name: Rollback on smoke test failure
if: steps.smoke.outcome == 'failure'
run: |
echo "Smoke tests failed — rolling back to ${{ vars.LAST_GOOD_SHA }}"
./deploy.sh production ${{ vars.LAST_GOOD_SHA }} # Redeploy previous known-good artifact
./smoke-test.sh https://myapp.com # Verify rollback succeeded
- name: Fail the job if smoke tests failed
if: steps.smoke.outcome == 'failure'
run: exit 1 # Mark the workflow as failed after rollback completes
What just happened?
The deploy job attempts the new deployment and runs smoke tests. If they fail, it immediately redeploys the previous known-good artifact — stored by SHA in a repository variable — and verifies that the rollback itself succeeded. The job is then marked as failed so the team is alerted, but production has already been restored before the failure notification is sent.
Warning: A Rollback Plan That Has Never Been Tested Is Not a Rollback Plan
Writing a rollback script and never running it is the equivalent of installing a fire extinguisher and never checking whether it is charged. The worst moment to discover that your rollback procedure fails — because the previous artifact was deleted, the deploy script has a bug, or the database migration is non-reversible — is during a production incident at 2am with users affected and a manager on the call. Rollback procedures must be tested in production on a schedule: triggered deliberately, verified to work, and documented. Teams that do this regularly find the procedure boring; teams that do not find it catastrophic.
Key Takeaways from This Lesson
Teacher's Note
Store the last known-good deployment SHA somewhere the pipeline can read it automatically — a repository variable, a parameter store entry, or a deployment record — so a rollback never requires a human to remember or look up what version was running before.
Practice Questions
Answer in your own words — then check against the expected answer.
1. What is the name of the database migration strategy that splits a breaking schema change across multiple deployments — first adding the new structure alongside the old (expand), then removing the old structure after the new one is proven stable (contract) — making each deployment independently rollback-safe?
2. A production deployment has introduced a known, isolated bug with a clear one-line fix. Rolling back would reverse a completed database migration and cause more disruption than the bug itself. What is the appropriate response — rolling back or rolling forward — and why?
3. What two artifact registry properties are prerequisites for artifact-based rollback — the conditions that ensure a previous deployment version can always be retrieved and redeployed exactly as it was originally?
Lesson Quiz
1. A production deployment has caused a severe error rate spike. The cause is unknown. The team needs to restore service as fast as possible. What is the fastest recovery mechanism available in a CI/CD pipeline with proper artifact management?
2. A team rolls back their application to the previous artifact after a failed deployment. The rollback completes but the application still crashes immediately. The deployment included a database migration that added a new non-nullable column. What is most likely happening?
3. A deployment has introduced a minor UI bug affecting 2% of users. The fix is a single CSS change. The deployment also included a data migration that populated a new table with 48 hours of backfilled records — reversing it would mean losing that data. What is the correct response?
Up Next · Lesson 21
CI/CD Architecture Design
Section III opens with the big picture — how to design a CI/CD architecture that scales with your team, your infrastructure, and your deployment complexity without becoming a maintenance burden.