CI/CD Lesson 20 – Rollback Strategies | Dataplexa

Section II · Lesson 20

Rollback Strategies

In this lesson

Rollback vs Roll Forward Artifact-Based Rollback Database Migrations Automated Rollback Rollback in the Pipeline

A rollback is the act of returning a production environment to a previously known-good state after a deployment introduces a problem. It is not a failure of CI/CD — it is evidence that the safety net works. Every team that deploys software to production will eventually need to roll back. The question is not whether rollback will be needed, but whether the team has a defined, practiced, and fast procedure for doing it when the moment comes. A rollback that takes four hours under pressure is qualitatively different from one that takes four minutes and runs automatically.

Rollback vs Roll Forward — Choosing the Right Response

Not every production problem calls for a rollback. The first decision after detecting an incident is whether to roll back to the previous version or roll forward with a fix. Each approach has a different risk profile and a different time cost, and the right choice depends on the nature and severity of the problem.

Rollback vs Roll Forward — Decision Guide

Roll Back

Redeploy the previous artifact. Fastest path to stability when the cause is unclear, the impact is severe, or the fix is not immediately available. Trades new features for immediate recovery.

Roll Forward

Deploy a new version with a targeted fix. Appropriate when the bug is understood, the fix is small and low-risk, and rolling back would cause more disruption than the bug itself (e.g. a completed database migration that cannot be reversed).

Best for

Severe or unclear impact, data integrity risk, high error rates, failed smoke tests

Best for

Known, isolated bug with a clear fix, irreversible migrations, minor impact with time to fix properly

Risk

May not be possible if database migrations are non-reversible

Risk

Fix may introduce new bugs under pressure; extends the incident window

The Emergency Exit Analogy

A fire exit is not a sign that a building is poorly designed — it is a sign that the designers understood that emergencies happen and planned for them. A fire exit that is locked, blocked, or leads somewhere unexpected is far more dangerous than no plan at all. Rollback is the emergency exit for a deployment. The test is not whether you ever need it; the test is whether it opens quickly, leads where it should, and has been tested recently enough that everyone knows how to use it.

Artifact-Based Rollback — The Fastest Path to Recovery

The most reliable rollback mechanism in a CI/CD pipeline is artifact-based rollback: redeploying a specific previous artifact directly from the registry, identified by its commit SHA or version tag. Because the previous artifact is already built, tested, and stored — and because the deployment process is automated and scripted — the rollback is essentially a re-run of the deploy job with a different artifact reference. No code changes required, no build step, no pipeline from scratch.

This is why artifact immutability and retention policies (covered in Lesson 14) are not administrative details — they are prerequisites for fast rollback. If previous artifacts are overwritten or deleted on a short retention schedule, the option to roll back to a specific version disappears. The artifact must still exist in the registry, under its original tag, in exactly the state it was when it was deployed.

Database Migrations — The Rollback Complication

Application code is easy to roll back. Database schema changes are not. A database migration that adds a column, renames a table, or drops a field changes the persistent state of the system in ways that may be irreversible — or at least expensive to reverse. Rolling back the application code while leaving the migrated schema in place can produce a mismatch that causes the previous version to fail immediately.

The industry-standard approach to making migrations rollback-safe is the expand-contract pattern. Instead of making a breaking schema change in a single migration, the change is split across multiple deployments. First, the new column or table is added alongside the old one (expand). The application is updated to write to both. After several deployment cycles of confirmed stability, the old column is removed (contract). At any point during the expand phase, rolling back the application code is safe because the old schema is still present. By the time the contract migration runs, the team is confident no rollback is needed.

Automated Rollback — Making Recovery Instant

Automated rollback triggers a redeployment of the previous artifact without human intervention, based on a defined failure signal — a failed smoke test, a spike in error rate from the observability platform, a health check that stops returning 200. The pipeline (or a monitoring integration) detects the signal, identifies the last known-good artifact, and deploys it immediately.

Automated rollback is the gold standard for production stability, but it requires two preconditions: the failure signal must be reliable enough that false positives do not trigger unnecessary rollbacks, and the rollback itself must be tested regularly so the team knows it works when needed. A rollback procedure that has never been tested in production is a procedure the team cannot trust in an incident. Platforms like Kubernetes, AWS ECS, and deployment tools like Argo Rollouts support automated rollback natively.

Automated Rollback on Smoke Test Failure — GitHub Actions

jobs:
  deploy-production:
    runs-on: ubuntu-latest
    environment: production
    steps:
      - name: Deploy new artifact
        run: ./deploy.sh production ${{ github.sha }}

      - name: Run smoke tests
        id: smoke
        run: ./smoke-test.sh https://myapp.com
        continue-on-error: true          # Don't fail the job yet — check the result first

      - name: Rollback on smoke test failure
        if: steps.smoke.outcome == 'failure'
        run: |
          echo "Smoke tests failed — rolling back to ${{ vars.LAST_GOOD_SHA }}"
          ./deploy.sh production ${{ vars.LAST_GOOD_SHA }}   # Redeploy previous known-good artifact
          ./smoke-test.sh https://myapp.com                  # Verify rollback succeeded

      - name: Fail the job if smoke tests failed
        if: steps.smoke.outcome == 'failure'
        run: exit 1                      # Mark the workflow as failed after rollback completes

What just happened?

The deploy job attempts the new deployment and runs smoke tests. If they fail, it immediately redeploys the previous known-good artifact — stored by SHA in a repository variable — and verifies that the rollback itself succeeded. The job is then marked as failed so the team is alerted, but production has already been restored before the failure notification is sent.

Warning: A Rollback Plan That Has Never Been Tested Is Not a Rollback Plan

Writing a rollback script and never running it is the equivalent of installing a fire extinguisher and never checking whether it is charged. The worst moment to discover that your rollback procedure fails — because the previous artifact was deleted, the deploy script has a bug, or the database migration is non-reversible — is during a production incident at 2am with users affected and a manager on the call. Rollback procedures must be tested in production on a schedule: triggered deliberately, verified to work, and documented. Teams that do this regularly find the procedure boring; teams that do not find it catastrophic.

Key Takeaways from This Lesson

✓

Rollback is a planned safety net, not a failure — every team that ships to production will eventually need one. The measure of maturity is how fast and reliable the procedure is, not whether it is ever used.

✓

Artifact-based rollback is the fastest recovery mechanism — redeploying a previous artifact from the registry requires no build step and no code change. Immutability and retention policies are the prerequisites that make it possible.

✓

Database migrations are the primary rollback complication — the expand-contract pattern makes schema changes rollback-safe by splitting breaking changes across multiple deployments so the previous application version is never incompatible with the current schema.

✓

Automated rollback on smoke test failure minimises the incident window — a deployment that triggers an automatic rollback before the on-call engineer is even paged has already restored service by the time the human arrives.

✓

Rollback procedures must be tested regularly — an untested rollback is unreliable under pressure. Trigger it deliberately on a schedule, verify it works end-to-end, and document the result. Make it boring before it becomes urgent.

Teacher's Note

Store the last known-good deployment SHA somewhere the pipeline can read it automatically — a repository variable, a parameter store entry, or a deployment record — so a rollback never requires a human to remember or look up what version was running before.

Practice Questions

Answer in your own words — then check against the expected answer.

1. What is the name of the database migration strategy that splits a breaking schema change across multiple deployments — first adding the new structure alongside the old (expand), then removing the old structure after the new one is proven stable (contract) — making each deployment independently rollback-safe?

2. A production deployment has introduced a known, isolated bug with a clear one-line fix. Rolling back would reverse a completed database migration and cause more disruption than the bug itself. What is the appropriate response — rolling back or rolling forward — and why?

3. What two artifact registry properties are prerequisites for artifact-based rollback — the conditions that ensure a previous deployment version can always be retrieved and redeployed exactly as it was originally?

Lesson Quiz

Up Next · Lesson 21

CI/CD Architecture Design

Section III opens with the big picture — how to design a CI/CD architecture that scales with your team, your infrastructure, and your deployment complexity without becoming a maintenance burden.

← Previous Course Index Next →

CI/CD Course