CI/CD Lesson 33 – Blue-Green Deployment | Dataplexa
Section IV · Lesson 33

Blue-Green Deployments

In this lesson

The Mechanics Pipeline Implementation Database Considerations Cost & Trade-offs Traffic Cutover Patterns

Blue-green deployment is a release strategy that eliminates deployment downtime and enables instant rollback by maintaining two identical production environments — blue and green — at all times. One environment serves live traffic while the other is idle. When a new version is ready to deploy, it is deployed to the idle environment, verified there with smoke tests and health checks, and then traffic is switched from the active environment to the newly deployed one via a load balancer or DNS change. If anything goes wrong, traffic switches back in seconds. The previously active environment is still running and has not been touched — it becomes the instant rollback target.

The Mechanics — Two Environments, One Traffic Switch

The core mechanic is straightforward. At any given moment, blue is live and green is idle — or vice versa. The pipeline always deploys to whichever environment is currently idle. After deployment and verification, a single traffic switch — a load balancer rule update, an AWS target group swap, a Kubernetes service selector update — moves all incoming requests from the old environment to the new one. The switch is near-instantaneous for new requests. In-flight requests complete against the old environment. Users experience no interruption.

Blue-Green Deployment — Step by Step

Before
Blue environment runs v1.4.1 and receives 100% of production traffic. Green environment runs v1.4.1 and is idle — kept warm but receiving no traffic.
Deploy
Pipeline deploys v1.4.2 to the green environment. Blue continues serving all traffic unchanged. Zero user impact during deployment.
Verify
Smoke tests run against the green environment directly. Health checks pass. End-to-end tests validate critical paths. Human approval granted if required.
Switch
Load balancer updated to route 100% of traffic to green (v1.4.2). Blue becomes idle. Switch takes milliseconds. Users on v1.4.2 immediately.
Monitor
Error rates, latency, and business metrics are monitored for 10–30 minutes. Blue environment remains idle but running — it is the instant rollback target.
Rollback
If an issue is detected, the load balancer switches traffic back to blue (v1.4.1) in seconds. No redeployment, no pipeline run, no rebuilding. Blue was never torn down.

The Train Platform Analogy

A busy train station has two platforms serving the same route. Platform 1 is currently active — the train on it is loading passengers. Platform 2 has an identical train that has been prepared, inspected, and is ready to depart. When the signal changes, passengers board from Platform 2 instead. If something is wrong with Platform 2's train, the switch back to Platform 1 is immediate — it is still there, still ready. Blue-green deployment is the two-platform model for software: always have a verified, ready alternative, and make the switch fast enough that passengers never notice the change.

Pipeline Implementation — Detecting Active and Deploying to Idle

The pipeline must determine which environment is currently active before it can deploy to the idle one. This is typically achieved by querying the load balancer or a deployment state store — a parameter in AWS Parameter Store, a label on a Kubernetes service, or a variable in a deployment database — to read which colour is currently live, then deploying to the opposite.

Blue-Green Deployment Pipeline — GitHub Actions

jobs:
  blue-green-deploy:
    runs-on: ubuntu-latest
    environment: production
    permissions:
      contents: read
      id-token: write

    steps:
      - uses: actions/checkout@v4

      - name: Authenticate to AWS via OIDC
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_DEPLOY_ROLE }}
          aws-region: eu-west-1

      - name: Determine active environment
        id: active
        run: |
          # Read which colour is currently serving traffic
          ACTIVE=$(aws ssm get-parameter \
            --name "/myapp/production/active-colour" \
            --query "Parameter.Value" \
            --output text)
          echo "active=$ACTIVE" >> $GITHUB_OUTPUT
          # Set idle to the opposite colour
          if [ "$ACTIVE" = "blue" ]; then
            echo "idle=green" >> $GITHUB_OUTPUT
          else
            echo "idle=blue" >> $GITHUB_OUTPUT
          fi

      - name: Deploy to idle environment
        run: |
          IDLE=${{ steps.active.outputs.idle }}
          echo "Deploying v${{ github.sha }} to $IDLE environment"
          aws ecs update-service \
            --cluster myapp-production \
            --service myapp-$IDLE \
            --force-new-deployment \
            --task-definition myapp:${{ github.sha }}
          aws ecs wait services-stable \
            --cluster myapp-production \
            --services myapp-$IDLE               # Wait for ECS to confirm healthy

      - name: Smoke test idle environment
        run: |
          IDLE=${{ steps.active.outputs.idle }}
          IDLE_URL=$(aws ssm get-parameter \
            --name "/myapp/production/$IDLE-url" \
            --query "Parameter.Value" --output text)
          ./smoke-test.sh "$IDLE_URL"            # Test the new version before traffic switch

      - name: Switch traffic to idle environment
        run: |
          IDLE=${{ steps.active.outputs.idle }}
          # Update load balancer to point to the newly deployed environment
          aws elbv2 modify-listener \
            --listener-arn ${{ secrets.ALB_LISTENER_ARN }} \
            --default-actions Type=forward,TargetGroupArn=$(aws ssm get-parameter \
              --name "/myapp/production/$IDLE-tg-arn" \
              --query "Parameter.Value" --output text)
          # Record the new active colour for the next deployment
          aws ssm put-parameter \
            --name "/myapp/production/active-colour" \
            --value "$IDLE" --overwrite

What just happened?

The pipeline read the current active colour from Parameter Store, deployed to the opposite environment, waited for ECS to confirm stability, ran smoke tests against the idle environment's direct URL before any traffic switch, then atomically updated the load balancer to route production traffic to the newly deployed environment and recorded the new active colour for the next deployment's reference. The previous environment is running and untouched — one parameter store write switches traffic back instantly if needed.

Database Considerations — The Blue-Green Complication

Blue-green deployment handles stateless application tiers elegantly. Databases — stateful, shared, and schema-constrained — are the complication. If both blue and green share the same database, a schema migration must be compatible with both versions simultaneously: the old version (blue) must continue to function against the migrated schema while the new version (green) also functions against it, until the traffic switch is complete and blue is decommissioned.

This constraint means that during a blue-green deployment, destructive schema changes — dropping a column, renaming a table — cannot be applied in the same release as the application code that removes the reference to that column. The expand-contract pattern (introduced in Lesson 20) is the solution: add the new column first and deploy, then migrate data, then deploy the application that uses the new column, then drop the old column in a subsequent release. Each step is independently deployable and backwards-compatible with the currently running version.

Cost, Trade-offs, and Traffic Cutover Patterns

Blue-green deployment has one significant cost: it requires double the infrastructure for the application tier. During the deployment window — the period between deploying to idle and decommissioning the old active environment — both environments are running simultaneously. For large applications on expensive infrastructure, this cost is real. The mitigation is to keep the deployment window short — measured in minutes, not hours — so the cost of running both environments is minimal relative to the operational safety gained.

Weighted traffic routing is a variant that switches traffic gradually rather than all at once — routing 5% of traffic to green, monitoring for 10 minutes, then 20%, then 50%, then 100%. This is effectively a blend of blue-green and canary deployment (covered in Lesson 34). It provides the safety of incremental exposure while retaining the instant rollback capability of blue-green — at any point during the weighted transition, routing 100% back to blue is a single parameter change.

Warning: Tearing Down the Old Environment Too Quickly Eliminates the Rollback Option

The entire value of blue-green deployment comes from the old environment remaining available as an instant rollback target. A pipeline that deprovisions the old environment immediately after the traffic switch — to save infrastructure cost — has converted blue-green into a standard deployment with a slightly different sequence. The old environment must remain running for at least as long as the team's defined soak period: the window during which post-deployment monitoring would detect a problem that warrants rollback. For most teams this is 15–60 minutes. For high-traffic production systems with complex behaviour, it may be several hours. Define the soak period explicitly, enforce it in the pipeline, and only then decommission the old environment.

Key Takeaways from This Lesson

Blue-green provides zero-downtime deployment and instant rollback — the new version is fully verified before any traffic switches, and the previous version stays running as an immediate rollback target throughout the soak period.
The pipeline determines active and idle, deploys to idle, verifies, then switches — a state store (Parameter Store, Kubernetes label, or equivalent) records which colour is live so each pipeline run targets the correct environment automatically.
Shared databases require backwards-compatible migrations — while both environments share the same database during the switch window, schema changes must be compatible with both the old and new application version simultaneously. The expand-contract pattern is the solution.
Weighted routing blends blue-green with canary safety — routing a small percentage of traffic to green first, monitoring, then incrementally increasing provides gradual exposure while retaining instant full rollback at any point in the transition.
The old environment must stay running through the full soak period — decommissioning it immediately after the traffic switch eliminates the rollback option that justifies the infrastructure cost of running two environments in the first place.

Teacher's Note

Define your soak period before you implement blue-green — "how long do we keep the old environment running after the switch?" is a business risk decision, not a technical one, and having it decided in advance means it will not be skipped under cost pressure during an incident.

Practice Questions

Answer in your own words — then check against the expected answer.

1. What is the term for the defined window of time — typically 15 to 60 minutes — during which post-deployment monitoring runs against the newly active environment, the old environment remains running as a rollback target, and the deployment is considered provisional rather than confirmed?



2. What traffic cutover pattern routes a small initial percentage — such as 5% — of production traffic to the new environment, monitors for errors, then incrementally increases the percentage until 100% of traffic has migrated, combining gradual exposure with the instant full rollback capability of blue-green?



3. When both blue and green share the same database during a blue-green deployment, a destructive schema change cannot be applied in the same release as the application code that removes the reference. What migration pattern handles this by adding the new structure first, migrating data, then removing the old structure in a later release?



Lesson Quiz

1. A team implements blue-green deployment but discovers that production issues are only detected after the traffic switch rather than before it. What step in the deployment sequence are they missing?


2. A blue-green deployment includes a database migration that drops a column the old application version still reads from. The traffic switch completes successfully, but the blue environment immediately starts logging database errors. What constraint was violated?


3. A team implements blue-green deployment and tears down the old environment immediately after the traffic switch to minimise infrastructure costs. Fifteen minutes later, monitoring alerts fire. The team attempts to rollback but discovers there is no running previous environment to switch to. What went wrong?


Up Next · Lesson 34

Canary Deployments

Blue-green switches all traffic at once. Canary goes further — routing a small percentage of real users to the new version first, validating with production traffic before committing to a full rollout.