CI/CD Course
Blue-Green Deployments
In this lesson
Blue-green deployment is a release strategy that eliminates deployment downtime and enables instant rollback by maintaining two identical production environments — blue and green — at all times. One environment serves live traffic while the other is idle. When a new version is ready to deploy, it is deployed to the idle environment, verified there with smoke tests and health checks, and then traffic is switched from the active environment to the newly deployed one via a load balancer or DNS change. If anything goes wrong, traffic switches back in seconds. The previously active environment is still running and has not been touched — it becomes the instant rollback target.
The Mechanics — Two Environments, One Traffic Switch
The core mechanic is straightforward. At any given moment, blue is live and green is idle — or vice versa. The pipeline always deploys to whichever environment is currently idle. After deployment and verification, a single traffic switch — a load balancer rule update, an AWS target group swap, a Kubernetes service selector update — moves all incoming requests from the old environment to the new one. The switch is near-instantaneous for new requests. In-flight requests complete against the old environment. Users experience no interruption.
Blue-Green Deployment — Step by Step
The Train Platform Analogy
A busy train station has two platforms serving the same route. Platform 1 is currently active — the train on it is loading passengers. Platform 2 has an identical train that has been prepared, inspected, and is ready to depart. When the signal changes, passengers board from Platform 2 instead. If something is wrong with Platform 2's train, the switch back to Platform 1 is immediate — it is still there, still ready. Blue-green deployment is the two-platform model for software: always have a verified, ready alternative, and make the switch fast enough that passengers never notice the change.
Pipeline Implementation — Detecting Active and Deploying to Idle
The pipeline must determine which environment is currently active before it can deploy to the idle one. This is typically achieved by querying the load balancer or a deployment state store — a parameter in AWS Parameter Store, a label on a Kubernetes service, or a variable in a deployment database — to read which colour is currently live, then deploying to the opposite.
Blue-Green Deployment Pipeline — GitHub Actions
jobs:
blue-green-deploy:
runs-on: ubuntu-latest
environment: production
permissions:
contents: read
id-token: write
steps:
- uses: actions/checkout@v4
- name: Authenticate to AWS via OIDC
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_DEPLOY_ROLE }}
aws-region: eu-west-1
- name: Determine active environment
id: active
run: |
# Read which colour is currently serving traffic
ACTIVE=$(aws ssm get-parameter \
--name "/myapp/production/active-colour" \
--query "Parameter.Value" \
--output text)
echo "active=$ACTIVE" >> $GITHUB_OUTPUT
# Set idle to the opposite colour
if [ "$ACTIVE" = "blue" ]; then
echo "idle=green" >> $GITHUB_OUTPUT
else
echo "idle=blue" >> $GITHUB_OUTPUT
fi
- name: Deploy to idle environment
run: |
IDLE=${{ steps.active.outputs.idle }}
echo "Deploying v${{ github.sha }} to $IDLE environment"
aws ecs update-service \
--cluster myapp-production \
--service myapp-$IDLE \
--force-new-deployment \
--task-definition myapp:${{ github.sha }}
aws ecs wait services-stable \
--cluster myapp-production \
--services myapp-$IDLE # Wait for ECS to confirm healthy
- name: Smoke test idle environment
run: |
IDLE=${{ steps.active.outputs.idle }}
IDLE_URL=$(aws ssm get-parameter \
--name "/myapp/production/$IDLE-url" \
--query "Parameter.Value" --output text)
./smoke-test.sh "$IDLE_URL" # Test the new version before traffic switch
- name: Switch traffic to idle environment
run: |
IDLE=${{ steps.active.outputs.idle }}
# Update load balancer to point to the newly deployed environment
aws elbv2 modify-listener \
--listener-arn ${{ secrets.ALB_LISTENER_ARN }} \
--default-actions Type=forward,TargetGroupArn=$(aws ssm get-parameter \
--name "/myapp/production/$IDLE-tg-arn" \
--query "Parameter.Value" --output text)
# Record the new active colour for the next deployment
aws ssm put-parameter \
--name "/myapp/production/active-colour" \
--value "$IDLE" --overwrite
What just happened?
The pipeline read the current active colour from Parameter Store, deployed to the opposite environment, waited for ECS to confirm stability, ran smoke tests against the idle environment's direct URL before any traffic switch, then atomically updated the load balancer to route production traffic to the newly deployed environment and recorded the new active colour for the next deployment's reference. The previous environment is running and untouched — one parameter store write switches traffic back instantly if needed.
Database Considerations — The Blue-Green Complication
Blue-green deployment handles stateless application tiers elegantly. Databases — stateful, shared, and schema-constrained — are the complication. If both blue and green share the same database, a schema migration must be compatible with both versions simultaneously: the old version (blue) must continue to function against the migrated schema while the new version (green) also functions against it, until the traffic switch is complete and blue is decommissioned.
This constraint means that during a blue-green deployment, destructive schema changes — dropping a column, renaming a table — cannot be applied in the same release as the application code that removes the reference to that column. The expand-contract pattern (introduced in Lesson 20) is the solution: add the new column first and deploy, then migrate data, then deploy the application that uses the new column, then drop the old column in a subsequent release. Each step is independently deployable and backwards-compatible with the currently running version.
Cost, Trade-offs, and Traffic Cutover Patterns
Blue-green deployment has one significant cost: it requires double the infrastructure for the application tier. During the deployment window — the period between deploying to idle and decommissioning the old active environment — both environments are running simultaneously. For large applications on expensive infrastructure, this cost is real. The mitigation is to keep the deployment window short — measured in minutes, not hours — so the cost of running both environments is minimal relative to the operational safety gained.
Weighted traffic routing is a variant that switches traffic gradually rather than all at once — routing 5% of traffic to green, monitoring for 10 minutes, then 20%, then 50%, then 100%. This is effectively a blend of blue-green and canary deployment (covered in Lesson 34). It provides the safety of incremental exposure while retaining the instant rollback capability of blue-green — at any point during the weighted transition, routing 100% back to blue is a single parameter change.
Warning: Tearing Down the Old Environment Too Quickly Eliminates the Rollback Option
The entire value of blue-green deployment comes from the old environment remaining available as an instant rollback target. A pipeline that deprovisions the old environment immediately after the traffic switch — to save infrastructure cost — has converted blue-green into a standard deployment with a slightly different sequence. The old environment must remain running for at least as long as the team's defined soak period: the window during which post-deployment monitoring would detect a problem that warrants rollback. For most teams this is 15–60 minutes. For high-traffic production systems with complex behaviour, it may be several hours. Define the soak period explicitly, enforce it in the pipeline, and only then decommission the old environment.
Key Takeaways from This Lesson
Teacher's Note
Define your soak period before you implement blue-green — "how long do we keep the old environment running after the switch?" is a business risk decision, not a technical one, and having it decided in advance means it will not be skipped under cost pressure during an incident.
Practice Questions
Answer in your own words — then check against the expected answer.
1. What is the term for the defined window of time — typically 15 to 60 minutes — during which post-deployment monitoring runs against the newly active environment, the old environment remains running as a rollback target, and the deployment is considered provisional rather than confirmed?
2. What traffic cutover pattern routes a small initial percentage — such as 5% — of production traffic to the new environment, monitors for errors, then incrementally increases the percentage until 100% of traffic has migrated, combining gradual exposure with the instant full rollback capability of blue-green?
3. When both blue and green share the same database during a blue-green deployment, a destructive schema change cannot be applied in the same release as the application code that removes the reference. What migration pattern handles this by adding the new structure first, migrating data, then removing the old structure in a later release?
Lesson Quiz
1. A team implements blue-green deployment but discovers that production issues are only detected after the traffic switch rather than before it. What step in the deployment sequence are they missing?
2. A blue-green deployment includes a database migration that drops a column the old application version still reads from. The traffic switch completes successfully, but the blue environment immediately starts logging database errors. What constraint was violated?
3. A team implements blue-green deployment and tears down the old environment immediately after the traffic switch to minimise infrastructure costs. Fifteen minutes later, monitoring alerts fire. The team attempts to rollback but discovers there is no running previous environment to switch to. What went wrong?
Up Next · Lesson 34
Canary Deployments
Blue-green switches all traffic at once. Canary goes further — routing a small percentage of real users to the new version first, validating with production traffic before committing to a full rollout.