CI/CD Course
Canary Deployments
In this lesson
Canary deployment is a release strategy that routes a small, controlled percentage of real production traffic to a new version of an application while the majority of traffic continues to the stable version — validating the new version's behaviour with actual users before committing to a full rollout. Unlike blue-green deployment, which switches all traffic at once after pre-switch verification, canary deployment uses production traffic itself as the validation signal. If the canary — the small slice of traffic on the new version — shows elevated error rates, increased latency, or degraded business metrics, the rollout is halted and traffic returns to the stable version. If the canary is healthy, traffic gradually increases until the rollout is complete.
The Canary Mechanics — Gradual Exposure with Real Traffic
The name comes from the historical practice of sending a canary into a coal mine before miners entered — if the canary showed distress, the miners knew the air was dangerous. In deployment terms, the canary is a small subset of users who experience the new version first. Their behaviour and the application's behaviour serving them are the signal that determines whether the rollout proceeds.
A typical canary progression runs in stages: 1% of traffic, then 5%, then 25%, then 50%, then 100%. At each stage the deployment pauses, observability metrics are evaluated against defined thresholds, and the rollout either advances to the next stage or rolls back entirely. The pause duration at each stage — called the bake time — gives the monitoring system time to collect meaningful signal before the decision is made.
The Restaurant New Menu Analogy
A restaurant testing a new dish does not replace every item on the menu simultaneously and wait to see if customers complain. They introduce the new dish to a small number of tables first — observe the reaction, listen to feedback, watch for returns to the kitchen. If those tables are happy, the dish rolls out to the whole restaurant. If not, the kitchen adjusts or pulls it. Canary deployment is the same discipline applied to software: expose a small cohort to the change first, measure their experience, and let the data decide whether to continue the rollout.
Traffic Splitting — Implementation Options
Traffic splitting can be implemented at several layers of the infrastructure stack. The right choice depends on what traffic splitting granularity is needed, whether user stickiness is required — some scenarios require the same user to always hit the same version — and what infrastructure is already in place.
Traffic Splitting Implementation Options
Metrics-Based Promotion — Letting Data Decide
The defining feature of a mature canary deployment is metrics-based promotion: the decision to advance to the next traffic percentage — or to roll back — is made automatically based on observed metrics, not by a human watching a dashboard. The pipeline queries the observability platform after each bake period and compares the canary's metrics against the stable version's baseline. If the canary's error rate is within threshold and latency has not regressed, the pipeline advances the rollout. If any metric breaches its threshold, the pipeline triggers an automatic rollback to 0% canary traffic.
Automated Canary Pipeline — Argo Rollouts
# argo-rollout.yaml — defines the canary strategy for the API deployment
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: api
spec:
strategy:
canary:
steps:
- setWeight: 5 # Step 1 — route 5% of traffic to canary
- pause: {duration: 10m} # Bake for 10 minutes
- analysis: # Evaluate metrics before proceeding
templates:
- templateName: error-rate-check
- setWeight: 25 # Step 2 — advance to 25% if analysis passes
- pause: {duration: 10m}
- analysis:
templates:
- templateName: error-rate-check
- setWeight: 100 # Step 3 — full rollout if all checks pass
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: error-rate-check
spec:
metrics:
- name: error-rate
interval: 1m
successCondition: result[0] < 0.01 # Fail if error rate exceeds 1%
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{status=~"5..",version="canary"}[5m]))
/
sum(rate(http_requests_total{version="canary"}[5m]))
What just happened?
Argo Rollouts manages the entire canary progression automatically. It routes 5% of traffic to the new version, waits 10 minutes, then queries Prometheus to verify the canary's error rate is below 1%. If it passes, it advances to 25%, bakes again, checks again, then completes the rollout to 100%. If the error rate exceeds 1% at any stage, Argo Rollouts automatically sets the canary weight back to 0% and alerts the team — no human intervention required to stop a bad deployment.
Canary vs Blue-Green — Choosing the Right Strategy
Blue-green and canary are complementary strategies, not competing alternatives. They answer different questions. Blue-green answers: "Is this version fundamentally healthy?" — verified against a pre-production copy of the environment before any real user sees it. Canary answers: "Does this version behave correctly with real production traffic at scale?" — validated by routing actual users to it incrementally.
Canary vs Blue-Green — Decision Reference
Warning: A Canary Without Defined Metrics Thresholds Is Just a Slow Rollout
Canary deployment without pre-defined success and failure metrics is not a safety mechanism — it is a deployment process that takes longer. If the decision to advance from 5% to 25% is made by a human looking at a dashboard and deciding it "looks okay," the canary is providing false confidence. The value of canary deployment is the automated, objective comparison of the canary's metrics against the stable baseline. Without explicit thresholds — error rate below X%, p99 latency below Yms, conversion rate within Z% of baseline — the canary cannot make an automated promotion or rollback decision, and the safety guarantee disappears. Define thresholds before implementing canary, not after.
Key Takeaways from This Lesson
Teacher's Note
Start your canary at 1%, not 10% — the cost of a bad deployment hitting 1% of users is an order of magnitude lower than 10%, and the signal from 1% is sufficient to detect most critical regressions within a reasonable bake time.
Practice Questions
Answer in your own words — then check against the expected answer.
1. What is the term for the pause period at each traffic percentage stage of a canary deployment — the window during which the monitoring system collects metrics from the canary before the automated decision to advance or roll back is made?
2. What Kubernetes-native tool manages the full canary deployment lifecycle — traffic weight progression, Prometheus metric evaluation at each stage, automatic promotion on success, and automatic rollback on metric threshold breach — without requiring pipeline intervention at each step?
3. What is the practice — central to mature canary deployment — where the decision to advance a rollout to the next traffic percentage is made automatically by comparing the canary's observed metrics against pre-defined thresholds, rather than by a human reviewing dashboards?
Lesson Quiz
1. A change introduces a subtle performance regression that only manifests under the specific traffic patterns of real production users — it passes all pre-production tests. Which deployment strategy would catch this before it affects all users, and why?
2. A team implements canary deployment by routing 5% of traffic to the new version, having an engineer watch Datadog for 10 minutes, and then manually advancing to 100% if it "looks okay." Why does this approach not deliver the safety guarantee of a true canary deployment?
3. A deployment includes a migration that drops a database column the previous application version still reads from. The team is deciding between canary and blue-green. Which strategy is more appropriate for this release, and why?
Up Next · Lesson 35
Feature Flags
Canary uses traffic splitting to control exposure. Feature flags go further — decoupling deployment from release entirely, so code ships to production before users ever see the feature.