Kubernetes Course
Scaling Applications
Kubernetes was built for this. Scaling from 3 Pods to 50 is a single command — or a single YAML change — and rolling back to 3 is equally trivial. This lesson covers manual scaling, the Horizontal Pod Autoscaler, and the patterns that let production clusters absorb 10x traffic spikes without a single page to on-call.
Horizontal vs Vertical Scaling
There are two directions to scale a containerised application. Horizontal scaling adds more Pods — more instances of the same container running in parallel. Vertical scaling gives existing Pods more CPU and memory. Kubernetes excels at horizontal scaling. Vertical scaling (covered in Lesson 50 via the VPA) requires a restart and is harder to automate safely.
↔ Horizontal Scaling (scale out)
More Pods. Each Pod handles a share of traffic. No downtime. Linear capacity increase. The Kubernetes-native approach.
3 Pods → 10 Pods → 3 Pods
↕ Vertical Scaling (scale up)
More CPU/memory per Pod. Requires a restart to take effect. Has physical limits. Works for apps that don't scale horizontally (stateful, single-threaded).
500m CPU → 2000m CPU (restart required)
The Kubernetes scaling contract: For horizontal scaling to work, your application must be stateless — any Pod can handle any request. If your app stores session state in memory, scaling to 10 Pods means a user's session might be on Pod 3 but their next request goes to Pod 7. Use Redis or a database for session state, keep your Pods stateless, and horizontal scaling becomes trivial.
Manual Scaling
The scenario: It's 11:45 PM on a Friday. Your company just got featured in a major publication. Traffic is spiking hard — your checkout API is running 3 Pods and CPU is pegged at 90%. Your SLO is 99.9% and you're watching error rates climb. You need to scale right now, before the weekend hits and takes down the business. Here's the sequence of commands you run.
kubectl get deployment checkout-api -n production
# First: confirm the current state — how many replicas are running and how many are ready
kubectl scale deployment checkout-api --replicas=12 -n production
# scale: immediately change the replica count to 12
# The ReplicaSet controller starts spinning up 9 new Pods instantly
# --replicas=12: the new desired replica count
kubectl get pods -n production -l app=checkout-api -w
# -w: watch mode — streams live updates as Pod status changes
# You'll see new Pods appear in ContainerCreating → Running in real time
kubectl rollout status deployment/checkout-api -n production
# rollout status: blocks until all 12 replicas are Running and Ready
# Exit code 0 when complete — useful in scripts to wait for scale-out to finish
kubectl top pods -n production -l app=checkout-api
# top: check actual CPU/memory after scaling to confirm load is distributed
# Each Pod should now be at roughly 1/4 of its previous CPU usage
$ kubectl get deployment checkout-api -n production NAME READY UP-TO-DATE AVAILABLE AGE checkout-api 3/3 3 3 14d $ kubectl scale deployment checkout-api --replicas=12 -n production deployment.apps/checkout-api scaled $ kubectl get pods -n production -l app=checkout-api -w NAME READY STATUS RESTARTS AGE checkout-api-6f8b9d-2xkpj 1/1 Running 0 14d checkout-api-6f8b9d-7rvqn 1/1 Running 0 14d checkout-api-6f8b9d-m4czl 1/1 Running 0 14d checkout-api-6f8b9d-4pkrn 0/1 ContainerCreating 0 2s checkout-api-6f8b9d-8wnqx 0/1 ContainerCreating 0 2s checkout-api-6f8b9d-b2zjl 0/1 ContainerCreating 0 2s checkout-api-6f8b9d-c9vmp 0/1 Pending 0 2s ... checkout-api-6f8b9d-4pkrn 1/1 Running 0 18s checkout-api-6f8b9d-8wnqx 1/1 Running 0 19s $ kubectl top pods -n production -l app=checkout-api NAME CPU(cores) MEMORY(bytes) checkout-api-6f8b9d-2xkpj 78m 112Mi checkout-api-6f8b9d-7rvqn 81m 108Mi checkout-api-6f8b9d-m4czl 76m 115Mi checkout-api-6f8b9d-4pkrn 72m 104Mi
What just happened?
Instant scale-out — The moment kubectl scale ran, the Deployment controller updated the ReplicaSet's desired count to 12. The ReplicaSet controller immediately started creating 9 new Pods. The existing 3 Pods never restarted — they kept serving traffic throughout. New Pods began receiving traffic as soon as their readiness probes passed.
CPU distributed — Before scaling, each of the 3 Pods was handling ~300m CPU each (total ~900m across 3 Pods). After scaling to 12, the same traffic is spread across 12 Pods — each handling ~75m. The application is now comfortably within limits. Error rates drop immediately as the load distributes.
Reconcile your YAML — After any manual scale, update the replicas: field in your deployment YAML and commit it. The next time someone applies the old YAML from Git, it would scale back to 3. The cluster state and the repo must stay in sync.
Declarative Scaling: Replicas in the Manifest
Manual scaling is for emergencies. The declarative approach — changing replicas in the Deployment manifest and applying it — is how production scaling decisions should normally happen. It's reviewable, auditable, and version-controlled.
The scenario: The Friday spike has passed. You've reviewed the traffic patterns and decided the checkout API should run 6 replicas during business hours as the new baseline — up from 3. You're updating the manifest, opening a pull request, getting it reviewed by the team, and merging.
apiVersion: apps/v1
kind: Deployment
metadata:
name: checkout-api
namespace: production
annotations:
scaling-rationale: "Increased from 3 to 6 after Friday traffic analysis" # Document WHY
last-scaled: "2025-03-14"
spec:
replicas: 6 # Changed from 3 — new production baseline
selector:
matchLabels:
app: checkout-api
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1 # At most 1 Pod unavailable during a rollout or scale event
maxSurge: 2 # Allow 2 extra Pods above desired count during scale-up
# maxSurge helps scale-up go faster — more Pods can be created
# before the old ones are removed
template:
metadata:
labels:
app: checkout-api
spec:
containers:
- name: checkout-api
image: company/checkout-api:2.3.0
ports:
- containerPort: 3000
resources:
requests:
cpu: "150m"
memory: "200Mi"
limits:
cpu: "500m"
memory: "350Mi"
$ kubectl apply -f checkout-deployment.yaml deployment.apps/checkout-api configured $ kubectl get deployment checkout-api -n production NAME READY UP-TO-DATE AVAILABLE AGE checkout-api 6/6 6 6 14d $ kubectl describe deployment checkout-api -n production | grep -E "Replicas:|RollingUpdate" Replicas: 6 desired | 6 updated | 6 total | 6 available | 0 unavailable StrategyType: RollingUpdate RollingUpdateStrategy: 1 max unavailable, 2 max surge
What just happened?
maxUnavailable and maxSurge — These two fields on rollingUpdate control how the Deployment scales and rolls out. maxUnavailable: 1 means at most 1 Pod can be unavailable during any transition. maxSurge: 2 means up to 2 extra Pods can exist temporarily above the desired count. During a pure scale event (no image change), maxSurge lets the new Pods spin up quickly before the count settles at 6.
Annotations for operational context — Adding scaling-rationale and last-scaled annotations to the Deployment metadata gives future team members context for why the replica count is what it is. Six months from now, someone will wonder why this service runs 6 replicas instead of 2 — the annotation answers it without a Slack archaeology expedition.
Horizontal Pod Autoscaler (HPA)
Manual scaling is reactive — a human notices the problem and acts. The Horizontal Pod Autoscaler is proactive — Kubernetes automatically adjusts the replica count based on real-time metrics. When CPU crosses a threshold, HPA adds Pods. When load drops, HPA removes them. Nobody needs to wake up at 3am.
HPA requires the metrics-server to be installed in the cluster — it's the component that feeds CPU and memory usage data to the HPA controller. Most managed Kubernetes services (EKS, GKE, AKS) include it by default.
The scenario: You're a platform engineer at a B2B SaaS company. The API service sees predictable but severe daily spikes — quiet at night, high traffic during business hours across multiple timezones. You want the service to scale from a minimum of 3 Pods to a maximum of 20 Pods automatically, based on CPU utilisation. When CPU per Pod averages above 60%, add Pods. When it drops below 60%, remove them.
apiVersion: autoscaling/v2 # HPA v2: supports multiple metrics and more control
kind: HorizontalPodAutoscaler
metadata:
name: checkout-api-hpa # Name of the HPA object
namespace: production
spec:
scaleTargetRef: # Which workload to autoscale
apiVersion: apps/v1
kind: Deployment # Target a Deployment (could also be StatefulSet)
name: checkout-api # The Deployment name to control
minReplicas: 3 # Never scale below 3 Pods — baseline availability
maxReplicas: 20 # Never scale above 20 Pods — cost ceiling
metrics: # metrics: what to measure to make scaling decisions
- type: Resource # Resource: built-in CPU/memory metrics from metrics-server
resource:
name: cpu # Which resource to monitor
target:
type: Utilization # Utilization: as a percentage of the container's CPU request
averageUtilization: 60 # Target: keep average CPU utilisation at 60% across all Pods
# Below 60%: HPA may scale down
# Above 60%: HPA scales up by adding Pods
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70 # Also scale on memory — if avg usage > 70% of request, scale up
behavior: # behavior: control the speed of scale-up and scale-down
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately — don't wait for a stabilisation window
policies:
- type: Pods
value: 4 # Add at most 4 Pods per scaling interval
periodSeconds: 15 # Scaling interval: every 15 seconds
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 minutes of low traffic before scaling down
# Prevents thrashing — scaling down too fast, then back up
policies:
- type: Percent
value: 10 # Remove at most 10% of Pods per scaling interval
periodSeconds: 60 # Scale down interval: every 60 seconds
$ kubectl apply -f checkout-hpa.yaml horizontalpodautoscaler.autoscaling/checkout-api-hpa created $ kubectl get hpa -n production NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE checkout-api-hpa Deployment/checkout-api 42%/60%, 55%/70% 3 20 6 2m $ kubectl describe hpa checkout-api-hpa -n production Name: checkout-api-hpa Namespace: production Reference: Deployment/checkout-api Metrics: ( current / target ) resource cpu on pods (as a percentage of request): 42% (63m) / 60% resource memory on pods (as a percentage of request): 55% (110Mi) / 70% Min replicas: 3 Max replicas: 20 Deployment pods: 6 current / 6 desired Conditions: Type Status Reason Message ---- ------ ------ ------- AbleToScale True ReadyForNewScale recommended size matches current size ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count ScalingLimited False DesiredWithinRange the desired count is within the acceptable range Events: Normal SuccessfulRescale 5m horizontal-pod-autoscaler New size: 8; reason: cpu resource above target
What just happened?
How HPA calculates replicas — The HPA controller runs every 15 seconds. It fetches current CPU usage from the metrics server, calculates the average utilisation across all current Pods, and computes the desired replica count: desiredReplicas = ceil(currentReplicas × (currentUtilization / targetUtilization)). If 6 Pods are running at 90% CPU and the target is 60%, the formula gives: ceil(6 × 90/60) = ceil(9) = 9. The HPA scales to 9 Pods.
stabilizationWindowSeconds for scale-down — The 5-minute stabilisation window prevents the classic "scale down too fast, traffic spikes again, scale back up" thrash cycle. The HPA waits until the metric has been below threshold consistently for 5 minutes before removing Pods. Scale-up has no window (0s) — you always want to add capacity immediately when traffic spikes.
TARGETS column in kubectl get hpa — 42%/60% means current average CPU utilisation is 42%, target is 60%. The cluster is healthy and has headroom. If this showed 85%/60%, the HPA would be actively scaling up right now.
HPA requires resource requests to be set — The Utilization target is a percentage of the container's CPU request. If you haven't set resources.requests.cpu on the container, the HPA has no baseline to calculate against and will show <unknown>/60% in the TARGETS column.
HPA in Action: The Full Scaling Lifecycle
Here's how the HPA and Deployment interact across a full traffic spike and recovery cycle:
HPA Lifecycle — Traffic Spike and Recovery
scaleUp.stabilizationWindowSeconds: 0).Checking HPA Status and Troubleshooting
The scenario: A developer reports that the HPA isn't scaling up even though they can see high CPU. You need to diagnose why the HPA isn't firing and find the root cause.
kubectl get hpa -n production
# Quick overview: TARGETS, MINPODS, MAXPODS, REPLICAS
# If TARGETS shows /60%, HPA can't read metrics — likely missing resource requests or metrics-server
kubectl describe hpa checkout-api-hpa -n production
# Full detail: conditions, recent events, current metric values
# Look at Conditions section — ScalingActive: False means HPA is not scaling
# Check the Message field in Conditions for the exact reason
kubectl get hpa checkout-api-hpa -n production -o yaml
# Full YAML dump — see the complete status including lastScaleTime and currentMetrics
kubectl top pods -n production -l app=checkout-api
# Verify metrics-server is returning data
# If kubectl top returns error, metrics-server is not installed or not healthy
kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/production/pods | python3 -m json.tool
# Raw metrics API query — confirms metrics-server is serving data for these pods
# If this 404s, metrics-server is down
$ kubectl get hpa -n production
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
checkout-api-hpa Deployment/checkout-api <unknown>/60% 3 20 3 8m
$ kubectl describe hpa checkout-api-hpa -n production
...
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True SucceededGetScale the HPA controller was able to get the target's current scale
ScalingActive False FailedGetResourceMetric the HPA was unable to compute the replica count: failed to get
cpu utilization: missing request for cpu
...
$ kubectl get deployment checkout-api -n production -o jsonpath='{.spec.template.spec.containers[0].resources}'
{}
(problem found: no resources block in the Deployment spec — HPA has no baseline to calculate against)What just happened?
The most common HPA failure: missing resource requests — The HPA Conditions section gave the exact answer: missing request for cpu. CPU utilisation percentage is calculated as actual usage divided by the CPU request. If there's no request, the denominator is zero — the calculation is undefined. The HPA can't function. Add resources.requests.cpu to the Deployment container spec and the HPA springs to life.
ScalingActive: False — The Conditions section of kubectl describe hpa is the first place to look when an HPA isn't behaving. ScalingActive: False with a reason and message tells you exactly what's broken. Other common reasons include FailedGetScale (RBAC issue), InvalidSelector (label selector mismatch), and SelectorRequired (Deployment has no selector).
Teacher's Note: Don't fight the HPA — design for it
The HPA works best when your application has a roughly linear relationship between request rate and CPU usage. Stateless HTTP APIs are ideal candidates. Where teams run into trouble is when they set the CPU target too high (90%+) — leaving no headroom for the HPA to respond before the service is already saturated. Setting the target at 60–70% means new Pods are spinning up while you still have 30–40% headroom. By the time the new Pods are ready and pass readiness probes (10–30 seconds), the existing Pods haven't fallen over.
The behavior.scaleDown.stabilizationWindowSeconds: 300 setting is not optional in production — it's mandatory for any service that sees bursty traffic. Without it, HPA will aggressively scale down after each spike, only to scale back up seconds later when the next request burst arrives. This is called thrashing and it causes exactly the kind of latency spikes you were trying to avoid.
One more thing: the HPA and manual kubectl scale fight each other. If you manually scale to 12 during an incident, the HPA will eventually bring it back down to whatever it calculates as the right number based on metrics. That's usually the right behaviour — but be aware of it when responding to incidents.
Practice Questions
1. Write the kubectl command to immediately scale the checkout-api Deployment in the production namespace to 10 replicas.
2. An HPA shows <unknown>/60% in the TARGETS column and ScalingActive: False with reason missing request for cpu. What field is missing from the Deployment container spec?
3. What HPA behavior field prevents the autoscaler from scaling down immediately after a traffic spike ends — avoiding the thrash cycle of scaling down and then back up?
Quiz
1. An HPA has targetUtilization: 50. Currently 4 Pods are running at 80% average CPU. How many replicas will the HPA calculate as the desired count?
2. You manually scale a Deployment to 15 replicas during an incident, but an HPA is also configured for it with maxReplicas: 20. Once the incident is over and CPU returns to normal, what happens?
3. What does rollingUpdate.maxSurge: 2 control in a Deployment spec?
Up Next · Lesson 25
Rolling Updates
How Kubernetes replaces old Pods with new ones without dropping a single request — and the exact parameters that control how fast and how safe that process is.