Kubernetes Lesson 24 – Scaling Applications | Dataplexa
Core Kubernetes Concepts · Lesson 24

Scaling Applications

Kubernetes was built for this. Scaling from 3 Pods to 50 is a single command — or a single YAML change — and rolling back to 3 is equally trivial. This lesson covers manual scaling, the Horizontal Pod Autoscaler, and the patterns that let production clusters absorb 10x traffic spikes without a single page to on-call.

Horizontal vs Vertical Scaling

There are two directions to scale a containerised application. Horizontal scaling adds more Pods — more instances of the same container running in parallel. Vertical scaling gives existing Pods more CPU and memory. Kubernetes excels at horizontal scaling. Vertical scaling (covered in Lesson 50 via the VPA) requires a restart and is harder to automate safely.

↔ Horizontal Scaling (scale out)

More Pods. Each Pod handles a share of traffic. No downtime. Linear capacity increase. The Kubernetes-native approach.

3 Pods → 10 Pods → 3 Pods

↕ Vertical Scaling (scale up)

More CPU/memory per Pod. Requires a restart to take effect. Has physical limits. Works for apps that don't scale horizontally (stateful, single-threaded).

500m CPU → 2000m CPU (restart required)

The Kubernetes scaling contract: For horizontal scaling to work, your application must be stateless — any Pod can handle any request. If your app stores session state in memory, scaling to 10 Pods means a user's session might be on Pod 3 but their next request goes to Pod 7. Use Redis or a database for session state, keep your Pods stateless, and horizontal scaling becomes trivial.

Manual Scaling

The scenario: It's 11:45 PM on a Friday. Your company just got featured in a major publication. Traffic is spiking hard — your checkout API is running 3 Pods and CPU is pegged at 90%. Your SLO is 99.9% and you're watching error rates climb. You need to scale right now, before the weekend hits and takes down the business. Here's the sequence of commands you run.

kubectl get deployment checkout-api -n production
# First: confirm the current state — how many replicas are running and how many are ready

kubectl scale deployment checkout-api --replicas=12 -n production
# scale: immediately change the replica count to 12
# The ReplicaSet controller starts spinning up 9 new Pods instantly
# --replicas=12: the new desired replica count

kubectl get pods -n production -l app=checkout-api -w
# -w: watch mode — streams live updates as Pod status changes
# You'll see new Pods appear in ContainerCreating → Running in real time

kubectl rollout status deployment/checkout-api -n production
# rollout status: blocks until all 12 replicas are Running and Ready
# Exit code 0 when complete — useful in scripts to wait for scale-out to finish

kubectl top pods -n production -l app=checkout-api
# top: check actual CPU/memory after scaling to confirm load is distributed
# Each Pod should now be at roughly 1/4 of its previous CPU usage
$ kubectl get deployment checkout-api -n production
NAME           READY   UP-TO-DATE   AVAILABLE   AGE
checkout-api   3/3     3            3           14d

$ kubectl scale deployment checkout-api --replicas=12 -n production
deployment.apps/checkout-api scaled

$ kubectl get pods -n production -l app=checkout-api -w
NAME                             READY   STATUS              RESTARTS   AGE
checkout-api-6f8b9d-2xkpj        1/1     Running             0          14d
checkout-api-6f8b9d-7rvqn        1/1     Running             0          14d
checkout-api-6f8b9d-m4czl        1/1     Running             0          14d
checkout-api-6f8b9d-4pkrn        0/1     ContainerCreating   0          2s
checkout-api-6f8b9d-8wnqx        0/1     ContainerCreating   0          2s
checkout-api-6f8b9d-b2zjl        0/1     ContainerCreating   0          2s
checkout-api-6f8b9d-c9vmp        0/1     Pending             0          2s
...
checkout-api-6f8b9d-4pkrn        1/1     Running             0          18s
checkout-api-6f8b9d-8wnqx        1/1     Running             0          19s

$ kubectl top pods -n production -l app=checkout-api
NAME                             CPU(cores)   MEMORY(bytes)
checkout-api-6f8b9d-2xkpj        78m          112Mi
checkout-api-6f8b9d-7rvqn        81m          108Mi
checkout-api-6f8b9d-m4czl        76m          115Mi
checkout-api-6f8b9d-4pkrn        72m          104Mi

What just happened?

Instant scale-out — The moment kubectl scale ran, the Deployment controller updated the ReplicaSet's desired count to 12. The ReplicaSet controller immediately started creating 9 new Pods. The existing 3 Pods never restarted — they kept serving traffic throughout. New Pods began receiving traffic as soon as their readiness probes passed.

CPU distributed — Before scaling, each of the 3 Pods was handling ~300m CPU each (total ~900m across 3 Pods). After scaling to 12, the same traffic is spread across 12 Pods — each handling ~75m. The application is now comfortably within limits. Error rates drop immediately as the load distributes.

Reconcile your YAML — After any manual scale, update the replicas: field in your deployment YAML and commit it. The next time someone applies the old YAML from Git, it would scale back to 3. The cluster state and the repo must stay in sync.

Declarative Scaling: Replicas in the Manifest

Manual scaling is for emergencies. The declarative approach — changing replicas in the Deployment manifest and applying it — is how production scaling decisions should normally happen. It's reviewable, auditable, and version-controlled.

The scenario: The Friday spike has passed. You've reviewed the traffic patterns and decided the checkout API should run 6 replicas during business hours as the new baseline — up from 3. You're updating the manifest, opening a pull request, getting it reviewed by the team, and merging.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: checkout-api
  namespace: production
  annotations:
    scaling-rationale: "Increased from 3 to 6 after Friday traffic analysis"  # Document WHY
    last-scaled: "2025-03-14"
spec:
  replicas: 6                           # Changed from 3 — new production baseline
  selector:
    matchLabels:
      app: checkout-api
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1                 # At most 1 Pod unavailable during a rollout or scale event
      maxSurge: 2                       # Allow 2 extra Pods above desired count during scale-up
                                        # maxSurge helps scale-up go faster — more Pods can be created
                                        # before the old ones are removed
  template:
    metadata:
      labels:
        app: checkout-api
    spec:
      containers:
        - name: checkout-api
          image: company/checkout-api:2.3.0
          ports:
            - containerPort: 3000
          resources:
            requests:
              cpu: "150m"
              memory: "200Mi"
            limits:
              cpu: "500m"
              memory: "350Mi"
$ kubectl apply -f checkout-deployment.yaml
deployment.apps/checkout-api configured

$ kubectl get deployment checkout-api -n production
NAME           READY   UP-TO-DATE   AVAILABLE   AGE
checkout-api   6/6     6            6           14d

$ kubectl describe deployment checkout-api -n production | grep -E "Replicas:|RollingUpdate"
Replicas:               6 desired | 6 updated | 6 total | 6 available | 0 unavailable
StrategyType:           RollingUpdate
RollingUpdateStrategy:  1 max unavailable, 2 max surge

What just happened?

maxUnavailable and maxSurge — These two fields on rollingUpdate control how the Deployment scales and rolls out. maxUnavailable: 1 means at most 1 Pod can be unavailable during any transition. maxSurge: 2 means up to 2 extra Pods can exist temporarily above the desired count. During a pure scale event (no image change), maxSurge lets the new Pods spin up quickly before the count settles at 6.

Annotations for operational context — Adding scaling-rationale and last-scaled annotations to the Deployment metadata gives future team members context for why the replica count is what it is. Six months from now, someone will wonder why this service runs 6 replicas instead of 2 — the annotation answers it without a Slack archaeology expedition.

Horizontal Pod Autoscaler (HPA)

Manual scaling is reactive — a human notices the problem and acts. The Horizontal Pod Autoscaler is proactive — Kubernetes automatically adjusts the replica count based on real-time metrics. When CPU crosses a threshold, HPA adds Pods. When load drops, HPA removes them. Nobody needs to wake up at 3am.

HPA requires the metrics-server to be installed in the cluster — it's the component that feeds CPU and memory usage data to the HPA controller. Most managed Kubernetes services (EKS, GKE, AKS) include it by default.

The scenario: You're a platform engineer at a B2B SaaS company. The API service sees predictable but severe daily spikes — quiet at night, high traffic during business hours across multiple timezones. You want the service to scale from a minimum of 3 Pods to a maximum of 20 Pods automatically, based on CPU utilisation. When CPU per Pod averages above 60%, add Pods. When it drops below 60%, remove them.

apiVersion: autoscaling/v2               # HPA v2: supports multiple metrics and more control
kind: HorizontalPodAutoscaler
metadata:
  name: checkout-api-hpa                 # Name of the HPA object
  namespace: production
spec:
  scaleTargetRef:                        # Which workload to autoscale
    apiVersion: apps/v1
    kind: Deployment                     # Target a Deployment (could also be StatefulSet)
    name: checkout-api                   # The Deployment name to control
  minReplicas: 3                         # Never scale below 3 Pods — baseline availability
  maxReplicas: 20                        # Never scale above 20 Pods — cost ceiling
  metrics:                               # metrics: what to measure to make scaling decisions
    - type: Resource                     # Resource: built-in CPU/memory metrics from metrics-server
      resource:
        name: cpu                        # Which resource to monitor
        target:
          type: Utilization              # Utilization: as a percentage of the container's CPU request
          averageUtilization: 60         # Target: keep average CPU utilisation at 60% across all Pods
                                         # Below 60%: HPA may scale down
                                         # Above 60%: HPA scales up by adding Pods
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 70         # Also scale on memory — if avg usage > 70% of request, scale up
  behavior:                              # behavior: control the speed of scale-up and scale-down
    scaleUp:
      stabilizationWindowSeconds: 0      # Scale up immediately — don't wait for a stabilisation window
      policies:
        - type: Pods
          value: 4                       # Add at most 4 Pods per scaling interval
          periodSeconds: 15              # Scaling interval: every 15 seconds
    scaleDown:
      stabilizationWindowSeconds: 300    # Wait 5 minutes of low traffic before scaling down
                                         # Prevents thrashing — scaling down too fast, then back up
      policies:
        - type: Percent
          value: 10                      # Remove at most 10% of Pods per scaling interval
          periodSeconds: 60              # Scale down interval: every 60 seconds
$ kubectl apply -f checkout-hpa.yaml
horizontalpodautoscaler.autoscaling/checkout-api-hpa created

$ kubectl get hpa -n production
NAME               REFERENCE                TARGETS          MINPODS   MAXPODS   REPLICAS   AGE
checkout-api-hpa   Deployment/checkout-api  42%/60%, 55%/70%   3         20        6          2m

$ kubectl describe hpa checkout-api-hpa -n production
Name:                                                  checkout-api-hpa
Namespace:                                             production
Reference:                                             Deployment/checkout-api
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  42% (63m) / 60%
  resource memory on pods (as a percentage of request): 55% (110Mi) / 70%
Min replicas:                                          3
Max replicas:                                          20
Deployment pods:                                       6 current / 6 desired
Conditions:
  Type            Status  Reason            Message
  ----            ------  ------            -------
  AbleToScale     True    ReadyForNewScale  recommended size matches current size
  ScalingActive   True    ValidMetricFound  the HPA was able to successfully calculate a replica count
  ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range
Events:
  Normal  SuccessfulRescale  5m  horizontal-pod-autoscaler  New size: 8; reason: cpu resource above target

What just happened?

How HPA calculates replicas — The HPA controller runs every 15 seconds. It fetches current CPU usage from the metrics server, calculates the average utilisation across all current Pods, and computes the desired replica count: desiredReplicas = ceil(currentReplicas × (currentUtilization / targetUtilization)). If 6 Pods are running at 90% CPU and the target is 60%, the formula gives: ceil(6 × 90/60) = ceil(9) = 9. The HPA scales to 9 Pods.

stabilizationWindowSeconds for scale-down — The 5-minute stabilisation window prevents the classic "scale down too fast, traffic spikes again, scale back up" thrash cycle. The HPA waits until the metric has been below threshold consistently for 5 minutes before removing Pods. Scale-up has no window (0s) — you always want to add capacity immediately when traffic spikes.

TARGETS column in kubectl get hpa42%/60% means current average CPU utilisation is 42%, target is 60%. The cluster is healthy and has headroom. If this showed 85%/60%, the HPA would be actively scaling up right now.

HPA requires resource requests to be set — The Utilization target is a percentage of the container's CPU request. If you haven't set resources.requests.cpu on the container, the HPA has no baseline to calculate against and will show <unknown>/60% in the TARGETS column.

HPA in Action: The Full Scaling Lifecycle

Here's how the HPA and Deployment interact across a full traffic spike and recovery cycle:

HPA Lifecycle — Traffic Spike and Recovery

09:00
Steady state: 3 Pods running (minReplicas), CPU at 25%. Business day starts, traffic begins building.
09:30
Traffic spike: Average CPU hits 80% (above 60% target). HPA calculates: ceil(3 × 80/60) = 4 Pods. Scales to 4 immediately (scaleUp.stabilizationWindowSeconds: 0).
09:35
Still climbing: 4 Pods at 85% CPU. HPA scales again: ceil(4 × 85/60) = 6 Pods. New Pods warm up, readiness passes, traffic distributes.
09:45
Stabilised: 6 Pods running at 55% CPU (below 60% target). HPA is satisfied. No further scaling. Cluster absorbs the load without any human intervention.
18:00
Traffic drops: CPU falls to 20%. HPA wants to scale down — but the 5-minute stabilisation window starts. HPA waits, watching if CPU stays low.
18:05
Scale down begins: CPU has been below 60% for 5 minutes. HPA starts removing Pods at 10% per 60s. Over the next ~20 minutes, scales gradually back to 3 (minReplicas). Cloud bill drops for the night.

Checking HPA Status and Troubleshooting

The scenario: A developer reports that the HPA isn't scaling up even though they can see high CPU. You need to diagnose why the HPA isn't firing and find the root cause.

kubectl get hpa -n production
# Quick overview: TARGETS, MINPODS, MAXPODS, REPLICAS
# If TARGETS shows /60%, HPA can't read metrics — likely missing resource requests or metrics-server

kubectl describe hpa checkout-api-hpa -n production
# Full detail: conditions, recent events, current metric values
# Look at Conditions section — ScalingActive: False means HPA is not scaling
# Check the Message field in Conditions for the exact reason

kubectl get hpa checkout-api-hpa -n production -o yaml
# Full YAML dump — see the complete status including lastScaleTime and currentMetrics

kubectl top pods -n production -l app=checkout-api
# Verify metrics-server is returning data
# If kubectl top returns error, metrics-server is not installed or not healthy

kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/production/pods | python3 -m json.tool
# Raw metrics API query — confirms metrics-server is serving data for these pods
# If this 404s, metrics-server is down
$ kubectl get hpa -n production
NAME               REFERENCE                TARGETS          MINPODS   MAXPODS   REPLICAS   AGE
checkout-api-hpa   Deployment/checkout-api  <unknown>/60%      3         20        3          8m

$ kubectl describe hpa checkout-api-hpa -n production
...
Conditions:
  Type            Status  Reason                   Message
  ----            ------  ------                   -------
  AbleToScale     True    SucceededGetScale        the HPA controller was able to get the target's current scale
  ScalingActive   False   FailedGetResourceMetric  the HPA was unable to compute the replica count: failed to get
                                                   cpu utilization: missing request for cpu
...

$ kubectl get deployment checkout-api -n production -o jsonpath='{.spec.template.spec.containers[0].resources}'
{}

(problem found: no resources block in the Deployment spec — HPA has no baseline to calculate against)

What just happened?

The most common HPA failure: missing resource requests — The HPA Conditions section gave the exact answer: missing request for cpu. CPU utilisation percentage is calculated as actual usage divided by the CPU request. If there's no request, the denominator is zero — the calculation is undefined. The HPA can't function. Add resources.requests.cpu to the Deployment container spec and the HPA springs to life.

ScalingActive: False — The Conditions section of kubectl describe hpa is the first place to look when an HPA isn't behaving. ScalingActive: False with a reason and message tells you exactly what's broken. Other common reasons include FailedGetScale (RBAC issue), InvalidSelector (label selector mismatch), and SelectorRequired (Deployment has no selector).

Teacher's Note: Don't fight the HPA — design for it

The HPA works best when your application has a roughly linear relationship between request rate and CPU usage. Stateless HTTP APIs are ideal candidates. Where teams run into trouble is when they set the CPU target too high (90%+) — leaving no headroom for the HPA to respond before the service is already saturated. Setting the target at 60–70% means new Pods are spinning up while you still have 30–40% headroom. By the time the new Pods are ready and pass readiness probes (10–30 seconds), the existing Pods haven't fallen over.

The behavior.scaleDown.stabilizationWindowSeconds: 300 setting is not optional in production — it's mandatory for any service that sees bursty traffic. Without it, HPA will aggressively scale down after each spike, only to scale back up seconds later when the next request burst arrives. This is called thrashing and it causes exactly the kind of latency spikes you were trying to avoid.

One more thing: the HPA and manual kubectl scale fight each other. If you manually scale to 12 during an incident, the HPA will eventually bring it back down to whatever it calculates as the right number based on metrics. That's usually the right behaviour — but be aware of it when responding to incidents.

Practice Questions

1. Write the kubectl command to immediately scale the checkout-api Deployment in the production namespace to 10 replicas.



2. An HPA shows <unknown>/60% in the TARGETS column and ScalingActive: False with reason missing request for cpu. What field is missing from the Deployment container spec?



3. What HPA behavior field prevents the autoscaler from scaling down immediately after a traffic spike ends — avoiding the thrash cycle of scaling down and then back up?



Quiz

1. An HPA has targetUtilization: 50. Currently 4 Pods are running at 80% average CPU. How many replicas will the HPA calculate as the desired count?


2. You manually scale a Deployment to 15 replicas during an incident, but an HPA is also configured for it with maxReplicas: 20. Once the incident is over and CPU returns to normal, what happens?


3. What does rollingUpdate.maxSurge: 2 control in a Deployment spec?


Up Next · Lesson 25

Rolling Updates

How Kubernetes replaces old Pods with new ones without dropping a single request — and the exact parameters that control how fast and how safe that process is.