Kubernetes Lesson 24 – Scaling Applications | Dataplexa

Core Kubernetes Concepts · Lesson 24

Scaling Applications

Kubernetes was built for this. Scaling from 3 Pods to 50 is a single command — or a single YAML change — and rolling back to 3 is equally trivial. This lesson covers manual scaling, the Horizontal Pod Autoscaler, and the patterns that let production clusters absorb 10x traffic spikes without a single page to on-call.

Horizontal vs Vertical Scaling

There are two directions to scale a containerised application. Horizontal scaling adds more Pods — more instances of the same container running in parallel. Vertical scaling gives existing Pods more CPU and memory. Kubernetes excels at horizontal scaling. Vertical scaling (covered in Lesson 50 via the VPA) requires a restart and is harder to automate safely.

↔ Horizontal Scaling (scale out)

More Pods. Each Pod handles a share of traffic. No downtime. Linear capacity increase. The Kubernetes-native approach.

3 Pods → 10 Pods → 3 Pods

↕ Vertical Scaling (scale up)

More CPU/memory per Pod. Requires a restart to take effect. Has physical limits. Works for apps that don't scale horizontally (stateful, single-threaded).

500m CPU → 2000m CPU (restart required)

The Kubernetes scaling contract: For horizontal scaling to work, your application must be stateless — any Pod can handle any request. If your app stores session state in memory, scaling to 10 Pods means a user's session might be on Pod 3 but their next request goes to Pod 7. Use Redis or a database for session state, keep your Pods stateless, and horizontal scaling becomes trivial.

Manual Scaling

The scenario: It's 11:45 PM on a Friday. Your company just got featured in a major publication. Traffic is spiking hard — your checkout API is running 3 Pods and CPU is pegged at 90%. Your SLO is 99.9% and you're watching error rates climb. You need to scale right now, before the weekend hits and takes down the business. Here's the sequence of commands you run.

kubectl get deployment checkout-api -n production
# First: confirm the current state — how many replicas are running and how many are ready

kubectl scale deployment checkout-api --replicas=12 -n production
# scale: immediately change the replica count to 12
# The ReplicaSet controller starts spinning up 9 new Pods instantly
# --replicas=12: the new desired replica count

kubectl get pods -n production -l app=checkout-api -w
# -w: watch mode — streams live updates as Pod status changes
# You'll see new Pods appear in ContainerCreating → Running in real time

kubectl rollout status deployment/checkout-api -n production
# rollout status: blocks until all 12 replicas are Running and Ready
# Exit code 0 when complete — useful in scripts to wait for scale-out to finish

kubectl top pods -n production -l app=checkout-api
# top: check actual CPU/memory after scaling to confirm load is distributed
# Each Pod should now be at roughly 1/4 of its previous CPU usage

$ kubectl get deployment checkout-api -n production
NAME           READY   UP-TO-DATE   AVAILABLE   AGE
checkout-api   3/3     3            3           14d

$ kubectl scale deployment checkout-api --replicas=12 -n production
deployment.apps/checkout-api scaled

$ kubectl get pods -n production -l app=checkout-api -w
NAME                             READY   STATUS              RESTARTS   AGE
checkout-api-6f8b9d-2xkpj        1/1     Running             0          14d
checkout-api-6f8b9d-7rvqn        1/1     Running             0          14d
checkout-api-6f8b9d-m4czl        1/1     Running             0          14d
checkout-api-6f8b9d-4pkrn        0/1     ContainerCreating   0          2s
checkout-api-6f8b9d-8wnqx        0/1     ContainerCreating   0          2s
checkout-api-6f8b9d-b2zjl        0/1     ContainerCreating   0          2s
checkout-api-6f8b9d-c9vmp        0/1     Pending             0          2s
...
checkout-api-6f8b9d-4pkrn        1/1     Running             0          18s
checkout-api-6f8b9d-8wnqx        1/1     Running             0          19s

$ kubectl top pods -n production -l app=checkout-api
NAME                             CPU(cores)   MEMORY(bytes)
checkout-api-6f8b9d-2xkpj        78m          112Mi
checkout-api-6f8b9d-7rvqn        81m          108Mi
checkout-api-6f8b9d-m4czl        76m          115Mi
checkout-api-6f8b9d-4pkrn        72m          104Mi

What just happened?

Instant scale-out — The moment kubectl scale ran, the Deployment controller updated the ReplicaSet's desired count to 12. The ReplicaSet controller immediately started creating 9 new Pods. The existing 3 Pods never restarted — they kept serving traffic throughout. New Pods began receiving traffic as soon as their readiness probes passed.

CPU distributed — Before scaling, each of the 3 Pods was handling ~300m CPU each (total ~900m across 3 Pods). After scaling to 12, the same traffic is spread across 12 Pods — each handling ~75m. The application is now comfortably within limits. Error rates drop immediately as the load distributes.

Reconcile your YAML — After any manual scale, update the replicas: field in your deployment YAML and commit it. The next time someone applies the old YAML from Git, it would scale back to 3. The cluster state and the repo must stay in sync.

Declarative Scaling: Replicas in the Manifest

Manual scaling is for emergencies. The declarative approach — changing replicas in the Deployment manifest and applying it — is how production scaling decisions should normally happen. It's reviewable, auditable, and version-controlled.

The scenario: The Friday spike has passed. You've reviewed the traffic patterns and decided the checkout API should run 6 replicas during business hours as the new baseline — up from 3. You're updating the manifest, opening a pull request, getting it reviewed by the team, and merging.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: checkout-api
  namespace: production
  annotations:
    scaling-rationale: "Increased from 3 to 6 after Friday traffic analysis"  # Document WHY
    last-scaled: "2025-03-14"
spec:
  replicas: 6                           # Changed from 3 — new production baseline
  selector:
    matchLabels:
      app: checkout-api
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1                 # At most 1 Pod unavailable during a rollout or scale event
      maxSurge: 2                       # Allow 2 extra Pods above desired count during scale-up
                                        # maxSurge helps scale-up go faster — more Pods can be created
                                        # before the old ones are removed
  template:
    metadata:
      labels:
        app: checkout-api
    spec:
      containers:
        - name: checkout-api
          image: company/checkout-api:2.3.0
          ports:
            - containerPort: 3000
          resources:
            requests:
              cpu: "150m"
              memory: "200Mi"
            limits:
              cpu: "500m"
              memory: "350Mi"

$ kubectl apply -f checkout-deployment.yaml
deployment.apps/checkout-api configured

$ kubectl get deployment checkout-api -n production
NAME           READY   UP-TO-DATE   AVAILABLE   AGE
checkout-api   6/6     6            6           14d

$ kubectl describe deployment checkout-api -n production | grep -E "Replicas:|RollingUpdate"
Replicas:               6 desired | 6 updated | 6 total | 6 available | 0 unavailable
StrategyType:           RollingUpdate
RollingUpdateStrategy:  1 max unavailable, 2 max surge

What just happened?

maxUnavailable and maxSurge — These two fields on rollingUpdate control how the Deployment scales and rolls out. maxUnavailable: 1 means at most 1 Pod can be unavailable during any transition. maxSurge: 2 means up to 2 extra Pods can exist temporarily above the desired count. During a pure scale event (no image change), maxSurge lets the new Pods spin up quickly before the count settles at 6.

Annotations for operational context — Adding scaling-rationale and last-scaled annotations to the Deployment metadata gives future team members context for why the replica count is what it is. Six months from now, someone will wonder why this service runs 6 replicas instead of 2 — the annotation answers it without a Slack archaeology expedition.

Horizontal Pod Autoscaler (HPA)

Manual scaling is reactive — a human notices the problem and acts. The Horizontal Pod Autoscaler is proactive — Kubernetes automatically adjusts the replica count based on real-time metrics. When CPU crosses a threshold, HPA adds Pods. When load drops, HPA removes them. Nobody needs to wake up at 3am.

HPA requires the metrics-server to be installed in the cluster — it's the component that feeds CPU and memory usage data to the HPA controller. Most managed Kubernetes services (EKS, GKE, AKS) include it by default.

The scenario: You're a platform engineer at a B2B SaaS company. The API service sees predictable but severe daily spikes — quiet at night, high traffic during business hours across multiple timezones. You want the service to scale from a minimum of 3 Pods to a maximum of 20 Pods automatically, based on CPU utilisation. When CPU per Pod averages above 60%, add Pods. When it drops below 60%, remove them.

apiVersion: autoscaling/v2               # HPA v2: supports multiple metrics and more control
kind: HorizontalPodAutoscaler
metadata:
  name: checkout-api-hpa                 # Name of the HPA object
  namespace: production
spec:
  scaleTargetRef:                        # Which workload to autoscale
    apiVersion: apps/v1
    kind: Deployment                     # Target a Deployment (could also be StatefulSet)
    name: checkout-api                   # The Deployment name to control
  minReplicas: 3                         # Never scale below 3 Pods — baseline availability
  maxReplicas: 20                        # Never scale above 20 Pods — cost ceiling
  metrics:                               # metrics: what to measure to make scaling decisions
    - type: Resource                     # Resource: built-in CPU/memory metrics from metrics-server
      resource:
        name: cpu                        # Which resource to monitor
        target:
          type: Utilization              # Utilization: as a percentage of the container's CPU request
          averageUtilization: 60         # Target: keep average CPU utilisation at 60% across all Pods
                                         # Below 60%: HPA may scale down
                                         # Above 60%: HPA scales up by adding Pods
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 70         # Also scale on memory — if avg usage > 70% of request, scale up
  behavior:                              # behavior: control the speed of scale-up and scale-down
    scaleUp:
      stabilizationWindowSeconds: 0      # Scale up immediately — don't wait for a stabilisation window
      policies:
        - type: Pods
          value: 4                       # Add at most 4 Pods per scaling interval
          periodSeconds: 15              # Scaling interval: every 15 seconds
    scaleDown:
      stabilizationWindowSeconds: 300    # Wait 5 minutes of low traffic before scaling down
                                         # Prevents thrashing — scaling down too fast, then back up
      policies:
        - type: Percent
          value: 10                      # Remove at most 10% of Pods per scaling interval
          periodSeconds: 60              # Scale down interval: every 60 seconds

$ kubectl apply -f checkout-hpa.yaml
horizontalpodautoscaler.autoscaling/checkout-api-hpa created

$ kubectl get hpa -n production
NAME               REFERENCE                TARGETS          MINPODS   MAXPODS   REPLICAS   AGE
checkout-api-hpa   Deployment/checkout-api  42%/60%, 55%/70%   3         20        6          2m

$ kubectl describe hpa checkout-api-hpa -n production
Name:                                                  checkout-api-hpa
Namespace:                                             production
Reference:                                             Deployment/checkout-api
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  42% (63m) / 60%
  resource memory on pods (as a percentage of request): 55% (110Mi) / 70%
Min replicas:                                          3
Max replicas:                                          20
Deployment pods:                                       6 current / 6 desired
Conditions:
  Type            Status  Reason            Message
  ----            ------  ------            -------
  AbleToScale     True    ReadyForNewScale  recommended size matches current size
  ScalingActive   True    ValidMetricFound  the HPA was able to successfully calculate a replica count
  ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range
Events:
  Normal  SuccessfulRescale  5m  horizontal-pod-autoscaler  New size: 8; reason: cpu resource above target

What just happened?

How HPA calculates replicas — The HPA controller runs every 15 seconds. It fetches current CPU usage from the metrics server, calculates the average utilisation across all current Pods, and computes the desired replica count: desiredReplicas = ceil(currentReplicas × (currentUtilization / targetUtilization)). If 6 Pods are running at 90% CPU and the target is 60%, the formula gives: ceil(6 × 90/60) = ceil(9) = 9. The HPA scales to 9 Pods.

stabilizationWindowSeconds for scale-down — The 5-minute stabilisation window prevents the classic "scale down too fast, traffic spikes again, scale back up" thrash cycle. The HPA waits until the metric has been below threshold consistently for 5 minutes before removing Pods. Scale-up has no window (0s) — you always want to add capacity immediately when traffic spikes.

TARGETS column in kubectl get hpa — 42%/60% means current average CPU utilisation is 42%, target is 60%. The cluster is healthy and has headroom. If this showed 85%/60%, the HPA would be actively scaling up right now.

HPA requires resource requests to be set — The Utilization target is a percentage of the container's CPU request. If you haven't set resources.requests.cpu on the container, the HPA has no baseline to calculate against and will show <unknown>/60% in the TARGETS column.

HPA in Action: The Full Scaling Lifecycle

Here's how the HPA and Deployment interact across a full traffic spike and recovery cycle:

HPA Lifecycle — Traffic Spike and Recovery