Kubernetes Lesson 49 – Horizontal Pod Autoscaler | Dataplexa
Advanced Workloads & Operations · Lesson 49

Horizontal Pod Autoscaler

The Horizontal Pod Autoscaler (HPA) watches a metric — CPU utilisation, memory, or a custom application metric — and automatically adjusts the number of Pod replicas in a Deployment or StatefulSet to keep that metric near a target value. When traffic spikes, more Pods appear. When traffic drops, excess Pods are removed.

How HPA Works

The HPA controller runs a control loop every 15 seconds. It reads the current metric value, computes the desired replica count, and updates the Deployment's replicas field. The formula:

desiredReplicas = ceil(currentReplicas × (currentMetricValue ÷ targetMetricValue))

Example: 3 replicas at 80% CPU, target is 50% CPU → ceil(3 × 80/50) = ceil(4.8) = 5 replicas. The HPA also requires the Metrics Server to be installed in the cluster — it collects CPU and memory usage from the kubelet and exposes them via the metrics.k8s.io API.

Creating an HPA

The scenario: Your payment API sees variable traffic — quiet at night, busy during business hours and flash sales. You want it to scale between 2 and 20 replicas based on CPU utilisation, targeting 60% average CPU across all Pods.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payment-api
  namespace: payments
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-api             # The Deployment to scale

  minReplicas: 2                  # Never scale below 2 — always some capacity
  maxReplicas: 20                 # Cap at 20 — cost ceiling and node capacity limit

  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization       # Utilization: percentage of the Pod's CPU request
          averageUtilization: 60  # Scale to keep average CPU at 60% of requests

    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80  # Also scale on memory — whichever triggers first wins

  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0       # Scale up immediately — don't wait
      policies:
        - type: Percent
          value: 100                      # Can double replicas per scaling interval
          periodSeconds: 15
        - type: Pods
          value: 4                        # Or add at most 4 Pods per interval
          periodSeconds: 15
      selectPolicy: Max                   # Use whichever policy allows more Pods (aggressive scale-up)

    scaleDown:
      stabilizationWindowSeconds: 300     # Wait 5 minutes before scaling down
      policies:
        - type: Percent
          value: 10                       # Remove at most 10% of replicas per interval
          periodSeconds: 60              # Gradual scale-down to avoid oscillation
$ kubectl apply -f payment-api-hpa.yaml
horizontalpodautoscaler.autoscaling/payment-api created

$ kubectl get hpa payment-api -n payments
NAME          REFERENCE              TARGETS         MINPODS   MAXPODS   REPLICAS
payment-api   Deployment/payment-api 42%/60%         2         20        3

# Watch autoscaling in action during a load spike:
$ kubectl get hpa payment-api -n payments -w
NAME          TARGETS     REPLICAS
payment-api   42%/60%     3         ← steady state
payment-api   91%/60%     3         ← traffic spike
payment-api   91%/60%     6         ← scale up: ceil(3 × 91/60) = 5, policy allows up to 6
payment-api   67%/60%     6
payment-api   61%/60%     6         ← approaching target
payment-api   58%/60%     6         ← at target — stabilizing
[5 minutes later — traffic drops]
payment-api   22%/60%     6         ← underutilized
payment-api   22%/60%     5         ← gradual scale-down (10% per 60s = 1 Pod)
payment-api   20%/60%     4
payment-api   19%/60%     3         ← back to steady state

What just happened?

Scale up fast, scale down slowly — The asymmetric behavior is intentional. Traffic spikes are costly to be slow on (users see errors). False scale-downs are cheap (a few extra Pods for 5 minutes). The stabilizationWindowSeconds: 300 on scale-down prevents "flapping" where a brief traffic dip triggers a scale-down that's immediately reversed when traffic recovers.

Resource requests are required for Utilization-type metrics — The HPA computes utilization as current usage ÷ request. If a Pod has no CPU request set, the HPA cannot compute utilization and will show <unknown>/60%. Always set resource requests on Pods that use HPA.

Custom and External Metrics

CPU and memory are not always the right scaling signal. A queue-based worker should scale on queue depth. An API should scale on requests-per-second. The HPA autoscaling/v2 API supports custom metrics (from Prometheus via the Prometheus Adapter) and external metrics (from cloud services like SQS queue depth).

  metrics:
    # Custom metric: HTTP requests per second (from Prometheus Adapter)
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second      # Metric name in Prometheus
        target:
          type: AverageValue
          averageValue: 100                   # Scale to keep avg RPS per Pod at 100

    # External metric: SQS queue depth
    - type: External
      external:
        metric:
          name: sqs_queue_depth
          selector:
            matchLabels:
              queue: payment-processing       # Which queue
        target:
          type: AverageValue
          averageValue: 50                    # Scale to keep ~50 messages per Pod

Teacher's Note: HPA and Cluster Autoscaler

HPA scales Pods. But if all nodes are full, those new Pods go Pending. The Cluster Autoscaler (or Karpenter on AWS) fills this gap — it watches for Pending Pods caused by resource pressure and provisions new nodes, then removes nodes when utilisation drops. The two work together: HPA handles the Pod dimension, Cluster Autoscaler handles the node dimension.

One important interaction: don't manually set replicas in a Deployment manifest that is managed by an HPA. After the first apply, the HPA owns the replica count. Your manifest's replicas: 3 will fight with the HPA every time you redeploy. Instead, remove replicas from the Deployment spec entirely and let the HPA manage it — or use kubectl apply with a server-side merge that excludes the replicas field.

Practice Questions

1. HPA requires an add-on to be installed that collects CPU and memory from kubelets and exposes them via the metrics.k8s.io API. What is this add-on called?



2. Which HPA behavior field prevents scale-down flapping by requiring the metric to stay low for a period before removing Pods?



3. An HPA shows <unknown>/60% for its CPU target and never scales. What is the most likely cause?



Quiz

1. Your HPA has averageUtilization: 60. Current replicas: 4, current average CPU: 90%. How many replicas does the HPA calculate?


2. You have an HPA managing a Deployment. Every time CI deploys the Deployment with replicas: 3 in the manifest, it overrides the HPA's current count. How do you fix this?


3. HPA scales Pods up but they stay Pending because all nodes are at capacity. Which complementary component adds new nodes to the cluster?


Up Next · Lesson 50

Vertical Pod Autoscaler

VPA automatically adjusts CPU and memory requests on individual Pods based on actual usage — eliminating the guesswork of right-sizing and preventing both OOMKills and resource waste. How it differs from HPA and when to use each.