Horizontal Pod Autoscaler

The Horizontal Pod Autoscaler (HPA) watches a metric — CPU utilisation, memory, or a custom application metric — and automatically adjusts the number of Pod replicas in a Deployment or StatefulSet to keep that metric near a target value. When traffic spikes, more Pods appear. When traffic drops, excess Pods are removed.

How HPA Works

The HPA controller runs a control loop every 15 seconds. It reads the current metric value, computes the desired replica count, and updates the Deployment's replicas field. The formula:

    desiredReplicas = ceil(currentReplicas × (currentMetricValue ÷ targetMetricValue))
  

Example: 3 replicas at 80% CPU, target is 50% CPU → ceil(3 × 80/50) = ceil(4.8) = 5 replicas. The HPA also requires the Metrics Server to be installed in the cluster — it collects CPU and memory usage from the kubelet and exposes them via the metrics.k8s.io API.

Creating an HPA

The scenario: Your payment API sees variable traffic — quiet at night, busy during business hours and flash sales. You want it to scale between 2 and 20 replicas based on CPU utilisation, targeting 60% average CPU across all Pods.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payment-api
  namespace: payments
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-api             # The Deployment to scale

  minReplicas: 2                  # Never scale below 2 — always some capacity
  maxReplicas: 20                 # Cap at 20 — cost ceiling and node capacity limit

  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization       # Utilization: percentage of the Pod's CPU request
          averageUtilization: 60  # Scale to keep average CPU at 60% of requests

    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80  # Also scale on memory — whichever triggers first wins

  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0       # Scale up immediately — don't wait
      policies:
        - type: Percent
          value: 100                      # Can double replicas per scaling interval
          periodSeconds: 15
        - type: Pods
          value: 4                        # Or add at most 4 Pods per interval
          periodSeconds: 15
      selectPolicy: Max                   # Use whichever policy allows more Pods (aggressive scale-up)

    scaleDown:
      stabilizationWindowSeconds: 300     # Wait 5 minutes before scaling down
      policies:
        - type: Percent
          value: 10                       # Remove at most 10% of replicas per interval
          periodSeconds: 60              # Gradual scale-down to avoid oscillation

$ kubectl apply -f payment-api-hpa.yaml
horizontalpodautoscaler.autoscaling/payment-api created

$ kubectl get hpa payment-api -n payments
NAME          REFERENCE              TARGETS         MINPODS   MAXPODS   REPLICAS
payment-api   Deployment/payment-api 42%/60%         2         20        3

# Watch autoscaling in action during a load spike:
$ kubectl get hpa payment-api -n payments -w
NAME          TARGETS     REPLICAS
payment-api   42%/60%     3         ← steady state
payment-api   91%/60%     3         ← traffic spike
payment-api   91%/60%     6         ← scale up: ceil(3 × 91/60) = 5, policy allows up to 6
payment-api   67%/60%     6
payment-api   61%/60%     6         ← approaching target
payment-api   58%/60%     6         ← at target — stabilizing
[5 minutes later — traffic drops]
payment-api   22%/60%     6         ← underutilized
payment-api   22%/60%     5         ← gradual scale-down (10% per 60s = 1 Pod)
payment-api   20%/60%     4
payment-api   19%/60%     3         ← back to steady state

What just happened?

Scale up fast, scale down slowly — The asymmetric behavior is intentional. Traffic spikes are costly to be slow on (users see errors). False scale-downs are cheap (a few extra Pods for 5 minutes). The stabilizationWindowSeconds: 300 on scale-down prevents "flapping" where a brief traffic dip triggers a scale-down that's immediately reversed when traffic recovers.

Resource requests are required for Utilization-type metrics — The HPA computes utilization as current usage ÷ request. If a Pod has no CPU request set, the HPA cannot compute utilization and will show <unknown>/60%. Always set resource requests on Pods that use HPA.

Custom and External Metrics

CPU and memory are not always the right scaling signal. A queue-based worker should scale on queue depth. An API should scale on requests-per-second. The HPA autoscaling/v2 API supports custom metrics (from Prometheus via the Prometheus Adapter) and external metrics (from cloud services like SQS queue depth).

  metrics:
    # Custom metric: HTTP requests per second (from Prometheus Adapter)
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second      # Metric name in Prometheus
        target:
          type: AverageValue
          averageValue: 100                   # Scale to keep avg RPS per Pod at 100

    # External metric: SQS queue depth
    - type: External
      external:
        metric:
          name: sqs_queue_depth
          selector:
            matchLabels:
              queue: payment-processing       # Which queue
        target:
          type: AverageValue
          averageValue: 50                    # Scale to keep ~50 messages per Pod

Teacher's Note: HPA and Cluster Autoscaler

HPA scales Pods. But if all nodes are full, those new Pods go Pending. The Cluster Autoscaler (or Karpenter on AWS) fills this gap — it watches for Pending Pods caused by resource pressure and provisions new nodes, then removes nodes when utilisation drops. The two work together: HPA handles the Pod dimension, Cluster Autoscaler handles the node dimension.

One important interaction: don't manually set replicas in a Deployment manifest that is managed by an HPA. After the first apply, the HPA owns the replica count. Your manifest's replicas: 3 will fight with the HPA every time you redeploy. Instead, remove replicas from the Deployment spec entirely and let the HPA manage it — or use kubectl apply with a server-side merge that excludes the replicas field.

Practice Questions

1. HPA requires an add-on to be installed that collects CPU and memory from kubelets and exposes them via the metrics.k8s.io API. What is this add-on called?

2. Which HPA behavior field prevents scale-down flapping by requiring the metric to stay low for a period before removing Pods?

3. An HPA shows <unknown>/60% for its CPU target and never scales. What is the most likely cause?

Quiz

Up Next · Lesson 50

Vertical Pod Autoscaler

VPA automatically adjusts CPU and memory requests on individual Pods based on actual usage — eliminating the guesswork of right-sizing and preventing both OOMKills and resource waste. How it differs from HPA and when to use each.

← Previous Course Index Next →

Kubernetes Course