Kubernetes Course
Horizontal Pod Autoscaler
The Horizontal Pod Autoscaler (HPA) watches a metric — CPU utilisation, memory, or a custom application metric — and automatically adjusts the number of Pod replicas in a Deployment or StatefulSet to keep that metric near a target value. When traffic spikes, more Pods appear. When traffic drops, excess Pods are removed.
How HPA Works
The HPA controller runs a control loop every 15 seconds. It reads the current metric value, computes the desired replica count, and updates the Deployment's replicas field. The formula:
Example: 3 replicas at 80% CPU, target is 50% CPU → ceil(3 × 80/50) = ceil(4.8) = 5 replicas. The HPA also requires the Metrics Server to be installed in the cluster — it collects CPU and memory usage from the kubelet and exposes them via the metrics.k8s.io API.
Creating an HPA
The scenario: Your payment API sees variable traffic — quiet at night, busy during business hours and flash sales. You want it to scale between 2 and 20 replicas based on CPU utilisation, targeting 60% average CPU across all Pods.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: payment-api
namespace: payments
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: payment-api # The Deployment to scale
minReplicas: 2 # Never scale below 2 — always some capacity
maxReplicas: 20 # Cap at 20 — cost ceiling and node capacity limit
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization # Utilization: percentage of the Pod's CPU request
averageUtilization: 60 # Scale to keep average CPU at 60% of requests
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80 # Also scale on memory — whichever triggers first wins
behavior:
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately — don't wait
policies:
- type: Percent
value: 100 # Can double replicas per scaling interval
periodSeconds: 15
- type: Pods
value: 4 # Or add at most 4 Pods per interval
periodSeconds: 15
selectPolicy: Max # Use whichever policy allows more Pods (aggressive scale-up)
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 minutes before scaling down
policies:
- type: Percent
value: 10 # Remove at most 10% of replicas per interval
periodSeconds: 60 # Gradual scale-down to avoid oscillation
$ kubectl apply -f payment-api-hpa.yaml horizontalpodautoscaler.autoscaling/payment-api created $ kubectl get hpa payment-api -n payments NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS payment-api Deployment/payment-api 42%/60% 2 20 3 # Watch autoscaling in action during a load spike: $ kubectl get hpa payment-api -n payments -w NAME TARGETS REPLICAS payment-api 42%/60% 3 ← steady state payment-api 91%/60% 3 ← traffic spike payment-api 91%/60% 6 ← scale up: ceil(3 × 91/60) = 5, policy allows up to 6 payment-api 67%/60% 6 payment-api 61%/60% 6 ← approaching target payment-api 58%/60% 6 ← at target — stabilizing [5 minutes later — traffic drops] payment-api 22%/60% 6 ← underutilized payment-api 22%/60% 5 ← gradual scale-down (10% per 60s = 1 Pod) payment-api 20%/60% 4 payment-api 19%/60% 3 ← back to steady state
What just happened?
Scale up fast, scale down slowly — The asymmetric behavior is intentional. Traffic spikes are costly to be slow on (users see errors). False scale-downs are cheap (a few extra Pods for 5 minutes). The stabilizationWindowSeconds: 300 on scale-down prevents "flapping" where a brief traffic dip triggers a scale-down that's immediately reversed when traffic recovers.
Resource requests are required for Utilization-type metrics — The HPA computes utilization as current usage ÷ request. If a Pod has no CPU request set, the HPA cannot compute utilization and will show <unknown>/60%. Always set resource requests on Pods that use HPA.
Custom and External Metrics
CPU and memory are not always the right scaling signal. A queue-based worker should scale on queue depth. An API should scale on requests-per-second. The HPA autoscaling/v2 API supports custom metrics (from Prometheus via the Prometheus Adapter) and external metrics (from cloud services like SQS queue depth).
metrics:
# Custom metric: HTTP requests per second (from Prometheus Adapter)
- type: Pods
pods:
metric:
name: http_requests_per_second # Metric name in Prometheus
target:
type: AverageValue
averageValue: 100 # Scale to keep avg RPS per Pod at 100
# External metric: SQS queue depth
- type: External
external:
metric:
name: sqs_queue_depth
selector:
matchLabels:
queue: payment-processing # Which queue
target:
type: AverageValue
averageValue: 50 # Scale to keep ~50 messages per Pod
Teacher's Note: HPA and Cluster Autoscaler
HPA scales Pods. But if all nodes are full, those new Pods go Pending. The Cluster Autoscaler (or Karpenter on AWS) fills this gap — it watches for Pending Pods caused by resource pressure and provisions new nodes, then removes nodes when utilisation drops. The two work together: HPA handles the Pod dimension, Cluster Autoscaler handles the node dimension.
One important interaction: don't manually set replicas in a Deployment manifest that is managed by an HPA. After the first apply, the HPA owns the replica count. Your manifest's replicas: 3 will fight with the HPA every time you redeploy. Instead, remove replicas from the Deployment spec entirely and let the HPA manage it — or use kubectl apply with a server-side merge that excludes the replicas field.
Practice Questions
1. HPA requires an add-on to be installed that collects CPU and memory from kubelets and exposes them via the metrics.k8s.io API. What is this add-on called?
2. Which HPA behavior field prevents scale-down flapping by requiring the metric to stay low for a period before removing Pods?
3. An HPA shows <unknown>/60% for its CPU target and never scales. What is the most likely cause?
Quiz
1. Your HPA has averageUtilization: 60. Current replicas: 4, current average CPU: 90%. How many replicas does the HPA calculate?
2. You have an HPA managing a Deployment. Every time CI deploys the Deployment with replicas: 3 in the manifest, it overrides the HPA's current count. How do you fix this?
3. HPA scales Pods up but they stay Pending because all nodes are at capacity. Which complementary component adds new nodes to the cluster?
Up Next · Lesson 50
Vertical Pod Autoscaler
VPA automatically adjusts CPU and memory requests on individual Pods based on actual usage — eliminating the guesswork of right-sizing and preventing both OOMKills and resource waste. How it differs from HPA and when to use each.