Kubernetes Course
Cluster Autoscaling
HPA scales the number of Pods. But if every node is full, new Pods go Pending. Cluster autoscaling solves the node dimension — automatically adding nodes when workloads can't be scheduled, and removing underutilised nodes to cut costs. This lesson covers the Cluster Autoscaler and its modern successor Karpenter, and how they interact with HPA and VPA.
The Node Scaling Problem
Without cluster autoscaling, node capacity is static. You either over-provision (expensive, wasteful) or under-provision (Pods go Pending, services degrade during traffic spikes). The right answer is dynamic node capacity that tracks workload demand — but provisioning a cloud instance takes 2–5 minutes, so the autoscaler must predict demand rather than react to it.
Without Cluster Autoscaler
HPA scales Pods → nodes are full → new Pods go Pending → users see errors or timeouts during traffic spikes. You pay for idle nodes overnight to have spare capacity during the day.
With Cluster Autoscaler
HPA scales Pods → Pods go Pending → CA provisions a new node in 2–4 minutes → Pods schedule → capacity meets demand. Idle nodes are removed automatically — you only pay for what you use.
Cluster Autoscaler
The Cluster Autoscaler (CA) is the original Kubernetes node autoscaler. It watches for Pending Pods and simulates whether adding a node from a configured node group would allow them to schedule. If yes, it scales up the node group. It also periodically checks for underutilised nodes and scales them down when their Pods can safely move elsewhere.
Scale-Up Decision Loop (runs every 10s)
Installing Cluster Autoscaler on EKS
The scenario: You're running EKS with a managed node group. Traffic peaks at midday and drops at night. You want the cluster to automatically scale between 2 and 20 nodes based on workload demand, using IRSA for authentication to the AWS Auto Scaling Group API.
# Step 1: Tag the Auto Scaling Group so CA can discover it
# In your eksctl cluster config or Terraform, add these tags to the ASG:
# k8s.io/cluster-autoscaler/enabled = true
# k8s.io/cluster-autoscaler/CLUSTER_NAME = owned
# Step 2: Create the IAM policy for CA (IRSA)
aws iam create-policy \
--policy-name ClusterAutoscalerPolicy \
--policy-document '{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": [
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:DescribeAutoScalingInstances",
"autoscaling:DescribeLaunchConfigurations",
"autoscaling:DescribeScalingActivities",
"autoscaling:SetDesiredCapacity",
"autoscaling:TerminateInstanceInAutoScalingGroup",
"ec2:DescribeImages",
"ec2:DescribeInstanceTypes",
"ec2:GetInstanceTypesFromInstanceRequirements",
"eks:DescribeNodegroup"
],
"Resource": "*"
}]
}'
# Step 3: Create IRSA ServiceAccount
eksctl create iamserviceaccount \
--cluster=production \
--namespace=kube-system \
--name=cluster-autoscaler \
--attach-policy-arn=arn:aws:iam::123456789012:policy/ClusterAutoscalerPolicy \
--approve
# Step 4: Install via Helm
helm repo add autoscaler https://kubernetes.github.io/autoscaler
helm install cluster-autoscaler autoscaler/cluster-autoscaler \
--namespace kube-system \
--set autoDiscovery.clusterName=production \
--set awsRegion=us-east-1 \
--set rbac.serviceAccount.create=false \
--set rbac.serviceAccount.name=cluster-autoscaler \
--set extraArgs.balance-similar-node-groups=true \
--set extraArgs.skip-nodes-with-system-pods=false \
--set extraArgs.scale-down-delay-after-add=5m \
--set extraArgs.scale-down-unneeded-time=10m
$ kubectl get pods -n kube-system | grep cluster-autoscaler cluster-autoscaler-6d9f4-xkp2m 1/1 Running 0 2m # Watch CA logs during a scale-up event: $ kubectl logs -n kube-system -l app.kubernetes.io/name=cluster-autoscaler -f I scale_up.go:468] Scale-up: setting group eks-production-app-nodes-abc123 size to 5 I clusterstate.go:215] Scale up in group eks-production-app-nodes: 3→5 I factory.go:84] waiting for 2 nodes # 3 minutes later — new nodes joined: I nodes.go:102] Added node ip-10-0-3-45.us-east-1.compute.internal I nodes.go:102] Added node ip-10-0-3-67.us-east-1.compute.internal I scheduler_based_predicates.go:88] Pod payment-api-7d9f4-abc scheduled # Scale-down after 10 minutes of low utilisation: I scale_down.go:612] ip-10-0-3-45 is unneeded since 10m — removing I deleter.go:98] Successfully added ToBeDeletedByClusterAutoscaler taint
What just happened?
CA uses simulation, not guesswork — Before triggering a scale-up, CA runs the scheduler's filtering logic in memory against each configured node group. It only adds a node if that node type would actually allow the Pending Pods to schedule — respecting taints, affinities, resource requirements, and node labels. This prevents wasted scale-ups where the wrong node type is provisioned.
Scale-down is conservative by design — A node is only removed if it has been "unneeded" (all its Pods can be rescheduled elsewhere) for the full scale-down-unneeded-time (10 minutes here). CA also respects PodDisruptionBudgets — if removing a node would violate a PDB, the scale-down is deferred. Nodes with system Pods (kube-proxy, CoreDNS), local storage, or the cluster-autoscaler.kubernetes.io/safe-to-evict: "false" annotation are never removed.
balance-similar-node-groups — When multiple node groups can satisfy a scale-up request, this flag distributes nodes evenly across them — preventing all new capacity from landing in a single AZ and maintaining zone balance.
Karpenter: The Modern Alternative
Karpenter is a newer node autoscaler built by AWS, now a CNCF project, that works on EKS and is being adopted on other clouds. It takes a fundamentally different approach from CA: instead of managing pre-configured node groups, Karpenter looks at each Pending Pod's requirements and provisions the most appropriate instance type on demand — from the full EC2 catalogue.
| Cluster Autoscaler | Karpenter | |
|---|---|---|
| Node selection | Picks from pre-configured node groups (fixed instance type per group) | Selects the optimal instance type from the full EC2 catalogue for each workload |
| Provisioning speed | 2–5 minutes (ASG launch + node join) | ~60 seconds (direct EC2 API, bypasses ASG) |
| Consolidation | Removes empty/underutilised nodes one at a time | Actively replaces multiple small nodes with fewer larger ones (bin packing) |
| Spot support | Requires separate spot node groups | Mix on-demand and spot within the same NodePool |
| Config overhead | Many node groups to maintain for different workload types | One or a few NodePool definitions cover all workloads |
The scenario: You want Karpenter to provision nodes for your EKS cluster — using a mix of on-demand and spot instances, picking the best instance type automatically, and actively consolidating underutilised nodes to cut costs.
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
template:
metadata:
labels:
managed-by: karpenter
spec:
nodeClassRef:
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
name: default
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"] # Try spot first, fall back to on-demand
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"] # Compute, memory, and general purpose families
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["2"] # Only 3rd generation and newer
limits:
cpu: "200" # Hard cap — Karpenter won't provision beyond this
memory: 400Gi
disruption:
consolidationPolicy: WhenUnderutilized # Actively consolidate underutilised nodes
consolidateAfter: 30s # Start consolidating 30s after a node is underutilised
expireAfter: 720h # Rotate nodes every 30 days (security patching)
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: AL2 # Amazon Linux 2 — Karpenter manages AMI selection
role: KarpenterNodeRole # IAM role for nodes to join the cluster
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: production # Which subnets to place nodes in
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: production
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 100Gi
volumeType: gp3
encrypted: true
$ kubectl get nodeclaims NAME TYPE CAPACITY ZONE NODE default-abc123 m5.xlarge spot us-east-1a ip-10-0-2-44 default-def456 c5.2xlarge spot us-east-1b ip-10-0-3-12 default-ghi789 r5.large on-demand us-east-1c ip-10-0-4-88 # Karpenter selected different instance types for different workload shapes ✓ # Consolidation in action — two small nodes → one larger node: $ kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter | grep consolidat Consolidating 2 nodes (m5.large x2) into 1 node (m5.xlarge) Saving: $0.048/hr (38% cost reduction for these nodes) $ kubectl get nodes NAME STATUS INSTANCE-TYPE CAPACITY-TYPE ip-10-0-2-44.compute.internal Ready m5.xlarge spot ← new consolidated node ip-10-0-3-12.compute.internal Ready c5.2xlarge spot ip-10-0-4-88.compute.internal Ready r5.large on-demand
What just happened?
Karpenter bypasses the ASG entirely — It calls the EC2 RunInstances API directly, which is why provisioning takes ~60 seconds instead of 2–5 minutes. It also selects the instance type in real time — if your workload needs 3.5 CPU and 7Gi memory, Karpenter picks the cheapest instance with at least that much capacity from the full EC2 catalogue, not just the fixed type in a node group.
Consolidation is proactive, not reactive — Rather than waiting for a node to be empty, Karpenter simulates whether it can move all Pods from two underutilised nodes onto one larger (but cheaper) node. If PodDisruptionBudgets allow it, it evicts the Pods and terminates both small nodes, replacing them with one right-sized one. This typically cuts node costs by 20–40% compared to CA's passive scale-down.
Protecting Workloads During Scale-Down
Both CA and Karpenter respect PodDisruptionBudgets when deciding which nodes to remove. But some Pods should never be evicted — long-running batch jobs, Pods with local state, or critical single-replica services.
# PodDisruptionBudget: protect payment-api during scale-down
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: payment-api-pdb
namespace: payments
spec:
minAvailable: 2 # At least 2 Pods must remain available during disruption
# OR: maxUnavailable: 1 # At most 1 Pod can be unavailable at a time
selector:
matchLabels:
app: payment-api
# CA/Karpenter check this before evicting Pods from a node being removed
# If evicting a Pod would violate the PDB, the node is not removed
---
# Prevent a specific Pod from being evicted (e.g. long-running batch job)
# Add this annotation to the Pod:
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
# With this annotation, CA will never remove the node hosting this Pod
# Karpenter respects this too via do-not-disrupt annotation:
karpenter.sh/do-not-disrupt: "true"
$ kubectl apply -f pdb-and-annotations.yaml poddisruptionbudget.policy/payment-api-pdb created $ kubectl get pdb payment-api-pdb -n payments NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE payment-api-pdb 2 N/A 1 5s # With 3 replicas and minAvailable:2, CA can evict 1 Pod at a time during scale-down ✓ # Verify safe-to-evict annotation prevents node removal $ kubectl describe pod batch-job-xyz -n default | grep safe-to-evict cluster-autoscaler.kubernetes.io/safe-to-evict: "false" # CA will never evict this Pod -- its node is pinned until the Job completes
Teacher's Note: Cluster Autoscaler vs Karpenter — which to use
Use Cluster Autoscaler if: you're on a non-EKS cloud (GKE, AKS both have native CA integrations), your team is familiar with node group management and wants predictable instance types per group, or you're running a workload with very specific CPU/memory ratios that benefit from a fixed instance type.
Use Karpenter if: you're on EKS and want the fastest provisioning, you want to maximise spot instance savings without maintaining separate node groups, or you have diverse workloads (small API Pods, large batch jobs, GPU workloads) that benefit from instance type flexibility. Karpenter is increasingly the recommended default for new EKS clusters.
One important operational note for both: always set PodDisruptionBudgets on any stateful or critical workload before enabling cluster autoscaling. Without PDBs, scale-down can evict all replicas of a service simultaneously, causing an outage. The autoscaler doesn't know which services are critical — PDBs are how you tell it.
Practice Questions
1. The Cluster Autoscaler triggers a scale-up when it detects Pods in which state — indicating they cannot be scheduled onto existing nodes?
2. Which Kubernetes resource does both Cluster Autoscaler and Karpenter check before evicting Pods from a node being removed — ensuring a minimum number of replicas remain available?
3. What is the Karpenter feature called that proactively replaces multiple underutilised small nodes with fewer, better-packed larger nodes to reduce costs?
Quiz
1. Why does Karpenter provision new nodes significantly faster than the Cluster Autoscaler?
2. How does the Cluster Autoscaler decide which node group to scale up when multiple groups are available?
3. A long-running data processing Job must not be interrupted by cluster scale-down. What annotation prevents CA from evicting the Pod?
Up Next · Lesson 52
Kubernetes Logging
Container logs are ephemeral — when a Pod dies, its logs disappear. This lesson covers the Kubernetes logging architecture, shipping logs with Fluentd and Fluent Bit, structured logging patterns, and centralising logs in Elasticsearch and CloudWatch.