Kubernetes Lesson 51 – Cluster Autoscaling | Dataplexa

Advanced Workloads & Operations · Lesson 51

Cluster Autoscaling

HPA scales the number of Pods. But if every node is full, new Pods go Pending. Cluster autoscaling solves the node dimension — automatically adding nodes when workloads can't be scheduled, and removing underutilised nodes to cut costs. This lesson covers the Cluster Autoscaler and its modern successor Karpenter, and how they interact with HPA and VPA.

The Node Scaling Problem

Without cluster autoscaling, node capacity is static. You either over-provision (expensive, wasteful) or under-provision (Pods go Pending, services degrade during traffic spikes). The right answer is dynamic node capacity that tracks workload demand — but provisioning a cloud instance takes 2–5 minutes, so the autoscaler must predict demand rather than react to it.

Without Cluster Autoscaler

HPA scales Pods → nodes are full → new Pods go Pending → users see errors or timeouts during traffic spikes. You pay for idle nodes overnight to have spare capacity during the day.

With Cluster Autoscaler

HPA scales Pods → Pods go Pending → CA provisions a new node in 2–4 minutes → Pods schedule → capacity meets demand. Idle nodes are removed automatically — you only pay for what you use.

Cluster Autoscaler

The Cluster Autoscaler (CA) is the original Kubernetes node autoscaler. It watches for Pending Pods and simulates whether adding a node from a configured node group would allow them to schedule. If yes, it scales up the node group. It also periodically checks for underutilised nodes and scales them down when their Pods can safely move elsewhere.

Scale-Up Decision Loop (runs every 10s)

Any Pending Pods?

If none → check for scale-down instead

Simulate node groups

For each configured node group, simulate adding one node — would the Pending Pods schedule?

Pick best group

Cheapest node type that fits all Pending Pods, respects min/max group size

Trigger scale-up

Call cloud provider API (ASG, MIG) → instance launches → joins cluster → Pods schedule

Installing Cluster Autoscaler on EKS

The scenario: You're running EKS with a managed node group. Traffic peaks at midday and drops at night. You want the cluster to automatically scale between 2 and 20 nodes based on workload demand, using IRSA for authentication to the AWS Auto Scaling Group API.

# Step 1: Tag the Auto Scaling Group so CA can discover it
# In your eksctl cluster config or Terraform, add these tags to the ASG:
# k8s.io/cluster-autoscaler/enabled = true
# k8s.io/cluster-autoscaler/CLUSTER_NAME = owned

# Step 2: Create the IAM policy for CA (IRSA)
aws iam create-policy \
  --policy-name ClusterAutoscalerPolicy \
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Action": [
        "autoscaling:DescribeAutoScalingGroups",
        "autoscaling:DescribeAutoScalingInstances",
        "autoscaling:DescribeLaunchConfigurations",
        "autoscaling:DescribeScalingActivities",
        "autoscaling:SetDesiredCapacity",
        "autoscaling:TerminateInstanceInAutoScalingGroup",
        "ec2:DescribeImages",
        "ec2:DescribeInstanceTypes",
        "ec2:GetInstanceTypesFromInstanceRequirements",
        "eks:DescribeNodegroup"
      ],
      "Resource": "*"
    }]
  }'

# Step 3: Create IRSA ServiceAccount
eksctl create iamserviceaccount \
  --cluster=production \
  --namespace=kube-system \
  --name=cluster-autoscaler \
  --attach-policy-arn=arn:aws:iam::123456789012:policy/ClusterAutoscalerPolicy \
  --approve

# Step 4: Install via Helm
helm repo add autoscaler https://kubernetes.github.io/autoscaler
helm install cluster-autoscaler autoscaler/cluster-autoscaler \
  --namespace kube-system \
  --set autoDiscovery.clusterName=production \
  --set awsRegion=us-east-1 \
  --set rbac.serviceAccount.create=false \
  --set rbac.serviceAccount.name=cluster-autoscaler \
  --set extraArgs.balance-similar-node-groups=true \
  --set extraArgs.skip-nodes-with-system-pods=false \
  --set extraArgs.scale-down-delay-after-add=5m \
  --set extraArgs.scale-down-unneeded-time=10m

$ kubectl get pods -n kube-system | grep cluster-autoscaler
cluster-autoscaler-6d9f4-xkp2m   1/1   Running   0   2m

# Watch CA logs during a scale-up event:
$ kubectl logs -n kube-system -l app.kubernetes.io/name=cluster-autoscaler -f
I scale_up.go:468] Scale-up: setting group
  eks-production-app-nodes-abc123 size to 5
I clusterstate.go:215] Scale up in group eks-production-app-nodes: 3→5
I factory.go:84] waiting for 2 nodes

# 3 minutes later — new nodes joined:
I nodes.go:102] Added node ip-10-0-3-45.us-east-1.compute.internal
I nodes.go:102] Added node ip-10-0-3-67.us-east-1.compute.internal
I scheduler_based_predicates.go:88] Pod payment-api-7d9f4-abc scheduled

# Scale-down after 10 minutes of low utilisation:
I scale_down.go:612] ip-10-0-3-45 is unneeded since 10m — removing
I deleter.go:98] Successfully added ToBeDeletedByClusterAutoscaler taint

What just happened?

CA uses simulation, not guesswork — Before triggering a scale-up, CA runs the scheduler's filtering logic in memory against each configured node group. It only adds a node if that node type would actually allow the Pending Pods to schedule — respecting taints, affinities, resource requirements, and node labels. This prevents wasted scale-ups where the wrong node type is provisioned.

Scale-down is conservative by design — A node is only removed if it has been "unneeded" (all its Pods can be rescheduled elsewhere) for the full scale-down-unneeded-time (10 minutes here). CA also respects PodDisruptionBudgets — if removing a node would violate a PDB, the scale-down is deferred. Nodes with system Pods (kube-proxy, CoreDNS), local storage, or the cluster-autoscaler.kubernetes.io/safe-to-evict: "false" annotation are never removed.

balance-similar-node-groups — When multiple node groups can satisfy a scale-up request, this flag distributes nodes evenly across them — preventing all new capacity from landing in a single AZ and maintaining zone balance.

Karpenter: The Modern Alternative

Karpenter is a newer node autoscaler built by AWS, now a CNCF project, that works on EKS and is being adopted on other clouds. It takes a fundamentally different approach from CA: instead of managing pre-configured node groups, Karpenter looks at each Pending Pod's requirements and provisions the most appropriate instance type on demand — from the full EC2 catalogue.

	Cluster Autoscaler	Karpenter
Node selection	Picks from pre-configured node groups (fixed instance type per group)	Selects the optimal instance type from the full EC2 catalogue for each workload
Provisioning speed	2–5 minutes (ASG launch + node join)	~60 seconds (direct EC2 API, bypasses ASG)
Consolidation	Removes empty/underutilised nodes one at a time	Actively replaces multiple small nodes with fewer larger ones (bin packing)
Spot support	Requires separate spot node groups	Mix on-demand and spot within the same NodePool
Config overhead	Many node groups to maintain for different workload types	One or a few NodePool definitions cover all workloads

The scenario: You want Karpenter to provision nodes for your EKS cluster — using a mix of on-demand and spot instances, picking the best instance type automatically, and actively consolidating underutilised nodes to cut costs.

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    metadata:
      labels:
        managed-by: karpenter
    spec:
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1beta1
        kind: EC2NodeClass
        name: default

      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]  # Try spot first, fall back to on-demand
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]        # Compute, memory, and general purpose families
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["2"]                  # Only 3rd generation and newer

  limits:
    cpu: "200"                           # Hard cap — Karpenter won't provision beyond this
    memory: 400Gi

  disruption:
    consolidationPolicy: WhenUnderutilized  # Actively consolidate underutilised nodes
    consolidateAfter: 30s               # Start consolidating 30s after a node is underutilised
    expireAfter: 720h                   # Rotate nodes every 30 days (security patching)

---

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2                        # Amazon Linux 2 — Karpenter manages AMI selection
  role: KarpenterNodeRole               # IAM role for nodes to join the cluster
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: production   # Which subnets to place nodes in
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: production
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 100Gi
        volumeType: gp3
        encrypted: true

$ kubectl get nodeclaims
NAME                    TYPE          CAPACITY    ZONE          NODE
default-abc123          m5.xlarge     spot        us-east-1a    ip-10-0-2-44
default-def456          c5.2xlarge    spot        us-east-1b    ip-10-0-3-12
default-ghi789          r5.large      on-demand   us-east-1c    ip-10-0-4-88
# Karpenter selected different instance types for different workload shapes ✓

# Consolidation in action — two small nodes → one larger node:
$ kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter | grep consolidat
Consolidating 2 nodes (m5.large x2) into 1 node (m5.xlarge)
  Saving: $0.048/hr (38% cost reduction for these nodes)

$ kubectl get nodes
NAME                                 STATUS   INSTANCE-TYPE   CAPACITY-TYPE
ip-10-0-2-44.compute.internal        Ready    m5.xlarge       spot         ← new consolidated node
ip-10-0-3-12.compute.internal        Ready    c5.2xlarge      spot
ip-10-0-4-88.compute.internal        Ready    r5.large        on-demand

What just happened?

Karpenter bypasses the ASG entirely — It calls the EC2 RunInstances API directly, which is why provisioning takes ~60 seconds instead of 2–5 minutes. It also selects the instance type in real time — if your workload needs 3.5 CPU and 7Gi memory, Karpenter picks the cheapest instance with at least that much capacity from the full EC2 catalogue, not just the fixed type in a node group.

Consolidation is proactive, not reactive — Rather than waiting for a node to be empty, Karpenter simulates whether it can move all Pods from two underutilised nodes onto one larger (but cheaper) node. If PodDisruptionBudgets allow it, it evicts the Pods and terminates both small nodes, replacing them with one right-sized one. This typically cuts node costs by 20–40% compared to CA's passive scale-down.

Protecting Workloads During Scale-Down

Both CA and Karpenter respect PodDisruptionBudgets when deciding which nodes to remove. But some Pods should never be evicted — long-running batch jobs, Pods with local state, or critical single-replica services.

# PodDisruptionBudget: protect payment-api during scale-down
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: payment-api-pdb
  namespace: payments
spec:
  minAvailable: 2                   # At least 2 Pods must remain available during disruption
  # OR: maxUnavailable: 1           # At most 1 Pod can be unavailable at a time
  selector:
    matchLabels:
      app: payment-api
  # CA/Karpenter check this before evicting Pods from a node being removed
  # If evicting a Pod would violate the PDB, the node is not removed

---

# Prevent a specific Pod from being evicted (e.g. long-running batch job)
# Add this annotation to the Pod:
metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
    # With this annotation, CA will never remove the node hosting this Pod
    # Karpenter respects this too via do-not-disrupt annotation:
    karpenter.sh/do-not-disrupt: "true"

$ kubectl apply -f pdb-and-annotations.yaml
poddisruptionbudget.policy/payment-api-pdb created

$ kubectl get pdb payment-api-pdb -n payments
NAME              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
payment-api-pdb   2               N/A               1                     5s
# With 3 replicas and minAvailable:2, CA can evict 1 Pod at a time during scale-down ✓

# Verify safe-to-evict annotation prevents node removal
$ kubectl describe pod batch-job-xyz -n default | grep safe-to-evict
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
# CA will never evict this Pod -- its node is pinned until the Job completes

Teacher's Note: Cluster Autoscaler vs Karpenter — which to use

Use Cluster Autoscaler if: you're on a non-EKS cloud (GKE, AKS both have native CA integrations), your team is familiar with node group management and wants predictable instance types per group, or you're running a workload with very specific CPU/memory ratios that benefit from a fixed instance type.

Use Karpenter if: you're on EKS and want the fastest provisioning, you want to maximise spot instance savings without maintaining separate node groups, or you have diverse workloads (small API Pods, large batch jobs, GPU workloads) that benefit from instance type flexibility. Karpenter is increasingly the recommended default for new EKS clusters.

One important operational note for both: always set PodDisruptionBudgets on any stateful or critical workload before enabling cluster autoscaling. Without PDBs, scale-down can evict all replicas of a service simultaneously, causing an outage. The autoscaler doesn't know which services are critical — PDBs are how you tell it.

Practice Questions

1. The Cluster Autoscaler triggers a scale-up when it detects Pods in which state — indicating they cannot be scheduled onto existing nodes?

2. Which Kubernetes resource does both Cluster Autoscaler and Karpenter check before evicting Pods from a node being removed — ensuring a minimum number of replicas remain available?

3. What is the Karpenter feature called that proactively replaces multiple underutilised small nodes with fewer, better-packed larger nodes to reduce costs?

Quiz

Up Next · Lesson 52

Kubernetes Logging

Container logs are ephemeral — when a Pod dies, its logs disappear. This lesson covers the Kubernetes logging architecture, shipping logs with Fluentd and Fluent Bit, structured logging patterns, and centralising logs in Elasticsearch and CloudWatch.

← Previous Course Index Next →

Kubernetes Course