Kubernetes Scheduling

When a Pod is created and has no node assigned, the kube-scheduler decides where it runs. This decision involves filtering nodes that can't run the Pod, scoring the remaining candidates, and binding the Pod to the best one. Understanding this process lets you influence placement for performance, cost, and reliability.

The Scheduling Pipeline

The scheduler runs two phases for every unscheduled Pod:

Phase 1 — Filtering

Eliminates nodes that cannot run the Pod. Checks: sufficient CPU/memory, required node labels (nodeSelector), taints the Pod can't tolerate, affinity hard rules, port availability, volume topology.

Phase 2 — Scoring

Ranks surviving nodes. Prefers: most available resources, spreading Pods across nodes and zones, nodes already caching the image, affinity soft rules. Highest score wins.

If no node passes filtering: Pod stays Pending. kubectl describe pod → Events section shows why.

nodeSelector: Simple Label-Based Placement

nodeSelector is the simplest scheduling constraint — a map of labels that the node must have. The Pod is only scheduled onto nodes where all specified labels match.

The scenario: Your GPU-intensive ML inference Pods must run on GPU nodes. You've labelled your GPU nodes with accelerator=nvidia-tesla-t4 and want to prevent inference Pods from landing on CPU-only nodes where they'd fail to start.

# Label your GPU nodes
kubectl label node gpu-node-1 accelerator=nvidia-tesla-t4
kubectl label node gpu-node-2 accelerator=nvidia-tesla-t4

# Verify node labels
kubectl get nodes --show-labels | grep accelerator

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ml-inference
  template:
    metadata:
      labels:
        app: ml-inference
    spec:
      nodeSelector:
        accelerator: nvidia-tesla-t4    # Pod only scheduled on nodes with this label
                                        # If no matching node exists → Pod stays Pending
      containers:
        - name: inference
          image: registry.company.com/ml-inference:2.0.0
          resources:
            limits:
              nvidia.com/gpu: 1         # Request 1 GPU from the device plugin

$ kubectl get pods -o wide
NAME                          READY   NODE
ml-inference-7d9f4-xkp2m      1/1     gpu-node-1   ← scheduled on GPU node ✓
ml-inference-7d9f4-rvqn2      1/1     gpu-node-2   ✓

# If no GPU node has capacity:
$ kubectl describe pod ml-inference-7d9f4-abc12
Events:
  Warning  FailedScheduling  0/5 nodes are available:
    3 node(s) didn't match Pod's node affinity/selector,
    2 node(s) had insufficient nvidia.com/gpu.

nodeName: Direct Node Assignment

For debugging or very specific operational needs, you can bypass the scheduler entirely and specify the exact node name. The Pod is placed directly — no scheduling logic runs.

spec:
  nodeName: ip-10-0-1-45.us-east-1.compute.internal   # Skip scheduler — place directly on this node
  # Use cases: debugging a node-specific issue, DaemonSet-like behaviour for a single Pod
  # Avoid in production deployments — bypasses scheduling checks (node capacity, taints)

Resource-Based Scheduling and Priority

The scheduler uses resources.requests — not limits — to make placement decisions. A node with 4 CPU cores that already has Pods requesting 3.5 cores can only accept new Pods requesting 0.5 cores or less, regardless of actual CPU usage at that moment.

# PriorityClass: higher priority Pods can preempt lower priority ones
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority-payments
value: 1000000           # Higher value = higher priority (default is 0)
globalDefault: false     # Don't make this the default for all Pods
preemptionPolicy: PreemptLowerPriority   # Evict lower-priority Pods if needed to schedule this one
description: "For payment processing — must always be schedulable"

---
# Use in a Pod:
spec:
  priorityClassName: high-priority-payments
  containers:
    - name: payment-processor
      resources:
        requests:
          cpu: 500m
          memory: 512Mi
  # If the cluster is full, Pods with lower PriorityClass values are evicted
  # to make room for this Pod

$ kubectl get priorityclass
NAME                      VALUE      GLOBAL-DEFAULT
system-cluster-critical   2000000000 false   ← built-in: kube-dns, metrics-server
system-node-critical      2000001000 false   ← built-in: kubelet, node daemons
high-priority-payments    1000000    false   ← your custom class
(default)                 0          true    ← all Pods without priorityClassName

# Preemption in action — cluster at capacity, high-priority Pod pending:
$ kubectl describe pod payment-processor-pending
Events:
  Normal  Preempting  Preempted Pod batch-job/data-exporter on node ip-10-0-1-12
  Normal  Scheduled   Successfully assigned to ip-10-0-1-12

What just happened?

Scheduling is based on requests, not usage — A node showing 10% CPU utilisation can still reject a new Pod if the sum of all Pod requests already equals node capacity. Always set meaningful resource requests — a Pod with no requests is effectively invisible to the scheduler's capacity model and can be scheduled onto a completely overloaded node.

PriorityClass and preemption for critical workloads — In a cluster under pressure, preemption ensures critical Pods like payment processors are always schedulable, even at the cost of evicting batch jobs or development workloads. Built-in system priorities (system-cluster-critical) ensure core cluster components like CoreDNS are never evicted by application workloads.

Teacher's Note: Debugging Pending Pods

kubectl describe pod <name> is your first tool for a Pending Pod. The Events section at the bottom tells you exactly why scheduling failed. Common messages and their fixes:

"Insufficient cpu/memory" — No node has enough allocatable capacity. Add nodes, reduce requests, or check if nodes have resources reserved for system Pods that reduces available capacity.

"didn't match node selector" — No node has the required labels. Check kubectl get nodes --show-labels — the label may be misspelled or the labelling step was missed.

"had taint ... that the pod didn't tolerate" — The Pod needs a toleration for a node taint. Covered in depth in the next lesson.

Practice Questions

1. The kube-scheduler uses which field — requests or limits — to determine whether a node has enough capacity to run a new Pod?

2. Which Pod spec field specifies a map of node labels that must all match for the Pod to be scheduled onto that node?

3. Which Kubernetes resource lets you assign a numeric priority to Pods so that higher-priority Pods can evict lower-priority ones when the cluster is at capacity?

Quiz

Up Next · Lesson 47

Taints and Tolerations

Taints let node operators mark nodes as unsuitable for general workloads — only Pods with matching tolerations can be scheduled there. The mechanism behind dedicated node pools, spot instance protection, and GPU node isolation.

← Previous Course Index Next →

Kubernetes Course