Kubernetes Course
Kubernetes Scheduling
When a Pod is created and has no node assigned, the kube-scheduler decides where it runs. This decision involves filtering nodes that can't run the Pod, scoring the remaining candidates, and binding the Pod to the best one. Understanding this process lets you influence placement for performance, cost, and reliability.
The Scheduling Pipeline
The scheduler runs two phases for every unscheduled Pod:
Phase 1 — Filtering
Eliminates nodes that cannot run the Pod. Checks: sufficient CPU/memory, required node labels (nodeSelector), taints the Pod can't tolerate, affinity hard rules, port availability, volume topology.
Phase 2 — Scoring
Ranks surviving nodes. Prefers: most available resources, spreading Pods across nodes and zones, nodes already caching the image, affinity soft rules. Highest score wins.
If no node passes filtering: Pod stays Pending. kubectl describe pod → Events section shows why.
nodeSelector: Simple Label-Based Placement
nodeSelector is the simplest scheduling constraint — a map of labels that the node must have. The Pod is only scheduled onto nodes where all specified labels match.
The scenario: Your GPU-intensive ML inference Pods must run on GPU nodes. You've labelled your GPU nodes with accelerator=nvidia-tesla-t4 and want to prevent inference Pods from landing on CPU-only nodes where they'd fail to start.
# Label your GPU nodes
kubectl label node gpu-node-1 accelerator=nvidia-tesla-t4
kubectl label node gpu-node-2 accelerator=nvidia-tesla-t4
# Verify node labels
kubectl get nodes --show-labels | grep accelerator
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-inference
spec:
replicas: 2
selector:
matchLabels:
app: ml-inference
template:
metadata:
labels:
app: ml-inference
spec:
nodeSelector:
accelerator: nvidia-tesla-t4 # Pod only scheduled on nodes with this label
# If no matching node exists → Pod stays Pending
containers:
- name: inference
image: registry.company.com/ml-inference:2.0.0
resources:
limits:
nvidia.com/gpu: 1 # Request 1 GPU from the device plugin
$ kubectl get pods -o wide
NAME READY NODE
ml-inference-7d9f4-xkp2m 1/1 gpu-node-1 ← scheduled on GPU node ✓
ml-inference-7d9f4-rvqn2 1/1 gpu-node-2 ✓
# If no GPU node has capacity:
$ kubectl describe pod ml-inference-7d9f4-abc12
Events:
Warning FailedScheduling 0/5 nodes are available:
3 node(s) didn't match Pod's node affinity/selector,
2 node(s) had insufficient nvidia.com/gpu.nodeName: Direct Node Assignment
For debugging or very specific operational needs, you can bypass the scheduler entirely and specify the exact node name. The Pod is placed directly — no scheduling logic runs.
spec:
nodeName: ip-10-0-1-45.us-east-1.compute.internal # Skip scheduler — place directly on this node
# Use cases: debugging a node-specific issue, DaemonSet-like behaviour for a single Pod
# Avoid in production deployments — bypasses scheduling checks (node capacity, taints)
Resource-Based Scheduling and Priority
The scheduler uses resources.requests — not limits — to make placement decisions. A node with 4 CPU cores that already has Pods requesting 3.5 cores can only accept new Pods requesting 0.5 cores or less, regardless of actual CPU usage at that moment.
# PriorityClass: higher priority Pods can preempt lower priority ones
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority-payments
value: 1000000 # Higher value = higher priority (default is 0)
globalDefault: false # Don't make this the default for all Pods
preemptionPolicy: PreemptLowerPriority # Evict lower-priority Pods if needed to schedule this one
description: "For payment processing — must always be schedulable"
---
# Use in a Pod:
spec:
priorityClassName: high-priority-payments
containers:
- name: payment-processor
resources:
requests:
cpu: 500m
memory: 512Mi
# If the cluster is full, Pods with lower PriorityClass values are evicted
# to make room for this Pod
$ kubectl get priorityclass NAME VALUE GLOBAL-DEFAULT system-cluster-critical 2000000000 false ← built-in: kube-dns, metrics-server system-node-critical 2000001000 false ← built-in: kubelet, node daemons high-priority-payments 1000000 false ← your custom class (default) 0 true ← all Pods without priorityClassName # Preemption in action — cluster at capacity, high-priority Pod pending: $ kubectl describe pod payment-processor-pending Events: Normal Preempting Preempted Pod batch-job/data-exporter on node ip-10-0-1-12 Normal Scheduled Successfully assigned to ip-10-0-1-12
What just happened?
Scheduling is based on requests, not usage — A node showing 10% CPU utilisation can still reject a new Pod if the sum of all Pod requests already equals node capacity. Always set meaningful resource requests — a Pod with no requests is effectively invisible to the scheduler's capacity model and can be scheduled onto a completely overloaded node.
PriorityClass and preemption for critical workloads — In a cluster under pressure, preemption ensures critical Pods like payment processors are always schedulable, even at the cost of evicting batch jobs or development workloads. Built-in system priorities (system-cluster-critical) ensure core cluster components like CoreDNS are never evicted by application workloads.
Teacher's Note: Debugging Pending Pods
kubectl describe pod <name> is your first tool for a Pending Pod. The Events section at the bottom tells you exactly why scheduling failed. Common messages and their fixes:
"Insufficient cpu/memory" — No node has enough allocatable capacity. Add nodes, reduce requests, or check if nodes have resources reserved for system Pods that reduces available capacity.
"didn't match node selector" — No node has the required labels. Check kubectl get nodes --show-labels — the label may be misspelled or the labelling step was missed.
"had taint ... that the pod didn't tolerate" — The Pod needs a toleration for a node taint. Covered in depth in the next lesson.
Practice Questions
1. The kube-scheduler uses which field — requests or limits — to determine whether a node has enough capacity to run a new Pod?
2. Which Pod spec field specifies a map of node labels that must all match for the Pod to be scheduled onto that node?
3. Which Kubernetes resource lets you assign a numeric priority to Pods so that higher-priority Pods can evict lower-priority ones when the cluster is at capacity?
Quiz
1. Describe the two phases the kube-scheduler runs for each unscheduled Pod.
2. A node shows 10% CPU utilisation in your monitoring dashboard but new Pods are still failing to schedule there with "Insufficient cpu". Why?
3. A Pod has been Pending for 10 minutes. What is the first command to run to diagnose why it hasn't been scheduled?
Up Next · Lesson 47
Taints and Tolerations
Taints let node operators mark nodes as unsuitable for general workloads — only Pods with matching tolerations can be scheduled there. The mechanism behind dedicated node pools, spot instance protection, and GPU node isolation.