Kubernetes Course
Node & Pod Affinity
nodeSelector matches simple label equality. Affinity rules go further: label expressions with operators (In, NotIn, Exists), hard requirements vs soft preferences, and Pod-to-Pod co-location or anti-co-location — keeping replicas spread across zones for high availability.
Required vs Preferred
Both node and Pod affinity support two modes:
requiredDuringSchedulingIgnoredDuringExecution
Hard requirement. Pod stays Pending if no matching node exists. Like nodeSelector but with richer expressions.
preferredDuringSchedulingIgnoredDuringExecution
Soft preference. Scheduler tries to satisfy it but falls back to any available node. Has a weight (1–100).
Node Affinity
The scenario: Your payment API must run in the us-east-1a or us-east-1b AZs (data residency requirement) and should prefer large instance types for better performance.
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In # In: zone must be one of these values
values:
- us-east-1a
- us-east-1b
# NotIn: must NOT be one of the values
# Exists: key must exist (any value)
# DoesNotExist: key must not exist
# Gt / Lt: numeric comparison (for resource quantities)
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80 # Higher weight = stronger preference (1–100)
preference:
matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values:
- m5.2xlarge
- m5.4xlarge # Prefer large instances but don't require them
- weight: 20
preference:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- us-east-1a # Slightly prefer 1a over 1b within the required set
Pod Anti-Affinity: Spreading Across Zones
The scenario: Your payment API has 3 replicas. If all 3 land in the same AZ and that AZ fails, the service goes down. Pod anti-affinity ensures replicas are spread across availability zones — a hard requirement for high availability.
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: payment-api # Match other Pods with this label
topologyKey: topology.kubernetes.io/zone
# topologyKey defines the "scope" of the anti-affinity
# zone: no two payment-api Pods in the same AZ (hard HA requirement)
# kubernetes.io/hostname: no two payment-api Pods on the same node
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: payment-api
topologyKey: kubernetes.io/hostname
# Also prefer different nodes within each zone (belt and suspenders)
$ kubectl get pods -o wide -n payments NAME READY NODE ZONE payment-api-abc-1 1/1 node-1a-large us-east-1a ← one per zone ✓ payment-api-abc-2 1/1 node-1b-large us-east-1b ✓ payment-api-abc-3 1/1 node-1a-small us-east-1a ← 3rd Pod, only 2 zones available # Hard anti-affinity per zone is satisfied: 1a and 1b used # 3rd Pod needed a 3rd zone but only 2 exist — it falls back to 1a (different node) # For strict 1-per-zone, set replicas ≤ number of zones
What just happened?
topologyKey is the spreading domain — Setting topology.kubernetes.io/zone means "no two of these Pods in the same zone." Setting kubernetes.io/hostname means "no two on the same node." Zone spreading is the most important for availability — a node failure takes one Pod, a zone failure takes all Pods on that zone.
Pod Affinity (co-location) — The inverse of anti-affinity: schedule this Pod close to Pods with a given label. Use case: a cache sidecar that must be on the same node as the service that uses it (topologyKey: kubernetes.io/hostname), or an ML feature server that benefits from low latency to the model server in the same AZ.
topologySpreadConstraints: The Modern Spread Primitive
Pod anti-affinity for spreading has a limitation: with required, any replica beyond the number of topology domains goes Pending. topologySpreadConstraints is the modern, more flexible alternative — it spreads Pods as evenly as possible across domains while allowing overflow.
spec:
topologySpreadConstraints:
- maxSkew: 1 # Max allowed difference in Pod count between zones
# maxSkew: 1 → if zone-a has 3, zone-b can have at most 4
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule # Hard: Pod stays Pending if skew would exceed maxSkew
# ScheduleAnyway: Soft — schedule even if skew exceeded
labelSelector:
matchLabels:
app: payment-api # Only count Pods with this label toward skew
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway # Soft: spread across nodes best-effort
labelSelector:
matchLabels:
app: payment-api
Teacher's Note: Which spread mechanism to use
nodeSelector — Use for simple, mandatory node label constraints. GPU pools, OS type, bare metal vs virtualised.
Node affinity — Use when you need expressions (In, NotIn, Gt, Lt) or soft preferences with weights. Data residency (must be in these AZs), prefer large instance types.
Pod anti-affinity — Use when the domain you want to spread across is determined by other Pods (not node labels). "No two replicas on the same node."
topologySpreadConstraints — Use for even distribution across zones or nodes without the hard-limit problem of required anti-affinity. The best default for multi-replica Deployments in multi-AZ clusters.
Practice Questions
1. Which affinity mode causes a Pod to stay Pending indefinitely if no node satisfies the constraint — a hard requirement?
2. In a podAntiAffinity rule, which field defines whether Pods are spread across availability zones vs individual nodes?
3. Which Pod spec field provides flexible, even spreading across topology domains without causing excess replicas to go Pending when the number of replicas exceeds the number of domains?
Quiz
1. Your 3-replica payment API all lands in us-east-1a. Why is this a problem and which affinity mechanism solves it?
2. You want your app and its Redis cache to run in the same AZ for latency, but the app must still deploy even if the same-AZ preference can't be satisfied. Which mode do you use?
3. In topologySpreadConstraints, what does maxSkew: 1 mean?
Up Next · Lesson 49
Horizontal Pod Autoscaler
HPA automatically scales the number of Pod replicas based on CPU, memory, or custom metrics. This lesson covers the metrics pipeline, scaling behaviour, stabilisation windows, and multi-metric scaling strategies.