Kubernetes Course
StatefulSets
Deployments are great for stateless workloads — Pods are interchangeable, replaceable, and scalable freely. StatefulSets exist for workloads that are not interchangeable: databases, message brokers, and distributed systems where each instance has a stable identity, a stable hostname, and its own persistent storage that must survive Pod restarts.
Why Deployments Break for Stateful Workloads
Imagine running a PostgreSQL cluster with a Deployment. Three Pods start — all with random names like postgres-7d9f4-xkp2m. Each mounts the same PVC? No — a single PVC with RWO access mode can only bind to one node. Separate PVCs? Fine, but which Pod owns which volume? The Pod names change on restart — the primary could restart and get a different volume, becoming a replica that thinks it's the primary. Other services that connect to "the primary by DNS name" have no stable address to use.
StatefulSets solve all four of these problems with four guarantees that Deployments don't provide:
Stable, unique Pod names
postgres-0, postgres-1, postgres-2 — ordinal index, always. After restart, postgres-0 is still postgres-0.
Stable network identity
Each Pod gets a predictable DNS name: postgres-0.postgres.payments.svc. Clients always know how to reach the primary.
Stable, dedicated storage
Each Pod gets its own PVC via a volumeClaimTemplate. postgres-0's volume follows postgres-0 across restarts and rescheduling.
Ordered, graceful operations
Scale up: 0 ready → 1 starts. Scale down: highest ordinal deleted first. Upgrades proceed one Pod at a time, lowest to highest.
Headless Services: The DNS Foundation
StatefulSets require a headless Service — a Service with clusterIP: None. Unlike a normal Service that load-balances across Pods, a headless Service tells CoreDNS to return the IP addresses of all matching Pods directly. This is what gives each StatefulSet Pod its own DNS name.
apiVersion: v1
kind: Service
metadata:
name: postgres # This name becomes part of every Pod's DNS record
namespace: payments
labels:
app: postgres
spec:
clusterIP: None # Headless: no virtual IP — CoreDNS returns Pod IPs directly
selector:
app: postgres
ports:
- port: 5432
name: postgres
# With clusterIP: None, DNS for this service works like this:
# postgres.payments.svc.cluster.local → returns all Pod IPs (for clients that do their own LB)
# postgres-0.postgres.payments.svc → returns postgres-0's IP only
# postgres-1.postgres.payments.svc → returns postgres-1's IP only
# This is how the primary (postgres-0) gets a stable, routable DNS name
Writing a StatefulSet
The scenario: You need to run a PostgreSQL primary with two read replicas. The primary (postgres-0) must always have its own dedicated PVC. Replicas (postgres-1, postgres-2) must start only after the primary is ready. App services connect to the primary by name.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
namespace: payments
spec:
serviceName: postgres # REQUIRED: name of the headless Service above
# This is what enables per-Pod DNS records
replicas: 3
selector:
matchLabels:
app: postgres
updateStrategy:
type: RollingUpdate # RollingUpdate: update Pods one at a time, highest ordinal first
rollingUpdate:
partition: 0 # Update all Pods (partition=0). Set to 2 to update only pod-2,
# leaving pod-0 and pod-1 at the old version (canary pattern)
podManagementPolicy: OrderedReady # OrderedReady: wait for each Pod to be Ready before starting next
# Parallel: start all Pods simultaneously (for non-dependent replicas)
template:
metadata:
labels:
app: postgres
spec:
terminationGracePeriodSeconds: 60 # Give postgres time to finish transactions on shutdown
containers:
- name: postgres
image: postgres:15
ports:
- containerPort: 5432
env:
- name: POSTGRES_DB
value: payments
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-credentials
key: POSTGRES_PASSWORD
- name: PGDATA
value: /var/lib/postgresql/data/pgdata
# IMPORTANT: pgdata subdir avoids the lost+found problem
# PostgreSQL won't start if PGDATA contains unexpected files
# lost+found is created when mounting an ext4 volume — put PGDATA in a subdirectory
volumeMounts:
- name: postgres-data
mountPath: /var/lib/postgresql/data # Mount the PVC here
readinessProbe:
exec:
command: ["pg_isready", "-U", "postgres"]
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
exec:
command: ["pg_isready", "-U", "postgres"]
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
volumeClaimTemplates: # THE key StatefulSet feature: per-Pod PVC
- metadata:
name: postgres-data # PVC name prefix — becomes postgres-data-postgres-0, etc.
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: gp3-encrypted
resources:
requests:
storage: 50Gi
# Helm creates: postgres-data-postgres-0, postgres-data-postgres-1, postgres-data-postgres-2
# Each Pod mounts ONLY its own PVC — data isolation is guaranteed by Kubernetes
$ kubectl apply -f postgres-statefulset.yaml service/postgres created statefulset.apps/postgres created # Watch ordered startup: 0 → 1 → 2 (OrderedReady policy) $ kubectl get pods -n payments -w NAME READY STATUS RESTARTS postgres-0 0/1 ContainerCreating postgres-0 0/1 Running postgres-0 1/1 Running ← postgres-0 Ready — NOW postgres-1 starts postgres-1 0/1 ContainerCreating postgres-1 1/1 Running ← postgres-1 Ready — NOW postgres-2 starts postgres-2 0/1 ContainerCreating postgres-2 1/1 Running $ kubectl get pvc -n payments NAME STATUS VOLUME CAPACITY STORAGECLASS postgres-data-postgres-0 Bound pvc-a1b2c3 50Gi gp3-encrypted postgres-data-postgres-1 Bound pvc-d4e5f6 50Gi gp3-encrypted postgres-data-postgres-2 Bound pvc-g7h8i9 50Gi gp3-encrypted # Verify per-Pod DNS records from within the cluster $ kubectl run dns-test --image=busybox --restart=Never -n payments -- \ nslookup postgres-0.postgres.payments.svc.cluster.local Server: 10.96.0.10 Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local Name: postgres-0.postgres.payments.svc.cluster.local Address 1: 192.168.2.15 ← postgres-0's Pod IP ✓
What just happened?
volumeClaimTemplates creates one PVC per Pod — Kubernetes automatically creates postgres-data-postgres-0, postgres-data-postgres-1, and postgres-data-postgres-2. Each PVC is bound to exactly one Pod — the naming convention is {volumeClaimTemplate.name}-{statefulset.name}-{ordinal}. If postgres-1 is rescheduled to a different node, it re-attaches to postgres-data-postgres-1. Its data follows it.
OrderedReady ensures safe cluster formation — postgres-0 must be Ready before postgres-1 starts. In a real PostgreSQL HA setup, postgres-0 is the primary. Replicas need to connect to an already-running primary during their startup (streaming replication handshake). OrderedReady enforces this sequence without any application-level coordination.
The PGDATA subdirectory trick — When Kubernetes mounts a new EBS volume, the Linux kernel may create a lost+found directory. PostgreSQL refuses to initialise in a non-empty directory. Setting PGDATA=/var/lib/postgresql/data/pgdata (a subdirectory of the mount point) sidesteps this — lost+found lives in the parent directory, PGDATA initialises cleanly in the child.
Scaling, Updates, and the Partition Field
StatefulSet scaling and updates are ordered — always highest ordinal first for scale-down and updates, lowest ordinal first for scale-up. The partition field in the update strategy enables canary-style rollouts: only Pods with ordinal ≥ partition are updated.
# Scale up: postgres-3 starts after postgres-2 is Ready
kubectl scale statefulset postgres --replicas=4 -n payments
# Scale down: postgres-3 deleted first, then postgres-2, etc.
# The primary (postgres-0) is always the last to be deleted
kubectl scale statefulset postgres --replicas=2 -n payments
# Rolling update: update image version
# With partition=0, updates all Pods: highest ordinal (postgres-2) first
kubectl set image statefulset/postgres postgres=postgres:16 -n payments
# Watch ordered update (highest ordinal first):
kubectl rollout status statefulset/postgres -n payments
# Waiting for 3 pods to be ready...
# Waiting for partitioned roll out to finish: 1 out of 3 new pods have been updated...
# Waiting for partitioned roll out to finish: 2 out of 3 new pods have been updated...
# statefulset rolling update complete 3 pods at Version controller revision postgres-6d9f4
# Canary pattern: update only postgres-2 first, test, then update the rest
kubectl patch statefulset postgres -n payments \
-p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":2}}}}'
kubectl set image statefulset/postgres postgres=postgres:16 -n payments
# Only postgres-2 updates — postgres-0 and postgres-1 remain on postgres:15
# Test postgres-2. If healthy:
kubectl patch statefulset postgres -n payments \
-p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":0}}}}'
# Now all Pods update to postgres:16
$ kubectl scale statefulset postgres --replicas=2 -n payments $ kubectl get pods -n payments -w postgres-2 1/1 Running postgres-1 1/1 Running postgres-0 1/1 Running postgres-2 1/1 Terminating ← highest ordinal deleted first ✓ postgres-2 0/1 Terminating postgres-2 Deleted postgres-1 1/1 Terminating ← then postgres-1 postgres-1 Deleted # postgres-0 (primary) unaffected ✓ $ kubectl get pvc -n payments NAME STATUS VOLUME CAPACITY postgres-data-postgres-0 Bound pvc-a1b2c3 50Gi ← PVCs persist after Pod deletion postgres-data-postgres-1 Bound pvc-d4e5f6 50Gi ← data is NOT deleted with scale-down postgres-data-postgres-2 Bound pvc-g7h8i9 50Gi ← must delete PVCs manually if unwanted
What just happened?
PVCs are NOT deleted when scaling down — This is intentional and important. Deleting a Pod does not delete its data. If you scale from 3 to 2 replicas, postgres-data-postgres-2 persists. When you scale back up, postgres-2 re-attaches to its original volume with all data intact. To actually free the storage, delete the PVCs manually after scaling down.
The partition canary pattern for databases — Running a new PostgreSQL minor version on postgres-2 (a replica) first lets you test replication compatibility, performance, and extension compatibility before updating the primary. If postgres-2 at the new version replicates correctly from postgres-0 at the old version for 24 hours, you have strong confidence the update is safe. Then update the primary.
StatefulSet vs Deployment: When to Use Each
| Deployment | StatefulSet | |
|---|---|---|
| Pod identity | Random names (app-7d9f4-xkp). Interchangeable. |
Stable ordinal names (app-0, app-1). Unique. |
| Storage | Shared PVC or ephemeral. All Pods see the same volume. | Per-Pod PVC via volumeClaimTemplates. Data isolation. |
| DNS | Single Service DNS → load balances across all Pods. | Per-Pod DNS via headless Service. Direct addressing. |
| Scaling | Simultaneous. Add/remove any Pod. | Ordered. Highest ordinal first on scale-down. |
| Use for | Web servers, APIs, workers — anything stateless. | PostgreSQL, MySQL, Redis Sentinel, Kafka, Zookeeper, Elasticsearch — stateful systems that need stable identity. |
Teacher's Note: Should you run databases in Kubernetes?
This is one of the most debated questions in the Kubernetes community. The short answer: StatefulSets are capable enough, but managed cloud databases (RDS, Cloud SQL, Aurora) often make more sense for production unless you have a strong operational reason to self-manage.
When Kubernetes-hosted databases make sense: you need database types not available as managed services (a specific Postgres extension, a custom Redis fork), you're running on-premises with no cloud option, your team has deep database operations expertise and owns the full stack, or you're using a Kubernetes operator (CloudNativePG, Percona Operator) that automates HA, backups, and failover — turning the StatefulSet complexity into a higher-level abstraction.
When to use managed services instead: your team is not a database operations team, you need point-in-time recovery without building it yourself, you want automated minor version upgrades and security patching, or your SLA requires multi-AZ failover in under 30 seconds. RDS can do all of these. A self-managed StatefulSet cannot — unless you build that capability on top of it.
Practice Questions
1. Which StatefulSet field automatically creates a dedicated PVC for each Pod, named after the Pod's ordinal index?
2. What setting on a Service makes it a headless Service — enabling per-Pod DNS records for StatefulSet Pods?
3. A StatefulSet named postgres is deployed in namespace payments with headless Service also named postgres. What is the full DNS name for the first Pod?
Quiz
1. You scale a StatefulSet from 3 replicas to 1. What happens to the PVCs that were bound to Pods 1 and 2?
2. You want to test a new database image on only the last replica of a 3-Pod StatefulSet before rolling it out to the primary. How do you use the partition field?
3. Why is podManagementPolicy: OrderedReady important for a PostgreSQL primary-replica StatefulSet?
Up Next · Lesson 31
Kubernetes Networking Deep Dive
Section III begins. How does traffic actually flow between Pods on different nodes? This lesson covers the flat network model, CNI plugins, kube-proxy iptables rules, and how to debug connectivity at every layer.