Kubernetes Lesson 30 – StatefulSets | Dataplexa
Core Concepts · Lesson 30

StatefulSets

Deployments are great for stateless workloads — Pods are interchangeable, replaceable, and scalable freely. StatefulSets exist for workloads that are not interchangeable: databases, message brokers, and distributed systems where each instance has a stable identity, a stable hostname, and its own persistent storage that must survive Pod restarts.

Why Deployments Break for Stateful Workloads

Imagine running a PostgreSQL cluster with a Deployment. Three Pods start — all with random names like postgres-7d9f4-xkp2m. Each mounts the same PVC? No — a single PVC with RWO access mode can only bind to one node. Separate PVCs? Fine, but which Pod owns which volume? The Pod names change on restart — the primary could restart and get a different volume, becoming a replica that thinks it's the primary. Other services that connect to "the primary by DNS name" have no stable address to use.

StatefulSets solve all four of these problems with four guarantees that Deployments don't provide:

Stable, unique Pod names

postgres-0, postgres-1, postgres-2 — ordinal index, always. After restart, postgres-0 is still postgres-0.

Stable network identity

Each Pod gets a predictable DNS name: postgres-0.postgres.payments.svc. Clients always know how to reach the primary.

Stable, dedicated storage

Each Pod gets its own PVC via a volumeClaimTemplate. postgres-0's volume follows postgres-0 across restarts and rescheduling.

Ordered, graceful operations

Scale up: 0 ready → 1 starts. Scale down: highest ordinal deleted first. Upgrades proceed one Pod at a time, lowest to highest.

Headless Services: The DNS Foundation

StatefulSets require a headless Service — a Service with clusterIP: None. Unlike a normal Service that load-balances across Pods, a headless Service tells CoreDNS to return the IP addresses of all matching Pods directly. This is what gives each StatefulSet Pod its own DNS name.

apiVersion: v1
kind: Service
metadata:
  name: postgres                     # This name becomes part of every Pod's DNS record
  namespace: payments
  labels:
    app: postgres
spec:
  clusterIP: None                    # Headless: no virtual IP — CoreDNS returns Pod IPs directly
  selector:
    app: postgres
  ports:
    - port: 5432
      name: postgres
  # With clusterIP: None, DNS for this service works like this:
  # postgres.payments.svc.cluster.local        → returns all Pod IPs (for clients that do their own LB)
  # postgres-0.postgres.payments.svc           → returns postgres-0's IP only
  # postgres-1.postgres.payments.svc           → returns postgres-1's IP only
  # This is how the primary (postgres-0) gets a stable, routable DNS name

Writing a StatefulSet

The scenario: You need to run a PostgreSQL primary with two read replicas. The primary (postgres-0) must always have its own dedicated PVC. Replicas (postgres-1, postgres-2) must start only after the primary is ready. App services connect to the primary by name.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
  namespace: payments
spec:
  serviceName: postgres              # REQUIRED: name of the headless Service above
                                     # This is what enables per-Pod DNS records
  replicas: 3
  selector:
    matchLabels:
      app: postgres
  updateStrategy:
    type: RollingUpdate              # RollingUpdate: update Pods one at a time, highest ordinal first
    rollingUpdate:
      partition: 0                   # Update all Pods (partition=0). Set to 2 to update only pod-2,
                                     # leaving pod-0 and pod-1 at the old version (canary pattern)
  podManagementPolicy: OrderedReady  # OrderedReady: wait for each Pod to be Ready before starting next
                                     # Parallel: start all Pods simultaneously (for non-dependent replicas)
  template:
    metadata:
      labels:
        app: postgres
    spec:
      terminationGracePeriodSeconds: 60   # Give postgres time to finish transactions on shutdown
      containers:
        - name: postgres
          image: postgres:15
          ports:
            - containerPort: 5432
          env:
            - name: POSTGRES_DB
              value: payments
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: postgres-credentials
                  key: POSTGRES_PASSWORD
            - name: PGDATA
              value: /var/lib/postgresql/data/pgdata
              # IMPORTANT: pgdata subdir avoids the lost+found problem
              # PostgreSQL won't start if PGDATA contains unexpected files
              # lost+found is created when mounting an ext4 volume — put PGDATA in a subdirectory
          volumeMounts:
            - name: postgres-data
              mountPath: /var/lib/postgresql/data   # Mount the PVC here
          readinessProbe:
            exec:
              command: ["pg_isready", "-U", "postgres"]
            initialDelaySeconds: 10
            periodSeconds: 5
          livenessProbe:
            exec:
              command: ["pg_isready", "-U", "postgres"]
            initialDelaySeconds: 30
            periodSeconds: 10
            failureThreshold: 3

  volumeClaimTemplates:              # THE key StatefulSet feature: per-Pod PVC
    - metadata:
        name: postgres-data          # PVC name prefix — becomes postgres-data-postgres-0, etc.
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: gp3-encrypted
        resources:
          requests:
            storage: 50Gi
        # Helm creates: postgres-data-postgres-0, postgres-data-postgres-1, postgres-data-postgres-2
        # Each Pod mounts ONLY its own PVC — data isolation is guaranteed by Kubernetes
$ kubectl apply -f postgres-statefulset.yaml
service/postgres created
statefulset.apps/postgres created

# Watch ordered startup: 0 → 1 → 2 (OrderedReady policy)
$ kubectl get pods -n payments -w
NAME         READY   STATUS              RESTARTS
postgres-0   0/1     ContainerCreating
postgres-0   0/1     Running
postgres-0   1/1     Running             ← postgres-0 Ready — NOW postgres-1 starts
postgres-1   0/1     ContainerCreating
postgres-1   1/1     Running             ← postgres-1 Ready — NOW postgres-2 starts
postgres-2   0/1     ContainerCreating
postgres-2   1/1     Running

$ kubectl get pvc -n payments
NAME                     STATUS   VOLUME        CAPACITY   STORAGECLASS
postgres-data-postgres-0 Bound    pvc-a1b2c3    50Gi       gp3-encrypted
postgres-data-postgres-1 Bound    pvc-d4e5f6    50Gi       gp3-encrypted
postgres-data-postgres-2 Bound    pvc-g7h8i9    50Gi       gp3-encrypted

# Verify per-Pod DNS records from within the cluster
$ kubectl run dns-test --image=busybox --restart=Never -n payments -- \
  nslookup postgres-0.postgres.payments.svc.cluster.local
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
Name:      postgres-0.postgres.payments.svc.cluster.local
Address 1: 192.168.2.15   ← postgres-0's Pod IP ✓

What just happened?

volumeClaimTemplates creates one PVC per Pod — Kubernetes automatically creates postgres-data-postgres-0, postgres-data-postgres-1, and postgres-data-postgres-2. Each PVC is bound to exactly one Pod — the naming convention is {volumeClaimTemplate.name}-{statefulset.name}-{ordinal}. If postgres-1 is rescheduled to a different node, it re-attaches to postgres-data-postgres-1. Its data follows it.

OrderedReady ensures safe cluster formation — postgres-0 must be Ready before postgres-1 starts. In a real PostgreSQL HA setup, postgres-0 is the primary. Replicas need to connect to an already-running primary during their startup (streaming replication handshake). OrderedReady enforces this sequence without any application-level coordination.

The PGDATA subdirectory trick — When Kubernetes mounts a new EBS volume, the Linux kernel may create a lost+found directory. PostgreSQL refuses to initialise in a non-empty directory. Setting PGDATA=/var/lib/postgresql/data/pgdata (a subdirectory of the mount point) sidesteps this — lost+found lives in the parent directory, PGDATA initialises cleanly in the child.

Scaling, Updates, and the Partition Field

StatefulSet scaling and updates are ordered — always highest ordinal first for scale-down and updates, lowest ordinal first for scale-up. The partition field in the update strategy enables canary-style rollouts: only Pods with ordinal ≥ partition are updated.

# Scale up: postgres-3 starts after postgres-2 is Ready
kubectl scale statefulset postgres --replicas=4 -n payments

# Scale down: postgres-3 deleted first, then postgres-2, etc.
# The primary (postgres-0) is always the last to be deleted
kubectl scale statefulset postgres --replicas=2 -n payments

# Rolling update: update image version
# With partition=0, updates all Pods: highest ordinal (postgres-2) first
kubectl set image statefulset/postgres postgres=postgres:16 -n payments

# Watch ordered update (highest ordinal first):
kubectl rollout status statefulset/postgres -n payments
# Waiting for 3 pods to be ready...
# Waiting for partitioned roll out to finish: 1 out of 3 new pods have been updated...
# Waiting for partitioned roll out to finish: 2 out of 3 new pods have been updated...
# statefulset rolling update complete 3 pods at Version controller revision postgres-6d9f4

# Canary pattern: update only postgres-2 first, test, then update the rest
kubectl patch statefulset postgres -n payments \
  -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":2}}}}'
kubectl set image statefulset/postgres postgres=postgres:16 -n payments
# Only postgres-2 updates — postgres-0 and postgres-1 remain on postgres:15
# Test postgres-2. If healthy:
kubectl patch statefulset postgres -n payments \
  -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":0}}}}'
# Now all Pods update to postgres:16
$ kubectl scale statefulset postgres --replicas=2 -n payments

$ kubectl get pods -n payments -w
postgres-2   1/1   Running
postgres-1   1/1   Running
postgres-0   1/1   Running
postgres-2   1/1   Terminating   ← highest ordinal deleted first ✓
postgres-2   0/1   Terminating
postgres-2         Deleted
postgres-1   1/1   Terminating   ← then postgres-1
postgres-1         Deleted
# postgres-0 (primary) unaffected ✓

$ kubectl get pvc -n payments
NAME                     STATUS   VOLUME       CAPACITY
postgres-data-postgres-0 Bound    pvc-a1b2c3   50Gi   ← PVCs persist after Pod deletion
postgres-data-postgres-1 Bound    pvc-d4e5f6   50Gi   ← data is NOT deleted with scale-down
postgres-data-postgres-2 Bound    pvc-g7h8i9   50Gi   ← must delete PVCs manually if unwanted

What just happened?

PVCs are NOT deleted when scaling down — This is intentional and important. Deleting a Pod does not delete its data. If you scale from 3 to 2 replicas, postgres-data-postgres-2 persists. When you scale back up, postgres-2 re-attaches to its original volume with all data intact. To actually free the storage, delete the PVCs manually after scaling down.

The partition canary pattern for databases — Running a new PostgreSQL minor version on postgres-2 (a replica) first lets you test replication compatibility, performance, and extension compatibility before updating the primary. If postgres-2 at the new version replicates correctly from postgres-0 at the old version for 24 hours, you have strong confidence the update is safe. Then update the primary.

StatefulSet vs Deployment: When to Use Each

Deployment StatefulSet
Pod identity Random names (app-7d9f4-xkp). Interchangeable. Stable ordinal names (app-0, app-1). Unique.
Storage Shared PVC or ephemeral. All Pods see the same volume. Per-Pod PVC via volumeClaimTemplates. Data isolation.
DNS Single Service DNS → load balances across all Pods. Per-Pod DNS via headless Service. Direct addressing.
Scaling Simultaneous. Add/remove any Pod. Ordered. Highest ordinal first on scale-down.
Use for Web servers, APIs, workers — anything stateless. PostgreSQL, MySQL, Redis Sentinel, Kafka, Zookeeper, Elasticsearch — stateful systems that need stable identity.

Teacher's Note: Should you run databases in Kubernetes?

This is one of the most debated questions in the Kubernetes community. The short answer: StatefulSets are capable enough, but managed cloud databases (RDS, Cloud SQL, Aurora) often make more sense for production unless you have a strong operational reason to self-manage.

When Kubernetes-hosted databases make sense: you need database types not available as managed services (a specific Postgres extension, a custom Redis fork), you're running on-premises with no cloud option, your team has deep database operations expertise and owns the full stack, or you're using a Kubernetes operator (CloudNativePG, Percona Operator) that automates HA, backups, and failover — turning the StatefulSet complexity into a higher-level abstraction.

When to use managed services instead: your team is not a database operations team, you need point-in-time recovery without building it yourself, you want automated minor version upgrades and security patching, or your SLA requires multi-AZ failover in under 30 seconds. RDS can do all of these. A self-managed StatefulSet cannot — unless you build that capability on top of it.

Practice Questions

1. Which StatefulSet field automatically creates a dedicated PVC for each Pod, named after the Pod's ordinal index?



2. What setting on a Service makes it a headless Service — enabling per-Pod DNS records for StatefulSet Pods?



3. A StatefulSet named postgres is deployed in namespace payments with headless Service also named postgres. What is the full DNS name for the first Pod?



Quiz

1. You scale a StatefulSet from 3 replicas to 1. What happens to the PVCs that were bound to Pods 1 and 2?


2. You want to test a new database image on only the last replica of a 3-Pod StatefulSet before rolling it out to the primary. How do you use the partition field?


3. Why is podManagementPolicy: OrderedReady important for a PostgreSQL primary-replica StatefulSet?


Up Next · Lesson 31

Kubernetes Networking Deep Dive

Section III begins. How does traffic actually flow between Pods on different nodes? This lesson covers the flat network model, CNI plugins, kube-proxy iptables rules, and how to debug connectivity at every layer.