Kubernetes Lesson 57 – Backup and Restore | Dataplexa

Advanced Workloads & Operations · Lesson 57

Backup and Restore

etcd is the brain of your cluster — every resource definition, Secret, ConfigMap, and RBAC policy lives there. If etcd is lost without a backup, the cluster cannot be recovered. This lesson covers etcd snapshots with etcdctl, application-level backup and restore with Velero, and the disaster recovery runbooks for common failure scenarios.

What Needs to Be Backed Up

A complete Kubernetes disaster recovery strategy covers two distinct layers — cluster state and application data. Both need independent backup strategies because they fail independently and recover differently.

Cluster State — etcd

All Kubernetes objects: Deployments, Services, ConfigMaps, Secrets, RBAC, CRDs, Namespaces. Backed up with etcdctl or managed service snapshots. Restored to recover the control plane after catastrophic failure.

Application Data — PersistentVolumes

Database files, uploaded content, stateful application data stored on PVCs. etcd only stores the PVC/PV metadata — not the actual data. Backed up with Velero or cloud-native snapshots.

etcd Backup with etcdctl

The scenario: You are running a self-managed cluster. You need to take a regular etcd snapshot and store it off-cluster so that if the control plane is destroyed, you can restore the cluster state. On managed clusters (EKS, GKE, AKS), the control plane etcd is managed and backed up by the cloud provider — this section applies to self-managed kubeadm clusters.

# Take a snapshot -- run on the control plane node
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Verify the snapshot is valid
ETCDCTL_API=3 etcdctl snapshot status \
  /backup/etcd-snapshot-20250310-140000.db \
  --write-out=table
# Output:
# +----------+----------+------------+------------+
# |   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
# +----------+----------+------------+------------+
# | abc12345 |   184923 |       3847 |      28 MB |
# +----------+----------+------------+------------+

# Copy snapshot off-cluster immediately
aws s3 cp /backup/etcd-snapshot-20250310-140000.db \
  s3://company-cluster-backups/etcd/production/

# Automate with a CronJob
kubectl apply -f etcd-backup-cronjob.yaml

$ ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-20250310-140000.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key
{"level":"info","msg":"created checkpoint"}
Snapshot saved at /backup/etcd-snapshot-20250310-140000.db

$ ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot-20250310-140000.db --write-out=table
+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| abc12345 |   184923 |       3847 |      28 MB |
+----------+----------+------------+------------+

$ aws s3 cp /backup/etcd-snapshot-20250310-140000.db \
  s3://company-cluster-backups/etcd/production/
upload: /backup/etcd-snapshot-20250310-140000.db to s3://company-cluster-backups/etcd/production/etcd-snapshot-20250310-140000.db

# etcd-backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: etcd-backup
  namespace: kube-system
spec:
  schedule: "0 */6 * * *"          # Every 6 hours
  concurrencyPolicy: Forbid         # Don't run overlapping backups
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          hostNetwork: true         # Need access to etcd on 127.0.0.1
          nodeSelector:
            node-role.kubernetes.io/control-plane: ""   # Run ONLY on control plane
          tolerations:
            - key: node-role.kubernetes.io/control-plane
              effect: NoSchedule
          containers:
            - name: etcd-backup
              image: bitnami/etcd:3.5.9
              command:
                - /bin/sh
                - -c
                - |
                  SNAPSHOT=/tmp/etcd-$(date +%Y%m%d-%H%M%S).db
                  etcdctl snapshot save $SNAPSHOT \
                    --endpoints=https://127.0.0.1:2379 \
                    --cacert=/etc/etcd/ca.crt \
                    --cert=/etc/etcd/server.crt \
                    --key=/etc/etcd/server.key
                  aws s3 cp $SNAPSHOT \
                    s3://company-cluster-backups/etcd/production/$(basename $SNAPSHOT)
                  echo "Backup complete: $(basename $SNAPSHOT)"
              volumeMounts:
                - name: etcd-certs
                  mountPath: /etc/etcd
                  readOnly: true
          volumes:
            - name: etcd-certs
              hostPath:
                path: /etc/kubernetes/pki/etcd
          restartPolicy: OnFailure
          serviceAccountName: etcd-backup-sa

$ kubectl apply -f etcd-backup-cronjob.yaml
cronjob.batch/etcd-backup created

$ kubectl get cronjob etcd-backup -n kube-system
NAME          SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE   AGE
etcd-backup   0 */6 * * *   False     0        <none>          10s

# After first run (6 hours later or trigger manually):
$ kubectl create job etcd-backup-manual --from=cronjob/etcd-backup -n kube-system
job.batch/etcd-backup-manual created

$ kubectl logs job/etcd-backup-manual -n kube-system
{"level":"info","msg":"created checkpoint"}
Snapshot saved at /tmp/etcd-20250310-140000.db
upload: /tmp/etcd-20250310-140000.db to s3://company-cluster-backups/etcd/production/etcd-20250310-140000.db
Backup complete: etcd-20250310-140000.db  ✓

$ aws s3 ls s3://company-cluster-backups/etcd/production/
2025-03-10 08:00:05    29360128 etcd-20250310-080000.db
2025-03-10 14:00:03    29491200 etcd-20250310-140000.db

etcd Restore

The scenario: The control plane node was destroyed. You have an etcd snapshot from 6 hours ago. You have rebuilt the control plane node and need to restore cluster state.

# Step 1: Download the snapshot from S3
aws s3 cp s3://company-cluster-backups/etcd/production/etcd-20250310-140000.db \
  /tmp/etcd-restore.db

# Step 2: Stop the API server and etcd static Pods
# (move their manifests out of /etc/kubernetes/manifests temporarily)
mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
mv /etc/kubernetes/manifests/etcd.yaml /tmp/
# Wait for API server and etcd to stop (~30 seconds)

# Step 3: Restore the snapshot to a new data directory
ETCDCTL_API=3 etcdctl snapshot restore /tmp/etcd-restore.db \
  --data-dir=/var/lib/etcd-restored \
  --name=master-node \
  --initial-cluster=master-node=https://MASTER_IP:2380 \
  --initial-cluster-token=etcd-cluster-1 \
  --initial-advertise-peer-urls=https://MASTER_IP:2380

# Step 4: Update etcd manifest to use restored data dir
# In /tmp/etcd.yaml, change:
# --data-dir=/var/lib/etcd  ->  --data-dir=/var/lib/etcd-restored
# Also update the volume hostPath accordingly

# Step 5: Restore the static Pod manifests
mv /tmp/etcd.yaml /etc/kubernetes/manifests/
mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
# kubelet picks them up and starts etcd + API server within 60 seconds

# Step 6: Verify cluster is restored
kubectl get nodes
kubectl get pods -A
# Cluster state reflects the snapshot point in time

$ aws s3 cp s3://company-cluster-backups/etcd/production/etcd-20250310-140000.db /tmp/etcd-restore.db
download: s3://...etcd-20250310-140000.db to /tmp/etcd-restore.db

# Stopping control plane static Pods:
$ mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
$ mv /etc/kubernetes/manifests/etcd.yaml /tmp/
# Wait ~30s for processes to stop...

$ ETCDCTL_API=3 etcdctl snapshot restore /tmp/etcd-restore.db \
  --data-dir=/var/lib/etcd-restored \
  --name=master-node \
  --initial-cluster=master-node=https://10.0.1.10:2380
{"level":"info","msg":"restored snapshot","path":"/tmp/etcd-restore.db","wal-dir":"/var/lib/etcd-restored/member/wal"}

# After updating etcd.yaml to use new data-dir and restoring manifests:
$ kubectl get nodes
NAME            STATUS   VERSION
master-node     Ready    v1.30.2
worker-node-1   Ready    v1.30.2
worker-node-2   Ready    v1.30.2   ← cluster state restored from snapshot ✓

$ kubectl get pods -n payments
NAME                          READY   STATUS
payment-api-7d9f4-xkp2m       1/1     Running   ← workloads restored ✓

What just happened?

etcd restore does not modify running etcd — The restore command writes a new data directory from the snapshot. You must stop etcd before restoring so it doesn't conflict with running state. The restore creates a fresh etcd cluster member — it does not "merge" with existing data.

The restored state is the snapshot point in time — Any changes made after the snapshot was taken are lost. This is why frequent snapshots matter — a 6-hour snapshot cadence means at most 6 hours of cluster state is lost. On managed clusters (EKS), AWS takes etcd snapshots automatically and retores are handled by AWS Support — you would rebuild worker nodes and re-sync workloads from Git instead.

Velero: Application-Level Backup and Restore

Velero is the standard tool for application-level backup in Kubernetes. Unlike etcd snapshots which capture everything, Velero lets you back up specific namespaces, take volume snapshots, and restore individual applications — even migrating them between clusters. It stores backup data in object storage (S3, GCS).

# Install Velero with AWS S3 backend
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.9.0 \
  --bucket company-velero-backups \
  --backup-location-config region=us-east-1 \
  --snapshot-location-config region=us-east-1 \
  --secret-file ./velero-credentials    # AWS credentials file

# Create an immediate backup of the payments namespace
velero backup create payments-backup-manual \
  --include-namespaces payments \
  --wait
# Backs up: all Kubernetes objects in the namespace + EBS volume snapshots for all PVCs

# Create a scheduled backup (nightly at 1am)
velero schedule create payments-nightly \
  --schedule="0 1 * * *" \
  --include-namespaces payments \
  --ttl 720h                            # Retain for 30 days

# List all backups
velero backup get
# NAME                     STATUS      ERRORS   WARNINGS   CREATED
# payments-backup-manual   Completed   0        0          2025-03-10 14:00:01

# Describe a backup to see what was included
velero backup describe payments-backup-manual --details

$ velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.9.0 \
  --bucket company-velero-backups \
  --backup-location-config region=us-east-1 \
  --snapshot-location-config region=us-east-1 \
  --secret-file ./velero-credentials
CustomResourceDefinition/backups.velero.io: created
CustomResourceDefinition/restores.velero.io: created
CustomResourceDefinition/schedules.velero.io: created
Deployment/velero: created

$ kubectl get pods -n velero
NAME                      READY   STATUS
velero-6d9f4-xkp2m        1/1     Running  ✓

$ velero backup-location get
NAME      PROVIDER   BUCKET/PREFIX                  PHASE
default   aws        company-velero-backups          Available  ✓

$ velero backup create payments-backup-manual --include-namespaces payments --wait
Backup request "payments-backup-manual" submitted.
Backup completed with status: Completed  ✓

# Restore from a backup -- full namespace restore
velero restore create payments-restore \
  --from-backup payments-backup-manual \
  --wait

# Restore to a different namespace (useful for DR testing)
velero restore create payments-dr-test \
  --from-backup payments-backup-manual \
  --namespace-mappings payments:payments-restored \
  --wait

# Restore only specific resources (e.g. just the ConfigMaps)
velero restore create payments-configs-only \
  --from-backup payments-backup-manual \
  --include-resources configmaps,secrets \
  --wait

# Check restore status
velero restore describe payments-restore --details
# Phase: Completed
# Warnings: 0  Errors: 0

kubectl get pods -n payments    # Pods are running from the restored state

$ velero backup create payments-backup-manual \
  --include-namespaces payments --wait
Backup request "payments-backup-manual" submitted successfully.
Waiting for backup to complete. You may safely press ctrl-c to stop waiting.
...........
Backup completed with status: Completed.

$ velero backup get
NAME                     STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES
payments-backup-manual   Completed   0        0          2025-03-10 14:00:01 +0000 UTC   29d
payments-nightly-abc123  Completed   0        0          2025-03-10 01:00:05 +0000 UTC   29d

$ velero restore create payments-restore \
  --from-backup payments-backup-manual --wait
Restore request "payments-restore" submitted successfully.
Waiting for restore to complete...
Restore completed with status: Completed
  Warnings: 0    Errors: 0

$ kubectl get pods -n payments
NAME                          READY   STATUS    RESTARTS
payment-api-7d9f4-xkp2m       1/1     Running   0   ← restored from backup ✓
payment-api-7d9f4-rvqn2       1/1     Running   0

What just happened?

Velero backs up objects and volumes independently — For Kubernetes objects (Deployments, Services, ConfigMaps, Secrets), Velero queries the API server and serialises them to JSON in S3. For PersistentVolumes, it triggers a cloud provider volume snapshot (EBS snapshot for AWS). The restore process re-creates the objects and restores the volumes from snapshots.

Namespace mapping for DR testing — The --namespace-mappings flag lets you restore a production backup into a test namespace without touching production. This is how you validate that your backups actually work — restore to payments-restored, run smoke tests, then tear it down. A backup you have never tested restoring is not a backup.

Velero vs etcd snapshots — etcd snapshots recover the entire cluster after control plane failure. Velero recovers specific applications and their data after accidental deletion, namespace corruption, or migration. They are complementary — you need both.

Disaster Recovery Scenarios

Scenario	Recovery approach	RTO
Accidental kubectl delete on a namespace	Velero restore from last backup. Objects re-created, volumes re-attached.	5–15 min
Single node failure	Kubernetes reschedules Pods automatically. Cluster Autoscaler replaces the node. No manual intervention needed with PDBs set.	2–5 min
Control plane destruction (self-managed)	Rebuild control plane node, restore etcd from snapshot, uncordon worker nodes.	30–60 min
Full cluster loss (managed)	Rebuild cluster with IaC (Terraform/eksctl). Restore workloads with Velero. Re-sync remaining objects from GitOps repo.	1–3 hrs
PVC data corruption	Stop the workload. Restore PVC from Velero volume snapshot. Restart the workload.	15–30 min

Teacher's Note: The backup you never restore is not a backup

Teams set up Velero, see green backups running nightly, and feel safe. Then a disaster happens and the restore fails — wrong IAM permissions, Velero version mismatch, missing CSI snapshot class, or volume snapshot not available in the target region. The backup ran. The restore does not.

Schedule a quarterly DR test: delete the payments namespace in a staging cluster, restore from the latest Velero backup, run smoke tests. Time the recovery. Fix anything that breaks. A 15-minute quarterly exercise reveals restore issues before they become production incidents. Document the result — compliance auditors love DR test evidence.

For managed clusters (EKS): AWS handles etcd. Your recovery focus is workloads and data. Keep your infrastructure-as-code (eksctl / Terraform) and GitOps repo clean enough that rebuilding the cluster from scratch takes under 30 minutes — then Velero handles the data layer. A cluster you can rebuild from code is more resilient than a cluster with backups you've never tested.

Practice Questions

1. Which command creates a point-in-time snapshot of etcd — the backing store for all Kubernetes cluster state?

2. Which Velero restore flag lets you restore a backup into a different namespace — enabling DR testing without affecting production?

3. What is the key difference between an etcd snapshot and a Velero backup — when would you use each?

Quiz

Up Next · Lesson 58

Kubernetes Troubleshooting

A systematic approach to diagnosing the most common Kubernetes failures: Pods that won't start, services that can't be reached, nodes that go NotReady, and deployments that roll back unexpectedly. The diagnostic commands and mental model every on-call engineer needs.

← Previous Course Index Next →

Kubernetes Course