Kubernetes Course
Backup and Restore
etcd is the brain of your cluster — every resource definition, Secret, ConfigMap, and RBAC policy lives there. If etcd is lost without a backup, the cluster cannot be recovered. This lesson covers etcd snapshots with etcdctl, application-level backup and restore with Velero, and the disaster recovery runbooks for common failure scenarios.
What Needs to Be Backed Up
A complete Kubernetes disaster recovery strategy covers two distinct layers — cluster state and application data. Both need independent backup strategies because they fail independently and recover differently.
Cluster State — etcd
All Kubernetes objects: Deployments, Services, ConfigMaps, Secrets, RBAC, CRDs, Namespaces. Backed up with etcdctl or managed service snapshots. Restored to recover the control plane after catastrophic failure.
Application Data — PersistentVolumes
Database files, uploaded content, stateful application data stored on PVCs. etcd only stores the PVC/PV metadata — not the actual data. Backed up with Velero or cloud-native snapshots.
etcd Backup with etcdctl
The scenario: You are running a self-managed cluster. You need to take a regular etcd snapshot and store it off-cluster so that if the control plane is destroyed, you can restore the cluster state. On managed clusters (EKS, GKE, AKS), the control plane etcd is managed and backed up by the cloud provider — this section applies to self-managed kubeadm clusters.
# Take a snapshot -- run on the control plane node
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Verify the snapshot is valid
ETCDCTL_API=3 etcdctl snapshot status \
/backup/etcd-snapshot-20250310-140000.db \
--write-out=table
# Output:
# +----------+----------+------------+------------+
# | HASH | REVISION | TOTAL KEYS | TOTAL SIZE |
# +----------+----------+------------+------------+
# | abc12345 | 184923 | 3847 | 28 MB |
# +----------+----------+------------+------------+
# Copy snapshot off-cluster immediately
aws s3 cp /backup/etcd-snapshot-20250310-140000.db \
s3://company-cluster-backups/etcd/production/
# Automate with a CronJob
kubectl apply -f etcd-backup-cronjob.yaml
$ ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-20250310-140000.db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
{"level":"info","msg":"created checkpoint"}
Snapshot saved at /backup/etcd-snapshot-20250310-140000.db
$ ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot-20250310-140000.db --write-out=table
+----------+----------+------------+------------+
| HASH | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| abc12345 | 184923 | 3847 | 28 MB |
+----------+----------+------------+------------+
$ aws s3 cp /backup/etcd-snapshot-20250310-140000.db \
s3://company-cluster-backups/etcd/production/
upload: /backup/etcd-snapshot-20250310-140000.db to s3://company-cluster-backups/etcd/production/etcd-snapshot-20250310-140000.db# etcd-backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: etcd-backup
namespace: kube-system
spec:
schedule: "0 */6 * * *" # Every 6 hours
concurrencyPolicy: Forbid # Don't run overlapping backups
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
hostNetwork: true # Need access to etcd on 127.0.0.1
nodeSelector:
node-role.kubernetes.io/control-plane: "" # Run ONLY on control plane
tolerations:
- key: node-role.kubernetes.io/control-plane
effect: NoSchedule
containers:
- name: etcd-backup
image: bitnami/etcd:3.5.9
command:
- /bin/sh
- -c
- |
SNAPSHOT=/tmp/etcd-$(date +%Y%m%d-%H%M%S).db
etcdctl snapshot save $SNAPSHOT \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/etcd/ca.crt \
--cert=/etc/etcd/server.crt \
--key=/etc/etcd/server.key
aws s3 cp $SNAPSHOT \
s3://company-cluster-backups/etcd/production/$(basename $SNAPSHOT)
echo "Backup complete: $(basename $SNAPSHOT)"
volumeMounts:
- name: etcd-certs
mountPath: /etc/etcd
readOnly: true
volumes:
- name: etcd-certs
hostPath:
path: /etc/kubernetes/pki/etcd
restartPolicy: OnFailure
serviceAccountName: etcd-backup-sa
$ kubectl apply -f etcd-backup-cronjob.yaml
cronjob.batch/etcd-backup created
$ kubectl get cronjob etcd-backup -n kube-system
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
etcd-backup 0 */6 * * * False 0 <none> 10s
# After first run (6 hours later or trigger manually):
$ kubectl create job etcd-backup-manual --from=cronjob/etcd-backup -n kube-system
job.batch/etcd-backup-manual created
$ kubectl logs job/etcd-backup-manual -n kube-system
{"level":"info","msg":"created checkpoint"}
Snapshot saved at /tmp/etcd-20250310-140000.db
upload: /tmp/etcd-20250310-140000.db to s3://company-cluster-backups/etcd/production/etcd-20250310-140000.db
Backup complete: etcd-20250310-140000.db ✓
$ aws s3 ls s3://company-cluster-backups/etcd/production/
2025-03-10 08:00:05 29360128 etcd-20250310-080000.db
2025-03-10 14:00:03 29491200 etcd-20250310-140000.dbetcd Restore
The scenario: The control plane node was destroyed. You have an etcd snapshot from 6 hours ago. You have rebuilt the control plane node and need to restore cluster state.
# Step 1: Download the snapshot from S3
aws s3 cp s3://company-cluster-backups/etcd/production/etcd-20250310-140000.db \
/tmp/etcd-restore.db
# Step 2: Stop the API server and etcd static Pods
# (move their manifests out of /etc/kubernetes/manifests temporarily)
mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
mv /etc/kubernetes/manifests/etcd.yaml /tmp/
# Wait for API server and etcd to stop (~30 seconds)
# Step 3: Restore the snapshot to a new data directory
ETCDCTL_API=3 etcdctl snapshot restore /tmp/etcd-restore.db \
--data-dir=/var/lib/etcd-restored \
--name=master-node \
--initial-cluster=master-node=https://MASTER_IP:2380 \
--initial-cluster-token=etcd-cluster-1 \
--initial-advertise-peer-urls=https://MASTER_IP:2380
# Step 4: Update etcd manifest to use restored data dir
# In /tmp/etcd.yaml, change:
# --data-dir=/var/lib/etcd -> --data-dir=/var/lib/etcd-restored
# Also update the volume hostPath accordingly
# Step 5: Restore the static Pod manifests
mv /tmp/etcd.yaml /etc/kubernetes/manifests/
mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
# kubelet picks them up and starts etcd + API server within 60 seconds
# Step 6: Verify cluster is restored
kubectl get nodes
kubectl get pods -A
# Cluster state reflects the snapshot point in time
$ aws s3 cp s3://company-cluster-backups/etcd/production/etcd-20250310-140000.db /tmp/etcd-restore.db
download: s3://...etcd-20250310-140000.db to /tmp/etcd-restore.db
# Stopping control plane static Pods:
$ mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
$ mv /etc/kubernetes/manifests/etcd.yaml /tmp/
# Wait ~30s for processes to stop...
$ ETCDCTL_API=3 etcdctl snapshot restore /tmp/etcd-restore.db \
--data-dir=/var/lib/etcd-restored \
--name=master-node \
--initial-cluster=master-node=https://10.0.1.10:2380
{"level":"info","msg":"restored snapshot","path":"/tmp/etcd-restore.db","wal-dir":"/var/lib/etcd-restored/member/wal"}
# After updating etcd.yaml to use new data-dir and restoring manifests:
$ kubectl get nodes
NAME STATUS VERSION
master-node Ready v1.30.2
worker-node-1 Ready v1.30.2
worker-node-2 Ready v1.30.2 ← cluster state restored from snapshot ✓
$ kubectl get pods -n payments
NAME READY STATUS
payment-api-7d9f4-xkp2m 1/1 Running ← workloads restored ✓What just happened?
etcd restore does not modify running etcd — The restore command writes a new data directory from the snapshot. You must stop etcd before restoring so it doesn't conflict with running state. The restore creates a fresh etcd cluster member — it does not "merge" with existing data.
The restored state is the snapshot point in time — Any changes made after the snapshot was taken are lost. This is why frequent snapshots matter — a 6-hour snapshot cadence means at most 6 hours of cluster state is lost. On managed clusters (EKS), AWS takes etcd snapshots automatically and retores are handled by AWS Support — you would rebuild worker nodes and re-sync workloads from Git instead.
Velero: Application-Level Backup and Restore
Velero is the standard tool for application-level backup in Kubernetes. Unlike etcd snapshots which capture everything, Velero lets you back up specific namespaces, take volume snapshots, and restore individual applications — even migrating them between clusters. It stores backup data in object storage (S3, GCS).
# Install Velero with AWS S3 backend
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.9.0 \
--bucket company-velero-backups \
--backup-location-config region=us-east-1 \
--snapshot-location-config region=us-east-1 \
--secret-file ./velero-credentials # AWS credentials file
# Create an immediate backup of the payments namespace
velero backup create payments-backup-manual \
--include-namespaces payments \
--wait
# Backs up: all Kubernetes objects in the namespace + EBS volume snapshots for all PVCs
# Create a scheduled backup (nightly at 1am)
velero schedule create payments-nightly \
--schedule="0 1 * * *" \
--include-namespaces payments \
--ttl 720h # Retain for 30 days
# List all backups
velero backup get
# NAME STATUS ERRORS WARNINGS CREATED
# payments-backup-manual Completed 0 0 2025-03-10 14:00:01
# Describe a backup to see what was included
velero backup describe payments-backup-manual --details
$ velero install \ --provider aws \ --plugins velero/velero-plugin-for-aws:v1.9.0 \ --bucket company-velero-backups \ --backup-location-config region=us-east-1 \ --snapshot-location-config region=us-east-1 \ --secret-file ./velero-credentials CustomResourceDefinition/backups.velero.io: created CustomResourceDefinition/restores.velero.io: created CustomResourceDefinition/schedules.velero.io: created Deployment/velero: created $ kubectl get pods -n velero NAME READY STATUS velero-6d9f4-xkp2m 1/1 Running ✓ $ velero backup-location get NAME PROVIDER BUCKET/PREFIX PHASE default aws company-velero-backups Available ✓ $ velero backup create payments-backup-manual --include-namespaces payments --wait Backup request "payments-backup-manual" submitted. Backup completed with status: Completed ✓
# Restore from a backup -- full namespace restore
velero restore create payments-restore \
--from-backup payments-backup-manual \
--wait
# Restore to a different namespace (useful for DR testing)
velero restore create payments-dr-test \
--from-backup payments-backup-manual \
--namespace-mappings payments:payments-restored \
--wait
# Restore only specific resources (e.g. just the ConfigMaps)
velero restore create payments-configs-only \
--from-backup payments-backup-manual \
--include-resources configmaps,secrets \
--wait
# Check restore status
velero restore describe payments-restore --details
# Phase: Completed
# Warnings: 0 Errors: 0
kubectl get pods -n payments # Pods are running from the restored state
$ velero backup create payments-backup-manual \ --include-namespaces payments --wait Backup request "payments-backup-manual" submitted successfully. Waiting for backup to complete. You may safely press ctrl-c to stop waiting. ........... Backup completed with status: Completed. $ velero backup get NAME STATUS ERRORS WARNINGS CREATED EXPIRES payments-backup-manual Completed 0 0 2025-03-10 14:00:01 +0000 UTC 29d payments-nightly-abc123 Completed 0 0 2025-03-10 01:00:05 +0000 UTC 29d $ velero restore create payments-restore \ --from-backup payments-backup-manual --wait Restore request "payments-restore" submitted successfully. Waiting for restore to complete... Restore completed with status: Completed Warnings: 0 Errors: 0 $ kubectl get pods -n payments NAME READY STATUS RESTARTS payment-api-7d9f4-xkp2m 1/1 Running 0 ← restored from backup ✓ payment-api-7d9f4-rvqn2 1/1 Running 0
What just happened?
Velero backs up objects and volumes independently — For Kubernetes objects (Deployments, Services, ConfigMaps, Secrets), Velero queries the API server and serialises them to JSON in S3. For PersistentVolumes, it triggers a cloud provider volume snapshot (EBS snapshot for AWS). The restore process re-creates the objects and restores the volumes from snapshots.
Namespace mapping for DR testing — The --namespace-mappings flag lets you restore a production backup into a test namespace without touching production. This is how you validate that your backups actually work — restore to payments-restored, run smoke tests, then tear it down. A backup you have never tested restoring is not a backup.
Velero vs etcd snapshots — etcd snapshots recover the entire cluster after control plane failure. Velero recovers specific applications and their data after accidental deletion, namespace corruption, or migration. They are complementary — you need both.
Disaster Recovery Scenarios
| Scenario | Recovery approach | RTO |
|---|---|---|
| Accidental kubectl delete on a namespace | Velero restore from last backup. Objects re-created, volumes re-attached. | 5–15 min |
| Single node failure | Kubernetes reschedules Pods automatically. Cluster Autoscaler replaces the node. No manual intervention needed with PDBs set. | 2–5 min |
| Control plane destruction (self-managed) | Rebuild control plane node, restore etcd from snapshot, uncordon worker nodes. | 30–60 min |
| Full cluster loss (managed) | Rebuild cluster with IaC (Terraform/eksctl). Restore workloads with Velero. Re-sync remaining objects from GitOps repo. | 1–3 hrs |
| PVC data corruption | Stop the workload. Restore PVC from Velero volume snapshot. Restart the workload. | 15–30 min |
Teacher's Note: The backup you never restore is not a backup
Teams set up Velero, see green backups running nightly, and feel safe. Then a disaster happens and the restore fails — wrong IAM permissions, Velero version mismatch, missing CSI snapshot class, or volume snapshot not available in the target region. The backup ran. The restore does not.
Schedule a quarterly DR test: delete the payments namespace in a staging cluster, restore from the latest Velero backup, run smoke tests. Time the recovery. Fix anything that breaks. A 15-minute quarterly exercise reveals restore issues before they become production incidents. Document the result — compliance auditors love DR test evidence.
For managed clusters (EKS): AWS handles etcd. Your recovery focus is workloads and data. Keep your infrastructure-as-code (eksctl / Terraform) and GitOps repo clean enough that rebuilding the cluster from scratch takes under 30 minutes — then Velero handles the data layer. A cluster you can rebuild from code is more resilient than a cluster with backups you've never tested.
Practice Questions
1. Which command creates a point-in-time snapshot of etcd — the backing store for all Kubernetes cluster state?
2. Which Velero restore flag lets you restore a backup into a different namespace — enabling DR testing without affecting production?
3. What is the key difference between an etcd snapshot and a Velero backup — when would you use each?
Quiz
1. Your etcd snapshot is 6 hours old. Your PostgreSQL StatefulSet's PVC contains the last 6 months of transaction data. Is the database data included in the etcd snapshot?
2. To restore a self-managed cluster from an etcd snapshot, what must you do before running etcdctl snapshot restore?
3. Your team has Velero running nightly backups. What should you do to ensure these backups actually work when needed?
Up Next · Lesson 58
Kubernetes Troubleshooting
A systematic approach to diagnosing the most common Kubernetes failures: Pods that won't start, services that can't be reached, nodes that go NotReady, and deployments that roll back unexpectedly. The diagnostic commands and mental model every on-call engineer needs.