Kubernetes Lesson 40 – Security Contexts | Dataplexa

Networking, Ingress & Security · Lesson 40

Security Contexts

RBAC controls what you can do to the Kubernetes API. Security contexts control what a container can do to the Linux kernel. A container running as root with all capabilities can escape to the host, read other containers' data, and pivot across your entire cluster. Security contexts are how you stop that.

Why Container Isolation Is Not Enough by Default

Containers share the host kernel. Unlike virtual machines, there is no hypervisor between your container and the Linux kernel. A container running as root (UID 0) inside the container maps to root on the host — and with the right capabilities, a root container can mount the host filesystem, load kernel modules, modify iptables rules, and read other processes' memory.

Every Docker image built without an explicit USER instruction runs as root. Most container images you'll encounter — including many official images — run as root by default. In a production Kubernetes cluster, every one of those containers is a potential escape path unless you explicitly constrain them.

A security context is a set of Linux security settings applied to a Pod or container. It controls: which UID/GID the process runs as, whether privilege escalation is allowed, which Linux capabilities the process has, whether the filesystem is read-only, and which seccomp/AppArmor profile is applied.

Two levels of security context

spec.securityContext (Pod-level): applies to all containers in the Pod. Sets shared settings like fsGroup, sysctls, seccompProfile.

spec.containers[].securityContext (container-level): applies to one specific container. Sets per-container settings like runAsUser, capabilities, readOnlyRootFilesystem. Container-level settings override Pod-level settings.

The Hardened Pod Template

The scenario: You're deploying a payment API to a cluster that handles PCI DSS-regulated card data. The security team requires that all containers run as non-root, cannot escalate privileges, have a read-only root filesystem, and have minimal Linux capabilities. Here's the fully hardened manifest.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-api
  namespace: payments
spec:
  replicas: 3
  selector:
    matchLabels:
      app: payment-api
  template:
    metadata:
      labels:
        app: payment-api
    spec:
      # --- Pod-level security context ---
      securityContext:
        runAsNonRoot: true            # Kubernetes rejects any container that would run as UID 0
                                      # If the image's USER is root, the Pod fails to start with a clear error
        runAsUser: 1000               # Run all containers as UID 1000 (unless overridden per-container)
        runAsGroup: 3000              # Primary group GID 3000
        fsGroup: 2000                 # Files created in mounted volumes are owned by GID 2000
                                      # fsGroup is critical for shared volume access in multi-container Pods
        seccompProfile:               # Seccomp: filter which syscalls the container can make
          type: RuntimeDefault        # RuntimeDefault: the container runtime's built-in secure profile
                                      # Blocks dangerous syscalls while allowing all normal operations
                                      # Alternative: Localhost (custom profile) or Unconfined (no filter)
        sysctls: []                   # Kernel parameter overrides — leave empty unless specifically needed
                                      # Modifying sysctls is an advanced and risky operation

      containers:
        - name: payment-api
          image: company/payment-api:3.0.0
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: "150m"
              memory: "200Mi"
            limits:
              cpu: "500m"
              memory: "350Mi"

          # --- Container-level security context ---
          securityContext:
            runAsNonRoot: true            # Belt and suspenders: enforce at container level too
            runAsUser: 1000               # Must match the UID the image's process actually runs as
            runAsGroup: 3000
            allowPrivilegeEscalation: false  # Prevents setuid/setgid and Linux capability acquisition
                                              # The single most important security context setting
                                              # Blocks: sudo, su, any SUID binary execution
            readOnlyRootFilesystem: true  # Mount container's root filesystem as read-only
                                          # Process cannot write to /etc, /bin, /usr, etc.
                                          # Forces explicit volume mounts for any writable paths needed
            capabilities:
              drop:
                - ALL                     # Drop ALL Linux capabilities first (clean slate)
              add:
                - NET_BIND_SERVICE        # Re-add only what's needed: bind to ports < 1024
                                          # Most apps don't even need this — remove if port >= 1024
                                          # Never add: SYS_ADMIN, NET_ADMIN, SYS_PTRACE, DAC_OVERRIDE

          volumeMounts:
            - name: tmp-volume           # Writable /tmp for the application
              mountPath: /tmp
            - name: cache-volume         # Writable cache directory
              mountPath: /app/cache

      volumes:
        - name: tmp-volume
          emptyDir: {}                   # Writable scratch space since root filesystem is read-only
        - name: cache-volume
          emptyDir:
            medium: Memory               # Cache in RAM — faster and doesn't touch disk

$ kubectl apply -f payment-api-hardened.yaml
deployment.apps/payment-api created

$ kubectl describe pod payment-api-8c4f7d-j9pkx -n payments | grep -A20 "Security Context:"
    Security Context:
      Allow Privilege Escalation:  false
      Capabilities:
        Drop: ALL
        Add: NET_BIND_SERVICE
      Read Only Root Filesystem:  true
      Run As Group:               3000
      Run As Non Root:            true
      Run As User:                1000
      Seccomp Profile Type:       RuntimeDefault

$ kubectl exec -it payment-api-8c4f7d-j9pkx -n payments -- id
uid=1000 gid=3000 groups=3000,2000

$ kubectl exec -it payment-api-8c4f7d-j9pkx -n payments -- touch /etc/hacked
touch: /etc/hacked: Read-only file system   ← root filesystem is read-only ✓

$ kubectl exec -it payment-api-8c4f7d-j9pkx -n payments -- touch /tmp/scratch
(succeeds — /tmp is writable via the emptyDir volume mount)

What just happened?

allowPrivilegeEscalation: false is the most critical setting — This single field blocks the most common container escape vector: SUID binaries. A SUID binary executes with the file owner's privileges rather than the caller's. If a binary owned by root has the SUID bit set, any user can run it as root. allowPrivilegeEscalation: false makes the kernel ignore the SUID bit entirely for processes in this container. Even if an attacker finds a SUID binary, they cannot escalate.

capabilities drop ALL then add back selectively — Linux capabilities are fine-grained subdivisions of root privileges. CAP_NET_ADMIN allows iptables modifications. CAP_SYS_ADMIN is essentially root. Dropping ALL and adding only NET_BIND_SERVICE (for ports <1024) means the container has the absolute minimum Linux kernel privileges needed. Most web applications running on ports ≥1024 can drop ALL with nothing added back.

readOnlyRootFilesystem + emptyDir volumes — A read-only root filesystem prevents an attacker from modifying binaries, writing scripts to /tmp, or persisting backdoors. The explicit emptyDir volumes for /tmp and /app/cache provide the writable space the application legitimately needs. This forces you to explicitly enumerate every writable directory — a discipline that reveals whether applications are writing to unexpected places.

Linux Capabilities Reference

Understanding which capabilities are dangerous helps you make informed decisions about what to drop and what to add back.

Capability	What it allows	In containers: drop it?
SYS_ADMIN	Almost everything root can do — mount filesystems, modify kernel settings, etc.	Always drop — this is essentially root
NET_ADMIN	Configure network interfaces, iptables, routing tables	Always drop unless you're a network CNI pod
SYS_PTRACE	Attach to and inspect other processes — used for debugging but dangerous in containers	Always drop — can read other processes' memory
DAC_OVERRIDE	Bypass file permission checks — read any file regardless of permissions	Always drop — bypasses file system security
NET_BIND_SERVICE	Bind to ports below 1024 (privileged ports)	Keep only if your app uses ports < 1024 (HTTP=80, HTTPS=443)
CHOWN	Change file ownership	Usually safe to drop — most apps don't chown files at runtime

runAsNonRoot and Image Compatibility

Setting runAsNonRoot: true only works if the container image is built to run as a non-root user. Many official images — including older versions of nginx, Redis, and PostgreSQL — default to root. Here's how to detect and fix this.

The scenario: You're applying security contexts to your fleet but some Pods are failing to start with a "container has runAsNonRoot and image will run as root" error. You need to identify which images are the problem and either switch to non-root variants or fix them in your Dockerfile.

kubectl get pods -n payments --field-selector=status.phase=Pending
# Find Pods that failed to start due to security context violations

kubectl describe pod payment-api-8c4f7d-xkpjz -n payments | grep -A5 "Events:"
# Events will show: "container has runAsNonRoot and image will run as root"
# This means the image's CMD/ENTRYPOINT runs as UID 0

docker inspect company/payment-api:3.0.0 | jq '.[0].Config.User'
# Check what user the image runs as — empty string or "0" means root
# "1000" or "appuser" means non-root — compatible with runAsNonRoot: true

# To check without docker:
kubectl run check-image --image=company/payment-api:3.0.0 --rm --restart=Never \
  --command -- id
# Run the 'id' command in the container — shows the UID/GID it uses by default

# In the Dockerfile — fix the image to run as non-root:
# FROM node:18-alpine
# RUN addgroup -g 1001 appgroup && adduser -u 1001 -G appgroup -s /bin/sh -D appuser
# USER 1001                 ← set the default user
# EXPOSE 3000
# CMD ["node", "server.js"]

# If you can't change the image, override the user in the security context:
# securityContext:
#   runAsUser: 1001         ← Kubernetes runs the container as UID 1001 regardless of image default
#   runAsGroup: 1001
#   runAsNonRoot: true      ← Validates that the effective UID is not 0

$ kubectl describe pod payment-api-8c4f7d-xkpjz -n payments | grep -A3 Events
Events:
  Warning  Failed  5s  kubelet  Error: container has runAsNonRoot
                                and image will run as root
                                (pod: "payment-api-8c4f7d-xkpjz", container: payment-api)

$ docker inspect company/payment-api:2.9.0 | jq '.[0].Config.User'
""   ← empty string = runs as root by default

$ docker inspect company/payment-api:3.0.0 | jq '.[0].Config.User'
"1000"   ← v3.0.0 fixed it — runs as UID 1000 ✓

(solution: update the image tag to 3.0.0 or add runAsUser: 1001 override)

What just happened?

runAsNonRoot validates, runAsUser overrides — runAsNonRoot: true is a validation — it checks that the effective UID is not 0 and refuses to start if it would be. runAsUser: 1001 is an override — it tells the kernel to run the container process as UID 1001 regardless of what the image specifies. Using both together gives you a hard guarantee: the container must run as a non-root UID, and you're specifying exactly which UID.

Fix the image, not just the security context — Overriding runAsUser in the security context is a band-aid. The proper fix is to add a USER instruction to the Dockerfile so the image itself runs as non-root. This way the image behaves correctly even without security context overrides — useful for local development and other environments where security contexts aren't applied.

fsGroup and Volume Permissions

When a container runs as a non-root user and mounts a volume, file ownership matters. The volume's files might be owned by root — and your non-root process can't write to them. The fsGroup field solves this by making Kubernetes change the ownership of mounted volumes to the specified GID at mount time.

spec:
  securityContext:
    runAsUser: 1000
    runAsGroup: 3000
    fsGroup: 2000                     # All mounted volumes will have GID 2000 applied
                                      # New files created in volumes get GID 2000 automatically
                                      # The process (GID 3000) can read/write volumes owned by 2000
                                      # if the volume permissions include group read/write

    fsGroupChangePolicy: OnRootMismatch  # OnRootMismatch: only change ownership if root dir has wrong ownership
                                          # Always: always recursively chown all files (slow for large volumes)
                                          # OnRootMismatch is the default and is faster for large PVCs

  containers:
    - name: app
      securityContext:
        runAsUser: 1000
      volumeMounts:
        - name: data-vol
          mountPath: /data            # Kubernetes will chown /data to GID 2000 before starting the container

  volumes:
    - name: data-vol
      persistentVolumeClaim:
        claimName: app-data-pvc

$ kubectl exec -it app-pod-7f9b4d-2xkpj -- ls -la /data
total 8
drwxrwsr-x 2 root  2000 4096 Mar 10 09:44 .
drwxr-xr-x 1 root  root 4096 Mar 10 09:44 ..
-rw-r--r-- 1 root  2000 0    Mar 10 09:44 existing-file.db

$ kubectl exec -it app-pod-7f9b4d-2xkpj -- touch /data/newfile.db
(succeeds — process GID 2000 matches directory GID 2000, setgid bit means new files inherit GID)

$ kubectl exec -it app-pod-7f9b4d-2xkpj -- id
uid=1000 gid=3000 groups=3000,2000   ← added to group 2000 via fsGroup

What just happened?

fsGroup adds the process to a supplementary group — The id output shows the process is in GID 3000 (primary group) AND GID 2000 (supplementary group added by fsGroup). Kubernetes makes the process a member of the fsGroup GID — which lets it access files owned by that GID even though the primary group is different.

OnRootMismatch vs Always — fsGroupChangePolicy: Always recursively chowns every file in every mounted volume at Pod start time. For a 100GB database volume with millions of files, this can take minutes. OnRootMismatch only runs the chown if the root directory's GID doesn't already match — much faster for volumes that are already correctly owned.

Security Context Quick Reference

The security settings that matter most, at a glance:

Setting	Level	Recommended	Why
allowPrivilegeEscalation	Container	false	Blocks SUID/SGID escalation — most critical setting
runAsNonRoot	Pod / Container	true	Enforces non-root at admission — catches root images before they run
runAsUser	Pod / Container	1000 (or your app UID)	Explicit UID — more predictable than relying on image USER
readOnlyRootFilesystem	Container	true	Prevents writing backdoors, modifying binaries — pair with writable emptyDirs
capabilities.drop ALL	Container	Always	Clean slate — then add back only what's needed
seccompProfile RuntimeDefault	Pod	RuntimeDefault	Blocks ~100 dangerous syscalls with no application changes
fsGroup	Pod	2000 (or your volume GID)	Required when non-root user needs to write to mounted volumes

Teacher's Note: Start with a restricted baseline, then fix what breaks

The common objection to security contexts is "it's too hard to figure out what breaks." Here's the approach that works: apply the hardened template to one service in staging, watch what fails, fix it. Repeat for the next service. Build a library of known-good configurations for your common application types (Node.js API, Java service, Python worker).

The most common failure modes after hardening: (1) App writes to /tmp — add an emptyDir for /tmp. (2) App writes log files to its install directory — add an emptyDir for the log directory. (3) App needs to write to /var/run for PID files — add an emptyDir. (4) App uses the wrong UID — check the image and either fix it or override with runAsUser.

For new Kubernetes clusters: set the Pod Security Admission controller (successor to PodSecurityPolicy, which was removed in Kubernetes 1.25) to baseline on all namespaces and restricted on namespaces with sensitive data. The restricted profile enforces most of the settings in this lesson automatically, rejecting Pods that don't comply.

Practice Questions

1. Which single security context setting is most critical for preventing privilege escalation via SUID binaries — making the kernel ignore the SUID/SGID bit for processes in the container?

2. You want to prevent a container from writing to its root filesystem (like /etc, /bin, /tmp), making it impossible for an attacker to persist backdoors. Which security context field achieves this?

3. Your container runs as UID 1000 / GID 3000 but needs to write to a mounted PVC. The PVC's files are owned by root. Which Pod-level security context field makes Kubernetes change the GID ownership of mounted volumes so the non-root process can write to them?

Quiz

Up Next · Lesson 41

Pod Security Policies

Cluster-wide security context enforcement — how Pod Security Admission (the PodSecurityPolicy successor) automatically validates and enforces security baselines across all namespaces.

← Previous Course Index Next →

Kubernetes Course

Security Contexts

Why Container Isolation Is Not Enough by Default

The Hardened Pod Template

Linux Capabilities Reference

runAsNonRoot and Image Compatibility

fsGroup and Volume Permissions

Security Context Quick Reference

Practice Questions

Quiz