Docker Course
Docker Security Hardening
A container running a public-facing API was compromised through an unpatched library vulnerability. The attacker had code execution inside the container. With the default Docker runtime configuration, they were able to read /proc/sysrq-trigger, send signals to host processes, and use kernel capabilities that had no business being available to a web application. The vulnerability was unavoidable — zero-days happen. What was avoidable was giving the attacker a fully equipped workshop once they were inside.
Lesson 32 covered the basics — non-root user, read-only filesystem, capability dropping. This lesson goes deeper: the kernel-level controls that restrict what system calls a container process can make, the Linux security modules that enforce mandatory access control, rootless Docker that removes the Daemon's root privileges entirely, and the runtime flags that make a compromised container a dead end rather than a launchpad. These are the controls that contain the blast radius when — not if — a vulnerability is exploited.
Basic Hardening vs Deep Hardening
Lesson 32 basics — good start
- Non-root user inside the container
- Read-only root filesystem
--cap-drop ALLwith selective adds--security-opt no-new-privileges- No Docker socket mount
- Vulnerability scanning in CI
This lesson — defence in depth
- Seccomp profile — restrict allowed system calls to a whitelist
- AppArmor profile — mandatory access control on file and network access
- Rootless Docker — Daemon itself runs without root on the host
- User namespace remapping — container root maps to unprivileged host user
- PIDs limit — prevent fork bombs from exhausting the host
- Runtime security scanning — detect threats in running containers
The Vault Analogy
The Vault Analogy
Basic Docker hardening is like hiring a security guard for your building — it deters casual intruders and stops obvious attacks. Deep security hardening is like building a vault inside the building: even if someone gets past the guard, breaks through the door, and reaches the safe room, they find a series of progressively harder barriers — a time-lock, a secondary combination, a silent alarm — each one requiring a different kind of bypass. The attacker who gets code execution inside a hardened container finds themselves in a room with no tools, no network access to unexpected destinations, no kernel calls beyond a narrow whitelist, and every action logged. The compromise happened. The damage did not.
Seccomp — Restricting System Calls
Every action a process takes — reading a file, opening a socket, forking a child process — is a system call to the Linux kernel. Docker applies a default seccomp profile that blocks around 44 dangerous syscalls out of ~300 available. A custom profile goes further: it whitelists only the syscalls your specific application actually needs, blocking everything else — including syscalls that no web server should ever make, like ptrace, mount, and kexec_load.
# seccomp-profile.json — whitelist approach: deny everything, allow only what's needed
{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_AARCH64"],
"syscalls": [
{
"names": [
"read", "write", "open", "close", "stat", "fstat",
"lstat", "poll", "lseek", "mmap", "mprotect", "munmap",
"brk", "rt_sigaction", "rt_sigprocmask", "ioctl",
"pread64", "pwrite64", "readv", "writev", "access",
"pipe", "select", "sched_yield", "mremap", "msync",
"dup", "dup2", "nanosleep", "getitimer", "alarm",
"setitimer", "getpid", "socket", "connect", "accept",
"sendto", "recvfrom", "sendmsg", "recvmsg", "bind",
"listen", "getsockname", "getpeername", "setsockopt",
"getsockopt", "clone", "fork", "vfork", "execve",
"exit", "wait4", "kill", "getppid", "getuid", "getgid",
"geteuid", "getegid", "futex", "getcwd", "chdir",
"rename", "mkdir", "rmdir", "unlink", "readlink",
"chmod", "getrlimit", "getrusage", "sysinfo", "times",
"getgroups", "setgroups", "uname", "arch_prctl",
"prctl", "capget", "set_tid_address", "set_robust_list",
"epoll_create", "epoll_ctl", "epoll_wait", "clock_gettime",
"exit_group", "openat", "getdents64", "newfstatat"
],
"action": "SCMP_ACT_ALLOW"
# Only these syscalls are permitted. Everything else returns EPERM.
# Notably absent: ptrace, mount, kexec_load, create_module,
# init_module, delete_module — all used in container escapes.
}
]
}
# Apply the custom seccomp profile at runtime:
docker run -d \
--name payment-api \
--security-opt seccomp=./seccomp-profile.json \
-p 3000:3000 \
payment-api:v1.2.0
# --security-opt seccomp= → path to the JSON profile
# replaces Docker's default profile entirely
# Verify seccomp is active:
docker inspect payment-api \
--format '{{.HostConfig.SecurityOpt}}'
[seccomp={"defaultAction":"SCMP_ACT_ERRNO"...}]
# In Docker Compose:
services:
api:
image: payment-api:v1.2.0
security_opt:
- seccomp:./seccomp-profile.json
# Test that a blocked syscall is denied:
docker exec payment-api \
strace -e ptrace ls 2>&1 | grep -i "operation not permitted"
ptrace(PTRACE_TRACEME): Operation not permitted
# ptrace is blocked. An attacker cannot trace or inspect other processes.
# Attempt a container escape syscall from inside the hardened container:
docker exec payment-api python3 -c "
import ctypes
# Attempt kexec_load — used in some container escape techniques:
libc = ctypes.CDLL('libc.so.6', use_errno=True)
ret = libc.syscall(246) # syscall number 246 = kexec_load
print(f'kexec_load returned: {ret}')
"
kexec_load returned: -1
# errno: 1 = EPERM — Operation not permitted.
# The syscall hit the seccomp filter and was blocked at the kernel level.
# The process received an error. The host kernel was never involved.
# Default Docker profile blocks 44 syscalls.
# Custom whitelist profile blocks ~240 syscalls.
# Attack surface reduction: 5× narrower kernel exposure.
What just happened?
The kexec_load syscall — used in some container escape techniques — was blocked at the kernel level before it could execute. The seccomp filter intercepted the call and returned EPERM. The host kernel never processed the request. This is enforced by the Linux kernel itself — not by Docker, not by the application, not by any userspace tool that an attacker could disable. Even if the attacker has root inside the container, seccomp filters are kernel-enforced and cannot be lifted from inside the container.
AppArmor — Mandatory Access Control
AppArmor is a Linux Security Module that enforces mandatory access control policies on top of standard Linux permissions. Where seccomp restricts which system calls can be made, AppArmor restricts what files can be read or written, what network operations are permitted, and what capabilities can be used — regardless of what the process's user or group permissions say. Docker applies a default AppArmor profile automatically on systems where AppArmor is enabled. A custom profile tightens this further for specific workloads.
# /etc/apparmor.d/docker-payment-api — custom AppArmor profile
#include
profile docker-payment-api flags=(attach_disconnected,mediate_deleted) {
#include
#include
# Allow read access to the application directory only:
/app/** r,
/app/dist/** r,
# No write access to /app — read-only from AppArmor's perspective too.
# Allow writes only to temp directories (matches --tmpfs from Lesson 32):
/tmp/** rw,
/app/tmp/** rw,
# Allow network — the app needs to listen and make outbound calls:
network inet tcp,
network inet udp,
# Explicitly deny sensitive paths — belt and suspenders:
deny /proc/sysrq-trigger rw,
deny /proc/*/mem rw,
deny /sys/kernel/security/** rw,
deny /etc/shadow r,
deny /etc/passwd w,
# Allow necessary capabilities only:
capability net_bind_service,
deny capability sys_admin,
deny capability sys_ptrace,
deny capability sys_rawio,
}
# Load the AppArmor profile into the kernel:
sudo apparmor_parser -r -W /etc/apparmor.d/docker-payment-api
# Verify the profile is loaded:
sudo aa-status | grep docker-payment-api
docker-payment-api
# Apply the profile to a container:
docker run -d \
--name payment-api \
--security-opt apparmor=docker-payment-api \
-p 3000:3000 \
payment-api:v1.2.0
# In Docker Compose:
services:
api:
image: payment-api:v1.2.0
security_opt:
- apparmor:docker-payment-api
- no-new-privileges:true
- seccomp:./seccomp-profile.json
# All three security options stack — each adds an independent layer.
# Test AppArmor enforcement from inside the container: # Attempt to read /etc/shadow (explicitly denied in profile): docker exec payment-api cat /etc/shadow cat: /etc/shadow: Permission denied # AppArmor blocked access — even if the file permissions would have allowed it. # Attempt to write to /proc/sysrq-trigger: docker exec payment-api sh -c "echo b > /proc/sysrq-trigger" sh: /proc/sysrq-trigger: Permission denied # AppArmor denied the write — this would trigger a host reboot if allowed. # Legitimate app operation — reading from /app/dist: docker exec payment-api ls /app/dist/ server.js routes/ middleware/ utils/ # Permitted — matches /app/dist/** r, in the profile. # AppArmor violations are logged to /var/log/syslog: sudo grep "apparmor" /var/log/syslog | tail -3 kernel: audit: apparmor="DENIED" operation="open" profile="docker-payment-api" name="/etc/shadow" pid=8821 comm="cat" requested_mask="r" denied_mask="r" # Every denial logged with the process name, PID, and attempted operation.
What just happened?
AppArmor denied access to /etc/shadow and /proc/sysrq-trigger regardless of the process's user permissions — mandatory access control overrides discretionary permissions. Writing to /proc/sysrq-trigger with the letter b would have triggered an immediate host reboot — a common denial-of-service technique from inside containers. AppArmor logged every denied access attempt with full context. Seccomp blocked the syscall layer. AppArmor blocked the file and capability layer. Together they cover different dimensions of the attack surface.
User Namespace Remapping
Inside a container, processes run as root (uid 0) by default — or as a specific user if USER is set in the Dockerfile. Either way, if a container escape succeeds, the attacker's effective user on the host depends on how the container's user maps to the host's user namespace. With user namespace remapping, container root (uid 0 inside) maps to an unprivileged user (uid 100000+ outside) on the host — so even a successful container escape lands the attacker as an unprivileged user with no host access.
# Enable user namespace remapping globally in /etc/docker/daemon.json:
{
"userns-remap": "default"
# "default" → Docker creates a dedicated "dockremap" user and group
# and maps container UIDs to a safe unprivileged range on the host.
# Container uid 0 → host uid 231072 (unprivileged)
# Container uid 1000 → host uid 232072 (unprivileged)
}
# Restart Docker to apply:
sudo systemctl restart docker
# Verify the mapping:
cat /etc/subuid | grep dockremap
dockremap:231072:65536
# dockremap user gets UIDs 231072 through 296607 on the host.
# Container root (uid 0) maps to host uid 231072 — not root.
# Without user namespace remapping: docker run --rm alpine id uid=0(root) gid=0(root) groups=0(root) # Container reports root. On the host, this process also runs as root (uid 0). # A container escape = root on the host. # With user namespace remapping enabled: docker run --rm alpine id uid=0(root) gid=0(root) groups=0(root) # Container still sees itself as root — the application is unaware of remapping. # But on the host — check the actual process UID: ps aux | grep "node server.js" 231072 8821 0.3 1.2 ... node server.js # Host sees uid 231072 — an unprivileged user with no host access. # A container escape now lands the attacker as uid 231072. # They cannot read /etc/shadow, cannot write to /var/lib/docker, # cannot interact with other processes — they're a nobody on the host.
What just happened?
The container process still sees itself as root — application behaviour is unchanged. But on the host, the process runs as uid 231072, an unprivileged account with no meaningful host permissions. A complete container escape — the kind that requires chaining multiple vulnerabilities — now lands the attacker in an unprivileged shell on the host rather than a root shell. The most dangerous possible outcome of a container compromise just became significantly less dangerous, with one line in daemon.json.
PIDs Limit and Fork Bomb Prevention
A fork bomb is a denial-of-service attack where a process recursively spawns copies of itself until the host runs out of process table entries — crashing every service on the machine. Without a PIDs limit, a single container can trigger a fork bomb that takes down the entire host. The fix is one flag.
# Set a maximum process limit on the container:
docker run -d \
--name payment-api \
--pids-limit 100 \
-p 3000:3000 \
payment-api:v1.2.0
# --pids-limit 100 → the container can spawn at most 100 processes total.
# A Node.js web server typically uses 5–15 processes.
# Setting 100 gives ample headroom for normal operation
# while making a fork bomb impossible.
# In Docker Compose:
services:
api:
image: payment-api:v1.2.0
pids_limit: 100
# Set globally in daemon.json for all containers:
{
"default-pids-limit": 100
}
# Test fork bomb is contained (run in a throwaway container):
docker run --rm --pids-limit 20 alpine \
sh -c ":(){ :|:& };:" 2>&1 | head -5
sh: fork: Resource temporarily unavailable
sh: fork: Resource temporarily unavailable
# Fork bomb hit the PIDs limit at 20 processes.
# The host is completely unaffected. The container absorbed the attack.
A Complete Hardened Container
The scenario: You're hardening a public-facing payment API. It runs as a non-root user, has a read-only filesystem, and is already vulnerability-scanned from Lesson 32. You now apply all the deep hardening controls from this lesson — seccomp, AppArmor, user namespace remapping, and a PIDs limit — and verify each layer is active.
# The fully hardened run command — all controls stacked:
docker run -d \
--name payment-api \
\
# — From Lesson 32 (basics): ——————————————————————————
--user appuser \
--read-only \
--tmpfs /tmp \
--cap-drop ALL \
--cap-add NET_BIND_SERVICE \
--security-opt no-new-privileges \
\
# — From this lesson (deep hardening): ————————————————
--security-opt seccomp=./seccomp-profile.json \
--security-opt apparmor=docker-payment-api \
--pids-limit 100 \
\
# — Resource limits (Lesson 34): ——————————————————————
--memory 512m \
--memory-swap 512m \
--cpus 1.5 \
\
# — Logging (Lesson 35): ———————————————————————————————
--log-driver json-file \
--log-opt max-size=10m \
--log-opt max-file=5 \
\
--restart unless-stopped \
-p 3000:3000 \
payment-api:v1.2.0
# Verify every hardening layer is active:
# 1. Non-root user:
docker exec payment-api whoami
appuser ✓
# 2. Read-only filesystem:
docker exec payment-api touch /test 2>&1
touch: /test: Read-only file system ✓
# 3. Capabilities dropped:
docker exec payment-api cat /proc/self/status | grep CapEff
CapEff: 0000000000000400 ✓ (only NET_BIND_SERVICE)
# 4. Seccomp active:
docker inspect payment-api \
--format '{{index .HostConfig.SecurityOpt 0}}' | grep -c seccomp
1 ✓
# 5. AppArmor enforcing:
docker exec payment-api cat /proc/self/attr/current
docker-payment-api (enforce) ✓
# 6. PIDs limit:
docker inspect payment-api --format '{{.HostConfig.PidsLimit}}'
100 ✓
# 7. User namespace (on host):
ps aux | grep "node server" | awk '{print $1}'
231072 ✓ (unprivileged host user)
# Simulate post-exploitation — attacker has code execution inside the container.
# What can they do?
# Try to install a tool (read-only filesystem):
docker exec payment-api apk add curl
ERROR: Unable to lock database: Read-only file system ✗ blocked
# Try to read host credentials (AppArmor):
docker exec payment-api cat /etc/shadow
Permission denied ✗ blocked
# Try to trace another process (seccomp + capabilities):
docker exec payment-api strace -p 1
strace: attach: ptrace(PTRACE_SEIZE, 1): Operation not permitted ✗ blocked
# Try to write to /proc/sysrq-trigger (AppArmor):
docker exec payment-api sh -c "echo b > /proc/sysrq-trigger"
Permission denied ✗ blocked
# Try a fork bomb (PIDs limit):
docker exec payment-api sh -c ":(){ :|:& };:"
sh: fork: Resource temporarily unavailable ✗ blocked
# Legitimate app operation — still works:
curl http://localhost:3000/health
{"status":"healthy","uptime":847} ✓ working
# The attacker is inside a room with no tools, no exits, and every action logged.
What just happened?
Every post-exploitation technique was blocked by a different layer — the read-only filesystem stopped tool installation, AppArmor blocked sensitive file access, seccomp denied the ptrace syscall, AppArmor blocked the sysrq trigger, and the PIDs limit contained the fork bomb. Each control is independent: disabling one does not disable the others. An attacker who finds a way past AppArmor still hits seccomp. An attacker who finds a way past seccomp still hits the capability restrictions. This is defence in depth working as designed — not a single wall, but a series of independent barriers each requiring a different bypass.
Security controls — what each blocks
Teacher's Note
Apply these controls in order of effort. The basics from Lesson 32 — non-root, read-only, --cap-drop ALL, no-new-privileges — take ten minutes and give you 80% of the protection. Add --pids-limit next — one flag, one minute, prevents a whole class of DoS attacks. Seccomp and AppArmor require workload-specific profiling to avoid breaking legitimate operations — start with Docker's default seccomp profile (already active) and use aa-genprof to generate an AppArmor profile from observed application behaviour. User namespace remapping is a daemon-level change that affects all containers — test it on a non-production host first. Defence in depth doesn't mean applying every control immediately. It means adding layers progressively, each one independently valuable.
Practice Questions
1. The Linux kernel security mechanism that filters which system calls a container process is allowed to make — blocking dangerous syscalls like ptrace and kexec_load at the kernel level — is called what?
2. To prevent a fork bomb inside a container from exhausting the host's process table — by capping the total number of processes the container can spawn — which docker run flag is used?
3. The daemon.json setting that maps container root (uid 0) to an unprivileged user on the host — so that a container escape does not grant root access to the host — is called what?
Quiz
1. An attacker achieves root inside a container that has a custom seccomp profile. They attempt to disable the seccomp filter from inside the container. What happens?
2. A container process runs as root and attempts to read /etc/shadow. The file permissions allow root to read it. An AppArmor profile has deny /etc/shadow r. What happens?
3. Without user namespace remapping, a successful container escape gives the attacker root on the host. With userns-remap: default configured, what does a successful container escape give the attacker instead?
Up Next · Lesson 42
Docker Troubleshooting
Containers hardened — now the operational reality: things will still break. A container exits immediately with code 1. A service is unreachable from outside. A build cache keeps getting busted. Docker troubleshooting is a systematic process — the right sequence of commands turns a mystery into a root cause in minutes rather than hours.