Docker Course
Docker Troubleshooting
A container exits two seconds after starting with no output and exit code 1. A service is running but completely unreachable from outside the host. A build that worked yesterday fails today with a cryptic layer error. A database keeps restarting every ninety seconds without explanation. Each of these looks like a different problem — and each has a systematic sequence of commands that turns it from a mystery into a root cause in under five minutes. Most Docker problems are not exotic. They are the same six or seven failure modes, over and over, in different disguises.
This lesson is a troubleshooting playbook: the exact commands, in the exact order, for the failures that appear most often in practice. Not theory — the specific flags, the specific output patterns to look for, and the specific fixes that resolve them.
The Diagnostic Ladder Analogy
The Diagnostic Ladder Analogy
A doctor diagnosing a patient doesn't start by ordering every test available. They start with the cheapest, fastest test that rules out the most common causes — temperature, pulse, blood pressure — and only move to expensive imaging if the basics don't explain the symptom. Docker troubleshooting follows the same discipline. Start at the bottom of the ladder: is the container running? If not, what did the logs say before it exited? If it's running, can you reach it from inside? From outside? Each rung of the ladder costs more time but covers a narrower problem space. Most issues are resolved on the first two rungs and never require the top.
Failure 1 — Container Exits Immediately
The most common failure for new Docker users: a container starts and stops within seconds. docker ps shows nothing. The container ran, failed, and exited. The cause is almost always in the logs — but docker ps only shows running containers. The key is docker ps -a, which shows all containers including exited ones, and docker logs on the stopped container.
# Step 1 — find the exited container:
docker ps -a
CONTAINER ID NAME STATUS IMAGE
a1b2c3d4e5f6 payment-api Exited (1) 3 seconds ago payment-api:v1.2.0
# STATUS: Exited (1) — exit code 1 means application error.
# Exited (137) = OOM kill. Exited (143) = graceful SIGTERM. Exited (0) = clean exit.
# Step 2 — read the logs from the exited container:
docker logs a1b2c3d4e5f6
Error: Cannot find module '/app/dist/server.js'
at Function.Module._resolveFilename (node:internal/modules/cjs/loader:1039:15)
# The compiled output doesn't exist — the build step was skipped or failed.
# Step 3 — check the exit code for more context:
docker inspect a1b2c3d4e5f6 \
--format 'ExitCode={{.State.ExitCode}} Error={{.State.Error}}'
ExitCode=1 Error=
# Step 4 — if logs are empty, run interactively to see what happens:
docker run -it --entrypoint sh payment-api:v1.2.0
# Drops into a shell inside the image — run the CMD manually to see the error.
/app # node dist/server.js
Error: Cannot find module '/app/dist/server.js'
# Confirmed — dist/ directory is missing from the image.
# Common exit codes and their meanings: # Exit 0 → clean shutdown — process exited normally (not a crash) # Exit 1 → application error — check docker logs for the error message # Exit 2 → misuse of shell built-in — usually a shell script error in CMD # Exit 125 → docker run itself failed — invalid flag or missing image # Exit 126 → CMD found but not executable — file permission problem # Exit 127 → CMD not found — wrong path, missing binary, or wrong base image # Exit 137 → OOM kill (128 + SIGKILL 9) — increase memory limit # Exit 139 → segmentation fault (128 + SIGSEGV 11) — application bug # Exit 143 → graceful shutdown (128 + SIGTERM 15) — Docker asked it to stop # Fix for the missing dist/ error: # Check the Dockerfile — the build step was missing: docker build --target production -t payment-api:v1.2.0 . # Rebuild with the production target — includes the npm run build step. docker run -d --name payment-api payment-api:v1.2.0 docker ps CONTAINER ID NAME STATUS b2c3d4e5f6a7 payment-api Up 4 seconds ← running
What just happened?
docker ps -a revealed the exited container that docker ps hides. docker logs on the container ID showed the exact error — a missing compiled output file. Running with --entrypoint sh replaced the container's default command with a shell, allowing manual inspection of the filesystem and reproduction of the failure interactively. The fix was a correct build — not a Docker configuration change. Most exit-immediately failures are application errors visible in the logs within thirty seconds of starting the investigation.
Failure 2 — Service Unreachable from Outside
The container is running. The application is listening. But requests from outside the host — or even from the host itself — get connection refused or time out. This failure has three possible causes, each checked in sequence: the port mapping is wrong, the application is bound to the wrong interface, or a firewall is blocking the traffic. Check them in that order.
# Step 1 — confirm the port mapping is correct:
docker ps --format "table {{.Names}}\t{{.Ports}}"
NAME PORTS
payment-api 0.0.0.0:3000->3000/tcp
# 0.0.0.0:3000 → listening on all interfaces on the host.
# If this shows nothing, the -p flag was missing from docker run.
docker port payment-api
3000/tcp -> 0.0.0.0:3000
# Confirms mapping: container port 3000 → host port 3000.
# Step 2 — confirm the application is reachable from INSIDE the container:
docker exec payment-api wget -qO- http://localhost:3000/health
{"status":"healthy"}
# If this works but external access fails, the problem is the port mapping or firewall.
# If this ALSO fails, the application itself is not listening — check its bind address.
# Step 3 — check what address the application is binding to:
docker exec payment-api ss -tlnp
State Recv-Q Send-Q Local Address:Port
LISTEN 0 128 127.0.0.1:3000 ← WRONG — only accepts local connections
# Application is bound to 127.0.0.1 — only accepts connections from inside the container.
# Port mapping forwards traffic from the host to the container's network interface,
# not to 127.0.0.1. Traffic arrives but the application ignores it.
# Fix — bind to 0.0.0.0 in the application:
# Node.js: app.listen(3000, '0.0.0.0')
# Python: uvicorn main:app --host 0.0.0.0 --port 8000
# Go: http.ListenAndServe(":8000", handler) ← colon prefix = 0.0.0.0
# After fix — correct bind address:
docker exec payment-api ss -tlnp
State Recv-Q Send-Q Local Address:Port
LISTEN 0 128 0.0.0.0:3000 ← CORRECT — accepts all connections
# Full diagnostic sequence — service unreachable: # 1. Port mapped correctly? docker port payment-api 3000/tcp -> 0.0.0.0:3000 ✓ mapping exists # 2. Reachable from inside the container? docker exec payment-api wget -qO- http://localhost:3000/health wget: can't connect to remote host (127.0.0.1): Connection refused ✗ # 3. Application actually running? docker exec payment-api ps aux PID USER COMMAND 1 node node server.js ✓ process is running # 4. What is it binding to? docker exec payment-api ss -tlnp (nothing — no listening sockets) # The process is running but not yet listening — still starting up. # Wait 5 seconds and retry: docker exec payment-api ss -tlnp LISTEN 127.0.0.1:3000 # Now listening — but on loopback only. Bind address is the bug. # Root cause confirmed: application started with default localhost binding. # Fix: update server.js to listen on 0.0.0.0. # After fix: external requests reach the container successfully.
Failure 3 — Container Keeps Restarting
A container with restart: unless-stopped that keeps cycling through start, crash, restart is called a crash loop. The container comes up, fails, Docker restarts it, it fails again — typically faster each cycle. The restart policy is doing its job, but masking the underlying failure. The critical tool is docker logs --previous — which reads the logs from the run that crashed, not the current (post-restart) run.
# Detect a crash loop:
docker ps
CONTAINER ID NAME STATUS
a1b2c3d4e5f6 payment-api Restarting (1) 3 seconds ago
docker inspect payment-api --format '{{.RestartCount}}'
7
# Restarted 7 times — this is a crash loop.
# Read logs from the PREVIOUS run (the one that just crashed):
docker logs --previous payment-api
# shorthand: docker logs -p payment-api
Error: connect ECONNREFUSED 127.0.0.1:5432
at TCPConnectWrap.afterConnect
# Application cannot connect to Postgres on 127.0.0.1:5432.
# Inside a container, 127.0.0.1 is the container itself — not the database container.
# The DB_HOST environment variable is wrong.
# Check the environment variable that was set:
docker inspect payment-api \
--format '{{range .Config.Env}}{{println .}}{{end}}' | grep DB
DB_HOST=127.0.0.1
# Wrong — should be the Compose service name: DB_HOST=db
# Fix — correct the environment variable:
docker run -d \
--name payment-api \
-e DB_HOST=db \
--network myapp_backend \
payment-api:v1.2.0
docker ps
CONTAINER ID NAME STATUS
b2c3d4e5f6a7 payment-api Up 12 seconds ← stable
# Common crash loop causes and their log signatures:
# 1. Wrong DB host:
Error: connect ECONNREFUSED 127.0.0.1:5432
# Fix: DB_HOST should be the service name (e.g. "db"), not 127.0.0.1
# 2. Missing environment variable:
Error: DATABASE_URL is not defined
# Fix: add the missing -e flag or env_file entry
# 3. OOM kill (exit code 137):
docker inspect payment-api --format '{{.State.OOMKilled}}'
true
# Fix: increase --memory limit or reduce application memory usage
# 4. Port already in use:
Error: listen EADDRINUSE :::3000
# Fix: stop whatever is already using port 3000 on the host,
# or map to a different host port: -p 3001:3000
# 5. Permission denied on mounted volume:
Error: EACCES: permission denied, open '/data/config.json'
# Fix: check volume ownership — chown to match the container's USER uid
What just happened?
docker logs --previous read the logs from the crashed run rather than the current (freshly started) run — exposing the exact error before Docker restarted the container. The restart count of 7 confirmed a crash loop. The error message pointed directly to the wrong DB_HOST value: 127.0.0.1 inside a container resolves to the container itself, not the database service. Correcting the environment variable to the Compose service name resolved the crash loop immediately. The --previous flag is the single most important troubleshooting tool for crash loops.
Failure 4 — Networking Between Containers
Two containers are running but cannot reach each other by service name. DNS resolution fails. This almost always means the containers are on different networks — either they were started without a shared network, or the Compose file has network configuration errors. The diagnostic tool is docker network inspect.
# Step 1 — list all networks:
docker network ls
NETWORK ID NAME DRIVER
a1b2c3d4e5f6 bridge bridge ← default — no DNS between containers
b2c3d4e5f6a7 myapp_backend bridge ← Compose-created — has DNS
c3d4e5f6a7b8 myapp_frontend bridge
# Step 2 — check which containers are on which network:
docker network inspect myapp_backend \
--format '{{range .Containers}}{{.Name}}: {{.IPv4Address}}{{"\n"}}{{end}}'
payment-api: 172.18.0.3/16
postgres-db: 172.18.0.2/16
# Both containers are on the same network — DNS should work.
# Step 3 — test DNS resolution from inside the container:
docker exec payment-api nslookup postgres-db
nslookup: can't resolve 'postgres-db'
# DNS fails even though both containers are on the same network.
# Step 4 — inspect the container's network config:
docker inspect payment-api \
--format '{{json .NetworkSettings.Networks}}' | python3 -m json.tool
{
"bridge": { ← connected to the DEFAULT bridge, not myapp_backend
"IPAddress": "172.17.0.2"
}
}
# The container was started with docker run without --network myapp_backend.
# It joined the default bridge which has no DNS.
# Fix — connect the container to the correct network:
docker network connect myapp_backend payment-api
# Verify DNS now works:
docker exec payment-api nslookup postgres-db
Name: postgres-db
Address: 172.18.0.2 ✓
Failure 5 — Volume and Permission Problems
A container starts successfully but fails to read or write files on a mounted volume. The application reports Permission denied. This happens because the host directory or named volume was created with different ownership than the user the container process runs as. The container user's UID must match the ownership of the mounted path.
# Diagnose the permission problem:
# Step 1 — check what user the container process runs as:
docker exec payment-api id
uid=1001(appuser) gid=1001(appgroup) groups=1001(appgroup)
# Step 2 — check the ownership of the mounted path inside the container:
docker exec payment-api ls -la /data
drwxr-xr-x 2 root root 4096 Jan 15 02:00 .
# Owned by root:root — appuser (uid 1001) cannot write here.
# Step 3 — check the host directory ownership:
ls -la ./data/
drwxr-xr-x 2 root root 4096 Jan 15 02:00 ./
# Created as root on the host — mounted into the container as root-owned.
# Fix Option A — chown the host directory to match the container user's UID:
sudo chown -R 1001:1001 ./data/
# The container's appuser (uid 1001) can now write to the mounted path.
# Fix Option B — chown inside the Dockerfile BEFORE switching USER:
RUN mkdir -p /data && chown -R appuser:appgroup /data
# Creates the directory with correct ownership before the container starts.
# Named volumes populated on first run will be owned by the creating user.
# Fix Option C — for named volumes, initialise ownership with a one-off container:
docker run --rm \
-v myapp_data:/data \
alpine chown -R 1001:1001 /data
# Runs alpine, mounts the named volume, fixes ownership, container exits.
# All subsequent containers mounting this volume see correct ownership.
The Complete Troubleshooting Reference
Symptom → first command → what to look for
docker ps
docker ps -a
Exit code and STATUS column
docker logs <id>
Error message on last line
docker logs -p <name>
Error from the previous (crashed) run
docker port <name>
Mapping exists and shows 0.0.0.0
docker exec <name> ss -tlnp
Bind address — must be 0.0.0.0 not 127.0.0.1
docker network inspect <net>
Both containers listed under Containers
docker inspect <name> | grep OOM
OOMKilled: true → increase memory limit
docker exec <name> ls -la /mount
Ownership UID must match container user UID
COPY . . before dependency install
docker system df
Images, containers, volumes consuming space
Disk Space — Cleaning Up Docker Artefacts
Docker accumulates images, stopped containers, unused volumes, and dangling build cache over time. On a developer machine or a CI server running many builds, this can exhaust disk space within days. The docker system commands give visibility and control over all of it.
# See how much disk Docker is using — broken down by type:
docker system df
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 23 4 14.7GB 11.2GB (76%)
Containers 12 3 2.1MB 1.8MB (86%)
Local Volumes 8 3 4.2GB 2.1GB (50%)
Build Cache - - 3.8GB 3.8GB
# Remove stopped containers, unused networks, dangling images, build cache:
docker system prune
WARNING! This will remove:
- all stopped containers
- all networks not used by at least one container
- all dangling images
- all dangling build cache
Are you sure you want to continue? [y/N] y
Total reclaimed space: 11.4GB
# More aggressive — also removes unused images (not just dangling):
docker system prune -a
# Use with care on a production server — this removes ALL images not currently
# used by a running container. The next deployment will need to pull them again.
# Targeted cleanup — more surgical:
docker container prune # stopped containers only
docker image prune # dangling images only (untagged)
docker image prune -a # all unused images
docker volume prune # volumes not attached to any container
docker builder prune # build cache only
# DANGER — remove everything including running containers:
docker system prune -a --volumes --force
# Only run this if you intend to start completely from scratch on this host.
docker system df TYPE TOTAL ACTIVE SIZE RECLAIMABLE Images 23 4 14.7GB 11.2GB (76%) Containers 12 3 2.1MB 1.8MB (86%) Local Volumes 8 3 4.2GB 2.1GB (50%) Build Cache - - 3.8GB 3.8GB # 76% of image storage is reclaimable — old builds, previous versions, test images. # 3.8 GB build cache — accumulated over weeks of development. docker system prune Deleted Containers: a1b2c3d4e5f6 b2c3d4e5f6a7 Deleted Images: sha256:3a7f... payment-api:v0.9.1 sha256:8b1c... payment-api:v1.0.0 sha256:c9d1... payment-api:v1.1.0 ← 19 old images removed Deleted build cache objects: 847 entries Total reclaimed space: 11.4GB # 11.4 GB freed. Build cache cleared. Disk pressure resolved.
Never Run docker system prune -a on a Production Server Without Checking
docker system prune -a removes all images not currently running — including the image for the service you're about to roll back to. On a production server, always run docker system df first to understand what will be removed, and use targeted commands (docker image prune, docker builder prune) rather than the nuclear option. The build cache is almost always the largest reclaimer and the safest to clear.
Teacher's Note
Keep this troubleshooting reference bookmarked — not because the commands are hard to remember, but because when something is broken at 2 AM the last thing you want to do is think about which command comes next. The sequence is always the same: is the container running (docker ps -a), what did it say before it failed (docker logs -p), what is its current internal state (docker exec + docker inspect), and what does the network look like (docker network inspect). Start at the top of that list every time and work down — you will almost always find the cause before you reach the bottom.
Practice Questions
1. A container in a crash loop has restarted six times. You need to read the logs from the run that just crashed — not the current run. Which command retrieves them?
2. A container is not visible in docker ps. Which command shows all containers including those that have exited — so you can find the container and read its exit code?
3. A CI server is running low on disk space. Before running any cleanup command, which Docker command shows a breakdown of how much space is consumed by images, containers, volumes, and build cache?
Quiz
1. A container shows Exited (137) in docker ps -a. The logs show no error message. What happened and what is the fix?
2. A container has a correct port mapping (0.0.0.0:3000->3000/tcp). The app is reachable from inside the container with wget localhost:3000 but not from outside. ss -tlnp shows 127.0.0.1:3000. What is the cause and fix?
3. Two containers are running. docker network inspect myapp_backend shows both in the Containers list. But nslookup db from inside the API container returns "can't resolve". What is the most likely cause?
Up Next · Lesson 43
Docker with CI/CD
Troubleshooting covered — now the automation layer: how Docker plugs into a CI/CD pipeline. Build, test, scan, push, and deploy — all triggered by a git push, all reproducible, all without a human touching a server.