Docker Lesson 35 – Logging & Monitoring | Dataplexa
Section III · Lesson 35

Logging & Monitoring

A payment service went down at 2:47 AM on a Friday. By the time the on-call engineer connected, the container had already restarted three times — each restart wiping the previous logs. The crash cause was gone. The team spent four hours reproducing the failure in staging. The root cause turned out to be a single malformed JSON payload hitting an unhandled exception. It would have taken thirty seconds to find in the logs, if the logs had been preserved anywhere other than inside the container that kept restarting itself.

This lesson covers how Docker logging works, why the default setup loses data at the worst moment, and how to build a logging and monitoring setup that gives you the full picture — before, during, and after an incident. Concrete commands, real driver configurations, and the exact docker stats patterns that predict failures before they happen.

The Default Setup vs A Production Setup

Default Docker logging

  • Logs stored inside the container as JSON files
  • No size limit — logs grow until the disk fills up
  • Logs disappear when the container is removed
  • Restart wipes nothing — but docker rm loses everything
  • No centralised view across multiple containers
  • No alerting — problems discovered by users, not engineers

Production logging setup

  • Logs shipped to a persistent external destination
  • Rotation configured — disk usage is capped and predictable
  • Logs survive container restarts and removal
  • Centralised view across all containers and hosts
  • Structured JSON logs — searchable and filterable
  • Alerting on error rates, memory thresholds, and crash loops

How Docker Logging Works

Everything a container writes to stdout and stderr is captured by the Docker Daemon and handled by a logging driver. The driver determines where the output goes — a local JSON file, a syslog server, an external aggregator, or nowhere. The default driver is json-file, which writes to a file on the host at /var/lib/docker/containers/<id>/<id>-json.log. The application itself does not need to know anything about the logging destination — it just writes to stdout.

# The basic command — stream logs from a running container
docker logs payment-api
# Outputs everything the container has written to stdout and stderr
# since it started. Follows the same order the process wrote it.

# Follow logs in real time (like tail -f):
docker logs -f payment-api

# Show only the last N lines:
docker logs --tail 100 payment-api

# Filter by time — logs since a specific point:
docker logs --since 2h payment-api
docker logs --since 2024-01-15T02:00:00 payment-api

# Include timestamps (Docker adds them, the app doesn't need to):
docker logs -t payment-api

# Combine flags — last 50 lines with timestamps, live follow:
docker logs -f -t --tail 50 payment-api
docker logs -t --tail 5 payment-api

2024-01-15T02:44:11Z {"level":"info","msg":"POST /payment 200","ms":42}
2024-01-15T02:44:18Z {"level":"info","msg":"POST /payment 200","ms":38}
2024-01-15T02:44:31Z {"level":"error","msg":"DB connection timeout","retry":1}
2024-01-15T02:44:32Z {"level":"error","msg":"DB connection timeout","retry":2}
2024-01-15T02:44:33Z {"level":"fatal","msg":"max retries exceeded — exiting"}

# The container then restarted. These five lines explain exactly what happened.
# Without logs preserved externally, a restart would make this visible
# only if you happened to be watching — and at 2:47 AM, nobody was.

What just happened?

Docker captured every line the process wrote to stdout and stderr, added timestamps, and made them available via docker logs. The application just called console.log() — Docker handled the rest. The --since and --tail flags let you zero in on the exact window of a failure without scrolling through thousands of lines. This is the first place to look the moment something goes wrong.

Log Rotation — Preventing Disk Exhaustion

The default json-file driver has no size limit. A busy service writing hundreds of lines per second will fill a disk in hours. The fix is two options on the logging driver: max-size caps how large a single log file can grow, and max-file controls how many rotated files are kept before the oldest is deleted. Set these on every container.

# Set log rotation on a single container:
docker run -d \
  --name payment-api \
  --log-driver json-file \
  --log-opt max-size=10m \
  --log-opt max-file=5 \
  -p 3000:3000 \
  payment-api:v1.2.0
# max-size=10m  → each log file is capped at 10 MB
#                 when the file hits 10 MB it is rotated
# max-file=5   → keep at most 5 rotated files
#                 total maximum disk usage: 10m × 5 = 50 MB per container
#                 the oldest file is deleted when a 6th would be created
# Set log rotation globally for all containers on the host.
# Edit /etc/docker/daemon.json (create it if it doesn't exist):
{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "5"
  }
}
# Restart the Docker Daemon to apply:
sudo systemctl restart docker
# All new containers on this host now inherit these defaults automatically.
# Existing containers are unaffected — they must be recreated to pick up the change.
# Confirm log rotation is configured on a running container:
docker inspect payment-api | grep -A6 '"LogConfig"'
"LogConfig": {
  "Type": "json-file",
  "Config": {
    "max-file": "5",
    "max-size": "10m"
  }
}

# View the physical log files on the host:
ls -lh /var/lib/docker/containers/a1b2c3d4*/
-rw-r----- 1 root root 10M  a1b2c3d4-json.log
-rw-r----- 1 root root 10M  a1b2c3d4-json.log.1
-rw-r----- 1 root root 7.2M a1b2c3d4-json.log.2
# Three files — rotation is working. Total disk usage: 27.2 MB, well under the 50 MB cap.

What just happened?

Log rotation is now enforced. No matter how much traffic the service sees, its logs will never consume more than 50 MB on disk. When the container is busy, old log files are silently deleted to make room for new ones. The most recent logs are always available via docker logs. Set this globally in daemon.json once and every container on the host inherits it automatically.

Logging Drivers — Sending Logs Off the Host

For production, logs need to survive beyond the host. The json-file driver writes to the local disk — if the host goes down, the logs go with it. Docker supports a range of drivers that ship logs directly to external systems. The two most commonly used in production are syslog for traditional infrastructure and awslogs for AWS deployments. Switching the driver requires no changes to the application — it still just writes to stdout.

# Send logs to a remote syslog server:
docker run -d \
  --name payment-api \
  --log-driver syslog \
  --log-opt syslog-address=tcp://logs.acmecorp.internal:514 \
  --log-opt syslog-facility=daemon \
  --log-opt tag="payment-api/{{.ID}}" \
  -p 3000:3000 \
  payment-api:v1.2.0
# syslog-address  → the remote syslog server — TCP for reliability over UDP
# syslog-facility → categorises the log source for the syslog server
# tag             → identifies the container in the syslog stream
#                   {{.ID}} is replaced with the container ID at runtime

# Send logs to AWS CloudWatch Logs:
docker run -d \
  --name payment-api \
  --log-driver awslogs \
  --log-opt awslogs-region=ap-south-1 \
  --log-opt awslogs-group=/acmecorp/production/payment-api \
  --log-opt awslogs-stream=payment-api-1 \
  -p 3000:3000 \
  payment-api:v1.2.0
# awslogs-group   → the CloudWatch log group (created automatically if absent)
# awslogs-stream  → the stream within that group — typically one per container

Docker logging drivers — when to use each

json-file Default. Local file storage. Always set max-size and max-file. Good for development and single-host setups.
syslog Ships to a remote syslog server. Good for teams already running centralised syslog infrastructure.
awslogs Ships to AWS CloudWatch Logs. The standard choice for containers running on EC2, ECS, or Docker on AWS.
none Discards all logs. Only for containers that must produce zero output — never use in production services.

Logging in Docker Compose

Logging configuration belongs in the Compose file alongside everything else — declared once, applied consistently on every deployment. No manual flags required when starting services.

version: "3.8"

services:
  api:
    image: payment-api:v1.2.0
    logging:
      driver: json-file
      options:
        max-size: "10m"
        max-file: "5"
    ports:
      - "3000:3000"

  db:
    image: postgres:15-alpine
    logging:
      driver: json-file
      options:
        max-size: "20m"
        # Databases can be chattier — give them more room before rotation
        max-file: "3"
    volumes:
      - pgdata:/var/lib/postgresql/data

  redis:
    image: redis:7-alpine
    logging:
      driver: json-file
      options:
        max-size: "5m"
        max-file: "3"

volumes:
  pgdata:

Structured Logging — Making Logs Searchable

Plain text logs are readable but unsearchable at scale. Structured logs — JSON written to stdout — can be filtered, aggregated, and alerted on without regex. The application produces them; Docker captures them verbatim. At scale, this is the difference between finding a bug in 30 seconds and spending an hour grepping.

// Plain text — readable but impossible to query at scale:
console.log(`POST /payment 200 42ms user=u_8821 order=ord_4492`);

// Structured JSON — every field is queryable:
console.log(JSON.stringify({
  level:    "info",
  msg:      "payment processed",
  method:   "POST",
  path:     "/payment",
  status:   200,
  ms:       42,
  userId:   "u_8821",
  orderId:  "ord_4492",
  ts:       new Date().toISOString()
}));
// Docker captures this line to stdout exactly as written.
// CloudWatch, Splunk, Datadog, Loki — any aggregator can now:
// • Filter all errors: level="error"
// • Measure p99 latency: avg(ms) where path="/payment"
// • Alert on failures: count(status=500) > 10 in 1 minute
// None of this is possible with plain text.
docker logs payment-api | tail -4

{"level":"info","msg":"payment processed","method":"POST","path":"/payment","status":200,"ms":42,"userId":"u_8821","orderId":"ord_4492","ts":"2024-01-15T02:44:11Z"}
{"level":"info","msg":"payment processed","method":"POST","path":"/payment","status":200,"ms":38,"userId":"u_9104","orderId":"ord_4493","ts":"2024-01-15T02:44:18Z"}
{"level":"error","msg":"DB connection timeout","retry":1,"ts":"2024-01-15T02:44:31Z"}
{"level":"fatal","msg":"max retries exceeded — exiting","ts":"2024-01-15T02:44:33Z"}

# In a log aggregator, finding all errors in the last 24h:
# level="error" OR level="fatal"  →  2 results, exact timestamps, full context.
# Without structured logs, the same query would be: grep -i "error\|fatal\|exception\|fail"
# — returning thousands of false positives and missing fields.

Monitoring with docker stats

Logs tell you what happened. docker stats tells you what is happening right now and what is about to happen. The patterns below are the early warning signs that appear in metrics minutes or hours before a container crashes or degrades — if you know what to look for.

# Real-time view of all containers:
docker stats

# Snapshot for scripts and alerting:
docker stats --no-stream

# Custom format — output only what matters:
docker stats --no-stream \
  --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}\t{{.NetIO}}"

# Check a specific container:
docker stats payment-api --no-stream
NAME          CPU %   MEM USAGE / LIMIT    MEM %   NET I/O
payment-api   3.2%    201MiB / 512MiB      39.3%   14.5MB / 8.2MB
postgres-db   1.1%    318MiB / 1024MiB     31.1%   2.1MB / 1.4MB
redis         0.1%    8MiB / 128MiB         6.3%   980kB / 720kB

# Pattern: memory creeping up steadily over time → memory leak
# Healthy:   39% → 40% → 39% → 41%  (flat, bouncing around a baseline)
# Leaking:   39% → 47% → 56% → 64%  (rising without request spike — leak)

# Pattern: CPU sustained near limit → throttling — service is degraded
# Healthy:   3% → 8% → 4% → 6%   (spiky, returns to baseline)
# Throttled: 149% → 150% → 150%  (pegged at ceiling — requests queuing up)

# Pattern: NET I/O spike with no traffic increase → data exfiltration or runaway client

What just happened?

The patterns in docker stats output are the most actionable signal available without external monitoring infrastructure. A memory percentage that rises steadily without traffic growth is a leak — catching it at 64% means a graceful restart before an OOM kill. CPU pegged at its limit means requests are already queuing — customers are experiencing slow responses right now. These signals are only visible because resource limits are set; without limits, the percentages are measured against total host memory and mean nothing.

Inspecting a Crashed Container

When a container has crashed and restarted, docker logs shows the logs from the current run — not the one that crashed. The previous run's logs are still accessible with one flag. This is often the most important command you'll run during an incident.

# Check how many times the container has restarted:
docker inspect payment-api | grep -i restartcount
"RestartCount": 3,
# This container has crashed and been restarted 3 times.

# View logs from the PREVIOUS run — the one that crashed:
docker logs --previous payment-api
# or shorthand:
docker logs -p payment-api
# This is the crash log. These are the lines written before the fatal exit.
# Without this flag, you only see the current (post-restart) run.

# Check the exit code of the last crash:
docker inspect payment-api \
  --format '{{.State.ExitCode}} — {{.State.Error}}'
137 — (signal: killed)
# Exit code 137 = OOM kill (128 + signal 9)
# Exit code 1   = application error
# Exit code 0   = clean shutdown
# Exit code 143 = SIGTERM — Docker asked it to stop gracefully
docker logs -p payment-api

{"level":"info","msg":"payment processed","status":200,"ms":41}
{"level":"info","msg":"payment processed","status":200,"ms":39}
{"level":"warn","msg":"memory pressure detected","heapUsed":"480MB","limit":"512MB"}
{"level":"error","msg":"heap allocation failed — out of memory"}
# Container killed by OOM. Exit code 137. No further output.

# Armed with this, the investigation is immediate:
# → Memory limit is 512 MB, heap was at 480 MB (93%) before the kill
# → The process was not leaking — it hit the limit under peak load
# → Fix: increase the memory limit, or reduce per-request memory allocation
# Time to root cause: 90 seconds.

The Logging & Monitoring Checklist

Before any container goes to production

Log rotation configuredmax-size and max-file set on every container or globally in daemon.json
Logs shipped off-host — syslog, awslogs, or a log collector ensures logs survive host failure
Structured JSON logging — application writes JSON to stdout for searchable, alertable log data
No secrets in logs — application code must never log tokens, passwords, or full request bodies
Resource limits setdocker stats percentages are only actionable when limits are enforced
Restart count monitored — a container restarting repeatedly signals a crash loop that needs immediate attention
--previous flag known — engineers know to use docker logs -p to read the crash log from the last failed run
Exit codes understood — 137 means OOM kill, 1 means app error, 0 means clean exit, 143 means graceful stop

Teacher's Note

Start with two things: log rotation in daemon.json and structured JSON output from your application. Those two together mean logs are always bounded in size and always searchable when you need them. Add off-host shipping when you have more than one server. The docker logs -p flag and exit code 137 are the two pieces of knowledge that will save you the most time during your first 2 AM incident.

Practice Questions

1. A container has crashed and restarted. You need to read the logs from the run that crashed — not the current run. Which command retrieves them?



2. You inspect a crashed container and see "ExitCode": 137. What caused this exit?



3. To prevent the default json-file logging driver from filling up the host disk, which log option caps how large a single log file can grow before it is rotated?



Quiz

1. You observe a container's MEM % in docker stats over 30 minutes: 39%, 47%, 56%, 64%. Traffic has not increased. What is happening and what does it predict?


2. A host is running ten containers with default Docker logging settings. The disk is at 94% capacity. What is the cause and the correct fix?


3. A team wants to be able to filter logs by HTTP status code, measure average response time, and alert when error rate exceeds 1% — all from their log aggregator. What logging approach makes this possible?


Up Next · Lesson 36

Docker in Dev vs Prod

Section III complete. Now the environments question that trips up almost every team: why does it work on my machine? Docker in development and Docker in production need different configurations — the same image, but very different runtime setups. Section IV shows you how to bridge that gap cleanly.