Web APIs
API Monitoring & Logging
Instrument a live API with structured logs, request tracing, uptime checks, and alerting so failures surface in seconds — not in a customer support ticket.
In 2021, Facebook's internal DNS configuration change took down Facebook, Instagram, WhatsApp, and Messenger simultaneously for nearly six hours. The engineers who needed to fix it could not even reach the internal tools to diagnose it, because those tools lived on the same network that was down. The outage cost an estimated $60 million in lost revenue — and it was a configuration error, not a code bug.
The lesson everyone took from that incident was not "write better code." It was: know what your system is doing at all times, from somewhere outside the system. That is exactly what monitoring and logging give you. An API without observability is a black box. Something can go wrong inside it right now, and you would not know until a user tells you.
Monitoring & Logging — Concept Anatomy
The Three Pillars of Observability
"Observability" is not a tool — it is a property of a system. A system is observable if you can understand its internal state from its external outputs alone. For APIs, that external output comes in three forms: logs, metrics, and traces. Each answers a different question.
Logs — "What happened?"
Discrete, timestamped records of individual events. A request arrived, a database query ran, an error was thrown. Logs are the narrative of your system — precise but voluminous. Stripe logs every API call for 90 days.
Metrics — "How is it behaving over time?"
Numeric measurements aggregated over time — request rate, error rate, response latency, CPU usage. Metrics are cheap to store and fast to query. They power dashboards and alert thresholds.
Traces — "Where did the time go?"
A trace follows one request across every service it touches, recording how long each step took. When a response takes 800ms instead of 80ms, a trace tells you whether the delay was in your API, the database, or a third-party call.
Uptime Checks — "Is it reachable at all?"
External probes that ping your endpoints from outside your infrastructure every 30–60 seconds. If your server goes down, these catch it before any internal system does — because they see what your users see.
Observability Features in Practice
| Concept | What it does | APIForge use case |
|---|---|---|
| Structured Logging | Writes logs as JSON objects instead of plain text strings — machine-parseable and searchable | APIForge Backend logs every request with method, path, status, latency, and user ID |
| Log Levels | Categorise log entries as DEBUG, INFO, WARN, or ERROR — filter by severity in production | Production only ships INFO and above — DEBUG logs stay in dev to avoid noise |
| Request ID / Correlation ID | A unique ID stamped on every incoming request and threaded through all logs for that request | Support team pastes a request ID from a user complaint — DevOps finds every log line in seconds |
| RED Metrics | Rate (requests/sec), Errors (error rate %), Duration (response time p50/p95/p99) | Dashboard shows real-time RED for every APIForge endpoint — anomalies visible instantly |
| Alerting | Rules that fire a notification (Slack, PagerDuty, email) when a metric crosses a threshold | Alert fires if error rate on /projects exceeds 1% for 2 consecutive minutes |
| Distributed Tracing | Follows one request across multiple services, recording each span's duration and metadata | APIForge traces show whether latency spikes are in the API, Postgres, or Redis cache layer |
The Observability Pipeline
Raw observability data is worthless if it sits isolated on a server. The pipeline matters — getting data from your API into a place where it can be searched, visualised, and acted on. Here is how the APIForge DevOps team routes all three signal types.
APIForge — Observability Pipeline
emits logs + metrics + traces
OpenTelemetry / Fluentd
Loki · Prometheus · Tempo
Grafana dashboards
Slack · PagerDuty
The Loki + Prometheus + Tempo Stack
These three open-source tools from Grafana Labs form a popular self-hosted observability stack. Loki stores logs without indexing the full text — cheap and fast. Prometheus scrapes and stores time-series metrics. Tempo stores distributed traces. All three are queryable from a single Grafana dashboard. Cloud-managed alternatives include Datadog, New Relic, and AWS CloudWatch — same concepts, different pricing models.
Step 1 — Structured Logging in Your API
Plain-text logs look like this: GET /projects 200 48ms. They are readable to a human. But when you have ten thousand log lines and need to find every request that took more than 500ms last Tuesday from a specific user, a plain-text log is almost useless — you cannot query it without writing a regex nightmare.
Structured logs are JSON objects. Every field is a named key with a typed value. A log aggregator like Loki or Datadog can index those fields and answer queries like { latency_ms > 500, user_id = "usr_9x2" } across millions of records in milliseconds.
What fields should every request log contain?
At minimum: timestamp (ISO 8601), level (INFO/WARN/ERROR), request_id (unique per request), method, path, status_code, latency_ms, user_id (if authenticated), ip. Add service and version if you run multiple deployments — they tell you which build produced the log line.
The APIForge Backend team uses a request logging middleware — a function that wraps every incoming request, captures timing and metadata, and writes a structured JSON log entry when the response is sent. Here is their implementation.
// WHAT: APIForge — Express.js structured logging middleware
// File: src/middleware/requestLogger.js
// Logs every request as a JSON object with timing and context
const { randomUUID } = require("crypto");
function requestLogger(req, res, next) {
// Generate a unique ID for this request
const requestId = randomUUID();
const startTime = Date.now();
// Attach the request ID to the request object
// so downstream code can reference it in their own logs
req.requestId = requestId;
// Also expose it as a response header so clients can
// include it in bug reports
res.setHeader("X-Request-ID", requestId);
// Hook into the response finish event — fires after
// the response is fully sent to the client
res.on("finish", () => {
const latency = Date.now() - startTime;
const logEntry = {
timestamp: new Date().toISOString(),
level: res.statusCode >= 500 ? "ERROR"
: res.statusCode >= 400 ? "WARN"
: "INFO",
request_id: requestId,
method: req.method,
path: req.path,
status_code: res.statusCode,
latency_ms: latency,
user_id: req.user?.id ?? null,
ip: req.headers["x-forwarded-for"] ?? req.socket.remoteAddress,
user_agent: req.headers["user-agent"] ?? null,
service: "apiforge-backend",
version: process.env.APP_VERSION ?? "unknown"
};
// Write JSON to stdout — the collector picks it up from there
process.stdout.write(JSON.stringify(logEntry) + "\n");
});
next();
}
module.exports = requestLogger;
// Usage in app.js:
// const requestLogger = require("./middleware/requestLogger");
// app.use(requestLogger);What just happened?
The middleware attaches a requestId to every request before it hits any route handler. When the response finishes, it calculates actual latency and writes one JSON line. The log level is derived automatically from the status code — 5xx becomes ERROR, 4xx becomes WARN, everything else is INFO. No manual level-setting needed.
Writing to stdout keeps the API decoupled from the logging infrastructure. A collector (Fluentd, Vector, or the platform's own agent) picks up stdout, parses the JSON, and ships it to Loki or Datadog. Swap the collector without touching application code.
Try this: Add req.requestId to your error handler so every thrown error log also contains the request ID. Now a single ID connects the access log entry to the error log entry for the same request.
Step 2 — Exposing Metrics with the RED Method
The RED method — Rate, Errors, Duration — is a framework for choosing which metrics to instrument on every service. It was popularised by Tom Wilkie at Weaveworks and is now the default mental model at most engineering teams. The idea is simple: if all three RED metrics are healthy, your service is almost certainly healthy.
Rate tells you how busy the service is (requests per second). Errors tells you how many of those are failing. Duration tells you how long the healthy ones take — tracked as percentiles, not averages. The p99 latency (the slowest 1% of requests) is often more actionable than the mean, because a degraded tail can indicate a resource exhaustion problem that the average masks completely.
Why p99, not average?
If 99 out of 100 requests take 10ms and one takes 5000ms, the average is about 60ms — which looks fine. But 1% of your users are waiting 5 seconds. At 1,000 requests per second, that is 10 people every second experiencing a broken-feeling API. Percentile metrics (p50, p95, p99) expose this; averages hide it. Stripe monitors p99 latency on every API endpoint as a core SLA metric.
The APIForge Backend team exposes a /metrics endpoint in Prometheus format. Prometheus scrapes it every 15 seconds. Here is the middleware that tracks the three RED signals.
// WHAT: APIForge — RED metrics middleware using prom-client
// npm install prom-client
// File: src/middleware/metrics.js
const promClient = require("prom-client");
// Collect default Node.js metrics (event loop lag, heap, etc.)
promClient.collectDefaultMetrics({ prefix: "apiforge_" });
// Rate — counter increments on every request
const httpRequestsTotal = new promClient.Counter({
name: "apiforge_http_requests_total",
help: "Total HTTP requests received",
labelNames: ["method", "path", "status_code"]
});
// Errors — counter for non-2xx responses
const httpErrorsTotal = new promClient.Counter({
name: "apiforge_http_errors_total",
help: "Total HTTP error responses (4xx and 5xx)",
labelNames: ["method", "path", "status_code"]
});
// Duration — histogram tracks response time distribution
const httpRequestDuration = new promClient.Histogram({
name: "apiforge_http_request_duration_ms",
help: "HTTP request latency in milliseconds",
labelNames: ["method", "path", "status_code"],
buckets: [5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000]
});
function metricsMiddleware(req, res, next) {
const start = Date.now();
res.on("finish", () => {
const duration = Date.now() - start;
const labels = {
method: req.method,
path: req.route?.path ?? req.path,
status_code: res.statusCode
};
httpRequestsTotal.inc(labels);
httpRequestDuration.observe(labels, duration);
if (res.statusCode >= 400) {
httpErrorsTotal.inc(labels);
}
});
next();
}
// Expose /metrics endpoint for Prometheus to scrape
async function metricsEndpoint(req, res) {
res.set("Content-Type", promClient.register.contentType);
res.end(await promClient.register.metrics());
}
module.exports = { metricsMiddleware, metricsEndpoint };
// Usage in app.js:
// app.use(metricsMiddleware);
// app.get("/metrics", metricsEndpoint);What just happened?
Three instruments track the three RED signals. The Counter for total requests measures rate over time (Prometheus computes the per-second rate using rate()). The error Counter is separate so you can calculate error percentage directly. The Histogram records every response time into pre-defined buckets — the bucket counts let Prometheus compute accurate percentiles across any time window.
Labels (method, path, status_code) mean one metric definition covers the entire API. A Grafana query like histogram_quantile(0.99, rate(apiforge_http_request_duration_ms_bucket[5m])) gives you the live p99 latency for any endpoint without additional code.
Try this: After running a few requests, hit /metrics in your browser. You should see the raw Prometheus text format. The bucket counts show you exactly how request times are distributed — most should land in the low buckets.
Step 3 — Distributed Tracing with Correlation IDs
A metric can tell you that p99 latency spiked from 80ms to 900ms at 14:32. But it cannot tell you why. That is where tracing comes in. A trace reconstructs the exact path of one specific request — every function call, every database query, every external HTTP call — with the time spent at each step.
In a single-service API, a trace is straightforward. In a microservices architecture — where one user request might touch an auth service, a projects service, a notifications service, and a billing service — the correlation ID is the thread that ties it all together. Every service logs the same ID so the full journey is reconstructable.
// WHAT: APIForge — Request tracing with span timing
// Wraps individual operations inside a request with timing data
// File: src/utils/tracer.js
const { randomUUID } = require("crypto");
class Tracer {
constructor(requestId) {
this.requestId = requestId;
this.traceId = randomUUID();
this.spans = [];
this.startTime = Date.now();
}
// Start timing a named operation
startSpan(name) {
const span = {
name,
traceId: this.traceId,
requestId: this.requestId,
startMs: Date.now()
};
this.spans.push(span);
return span;
}
// End timing for a span and record its duration
endSpan(span, metadata = {}) {
span.durationMs = Date.now() - span.startMs;
span.metadata = metadata;
span.status = "ok";
}
// Mark a span as failed
errorSpan(span, error) {
span.durationMs = Date.now() - span.startMs;
span.status = "error";
span.error = error.message;
}
// Emit the full trace as a structured log
flush() {
const totalDuration = Date.now() - this.startTime;
const traceLog = {
timestamp: new Date().toISOString(),
level: "INFO",
type: "trace",
trace_id: this.traceId,
request_id: this.requestId,
total_duration_ms: totalDuration,
spans: this.spans
};
process.stdout.write(JSON.stringify(traceLog) + "\n");
}
}
// Example usage inside a route handler:
async function getProjectsHandler(req, res) {
const tracer = new Tracer(req.requestId);
// Span 1: Auth check
const authSpan = tracer.startSpan("auth.verify_token");
const user = await verifyJWT(req.headers.authorization);
tracer.endSpan(authSpan, { userId: user.id });
// Span 2: Database query
const dbSpan = tracer.startSpan("db.query_projects");
const projects = await db.query(
"SELECT * FROM projects WHERE team_id = $1", [user.teamId]
);
tracer.endSpan(dbSpan, { rowCount: projects.rows.length });
// Span 3: Cache write
const cacheSpan = tracer.startSpan("cache.set_projects");
await redis.setex(`projects:${user.teamId}`, 300, JSON.stringify(projects.rows));
tracer.endSpan(cacheSpan, { ttlSeconds: 300 });
tracer.flush();
res.json({ data: projects.rows });
}What just happened?
The trace output immediately answers the latency question that the metric raised. Total duration was 847ms — but drilling into the spans shows the database query consumed 821ms of that. Auth was 12ms, cache was 8ms. Without the trace, debugging would start with guessing. With it, the next step is obvious: check the query plan for a missing index on projects.team_id.
The trace_id and request_id are both present. In a multi-service system, each downstream service creates its own spans under the same trace_id, so the entire distributed call chain is reconstructable in Tempo or Jaeger.
Try this: Add a Redis cache check before the database span. If the cache hits, the database span never runs and total duration drops to ~20ms. Then look at your traces for both code paths side by side — the before/after is immediately visible in the span breakdown.
Step 4 — Alerting That Pages You Before Users Notice
A monitoring system with no alerts is just an expensive dashboard for incidents you hear about from Twitter. Alerts are the mechanism that closes the loop — they turn a metric crossing a threshold into a human being waking up or a Slack message appearing at 2 AM.
Good alerts are specific, actionable, and low-noise. An alert that fires twenty times a day gets ignored. An alert that fires twice a week and always indicates a real problem gets acted on. The APIForge team follows one rule: every alert must have a runbook — a short document that says what the alert means and what to do about it.
# WHAT: APIForge — Prometheus alerting rules
# File: prometheus/alerts/apiforge-api.yml
# These rules fire when API health metrics cross thresholds
groups:
- name: apiforge-api-alerts
rules:
# Alert 1: High error rate on any endpoint
- alert: APIHighErrorRate
expr: |
(
rate(apiforge_http_errors_total[5m])
/
rate(apiforge_http_requests_total[5m])
) > 0.01
for: 2m
labels:
severity: warning
team: backend
annotations:
summary: "High error rate on {{ $labels.path }}"
description: >
Error rate is {{ $value | humanizePercentage }} on
{{ $labels.method }} {{ $labels.path }}.
This has been above 1% for 2 minutes.
runbook: "https://wiki.apiforge.io/runbooks/high-error-rate"
# Alert 2: p99 latency spike
- alert: APIHighLatencyP99
expr: |
histogram_quantile(0.99,
rate(apiforge_http_request_duration_ms_bucket[5m])
) > 500
for: 3m
labels:
severity: warning
team: backend
annotations:
summary: "p99 latency above 500ms on {{ $labels.path }}"
description: >
99th percentile latency is {{ $value }}ms on
{{ $labels.method }} {{ $labels.path }}.
runbook: "https://wiki.apiforge.io/runbooks/high-latency"
# Alert 3: API completely unreachable (uptime)
- alert: APIDown
expr: up{job="apiforge-backend"} == 0
for: 1m
labels:
severity: critical
team: devops
annotations:
summary: "APIForge backend is unreachable"
description: "The /metrics scrape target has been down for 1 minute."
runbook: "https://wiki.apiforge.io/runbooks/api-down"What just happened?
The first alert calculates error rate as a ratio of error requests to total requests over a 5-minute rolling window. The for: 2m clause means the condition must stay true for 2 consecutive minutes before the alert fires — this prevents a brief spike from waking someone up at 3 AM for a transient blip.
Every alert includes a runbook URL. The on-call engineer who receives the PagerDuty page has immediate context: what the alert means, what to check first, and what actions to take. This turns a panic into a procedure.
Try this: Write a fourth alert rule for when the request rate drops to zero — rate(apiforge_http_requests_total[5m]) == 0 for 5 minutes during business hours. A sudden silence in traffic is often more alarming than a spike.
What Observability Actually Changes
The difference between an instrumented API and a blind one is not just about tooling. It is about how fast you recover when something goes wrong — and whether you find out before or after your users do.
Without Monitoring
A database index is dropped during a migration. Queries slow from 40ms to 4000ms. Users start seeing timeouts.
First notification: a user support ticket 45 minutes later. Engineer checks logs — plain text, no timestamps on individual queries. Guessing begins.
Mean time to resolution: 2–3 hours. Outage declared, post-mortem required, trust damaged.
With Monitoring
Same migration drops the index. p99 latency alert fires 3 minutes later. On-call engineer is paged.
Trace shows: db.query_projects span jumped from 38ms to 3900ms at exactly 14:32. Migration ran at 14:30. Root cause: obvious.
Index recreated at 14:38. Total user-facing degradation: 8 minutes. No ticket needed. Runbook updated.
The APIForge Incident Response Flow
From Alert to Resolution
Alert fires — Prometheus detects threshold breach, Alertmanager sends Slack message and PagerDuty page to the on-call engineer.
Check the dashboard — open Grafana, look at the RED metrics panel. Is it one endpoint or all of them? Is the error rate rising or steady? Is latency also spiking?
Query the logs — filter Loki by level=ERROR and the affected path. Read the error messages. Look for a pattern in the request IDs.
Pull a trace — take a request ID from an error log, find the trace in Tempo. The span breakdown shows exactly where time is being spent or where errors are thrown.
Fix and verify — deploy the fix (or rollback), watch the error rate and latency metrics return to baseline on the dashboard in real time.
Alert resolves — Alertmanager sends an automatic "resolved" notification. The incident timeline is captured in the logs for the post-mortem.
Observability Tool Landscape
| Tool | Signal type | Hosting | Free tier | Best for |
|---|---|---|---|---|
| Prometheus + Grafana | Metrics + dashboards | Self-hosted | Yes (open source) | Teams with own infra, full control |
| Datadog | Logs + metrics + traces + APM | SaaS | 14-day trial | Enterprise teams, all-in-one platform |
| Grafana Cloud | Logs (Loki) + metrics + traces | Managed cloud | Yes (generous) | Startups wanting managed OSS stack |
| New Relic | Full-stack APM + alerts | SaaS | 100GB/month free | Full-stack visibility, generous free tier |
| AWS CloudWatch | Logs + metrics + alarms | AWS-native | Yes (limited) | APIs already hosted on AWS |
What not to log
Never log passwords, credit card numbers, social security numbers, full authentication tokens, or any field covered by GDPR, HIPAA, or PCI-DSS as personally identifiable information. Structured logging makes it easy to accidentally include req.body wholesale — and request bodies often contain sensitive data. Log only the fields you explicitly name. A data breach caused by over-logging is a compliance nightmare and a legal liability.
Self-Hosted vs Managed Observability
Self-Hosted (Prometheus + Loki + Tempo + Grafana)
You run the storage and query infrastructure yourself — either on your own servers or in Kubernetes. Data never leaves your environment.
Pros: zero data egress cost, full control, no vendor lock-in. Cons: operational overhead — you maintain the observability stack as its own service.
Managed SaaS (Datadog / Grafana Cloud / New Relic)
You ship logs and metrics to a third-party platform. They handle storage, scaling, retention, and the query interface.
Pros: no infra to manage, scales automatically, usually easier to set up. Cons: ongoing cost scales with data volume, data leaves your environment, vendor dependency.
The APIForge team started with Grafana Cloud on the free tier — one install command, five minutes of configuration, and they had logs and metrics flowing. As they scale past the free tier limits, they plan to migrate the critical metrics storage to self-hosted Prometheus while keeping Loki in Grafana Cloud for log search.
Quiz
1. The APIForge p99 latency alert fires at 14:32. A Prometheus metric shows the spike, but does not explain it. An engineer pulls a distributed trace for one of the slow requests. What does the trace reveal that the metric alone could not?
2. The APIForge error rate alert uses for: 2m in its Prometheus rule definition. What is the purpose of this field?
3. An APIForge support engineer receives a user complaint containing the header value X-Request-ID: f47ac10b-58cc. They search Loki for that value and find every log line from that request instantly. What does the request logging middleware do to make this possible?