WEB API's Lesson 29 – Monitoring and Logging | Dataplexa

Web APIs · Lesson 29

API Monitoring & Logging

Instrument a live API with structured logs, request tracing, uptime checks, and alerting so failures surface in seconds — not in a customer support ticket.

In 2021, Facebook's internal DNS configuration change took down Facebook, Instagram, WhatsApp, and Messenger simultaneously for nearly six hours. The engineers who needed to fix it could not even reach the internal tools to diagnose it, because those tools lived on the same network that was down. The outage cost an estimated $60 million in lost revenue — and it was a configuration error, not a code bug.

The lesson everyone took from that incident was not "write better code." It was: know what your system is doing at all times, from somewhere outside the system. That is exactly what monitoring and logging give you. An API without observability is a black box. Something can go wrong inside it right now, and you would not know until a user tells you.

Monitoring & Logging — Concept Anatomy

Concept: API Observability

Type: Infrastructure + Code Practice

Used for: Debugging, SLA tracking, capacity planning, incident response

Pillars: Logs · Metrics · Traces

Standard: OpenTelemetry (CNCF)

The Three Pillars of Observability

"Observability" is not a tool — it is a property of a system. A system is observable if you can understand its internal state from its external outputs alone. For APIs, that external output comes in three forms: logs, metrics, and traces. Each answers a different question.

Logs — "What happened?"

Discrete, timestamped records of individual events. A request arrived, a database query ran, an error was thrown. Logs are the narrative of your system — precise but voluminous. Stripe logs every API call for 90 days.

Metrics — "How is it behaving over time?"

Numeric measurements aggregated over time — request rate, error rate, response latency, CPU usage. Metrics are cheap to store and fast to query. They power dashboards and alert thresholds.

Traces — "Where did the time go?"

A trace follows one request across every service it touches, recording how long each step took. When a response takes 800ms instead of 80ms, a trace tells you whether the delay was in your API, the database, or a third-party call.

Uptime Checks — "Is it reachable at all?"

External probes that ping your endpoints from outside your infrastructure every 30–60 seconds. If your server goes down, these catch it before any internal system does — because they see what your users see.

Observability Features in Practice

Concept	What it does	APIForge use case
Structured Logging	Writes logs as JSON objects instead of plain text strings — machine-parseable and searchable	APIForge Backend logs every request with method, path, status, latency, and user ID
Log Levels	Categorise log entries as DEBUG, INFO, WARN, or ERROR — filter by severity in production	Production only ships INFO and above — DEBUG logs stay in dev to avoid noise
Request ID / Correlation ID	A unique ID stamped on every incoming request and threaded through all logs for that request	Support team pastes a request ID from a user complaint — DevOps finds every log line in seconds
RED Metrics	Rate (requests/sec), Errors (error rate %), Duration (response time p50/p95/p99)	Dashboard shows real-time RED for every APIForge endpoint — anomalies visible instantly
Alerting	Rules that fire a notification (Slack, PagerDuty, email) when a metric crosses a threshold	Alert fires if error rate on /projects exceeds 1% for 2 consecutive minutes
Distributed Tracing	Follows one request across multiple services, recording each span's duration and metadata	APIForge traces show whether latency spikes are in the API, Postgres, or Redis cache layer

The Observability Pipeline

Raw observability data is worthless if it sits isolated on a server. The pipeline matters — getting data from your API into a place where it can be searched, visualised, and acted on. Here is how the APIForge DevOps team routes all three signal types.

APIForge — Observability Pipeline

API Server
emits logs + metrics + traces

→

Collector
OpenTelemetry / Fluentd

→

Storage
Loki · Prometheus · Tempo

→

Visualisation
Grafana dashboards

→

Alerting
Slack · PagerDuty

The Loki + Prometheus + Tempo Stack

These three open-source tools from Grafana Labs form a popular self-hosted observability stack. Loki stores logs without indexing the full text — cheap and fast. Prometheus scrapes and stores time-series metrics. Tempo stores distributed traces. All three are queryable from a single Grafana dashboard. Cloud-managed alternatives include Datadog, New Relic, and AWS CloudWatch — same concepts, different pricing models.

Step 1 — Structured Logging in Your API

Plain-text logs look like this: GET /projects 200 48ms. They are readable to a human. But when you have ten thousand log lines and need to find every request that took more than 500ms last Tuesday from a specific user, a plain-text log is almost useless — you cannot query it without writing a regex nightmare.

Structured logs are JSON objects. Every field is a named key with a typed value. A log aggregator like Loki or Datadog can index those fields and answer queries like { latency_ms > 500, user_id = "usr_9x2" } across millions of records in milliseconds.

What fields should every request log contain?

At minimum: timestamp (ISO 8601), level (INFO/WARN/ERROR), request_id (unique per request), method, path, status_code, latency_ms, user_id (if authenticated), ip. Add service and version if you run multiple deployments — they tell you which build produced the log line.

The APIForge Backend team uses a request logging middleware — a function that wraps every incoming request, captures timing and metadata, and writes a structured JSON log entry when the response is sent. Here is their implementation.

// WHAT: APIForge — Express.js structured logging middleware
// File: src/middleware/requestLogger.js
// Logs every request as a JSON object with timing and context

const { randomUUID } = require("crypto");

function requestLogger(req, res, next) {
  // Generate a unique ID for this request
  const requestId = randomUUID();
  const startTime = Date.now();

  // Attach the request ID to the request object
  // so downstream code can reference it in their own logs
  req.requestId = requestId;

  // Also expose it as a response header so clients can
  // include it in bug reports
  res.setHeader("X-Request-ID", requestId);

  // Hook into the response finish event — fires after
  // the response is fully sent to the client
  res.on("finish", () => {
    const latency = Date.now() - startTime;

    const logEntry = {
      timestamp: new Date().toISOString(),
      level: res.statusCode >= 500 ? "ERROR"
           : res.statusCode >= 400 ? "WARN"
           : "INFO",
      request_id: requestId,
      method: req.method,
      path: req.path,
      status_code: res.statusCode,
      latency_ms: latency,
      user_id: req.user?.id ?? null,
      ip: req.headers["x-forwarded-for"] ?? req.socket.remoteAddress,
      user_agent: req.headers["user-agent"] ?? null,
      service: "apiforge-backend",
      version: process.env.APP_VERSION ?? "unknown"
    };

    // Write JSON to stdout — the collector picks it up from there
    process.stdout.write(JSON.stringify(logEntry) + "\n");
  });

  next();
}

module.exports = requestLogger;

// Usage in app.js:
// const requestLogger = require("./middleware/requestLogger");
// app.use(requestLogger);

Incoming: GET /api/v1/projects (user: usr_9x2m, team: team_abc123) stdout → {"timestamp":"2025-07-15T11:04:37.221Z","level":"INFO","request_id":"f47ac10b-58cc-4372-a567-0e02b2c3d479","method":"GET","path":"/api/v1/projects","status_code":200,"latency_ms":48,"user_id":"usr_9x2m","ip":"203.0.113.45","user_agent":"PostmanRuntime/7.36.0","service":"apiforge-backend","version":"2.4.1"} Incoming: DELETE /api/v1/projects/proj_none (invalid ID) stdout → {"timestamp":"2025-07-15T11:04:38.104Z","level":"WARN","request_id":"a3bb189e-8bf9-3888-9021-d7bf7b8e5a6e","method":"DELETE","path":"/api/v1/projects/proj_none","status_code":404,"latency_ms":12,"user_id":"usr_9x2m","ip":"203.0.113.45","user_agent":"PostmanRuntime/7.36.0","service":"apiforge-backend","version":"2.4.1"} Incoming: GET /api/v1/projects (database connection dropped) stdout → {"timestamp":"2025-07-15T11:04:39.887Z","level":"ERROR","request_id":"c9bf9d55-4da9-4254-8b95-abead23ad9bd","method":"GET","path":"/api/v1/projects","status_code":503,"latency_ms":5003,"user_id":"usr_9x2m","ip":"203.0.113.45","user_agent":"PostmanRuntime/7.36.0","service":"apiforge-backend","version":"2.4.1"}

What just happened?

The middleware attaches a requestId to every request before it hits any route handler. When the response finishes, it calculates actual latency and writes one JSON line. The log level is derived automatically from the status code — 5xx becomes ERROR, 4xx becomes WARN, everything else is INFO. No manual level-setting needed.

Writing to stdout keeps the API decoupled from the logging infrastructure. A collector (Fluentd, Vector, or the platform's own agent) picks up stdout, parses the JSON, and ships it to Loki or Datadog. Swap the collector without touching application code.

Try this: Add req.requestId to your error handler so every thrown error log also contains the request ID. Now a single ID connects the access log entry to the error log entry for the same request.

Step 2 — Exposing Metrics with the RED Method

The RED method — Rate, Errors, Duration — is a framework for choosing which metrics to instrument on every service. It was popularised by Tom Wilkie at Weaveworks and is now the default mental model at most engineering teams. The idea is simple: if all three RED metrics are healthy, your service is almost certainly healthy.

Rate tells you how busy the service is (requests per second). Errors tells you how many of those are failing. Duration tells you how long the healthy ones take — tracked as percentiles, not averages. The p99 latency (the slowest 1% of requests) is often more actionable than the mean, because a degraded tail can indicate a resource exhaustion problem that the average masks completely.

Why p99, not average?

If 99 out of 100 requests take 10ms and one takes 5000ms, the average is about 60ms — which looks fine. But 1% of your users are waiting 5 seconds. At 1,000 requests per second, that is 10 people every second experiencing a broken-feeling API. Percentile metrics (p50, p95, p99) expose this; averages hide it. Stripe monitors p99 latency on every API endpoint as a core SLA metric.

The APIForge Backend team exposes a /metrics endpoint in Prometheus format. Prometheus scrapes it every 15 seconds. Here is the middleware that tracks the three RED signals.

// WHAT: APIForge — RED metrics middleware using prom-client
// npm install prom-client
// File: src/middleware/metrics.js

const promClient = require("prom-client");

// Collect default Node.js metrics (event loop lag, heap, etc.)
promClient.collectDefaultMetrics({ prefix: "apiforge_" });

// Rate — counter increments on every request
const httpRequestsTotal = new promClient.Counter({
  name: "apiforge_http_requests_total",
  help: "Total HTTP requests received",
  labelNames: ["method", "path", "status_code"]
});

// Errors — counter for non-2xx responses
const httpErrorsTotal = new promClient.Counter({
  name: "apiforge_http_errors_total",
  help: "Total HTTP error responses (4xx and 5xx)",
  labelNames: ["method", "path", "status_code"]
});

// Duration — histogram tracks response time distribution
const httpRequestDuration = new promClient.Histogram({
  name: "apiforge_http_request_duration_ms",
  help: "HTTP request latency in milliseconds",
  labelNames: ["method", "path", "status_code"],
  buckets: [5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000]
});

function metricsMiddleware(req, res, next) {
  const start = Date.now();

  res.on("finish", () => {
    const duration = Date.now() - start;
    const labels = {
      method: req.method,
      path: req.route?.path ?? req.path,
      status_code: res.statusCode
    };

    httpRequestsTotal.inc(labels);
    httpRequestDuration.observe(labels, duration);

    if (res.statusCode >= 400) {
      httpErrorsTotal.inc(labels);
    }
  });

  next();
}

// Expose /metrics endpoint for Prometheus to scrape
async function metricsEndpoint(req, res) {
  res.set("Content-Type", promClient.register.contentType);
  res.end(await promClient.register.metrics());
}

module.exports = { metricsMiddleware, metricsEndpoint };

// Usage in app.js:
// app.use(metricsMiddleware);
// app.get("/metrics", metricsEndpoint);

GET /metrics → 200 OK (Prometheus scrape) # HELP apiforge_http_requests_total Total HTTP requests received # TYPE apiforge_http_requests_total counter apiforge_http_requests_total{method="GET",path="/api/v1/projects",status_code="200"} 8423 apiforge_http_requests_total{method="POST",path="/api/v1/projects",status_code="201"} 1204 apiforge_http_requests_total{method="GET",path="/api/v1/projects",status_code="500"} 7 # HELP apiforge_http_errors_total Total HTTP error responses # TYPE apiforge_http_errors_total counter apiforge_http_errors_total{method="GET",path="/api/v1/projects",status_code="500"} 7 apiforge_http_errors_total{method="POST",path="/api/v1/projects",status_code="422"} 34 # HELP apiforge_http_request_duration_ms HTTP request latency in milliseconds # TYPE apiforge_http_request_duration_ms histogram apiforge_http_request_duration_ms_bucket{method="GET",path="/api/v1/projects",le="25"} 2198 apiforge_http_request_duration_ms_bucket{method="GET",path="/api/v1/projects",le="50"} 7804 apiforge_http_request_duration_ms_bucket{method="GET",path="/api/v1/projects",le="100"} 8389 apiforge_http_request_duration_ms_bucket{method="GET",path="/api/v1/projects",le="500"} 8419 apiforge_http_request_duration_ms_bucket{method="GET",path="/api/v1/projects",le="+Inf"} 8423 apiforge_http_request_duration_ms_sum{method="GET",path="/api/v1/projects"} 387504 apiforge_http_request_duration_ms_count{method="GET",path="/api/v1/projects"} 8423 # Node.js runtime metrics (sample) apiforge_nodejs_heap_size_used_bytes 42300416 apiforge_nodejs_event_loop_lag_seconds 0.00041

What just happened?

Three instruments track the three RED signals. The Counter for total requests measures rate over time (Prometheus computes the per-second rate using rate()). The error Counter is separate so you can calculate error percentage directly. The Histogram records every response time into pre-defined buckets — the bucket counts let Prometheus compute accurate percentiles across any time window.

Labels (method, path, status_code) mean one metric definition covers the entire API. A Grafana query like histogram_quantile(0.99, rate(apiforge_http_request_duration_ms_bucket[5m])) gives you the live p99 latency for any endpoint without additional code.

Try this: After running a few requests, hit /metrics in your browser. You should see the raw Prometheus text format. The bucket counts show you exactly how request times are distributed — most should land in the low buckets.

Step 3 — Distributed Tracing with Correlation IDs

A metric can tell you that p99 latency spiked from 80ms to 900ms at 14:32. But it cannot tell you why. That is where tracing comes in. A trace reconstructs the exact path of one specific request — every function call, every database query, every external HTTP call — with the time spent at each step.

In a single-service API, a trace is straightforward. In a microservices architecture — where one user request might touch an auth service, a projects service, a notifications service, and a billing service — the correlation ID is the thread that ties it all together. Every service logs the same ID so the full journey is reconstructable.

// WHAT: APIForge — Request tracing with span timing
// Wraps individual operations inside a request with timing data
// File: src/utils/tracer.js

const { randomUUID } = require("crypto");

class Tracer {
  constructor(requestId) {
    this.requestId = requestId;
    this.traceId = randomUUID();
    this.spans = [];
    this.startTime = Date.now();
  }

  // Start timing a named operation
  startSpan(name) {
    const span = {
      name,
      traceId: this.traceId,
      requestId: this.requestId,
      startMs: Date.now()
    };
    this.spans.push(span);
    return span;
  }

  // End timing for a span and record its duration
  endSpan(span, metadata = {}) {
    span.durationMs = Date.now() - span.startMs;
    span.metadata = metadata;
    span.status = "ok";
  }

  // Mark a span as failed
  errorSpan(span, error) {
    span.durationMs = Date.now() - span.startMs;
    span.status = "error";
    span.error = error.message;
  }

  // Emit the full trace as a structured log
  flush() {
    const totalDuration = Date.now() - this.startTime;
    const traceLog = {
      timestamp: new Date().toISOString(),
      level: "INFO",
      type: "trace",
      trace_id: this.traceId,
      request_id: this.requestId,
      total_duration_ms: totalDuration,
      spans: this.spans
    };
    process.stdout.write(JSON.stringify(traceLog) + "\n");
  }
}

// Example usage inside a route handler:
async function getProjectsHandler(req, res) {
  const tracer = new Tracer(req.requestId);

  // Span 1: Auth check
  const authSpan = tracer.startSpan("auth.verify_token");
  const user = await verifyJWT(req.headers.authorization);
  tracer.endSpan(authSpan, { userId: user.id });

  // Span 2: Database query
  const dbSpan = tracer.startSpan("db.query_projects");
  const projects = await db.query(
    "SELECT * FROM projects WHERE team_id = $1", [user.teamId]
  );
  tracer.endSpan(dbSpan, { rowCount: projects.rows.length });

  // Span 3: Cache write
  const cacheSpan = tracer.startSpan("cache.set_projects");
  await redis.setex(`projects:${user.teamId}`, 300, JSON.stringify(projects.rows));
  tracer.endSpan(cacheSpan, { ttlSeconds: 300 });

  tracer.flush();
  res.json({ data: projects.rows });
}

stdout → (formatted for readability) { "timestamp": "2025-07-15T14:32:07.441Z", "level": "INFO", "type": "trace", "trace_id": "b9c3a1e2-4f77-4d88-b5e2-c0f1d8a93b21", "request_id": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "total_duration_ms": 847, "spans": [ { "name": "auth.verify_token", "durationMs": 12, "status": "ok", "metadata": { "userId": "usr_9x2m" } }, { "name": "db.query_projects", "durationMs": 821, "status": "ok", "metadata": { "rowCount": 14 } }, { "name": "cache.set_projects", "durationMs": 8, "status": "ok", "metadata": { "ttlSeconds": 300 } } ] } Diagnosis: total_duration_ms=847, but auth=12ms and cache=8ms. The database query consumed 821ms (97% of total time). Action: check for missing index on projects.team_id column.

What just happened?

The trace output immediately answers the latency question that the metric raised. Total duration was 847ms — but drilling into the spans shows the database query consumed 821ms of that. Auth was 12ms, cache was 8ms. Without the trace, debugging would start with guessing. With it, the next step is obvious: check the query plan for a missing index on projects.team_id.

The trace_id and request_id are both present. In a multi-service system, each downstream service creates its own spans under the same trace_id, so the entire distributed call chain is reconstructable in Tempo or Jaeger.

Try this: Add a Redis cache check before the database span. If the cache hits, the database span never runs and total duration drops to ~20ms. Then look at your traces for both code paths side by side — the before/after is immediately visible in the span breakdown.

Step 4 — Alerting That Pages You Before Users Notice

A monitoring system with no alerts is just an expensive dashboard for incidents you hear about from Twitter. Alerts are the mechanism that closes the loop — they turn a metric crossing a threshold into a human being waking up or a Slack message appearing at 2 AM.

Good alerts are specific, actionable, and low-noise. An alert that fires twenty times a day gets ignored. An alert that fires twice a week and always indicates a real problem gets acted on. The APIForge team follows one rule: every alert must have a runbook — a short document that says what the alert means and what to do about it.

# WHAT: APIForge — Prometheus alerting rules
# File: prometheus/alerts/apiforge-api.yml
# These rules fire when API health metrics cross thresholds

groups:
  - name: apiforge-api-alerts
    rules:

      # Alert 1: High error rate on any endpoint
      - alert: APIHighErrorRate
        expr: |
          (
            rate(apiforge_http_errors_total[5m])
            /
            rate(apiforge_http_requests_total[5m])
          ) > 0.01
        for: 2m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "High error rate on {{ $labels.path }}"
          description: >
            Error rate is {{ $value | humanizePercentage }} on
            {{ $labels.method }} {{ $labels.path }}.
            This has been above 1% for 2 minutes.
          runbook: "https://wiki.apiforge.io/runbooks/high-error-rate"

      # Alert 2: p99 latency spike
      - alert: APIHighLatencyP99
        expr: |
          histogram_quantile(0.99,
            rate(apiforge_http_request_duration_ms_bucket[5m])
          ) > 500
        for: 3m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "p99 latency above 500ms on {{ $labels.path }}"
          description: >
            99th percentile latency is {{ $value }}ms on
            {{ $labels.method }} {{ $labels.path }}.
          runbook: "https://wiki.apiforge.io/runbooks/high-latency"

      # Alert 3: API completely unreachable (uptime)
      - alert: APIDown
        expr: up{job="apiforge-backend"} == 0
        for: 1m
        labels:
          severity: critical
          team: devops
        annotations:
          summary: "APIForge backend is unreachable"
          description: "The /metrics scrape target has been down for 1 minute."
          runbook: "https://wiki.apiforge.io/runbooks/api-down"

Prometheus Alert Manager — Firing Alerts ──────────────────────────────────────────────────── ALERT: APIHighErrorRate State: FIRING Severity: warning Team: backend Started: 2025-07-15T14:34:22Z (2m 14s ago) Labels: method = GET path = /api/v1/projects status_code = 500 Value: 0.0312 (3.1% error rate — threshold: 1%) Annotations: Summary: High error rate on /api/v1/projects Description: Error rate is 3.1% on GET /api/v1/projects. This has been above 1% for 2 minutes. Runbook: https://wiki.apiforge.io/runbooks/high-error-rate Notification sent: → Slack #incidents ✓ → PagerDuty on-call ✓ ──────────────────────────────────────────────────── Resolved alerts: 0 Pending alerts: 0 Firing alerts: 1

What just happened?

The first alert calculates error rate as a ratio of error requests to total requests over a 5-minute rolling window. The for: 2m clause means the condition must stay true for 2 consecutive minutes before the alert fires — this prevents a brief spike from waking someone up at 3 AM for a transient blip.

Every alert includes a runbook URL. The on-call engineer who receives the PagerDuty page has immediate context: what the alert means, what to check first, and what actions to take. This turns a panic into a procedure.

Try this: Write a fourth alert rule for when the request rate drops to zero — rate(apiforge_http_requests_total[5m]) == 0 for 5 minutes during business hours. A sudden silence in traffic is often more alarming than a spike.

What Observability Actually Changes

The difference between an instrumented API and a blind one is not just about tooling. It is about how fast you recover when something goes wrong — and whether you find out before or after your users do.

Without Monitoring

A database index is dropped during a migration. Queries slow from 40ms to 4000ms. Users start seeing timeouts.

First notification: a user support ticket 45 minutes later. Engineer checks logs — plain text, no timestamps on individual queries. Guessing begins.

Mean time to resolution: 2–3 hours. Outage declared, post-mortem required, trust damaged.

With Monitoring

Same migration drops the index. p99 latency alert fires 3 minutes later. On-call engineer is paged.

Trace shows: db.query_projects span jumped from 38ms to 3900ms at exactly 14:32. Migration ran at 14:30. Root cause: obvious.

Index recreated at 14:38. Total user-facing degradation: 8 minutes. No ticket needed. Runbook updated.

The APIForge Incident Response Flow

From Alert to Resolution

Alert fires — Prometheus detects threshold breach, Alertmanager sends Slack message and PagerDuty page to the on-call engineer.

Check the dashboard — open Grafana, look at the RED metrics panel. Is it one endpoint or all of them? Is the error rate rising or steady? Is latency also spiking?

Query the logs — filter Loki by level=ERROR and the affected path. Read the error messages. Look for a pattern in the request IDs.

Pull a trace — take a request ID from an error log, find the trace in Tempo. The span breakdown shows exactly where time is being spent or where errors are thrown.

Fix and verify — deploy the fix (or rollback), watch the error rate and latency metrics return to baseline on the dashboard in real time.

Alert resolves — Alertmanager sends an automatic "resolved" notification. The incident timeline is captured in the logs for the post-mortem.

Observability Tool Landscape

Tool	Signal type	Hosting	Free tier	Best for
Prometheus + Grafana	Metrics + dashboards	Self-hosted	Yes (open source)	Teams with own infra, full control
Datadog	Logs + metrics + traces + APM	SaaS	14-day trial	Enterprise teams, all-in-one platform
Grafana Cloud	Logs (Loki) + metrics + traces	Managed cloud	Yes (generous)	Startups wanting managed OSS stack
New Relic	Full-stack APM + alerts	SaaS	100GB/month free	Full-stack visibility, generous free tier
AWS CloudWatch	Logs + metrics + alarms	AWS-native	Yes (limited)	APIs already hosted on AWS

What not to log

Never log passwords, credit card numbers, social security numbers, full authentication tokens, or any field covered by GDPR, HIPAA, or PCI-DSS as personally identifiable information. Structured logging makes it easy to accidentally include req.body wholesale — and request bodies often contain sensitive data. Log only the fields you explicitly name. A data breach caused by over-logging is a compliance nightmare and a legal liability.

Self-Hosted vs Managed Observability

Self-Hosted (Prometheus + Loki + Tempo + Grafana)

You run the storage and query infrastructure yourself — either on your own servers or in Kubernetes. Data never leaves your environment.

Pros: zero data egress cost, full control, no vendor lock-in. Cons: operational overhead — you maintain the observability stack as its own service.

Managed SaaS (Datadog / Grafana Cloud / New Relic)

You ship logs and metrics to a third-party platform. They handle storage, scaling, retention, and the query interface.

Pros: no infra to manage, scales automatically, usually easier to set up. Cons: ongoing cost scales with data volume, data leaves your environment, vendor dependency.

The APIForge team started with Grafana Cloud on the free tier — one install command, five minutes of configuration, and they had logs and metrics flowing. As they scale past the free tier limits, they plan to migrate the critical metrics storage to self-hosted Prometheus while keeping Loki in Grafana Cloud for log search.

Quiz

Up Next

API Performance

The APIForge Backend team profils response bottlenecks, implements caching strategies, and load-tests their API to find the breaking point before real traffic does.

← Previous Course Index Next →