WEB API's Lesson 30 – API Performance | Dataplexa
Web APIs · Lesson 30

API Performance

Profile response bottlenecks, apply caching strategies, and load-test your API to find the breaking point before real traffic does.

Amazon calculated that every 100ms of latency costs them 1% in sales. Google found that a half-second delay in search results dropped traffic by 20%. These are not edge cases — they are the documented economic cost of a slow API. And in both companies, the culprit was almost never the application code itself. It was database queries, missing cache layers, and synchronous calls that could have run in parallel.

Performance work is not about rewriting everything in a faster language. It is about finding the one thing that is slow — the query missing an index, the external call that blocks the response, the endpoint loading 500 rows when it needs 20 — and fixing that one thing. Measure first. Optimise second. Always.

API Performance — Concept Anatomy

Concept: API Throughput & Latency Optimisation
Type: Architecture + Code Practice
Used for: Scaling, cost reduction, reliability under load
Key metrics: p50 / p95 / p99 latency · RPS · error rate under load
Tools: k6 · Autocannon · wrk · Artillery

Performance Techniques at a Glance

Concept What it does APIForge use case
Response Caching Stores a computed response and serves it directly on repeat requests — skips the full processing pipeline GET /projects list cached in Redis for 60s — drops DB load by 80% at peak traffic
Database Query Optimisation Indexes, query plans, and selective field fetching eliminate the most common source of API latency Adding index on projects.team_id drops query time from 820ms to 3ms
Connection Pooling Reuses open database connections instead of creating a new one per request — removes ~15ms overhead per call APIForge pg pool of 20 connections handles 200 concurrent requests without queue buildup
Parallel Execution Runs independent async operations concurrently with Promise.all() instead of awaiting each in sequence Fetching projects + user profile + team metadata simultaneously cuts response time by 60%
Payload Trimming Returns only the fields the client actually needs — reduces serialisation time and network transfer size Project list strips internal fields — payload drops from 14KB to 2KB per page
Load Testing Simulates concurrent users to find the throughput ceiling and latency degradation curve before production traffic does k6 test reveals APIForge breaks at 450 RPS — capacity planned ahead of a product launch

The Performance Optimisation Workflow

Performance work without measurement is guesswork. The workflow below is the same process used at companies like GitHub and Cloudflare — profile first to find the bottleneck, fix exactly that, then verify the improvement with hard numbers before moving on.

APIForge — Performance Optimisation Cycle

1. Baseline
measure current p99
2. Profile
find the slow span
3. Fix one thing
cache / index / pool
4. Verify
re-run load test
5. Repeat
next bottleneck
6. Load test
find ceiling RPS

Step 1 — Profiling: Find the Bottleneck First

The APIForge Backend team's GET /api/v1/projects endpoint was returning in 847ms at p99. Their SLA target is 200ms. A trace already identified the database query as the culprit — but before writing any fix, they need a full picture of every problem in the handler.

Here is the original handler in its slow state — four performance problems visible in the code before running any profiler.

// WHAT: APIForge — original slow GET /projects handler
// Four performance problems visible in plain sight

async function getProjectsHandler(req, res) {
  const userId = req.user.id;

  // Problem 1: Three sequential awaits that are fully independent
  // Each one blocks until the previous finishes — pure wasted time
  const user = await db.query(
    "SELECT * FROM users WHERE id = $1", [userId]
  );
  const team = await db.query(
    "SELECT * FROM teams WHERE id = $1", [user.rows[0].team_id]
  );
  const projects = await db.query(
    "SELECT * FROM projects WHERE team_id = $1", [team.rows[0].id]
  );

  // Problem 2: SELECT * fetches every column including
  // internal_notes, audit_log, raw_config — fields the response never uses

  // Problem 3: No cache — every request hits the database
  // even when the project list has not changed in hours

  // Problem 4: No pagination — returns every project ever created
  // A team with 800 projects sends 800 rows over the wire

  const response = projects.rows.map(p => ({
    id: p.id,
    name: p.name,
    team_id: p.team_id,
    visibility: p.visibility,
    created_at: p.created_at
    // ...and 14 more fields the frontend ignores
  }));

  res.json({ data: response });
}
Trace breakdown — GET /api/v1/projects (slow version): Span: db.query_user → 38ms Span: db.query_team → 31ms Span: db.query_projects → 821ms ← missing index on team_id Span: serialize_response → 44ms ← 800 rows, 19 fields each Total: 934ms Response payload: Rows returned: 800 Fields per row: 19 (client uses 5) Payload size: 187KB Transfer time: ~62ms on a 25Mbps connection p99 over 1 hour: 847ms p50 over 1 hour: 412ms Problems identified: [1] Sequential awaits: ~69ms wasted waiting in line [2] SELECT *: 14 unused fields serialised and transferred [3] No cache: database hit on 100% of requests [4] No pagination: 800 rows when 20 would serve the UI

What just happened?

Four performance problems in one handler. The database query is the biggest — 821ms for a query that should take under 5ms with a proper index. But even after fixing the index, the other three problems eat into the remaining budget: sequential awaits that could run in parallel, a bloated SELECT that fetches 19 columns when 5 are needed, and no cache to avoid repeating this work at all.

The 187KB payload is also significant. On a mobile connection, that adds hundreds of milliseconds of transfer time that the server-side latency number never captures.

Try this: Run EXPLAIN ANALYZE SELECT * FROM projects WHERE team_id = 'team_abc123' in your Postgres client. If you see "Seq Scan," there is no index on that column. Adding one is a single SQL statement that typically takes under a second and drops query time by 100x on large tables.

Step 2 — Database Optimisation and Parallel Queries

Database queries are the number one source of API latency in the vast majority of web applications. Not network. Not serialisation. The database. And the fix is almost always one of three things: add an index, select fewer columns, or stop making the database do work it already did five seconds ago.

What is a database index?

An index is a separate data structure the database maintains — sorted by the indexed column — so it can jump directly to matching rows instead of scanning every row in the table. Without an index on projects.team_id, a query for one team's projects reads every single project row in the database. With the index, it reads only the rows for that team. On a table with 500,000 rows, the difference is the entire table vs a few dozen rows — milliseconds vs seconds.

The refactored handler below fixes all four problems: a proper index, a targeted SELECT, Promise.all() for parallel execution, and a Redis cache layer in front of the database.

// WHAT: APIForge — optimised GET /projects handler
// Fixes: parallel queries + selective fields + Redis cache + pagination

const redis = require("./redis"); // ioredis client
const db    = require("./db");    // pg Pool

// Run once at startup — not per request:
// CREATE INDEX CONCURRENTLY idx_projects_team_id ON projects(team_id);

async function getProjectsHandler(req, res) {
  const userId = req.user.id;
  const page   = parseInt(req.query.page)  || 1;
  const limit  = Math.min(parseInt(req.query.limit) || 20, 100);
  const offset = (page - 1) * limit;

  // Check cache first — key includes pagination so pages cache independently
  const cacheKey = `projects:${userId}:page:${page}:limit:${limit}`;
  const cached   = await redis.get(cacheKey);

  if (cached) {
    res.setHeader("X-Cache", "HIT");
    return res.json(JSON.parse(cached));
  }

  // Fetch user (to get team_id), then run projects + count in parallel
  const userResult = await db.query(
    "SELECT id, team_id FROM users WHERE id = $1", [userId]
  );
  const teamId = userResult.rows[0].team_id;

  // Fix 1: Run data query and count query in parallel
  // Fix 2: SELECT only the five fields the API returns
  // Fix 3: LIMIT + OFFSET — never return unbounded rows
  const [projects, countResult] = await Promise.all([
    db.query(
      `SELECT id, name, team_id, visibility, created_at
       FROM projects
       WHERE team_id = $1
       ORDER BY created_at DESC
       LIMIT $2 OFFSET $3`,
      [teamId, limit, offset]
    ),
    db.query(
      "SELECT COUNT(*) FROM projects WHERE team_id = $1",
      [teamId]
    )
  ]);

  const total    = parseInt(countResult.rows[0].count);
  const response = {
    data: projects.rows,
    meta: {
      total,
      page,
      per_page:    limit,
      total_pages: Math.ceil(total / limit)
    }
  };

  // Cache the result for 60 seconds
  await redis.setex(cacheKey, 60, JSON.stringify(response));
  res.setHeader("X-Cache", "MISS");
  res.json(response);
}
First request — cache MISS: Trace breakdown — GET /api/v1/projects?page=1&limit=20 Span: redis.get → 1ms (miss) Span: db.query_user → 6ms (pool reuse, no reconnect) Span: db.query_projects → 4ms (index hit: idx_projects_team_id) Span: db.query_count → 3ms (runs in parallel with above) Span: redis.setex → 1ms Span: serialize_response → 1ms (20 rows, 5 fields) Total: 16ms (was 934ms) Headers: X-Cache: MISS Response payload: Rows returned: 20 (was 800) Fields per row: 5 (was 19) Payload size: 1.2KB (was 187KB) Subsequent request — cache HIT: Span: redis.get → 1ms (hit — returns immediately) Total: 1ms Headers: X-Cache: HIT p99 after optimisation: 18ms (was 847ms) p50 after optimisation: 4ms (was 412ms) Improvement: 47x faster at p99

What just happened?

The database query dropped from 821ms to 4ms — purely from adding an index on team_id. Switching from SELECT * to five explicit fields trimmed serialisation time and payload size by 99%. Running the count query in parallel with the data query removed sequential wait time. Total first-request latency: 16ms. Cached request latency: 1ms.

The X-Cache header lets the frontend team and monitoring instantly see whether a response came from Redis or the database — without touching traces.

Try this: Change the cache TTL from 60 seconds to 5, then to 300, and watch the Prometheus metric for database query rate shift in each case. The longer the TTL, the fewer DB hits — but the more stale the data. There is no universal right answer; it depends on how often the underlying data actually changes.

Step 3 — Connection Pooling and Resource Limits

Opening a database connection takes time — typically 10–50ms depending on the database, network, and TLS handshake. At 100 requests per second, creating a new connection per request means 100 connection setups every second. That overhead compounds into hundreds of milliseconds of unnecessary latency and saturates the database's connection limit long before the actual query load does.

A connection pool keeps a set of connections open and ready. Incoming requests borrow a connection, use it, and return it to the pool. The pool handles the lifecycle — opening connections on startup, replacing dead ones, and queuing requests when all connections are busy.

// WHAT: APIForge — PostgreSQL connection pool configuration
// npm install pg
// File: src/db.js

const { Pool } = require("pg");

const pool = new Pool({
  host:     process.env.DB_HOST,
  port:     process.env.DB_PORT || 5432,
  database: process.env.DB_NAME,
  user:     process.env.DB_USER,
  password: process.env.DB_PASSWORD,
  ssl:      { rejectUnauthorized: true },

  // Pool size: how many connections to keep open simultaneously
  // Rule of thumb: (2 x CPU cores) + effective_io_concurrency
  // For a 4-core server: start at 10, tune up with load testing
  max: 20,

  // How long a client waits for a free connection before throwing
  connectionTimeoutMillis: 3000,

  // How long an idle connection stays open before being closed
  idleTimeoutMillis: 30000,

  // Max uses per connection — forces reconnect to pick up config changes
  maxUses: 7500
});

pool.on("connect", () => {
  console.log(JSON.stringify({
    level: "INFO", event: "db.pool.connect",
    totalCount: pool.totalCount, idleCount: pool.idleCount
  }));
});

pool.on("error", (err) => {
  console.error(JSON.stringify({
    level: "ERROR", event: "db.pool.error", message: err.message
  }));
});

module.exports = pool;
Application startup: {"level":"INFO","event":"db.pool.connect","totalCount":1,"idleCount":0} {"level":"INFO","event":"db.pool.connect","totalCount":2,"idleCount":1} Under load — 200 concurrent requests: Pool status: totalCount: 20 (max connections open) idleCount: 3 (available immediately) waitingCount: 0 (no requests queued) Per-request connection overhead: Without pool: ~22ms per request (new TCP + TLS + auth each time) With pool: ~0ms per request (borrowed from idle pool instantly) Connection reuse: request_1 → borrows conn_04 → query 4ms → returns conn_04 request_2 → borrows conn_11 → query 3ms → returns conn_11 request_3 → borrows conn_04 → query 5ms → returns conn_04 At 300 concurrent requests (pool exhausted): waitingCount: 280 Avg wait for connection: 8ms Signal: increase pool size or add a read replica

What just happened?

With 20 pooled connections, 200 concurrent requests cycle through them without any connection setup overhead. Each request borrows a connection that is already authenticated and ready — the ~22ms cost is paid once per connection at startup, not once per request.

waitingCount spiking in your Prometheus metrics is an early warning sign before latency visibly degrades. It means your pool is saturated and requests are queuing. Either the pool size needs increasing or queries need to run faster so connections are returned sooner.

Try this: Set max: 1 in your pool config and run a load test with 50 concurrent users. Every request queues behind the single connection — you will feel exactly what a saturated pool looks like before tuning your real configuration.

Step 4 — Cache Invalidation on Writes

A cache is not a magic speed button you add to everything. Cached data is stale data — by definition, it is a snapshot from a moment in the past. The design question is always: how stale is too stale for this particular data, and what triggers a refresh?

TTL-based expiry

Cache entries expire after a fixed time. Simple to implement, no invalidation logic. Best for data that updates on a predictable schedule or where slight staleness is acceptable — like a project list or a dashboard count.

Write-through invalidation

When data is written (POST/PUT/DELETE), the corresponding cache key is deleted immediately. The next read is a cache miss and repopulates from the database. Keeps data accurate but requires invalidation code at every write path.

Cache-aside pattern

The application checks the cache first. On a miss, it reads from the database and writes to the cache before returning. The cache never proactively loads data — it only fills gaps. This is what the APIForge handler in Step 2 implements.

What not to cache

Never cache user-specific sensitive data across users. Never cache POST/PUT/DELETE responses — only GET. Never cache auth or permission checks — a user whose role changes should see the change immediately, not after a TTL expires.

// WHAT: APIForge — cache invalidation when a project is created
// Deletes cached project list pages so the next GET fetches fresh data

async function createProjectHandler(req, res) {
  const userId = req.user.id;
  const { name, visibility, tags } = req.body;

  const userResult = await db.query(
    "SELECT team_id FROM users WHERE id = $1", [userId]
  );
  const teamId = userResult.rows[0].team_id;

  // Write the new project to the database
  const result = await db.query(
    `INSERT INTO projects (name, team_id, visibility, tags, created_at)
     VALUES ($1, $2, $3, $4, NOW())
     RETURNING id, name, team_id, visibility, created_at`,
    [name, teamId, visibility, tags]
  );

  const newProject = result.rows[0];

  // Invalidate all cached project list pages for this user
  const pattern = `projects:${userId}:page:*`;
  const keys    = await redis.keys(pattern);

  if (keys.length > 0) {
    await redis.del(...keys);
    console.log(JSON.stringify({
      level: "INFO", event: "cache.invalidated",
      pattern, keys_deleted: keys.length
    }));
  }

  res.status(201).json({ data: newProject });
}
// Same invalidation logic applies to updateProject and deleteProject
POST /api/v1/projects Status: 201 Created | Time: 19ms stdout: {"level":"INFO","event":"cache.invalidated","pattern":"projects:usr_9x2m:page:*","keys_deleted":4} Response: { "data": { "id": "proj_new88", "name": "Mobile SDK v2", "team_id": "team_abc123", "visibility": "internal", "created_at": "2025-07-15T15:22:07Z" } } Subsequent GET /api/v1/projects?page=1: X-Cache: MISS (cache was cleared — database queried) Time: 16ms (index still fast) GET /api/v1/projects?page=1 (one request later): X-Cache: HIT Time: 1ms (cache repopulated from previous miss)

What just happened?

When the POST creates a new project, it deletes all four cached page keys for that user's project list. The next GET is a cache miss, hits the database (still fast at 16ms), and automatically repopulates the cache. The request after that is a cache hit at 1ms. Data is never stale by more than one request cycle.

The log line cache.invalidated: keys_deleted: 4 tells you exactly which entries were cleared — useful for debugging stale data reports from users.

Try this: Create a project, then immediately GET the list before any other request runs. You should see X-Cache: MISS and the new project in the response. Without invalidation, you would see X-Cache: HIT and a list missing the project you just created.

Step 5 — Load Testing with k6

You have optimised the code. But how many users can it serve simultaneously before latency degrades or errors appear? Manual Postman testing uses one virtual user. Production might send 500 at once.

k6 is an open-source load testing tool from Grafana Labs. You write scripts in JavaScript that define how virtual users behave — what endpoints they call, how long they pause, how many users ramp up and over what duration. k6 executes these scripts and reports latency percentiles, throughput, and error rates across the full run.

Load test stages explained

A good load test has three phases. Ramp-up: gradually add virtual users so the server is not hit all at once (which does not reflect real traffic patterns). Sustained load: hold peak users long enough to detect memory leaks, connection pool exhaustion, or gradual latency drift. Ramp-down: reduce users gradually and observe recovery. A server that does not fully recover after load is removed has a resource leak.

// WHAT: APIForge — k6 load test for GET /projects
// Install: brew install k6  or  https://k6.io/docs/getting-started/installation
// Run: k6 run --env BASE_URL=https://staging.apiforge.io \
//             --env AUTH_TOKEN=eyJ... load-test.js

import http from "k6/http";
import { check, sleep } from "k6";
import { Rate, Trend } from "k6/metrics";

const errorRate       = new Rate("custom_errors");
const projectsLatency = new Trend("projects_endpoint_latency");

export const options = {
  stages: [
    { duration: "30s",  target: 50  }, // Ramp up to 50 users
    { duration: "60s",  target: 200 }, // Ramp up to 200 users
    { duration: "120s", target: 200 }, // Hold at 200 users
    { duration: "60s",  target: 400 }, // Push to 400 users (stress)
    { duration: "30s",  target: 0   }, // Ramp down
  ],
  thresholds: {
    // Test FAILS if p95 latency exceeds 200ms at any point
    "http_req_duration{name:list_projects}": ["p(95)<200"],
    // Test FAILS if error rate exceeds 1%
    "custom_errors": ["rate<0.01"],
  }
};

const BASE_URL   = __ENV.BASE_URL   || "http://localhost:3000";
const AUTH_TOKEN = __ENV.AUTH_TOKEN;

export default function () {
  const headers = {
    Authorization: `Bearer ${AUTH_TOKEN}`,
    Accept:        "application/json"
  };

  const res = http.get(
    `${BASE_URL}/api/v1/projects?page=1&limit=20`,
    { headers, tags: { name: "list_projects" } }
  );

  projectsLatency.add(res.timings.duration);
  errorRate.add(res.status !== 200);

  check(res, {
    "status is 200":          (r) => r.status === 200,
    "has data array":         (r) => r.json("data") !== null,
    "response under 200ms":   (r) => r.timings.duration < 200,
    "X-Cache header present": (r) => r.headers["X-Cache"] !== undefined,
  });

  // Simulate realistic think time between requests
  sleep(Math.random() * 2 + 1); // 1-3 second pause
}
k6 run load-test.js — APIForge Staging scenarios: 1, max VUs: 400, duration: 5m00s check results: ✓ status is 200 [ 99.31% ] ✓ has data array [ 99.31% ] ✗ response under 200ms [ 87.42% — fails during 400 VU stress stage ] ✓ X-Cache header present [ 99.31% ] projects_endpoint_latency: min: 1ms (cache hit) med: 4ms p(90): 31ms p(95): 148ms ← threshold p(95) < 200ms PASS at 200 VUs p(99): 892ms ← degrades sharply at 400 VUs custom_errors: rate: 0.69% ← threshold rate < 1% FAIL at 400 VUs http_reqs: 2,028 total peak throughput: 438 req/s at 200 VUs breaking point: ~380 req/s at 400 VUs (pool exhaustion + p99 spike) thresholds: ✓ http_req_duration{name:list_projects}: p(95)<200 PASS ✗ custom_errors: rate<0.01 FAIL at stress stage Recommendation: Safe operating capacity: ~200 concurrent users / 380 RPS Action before next launch: add second API server instance or increase DB pool max + add read replica

What just happened?

The load test ran five stages over five minutes, peaking at 400 virtual users. At 200 users the API held — p95 latency was 148ms, well inside the 200ms threshold. At 400 users the connection pool saturated, p99 latency spiked to 892ms, and 0.69% of requests errored. The threshold was 1% — so the test technically passed on errors but clearly showed 400 VUs is beyond comfortable capacity.

This is exactly the information you want before a product launch. The team now knows: safe at 200 concurrent users, needs horizontal scaling to handle 400. That conversation happens in a planning meeting, not during a live incident.

Try this: Run the same k6 test twice — once with the cache disabled (comment out the Redis block) and once with it enabled. Compare the p95 latency and peak RPS between both runs. The delta shows you exactly what caching is contributing to your throughput ceiling.

Before and After — The Full Picture

Every optimisation in this lesson built on the previous one. The total improvement was not from one big change — it was from fixing four independent problems, each of which compounded.

Before optimisation

p99 latency: 847ms

p50 latency: 412ms

Payload size: 187KB (800 rows, 19 fields)

DB hit rate: 100% of requests

Peak throughput: ~40 RPS before latency degrades

After optimisation

p99 latency: 18ms (47x improvement)

p50 latency: 1ms (cache hit)

Payload size: 1.2KB (20 rows, 5 fields)

DB hit rate: ~5% (95% cache hit rate)

Peak throughput: 438 RPS at 200 concurrent users

What drove each gain

1

Database index on team_id — query time from 821ms to 4ms. Single biggest gain. One SQL statement, no application code change required.

2

SELECT specific fields + pagination — payload from 187KB to 1.2KB. Cuts serialisation time, network transfer, and client parsing cost simultaneously.

3

Promise.all() for parallel queries — data and count queries run simultaneously instead of sequentially. Saves the full duration of whichever is faster (3ms here, more with complex counts).

4

Redis cache with write-through invalidation — 95% of requests served in 1ms with zero database involvement. Database load reduced by ~95% at steady state.

5

Connection pool — eliminates 22ms per-request connection overhead. Foundational — without a pool, every optimisation above would still lose ~22ms before a single query ran.

Performance Anti-Patterns to Avoid

Knowing what slows APIs down is as useful as knowing how to speed them up. These are the patterns that appear most often in the traces of genuinely slow APIs — each one is a specific mistake with a specific fix.

Anti-pattern What it costs Fix
N+1 queries Fetching a list of 50 items then querying the DB once per item — 51 queries instead of 1 JOIN or batch fetch with WHERE id IN (...)
Synchronous external calls Calling a third-party API (Stripe, SendGrid) inline during a request — adds their latency to yours Queue the work (Bull, SQS) and return 202 Accepted immediately
Missing pagination Returning every row in a table — one user with 10,000 records can OOM your server Always default LIMIT, cap max page size, return meta.total
Caching mutable auth data Caching permission lookups — a revoked user continues to have access until TTL expires Never cache auth checks, or use event-driven invalidation on role changes
Over-fetching in SELECT * Fetching and serialising columns the response never uses — wastes DB I/O, memory, and network Always name exactly the columns you need in every query

The N+1 query problem deserves special attention

N+1 is the single most common API performance bug. It looks like this: you fetch a list of 50 projects, then inside the loop you fetch the owner's name for each one — 50 additional queries. The fix is a single JOIN or a batch query with WHERE id IN (id1, id2, ...id50) that returns all 50 owners in one round trip. An ORM like Prisma or Sequelize can generate N+1 patterns automatically if you are not careful with eager loading — always check your query logs when using an ORM.

Choosing the Right Caching Layer

In-process memory cache

Store cached values in a JavaScript Map or a library like node-cache. Zero network overhead — sub-millisecond access.

Limitation: not shared across server instances. If you run three API servers, each has its own cache. Cache invalidation on write becomes complex — you would need to broadcast to all instances. Best for immutable reference data (config, feature flags) that rarely changes.

Redis (shared cache)

A single Redis instance serves all API servers. Cache invalidation on write works correctly regardless of how many instances are running — they all read from and write to the same cache.

Adds ~1ms network round trip but handles multi-instance deployments correctly. The right choice for any API that runs more than one server process — which is most production APIs. APIForge uses Redis from day one.

HTTP Caching with Cache-Control Headers

Everything above is server-side caching — the server avoids doing work it already did. But APIs can also instruct the client to cache responses, using standard HTTP headers. This pushes caching even further out — to the browser, to a CDN, or to the API client's own cache layer — and eliminates the round trip to your server entirely for repeat requests.

// WHAT: APIForge — HTTP Cache-Control headers on GET /projects
// Tells browsers, CDNs, and API clients how long to cache the response

async function getProjectsHandler(req, res) {
  // ... (fetch data with Redis + DB as before)

  // Cache-Control header breakdown:
  // private      — only the requesting client may cache this (not a shared CDN)
  //               because the data is user-specific
  // max-age=60   — client may use the cached copy for 60 seconds
  // stale-while-revalidate=30 — after max-age, client may serve the stale copy
  //               for up to 30 more seconds while fetching a fresh one in the background
  res.setHeader("Cache-Control", "private, max-age=60, stale-while-revalidate=30");

  // ETag: a fingerprint of the response body
  // Client sends it back as If-None-Match on subsequent requests
  // If the data has not changed, server returns 304 Not Modified — zero body transfer
  const etag = `"${Buffer.from(JSON.stringify(response)).length}-${Date.now()}"`;
  res.setHeader("ETag", etag);

  if (req.headers["if-none-match"] === etag) {
    return res.status(304).end();
  }

  res.json(response);
}

// For truly public, non-user-specific endpoints (e.g. GET /api/v1/status):
// res.setHeader("Cache-Control", "public, max-age=300, s-maxage=600");
// s-maxage applies to CDN caches specifically — longer than browser cache
First request — client has no cache: GET /api/v1/projects?page=1&limit=20 Response: Status: 200 OK Cache-Control: private, max-age=60, stale-while-revalidate=30 ETag: "1228-1752658927441" X-Cache: MISS Body: { data: [...], meta: {...} } (1.2KB) Second request — within 60 seconds (client serves from local cache): No network request made — browser/client returns cached copy instantly Latency: 0ms (no server involved at all) Third request — after 60 seconds (client revalidates): GET /api/v1/projects?page=1&limit=20 If-None-Match: "1228-1752658927441" Response (data unchanged): Status: 304 Not Modified Body: (empty) — 0 bytes transferred Latency: 8ms — just the round trip, no payload Response (data changed): Status: 200 OK ETag: "1356-1752659001882" (new fingerprint) Body: { data: [...] } — fresh data transferred

What just happened?

The Cache-Control: private, max-age=60 header tells the client it can serve this response from its own cache for 60 seconds without contacting the server at all. Zero latency, zero server load, zero bandwidth consumed for the most common case — a user refreshing a page they just loaded.

The ETag and 304 Not Modified pattern handles the case where the cache has expired but the data has not changed — the server confirms "still the same" with an 8ms round trip and zero bytes of body. Compared to sending 1.2KB of JSON, that is a meaningful saving at scale.

Try this: Open your browser DevTools, go to the Network tab, and hit an API endpoint twice. On the second request, look for "from cache" or status 304 in the response. If neither appears, the API is missing Cache-Control headers and the browser cannot cache the response even if the data is identical.

Quiz

1. The APIForge Backend team profiles their GET /projects endpoint and sees the trace breakdown: db.query_user 38ms, db.query_team 31ms, db.query_projects 821ms, serialize_response 44ms. Following the rule of measuring before optimising, which problem should they fix first and why?

2. An APIForge client cached a GET /projects response and received an ETag of "1228-abc". 90 seconds later it sends the same request with the header If-None-Match: "1228-abc". The project list has not changed. What does the server return and why is this better than a normal 200 response?

3. During a k6 load test at 300 concurrent users, the APIForge Prometheus dashboard shows waitingCount spiking to 280 on the database connection pool while query times remain fast at under 5ms each. What does this indicate and what should the team do?

Up Next
Mini Project: REST API
The APIForge Backend team builds a complete REST API from scratch — routes, middleware, auth, error handling, and a live test suite — applying every technique from the module in one cohesive project.