NO SQL Lesson 32 – Performance Optimization | Dataplexa
Data Modeling & Design · Lesson 32

NoSQL Performance Optimization

A slow NoSQL system is almost never a hardware problem. More RAM, faster disks, and bigger nodes mask symptoms for weeks before the same patterns resurface at the next scale milestone. Real performance fixes target root causes — the wrong partition key, a missing index, an unbounded scan, a consistency level set carelessly. This lesson is a systematic playbook for diagnosing and fixing performance problems in production NoSQL systems.

The Optimization Hierarchy

Performance problems in NoSQL systems almost always belong to one of four layers. Always start at the top — a schema fix gives orders-of-magnitude improvement. Hardware upgrades at the bottom give 20–50% improvement at 10× the cost.

1 — Schema & Data Model

Wrong partition key, missing table for access pattern, low-cardinality keys, unbounded partitions. Fix here = 100–10,000× improvement.

2 — Indexes

Missing index on filtered fields, wrong compound index field order (ESR), over-indexed collection adding write overhead. Fix here = 100–1,000× improvement.

3 — Query Patterns

N+1 query problems, fetching more data than needed, missing projections, scatter-gather queries that could hit one partition. Fix here = 5–50× improvement.

4 — Infrastructure & Configuration

JVM heap sizing, compaction strategy, connection pool sizing, caching configuration, consistency level tuning. Fix here = 20–50% improvement.

Diagnosing MongoDB Performance — The Toolkit

The scenario: You are on call at 2 AM. The ops dashboard shows MongoDB query latency at p99 climbing from 8ms to 4.2 seconds over the past hour. Traffic is normal. You have four minutes before your SLA is breached. You need to identify the slow query, understand why it is slow, and apply a fix.

// Step 1: find currently running slow operations
// maxTimeMS: 0 shows all running ops, sorted by duration
db.currentOp({ "active": true, "secs_running": { "$gt": 1 } })
{
  inprog: [
    {
      opid: 8821,
      op: "query",
      ns: "marketplace.orders",
      command: { find: "orders", filter: { status: "pending" },
                 sort: { created_at: -1 } },
      secs_running: 3,
      planSummary: "COLLSCAN",     ← full table scan
      numYields: 1204,
      docsExamined: 48200000       ← 48M docs examined so far
    }
  ]
}
db.currentOp() — live query inspection

currentOp shows every operation currently executing on the server. secs_running > 1 filters to anything that has been running for more than a second — those are your immediate suspects. planSummary: "COLLSCAN" confirms the problem instantly: a full collection scan on 48 million orders. numYields: 1204 means the query has yielded to other operations over a thousand times — it is blocking the entire collection with read locks.

Emergency kill while you fix the index

If a runaway query is blocking other operations, kill it with db.killOp(8821) to restore normal latency immediately. This buys time to apply the proper fix — the missing index — without the query continuing to degrade the cluster for another 3 minutes.

// Step 2: confirm diagnosis with explain()
db.orders.find(
  { status: "pending" }
).sort(
  { created_at: -1 }
).explain("executionStats")

// Step 3: apply the fix — compound index following ESR rule
// E: status (equality)  S: created_at (sort)
db.orders.createIndex(
  { status: 1, created_at: -1 },
  { name: "orders_status_date_idx", background: true }
)
explain() before index:
  stage: COLLSCAN  |  docsExamined: 48200000  |  executionTimeMs: 4180

createIndex:
  "orders_status_date_idx" — building in background...
  Build complete in 62 seconds (one-time cost)

explain() after index:
  stage: IXSCAN  |  docsExamined: 1840  |  executionTimeMs: 3
  p99 latency restored to 6ms  ✓
background: true — safe for live clusters

Building an index in the foreground on a live collection holds a write lock for the entire build duration — completely blocking writes for 62 seconds. background: true builds the index while allowing concurrent reads and writes, at the cost of a slightly slower build. In MongoDB 4.2+ all index builds are concurrent by default, but on older versions always specify background: true on production clusters.

Diagnosing Cassandra Performance — The Toolkit

The scenario: Your Cassandra-backed order service has been running fine for eight months. Over the past two weeks, read latency on the orders_by_customer table has climbed steadily from 2ms to 180ms. No traffic change, no schema change. You suspect a growing data problem rather than a query problem.

# Check table-level read/write latency stats
nodetool tablestats orders_keyspace.orders_by_customer

# Check the size and count of the largest partitions
# (requires cassandra-stress or a custom script at scale)
# Quick check: use nodetool cfhistograms to see partition size distribution
nodetool cfhistograms orders_keyspace orders_by_customer
nodetool tablestats:
  Table: orders_by_customer
  Read Count:        8,204,112
  Read Latency:      184.2 ms avg  ← was 2ms eight months ago
  Write Count:       142,891,044
  Write Latency:     1.1 ms avg    ← writes still fast
  SSTable Count:     284           ← very high, compaction lagging
  Live SS Table Disk Used: 892 GB

nodetool cfhistograms (partition size):
  Percentile  Partition Size
  50%         12 KB
  75%         48 KB
  95%         8.4 MB
  99%         892 MB    ← enormous partitions at the tail
  Max:        4.2 GB    ← way beyond 100MB recommendation
SSTable Count: 284 — compaction is falling behind

Every time the memtable flushes, a new SSTable file is written. Compaction merges SSTables in the background. 284 SSTables means compaction cannot keep up with write volume — each read must check more files, each requiring a Bloom filter check and potentially a disk read. Healthy Cassandra tables typically have under 10 SSTables. 284 is a compaction emergency.

Max partition: 4.2 GB — the real root cause

The 99th percentile partition at 892MB and the max at 4.2GB tell the real story. A handful of extremely large partitions are causing both problems: reads on large partitions must scan gigabytes of data, and compaction struggles to merge 4GB partition files efficiently. The partition key is unbounded — high-volume customers have accumulated years of order history in a single partition.

The scenario continues: You identify the root cause — the partition key is customer_id with no time bucketing. Your top 50 customers have been on the platform for three years and have accumulated millions of orders each. You redesign the partition key to add a monthly bucket, capping each partition at one month of orders per customer.

-- BEFORE: unbounded partition — grows forever per customer
CREATE TABLE orders_by_customer (
  customer_id TEXT,
  ordered_at  TIMESTAMP,
  order_id    UUID,
  total       DECIMAL,
  PRIMARY KEY (customer_id, ordered_at, order_id)
);

-- AFTER: monthly bucket — caps partition at ~30 days of orders per customer
-- order_month derived from ordered_at in application: "2025-01"
CREATE TABLE orders_by_customer_v2 (
  customer_id  TEXT,
  order_month  TEXT,        -- "YYYY-MM" bucket
  ordered_at   TIMESTAMP,
  order_id     UUID,
  total        DECIMAL,
  PRIMARY KEY ((customer_id, order_month), ordered_at, order_id)
) WITH CLUSTERING ORDER BY (ordered_at DESC, order_id ASC);
Created table orders_by_customer_v2
Took 0.644 seconds

After migration and compaction:
  SSTable Count:     8       ← down from 284
  Read Latency:      1.8ms   ← down from 184ms
  Max Partition:     18 MB   ← down from 4.2 GB  ✓
Monthly bucket as part of composite partition key

Adding order_month to the partition key creates one partition per customer per month. The busiest customer with 50,000 orders per month creates a partition of roughly 5–15MB — well within Cassandra's sweet spot. The application derives order_month from the order timestamp before writing. When querying the last 3 months, the app fires 3 partition queries and merges results.

The N+1 Query Problem — A Universal NoSQL Tax

The N+1 problem is not unique to SQL — it is just as common in NoSQL applications that use referencing. You fetch a list of N items, then fire one additional query per item to retrieve related data. The result: N+1 round trips to the database for what should be 1 or 2.

The scenario: A product listing page fetches 20 products, then for each product fires a separate query to get the seller's name. 21 database round trips per page load. Under load this becomes the primary bottleneck.

# BAD: N+1 — 1 query for products + 20 queries for sellers
products = list(db.products.find(
    {"category": "Electronics"},
    limit=20
))

for product in products:
    # One extra query per product — 20 extra round trips
    seller = db.sellers.find_one({"_id": product["seller_id"]})
    product["seller_name"] = seller["name"]

# GOOD: 2 queries total — collect IDs, fetch in one $in query
products = list(db.products.find(
    {"category": "Electronics"},
    limit=20
))
seller_ids = list({p["seller_id"] for p in products})

sellers = {
    s["_id"]: s for s in
    db.sellers.find({"_id": {"$in": seller_ids}})
}
for product in products:
    product["seller_name"] = sellers[product["seller_id"]]["name"]
N+1 approach:   21 round trips  |  total latency: 62ms
$in approach:    2  round trips  |  total latency: 4ms

At 500 req/sec: N+1 = 10,500 DB queries/sec
                $in  = 1,000  DB queries/sec  ✓
$in with a set of IDs — the standard fix

The $in operator fetches all matching documents in a single query. Building a set of seller IDs first (using a Python set comprehension to deduplicate) then fetching all sellers in one shot collapses 20 queries into 1. The sellers are indexed by _id, so MongoDB resolves all 20 lookups efficiently in a single B-tree traversal. The extended reference pattern from Lesson 25 eliminates this even further — if you had embedded seller_name in the product document, you would need zero extra queries.

Connection Pool Sizing — The Hidden Bottleneck

Application servers connect to database nodes through a connection pool. If the pool is too small, application threads queue waiting for a free connection — query latency climbs even though the database itself is underloaded. If the pool is too large, the database is overwhelmed with concurrent connections — each connection consumes memory and CPU on the database node.

The scenario: Your MongoDB service has 10 application server instances. Database CPU is at 15%. Application server threads are waiting for connections 40% of the time. Adding more database nodes has not helped. The problem is pool sizing, not database capacity.

from pymongo import MongoClient

# BEFORE: default pool size of 100 — too small for 10 app servers
# Each app server gets 100/10 = 10 effective connections
# At 500 concurrent requests per server, 490 threads are waiting
client_bad = MongoClient(
    "mongodb://localhost:27017/",
    maxPoolSize=100    # default — often too small
)

# AFTER: sized to match actual concurrency
# Rule of thumb: maxPoolSize ≈ (peak concurrent threads per server) * 1.2
# 50 concurrent threads per server * 1.2 = 60
client_good = MongoClient(
    "mongodb://localhost:27017/",
    maxPoolSize=60,       # sized to actual concurrency
    minPoolSize=10,       # keep warm connections ready
    waitQueueTimeoutMS=2000  # fail fast rather than hang
)

# Monitor pool usage — check server status
db = client_good["marketplace"]
server_status = db.command("serverStatus")
pool_stats = server_status["connections"]
print(f"Current: {pool_stats['current']}  "
      f"Available: {pool_stats['available']}  "
      f"Total created: {pool_stats['totalCreated']}")
Before (maxPoolSize=100, 10 servers):
  Connections in use: 100/100 — SATURATED
  Thread wait queue: 487 threads waiting  ← bottleneck
  p99 latency: 2,400ms

After (maxPoolSize=60, 10 servers = 600 total DB connections):
  Connections in use: 48/60 — healthy headroom
  Thread wait queue: 0
  p99 latency: 11ms  ✓

DB server connections: current=612  available=388  (1000 max)
waitQueueTimeoutMS: 2000

If all pool connections are in use, a new request waits up to waitQueueTimeoutMS for a free connection before raising a timeout error. Without this, requests queue indefinitely — a traffic spike causes threads to pile up, consuming server memory until the process crashes. 2 seconds is a reasonable ceiling: fast enough to fail before the HTTP request timeout triggers, slow enough to allow brief spikes to clear.

Total DB connections = maxPoolSize × app server count

With 10 app servers at maxPoolSize=60, the database receives up to 600 connections total. MongoDB's default limit is 1,000,000 connections — not the constraint. The constraint is the database's memory: each MongoDB connection uses approximately 1MB. 600 connections = 600MB of connection overhead on the database server. Size your pool to match actual concurrency, not to the maximum the database can theoretically accept.

Compaction Strategy in Cassandra — Choosing the Right Mode

Cassandra's compaction strategy determines how SSTables are merged. The wrong strategy for your workload pattern is one of the most common sources of read latency degradation over time.

Strategy Best for Avoid when Space amplification
STCS (SizeTieredCompaction) Write-heavy workloads, time-series inserts Read-heavy with random access — SSTables multiply High — needs 2× space during compaction
LCS (LeveledCompaction) Read-heavy, updates, consistent latency Very high write throughput — compaction can't keep up Low — 10× smaller SSTables, predictable reads
TWCS (TimeWindowCompaction) Time-series with TTL — best for IoT, metrics Data without clear time windows or no TTL Low — expired time windows deleted as whole files

A Performance Optimization Checklist

MongoDB

Run explain("executionStats") on every slow query
Check for COLLSCAN — add compound index following ESR
Check totalDocsExamined / nReturned ratio — target 1:1
Look for N+1 patterns — replace loops with $in
Check db.currentOp() during latency spikes
Verify connection pool is sized to actual concurrency
Add projections — fetch only the fields you need

Cassandra

Run nodetool tablestats — check read/write latency
Run nodetool cfhistograms — check partition size tail
SSTable count > 20 — trigger manual compaction
Max partition > 100MB — add time bucket to partition key
Check nodetool status — verify even data distribution
Review compaction strategy — TWCS for time-series with TTL
Never use ALLOW FILTERING — create a dedicated table instead

Teacher's Note

The fastest path to a performance improvement is almost always finding one query that is doing a full scan and adding the right index or table. Before tuning JVM settings, compaction strategies, or anything at the infrastructure layer — spend ten minutes with explain() in MongoDB or nodetool cfhistograms in Cassandra. The schema fix is almost always hiding in plain sight, and it is worth 100× more than any configuration tweak.

Practice Questions — You're the Engineer

Scenario:

You run db.currentOp() during a latency spike and find a query that has been running for 4 seconds. It shows planSummary: "COLLSCAN" and docsExamined: 61,000,000. You also notice a field in the output that shows the query has paused and resumed over 2,400 times to yield execution to other operations — evidence that it is blocking concurrent reads. What is this field called?


Scenario:

Your Cassandra table stores order history partitioned by customer_id. After 18 months in production, nodetool cfhistograms shows the 99th percentile partition is 2.1GB. Your five biggest enterprise customers each have a 4GB+ partition and their reads take 12 seconds. You need to bound partition growth without changing the access pattern. You add a second field to the composite partition key derived from the order date — limiting each partition to one month of data. What is this technique called?


Scenario:

Your product listing API returns 20 products per page. A code review reveals that for each product returned, the handler fires a separate db.sellers.find_one({"{"}_id: product["seller_id"]{"}"} call to retrieve the seller name. Under load at 300 requests per second, your MongoDB server is receiving 6,300 queries per second for what should be 600 queries per second. You need to collapse the 20 individual seller queries into a single query. Which MongoDB operator do you use?


Quiz — Performance Optimization in Production

Scenario:

Your Cassandra user profile table has been live for a year. nodetool tablestats shows read latency at 340ms — up from 4ms at launch. Write latency is still 1ms. SSTable count is 312. The table receives mostly reads with occasional profile updates. The partition sizes are healthy (under 1MB each). What is the most likely root cause and the correct fix?

Scenario:

Your MongoDB service runs across 8 application server instances. Database CPU is at 12%. Network throughput is normal. But p99 API latency has climbed to 3.4 seconds. db.serverStatus() shows connections.current: 100 and connections.available: 0. Application server logs show threads waiting an average of 3.1 seconds for a database connection. Adding more MongoDB nodes has not helped. What is the root cause?

Scenario:

A Cassandra cluster is degrading under load. An infrastructure engineer immediately proposes upgrading all 12 nodes from 64GB to 128GB RAM — at a cost of £40,000. You push back. Following the optimization hierarchy, what should be investigated first, and why is hardware the wrong first step?

Up Next · Lesson 33

NoSQL in Microservices

Each microservice owns its own database — and the decisions you make about which database each service gets determine whether your system scales or becomes a distributed monolith.