NO SQL Lesson 29 – Indexing in NOSQL | Dataplexa
Data Modeling & Design · Lesson 29

Indexing in NoSQL

A MongoDB collection with 500 million documents and no index on the field you are filtering by will scan every single document — 500 million reads — to return 20 results. Add the right index and that same query scans 20 entries. The difference is not incremental. It is the difference between a query that takes 8 minutes and one that takes 2 milliseconds. Indexing is the single highest-leverage performance tool in any NoSQL database, and most teams use it badly.

Indexes Are a Trade-off, Not a Free Lunch

Every index you create speeds up reads on the indexed field and slows down every write to that collection. On every insert, update, and delete, MongoDB must update not just the document but every index that covers that document. A collection with 8 indexes requires 9 write operations per document change — one for the document, eight for the indexes. At high write throughput, an over-indexed collection becomes the bottleneck.

The Golden Rule of Indexing

Index the fields you filter and sort on in your most frequent, most latency-sensitive queries. Do not index fields that are only queried occasionally — the write overhead is not worth the occasional read benefit. Every index should be justified by a specific, measured query pattern.

Index Types Across NoSQL Databases

Different NoSQL databases offer different index types. Understanding what each one does — and what it costs — prevents the common mistake of reaching for the most powerful type when a simpler one would do.

Index Type Database Supports Cost
Single field MongoDB, DynamoDB (GSI) Equality and range on one field Low — one B-tree per index
Compound MongoDB Multi-field filter + sort in one scan Medium — larger B-tree, covers more queries
Multikey MongoDB Querying inside arrays — indexes each element Medium-high — one entry per array element
Text MongoDB Full-text search, stemming, stop words High — inverted index, large on disk
TTL MongoDB Auto-expire documents after a time period Low — background thread scans periodically
Secondary (Cassandra) Cassandra Filter on non-partition-key columns High — distributed, requires querying all nodes
GSI / LSI (DynamoDB) DynamoDB Alternative partition key for different queries High — duplicates data, additional write capacity

Compound Indexes — The ESR Rule

In MongoDB, compound indexes cover multiple fields in a single B-tree. The order of fields in a compound index is critical — it determines which queries the index can serve. The ESR rule is the practical guide: put Equality fields first, then Sort fields, then Range fields last.

ESR in action — an order query:

// Query: orders for customer X, in status 'shipped', newest first
db.orders.find({
  customer_id: "cust_441",   // E — equality
  status: "shipped"         // E — equality
}).sort({ created_at: -1 });      // S — sort

// Correct index — E fields first, S field last:
db.orders.createIndex({
  customer_id: 1,     // E
  status: 1,         // E
  created_at: -1     // S
});

// WRONG order — sort field first breaks index efficiency:
// { created_at: -1, customer_id: 1, status: 1 }
// MongoDB cannot use this index for equality + sort together

Hands-on — Diagnosing and Fixing Slow Queries

The scenario: You are the backend engineer for a job board platform. The jobs listing page filters by city and job_type, then sorts by posted_at descending. There are 40 million job documents. The page is taking 6 seconds to load. Your first job is to understand why, then fix it.

// Step 1: diagnose with explain() — winningPlan reveals what MongoDB is doing
db.jobs.find(
  { city: "London", job_type: "Engineering" }
).sort(
  { posted_at: -1 }
).explain("executionStats")
{
  winningPlan: {
    stage: "SORT",
    inputStage: {
      stage: "COLLSCAN",       ← scanning all 40 million documents
      filter: { city: "London", job_type: "Engineering" }
    }
  },
  executionStats: {
    executionTimeMillis: 6240,
    totalDocsExamined:   40000000,   ← 40M docs scanned
    nReturned:           847          ← 847 results returned
  }
}
stage: "COLLSCAN"

Collection scan — MongoDB is reading every document in the collection because there is no index covering city or job_type. This is the worst possible execution plan. 40 million documents examined to return 847 results: an examination ratio of 47,000:1. The goal after indexing is an examination ratio close to 1:1.

stage: "SORT" on top of COLLSCAN

After scanning all 40 million documents, MongoDB must sort the 847 results by posted_at in memory. This is an in-memory sort — it consumes RAM and is blocked until all matching documents are gathered. A covering index will eliminate this entirely by storing documents in the correct sort order.

The scenario continues: You apply the ESR rule — two equality fields (city, job_type) first, then the sort field (posted_at) last. You create the index and re-run the query.

// Create the compound index following ESR order
// E: city (equality)
// E: job_type (equality)
// S: posted_at (sort — descending matches query sort direction)
db.jobs.createIndex(
  { city: 1, job_type: 1, posted_at: -1 },
  { name: "jobs_city_type_date_idx" }
)
"jobs_city_type_date_idx"
Index built successfully — scanning 40M documents to build index.
Build time: 48 seconds (one-time cost)
Index size on disk: 1.2 GB
Build time: 48 seconds (one-time cost)

Creating an index on an existing large collection takes time — MongoDB scans all documents to build the B-tree. This is a one-time cost. In production, always build indexes in the background using {"{"} background: true {"}"} (MongoDB 4.1 and earlier) or use the rolling build procedure for replica sets — build the index on each secondary one at a time, then step down the primary. Never create indexes on a live primary without a plan.

Index size: 1.2 GB

Indexes consume disk space and — when hot — RAM (MongoDB's WiredTiger engine keeps frequently accessed index pages in cache). Monitor index size in proportion to your available RAM. If indexes exceed available RAM, index pages must be read from disk on every query — negating much of the performance benefit.

// Re-run explain() after creating the index
db.jobs.find(
  { city: "London", job_type: "Engineering" }
).sort(
  { posted_at: -1 }
).explain("executionStats")
{
  winningPlan: {
    stage: "FETCH",
    inputStage: {
      stage: "IXSCAN",         ← index scan (not collection scan)
      indexName: "jobs_city_type_date_idx",
      direction: "forward"
    }
  },
  executionStats: {
    executionTimeMillis: 2,      ← 6240ms → 2ms
    totalDocsExamined:   847,    ← exactly the docs returned
    nReturned:           847,
    totalKeysExamined:   847     ← 1:1 examination ratio ✓
  }
}
stage: "IXSCAN" — the target execution plan

Index scan — MongoDB walks the B-tree to find the 847 matching entries directly. No in-memory sort because the index stores posted_at in descending order, so the results already arrive in the correct order. The SORT stage is completely gone from the plan. 6,240ms → 2ms. This is the leverage that makes indexing worth understanding deeply.

totalDocsExamined: 847 = nReturned: 847

A 1:1 examination ratio — every document examined was returned. This is the ideal. Before indexing it was 47,000:1. This ratio is the single best indicator of index health — the closer it is to 1:1, the better your index is serving the query.

Secondary Indexes in Cassandra — Use With Caution

Cassandra's secondary indexes are fundamentally different from MongoDB's. In MongoDB, an index is a separate B-tree structure stored on the same node as the data. In Cassandra, a secondary index is a distributed index — each node maintains a local secondary index only for the data it owns. A query on a secondary index must hit every node in the cluster and merge results — a scatter-gather operation.

Cassandra Secondary Indexes — When They Are Acceptable

A Cassandra secondary index is acceptable only when: (1) the query always includes the partition key (so the secondary index lookup happens on one node, not all of them), AND (2) the indexed column has high cardinality. A secondary index on a low-cardinality column like status is catastrophically slow — it returns millions of rows from every node. The correct solution for a new query pattern is a new table, not a secondary index.

The scenario: Your team has a Cassandra table of user sessions partitioned by user_id. The security team needs to quickly find all active sessions for a specific device fingerprint — without scanning all users. A secondary index on device_fingerprint would scatter across all nodes. The right solution is a dedicated lookup table.

-- WRONG approach: secondary index on device_fingerprint
-- Hits every node in the cluster, merges results — slow at scale
CREATE INDEX ON user_sessions (device_fingerprint);  -- avoid this

-- RIGHT approach: a dedicated lookup table for this query pattern
-- "Find all sessions for a given device fingerprint"
CREATE TABLE sessions_by_device (
  device_fingerprint TEXT,
  created_at         TIMESTAMP,
  session_id         UUID,
  user_id            TEXT,
  ip_address         TEXT,
  PRIMARY KEY (device_fingerprint, created_at, session_id)
) WITH CLUSTERING ORDER BY (created_at DESC, session_id ASC)
  AND default_time_to_live = 86400;  -- sessions expire after 24h
Created table sessions_by_device
Took 0.612 seconds

Query: "all active sessions on device fp_XK99"
  SELECT * FROM sessions_by_device
  WHERE device_fingerprint = 'fp_XK99'
  LIMIT 10;

  Hits: 1 partition on 1 node  ✓
  Query time: 1.8ms
New table instead of secondary index

The security query gets its own table shaped around the query — device_fingerprint as partition key, created_at as clustering column. The query hits one partition on one node. The trade-off: every session write must also write to this table. This is the standard Cassandra pattern — accept write duplication, guarantee single-partition reads.

DynamoDB Global Secondary Indexes

DynamoDB's Global Secondary Index (GSI) is the DynamoDB equivalent of a separate query table — it projects a copy of selected attributes under a different partition key, allowing you to query the same data by a completely different attribute. Unlike Cassandra's secondary indexes, a GSI is maintained automatically by DynamoDB and provides single-partition performance for the indexed query.

The scenario: Your DynamoDB orders table is partitioned by customer_id. The warehouse team needs to query orders by warehouse_id to plan daily pick lists. This is a completely different access pattern — you add a GSI on warehouse_id.

import boto3

dynamodb = boto3.client('dynamodb', region_name='eu-west-1')

# Add a GSI to an existing table — DynamoDB backfills it automatically
dynamodb.update_table(
    TableName='orders',
    AttributeDefinitions=[
        {'AttributeName': 'warehouse_id', 'AttributeType': 'S'},
        {'AttributeName': 'order_date',   'AttributeType': 'S'}
    ],
    GlobalSecondaryIndexUpdates=[{
        'Create': {
            'IndexName': 'warehouse-orders-idx',
            'KeySchema': [
                {'AttributeName': 'warehouse_id', 'KeyType': 'HASH'},
                {'AttributeName': 'order_date',   'KeyType': 'RANGE'}
            ],
            'Projection': {'ProjectionType': 'INCLUDE',
                           'NonKeyAttributes': ['status', 'item_count']},
            'ProvisionedThroughput': {'ReadCapacityUnits': 10,
                                      'WriteCapacityUnits': 10}
        }
    }]
)
GSI 'warehouse-orders-idx' creation initiated.
DynamoDB is backfilling the index from existing table data.
Status: CREATING → ACTIVE (typically 5-30 minutes for large tables)

Query via GSI:
  warehouse_id = 'wh_london_3'
  order_date between '2025-01-15' and '2025-01-15'
  → 142 orders returned in 4.1ms  ✓
ProjectionType: 'INCLUDE'

A GSI can project ALL attributes (full copy — maximum flexibility, maximum storage cost), KEYS_ONLY (just the primary key and index key — minimum storage), or INCLUDE (the index keys plus a specified list of attributes). For the warehouse pick list, the team only needs status and item_count alongside the keys — projecting only those two fields keeps the GSI small and read costs low.

GSI write cost — every order write also writes to the GSI

Every write to the base table that affects the GSI's projected attributes consumes write capacity on both the base table and the GSI. A table with three GSIs requires four write operations per item — one base, three indexes. Always factor GSI write capacity into your cost estimates, especially on high-throughput tables.

TTL Indexes — Automatic Data Expiry

The scenario: Your platform stores user sessions in MongoDB. Sessions older than 30 days must be deleted. A nightly cron job was running deleteMany({"{"} created_at: {"{"} $lt: thirtyDaysAgo {"}"} {"}"}) — but it was locking the collection for 40 seconds at 2 AM and causing latency spikes. You replace the cron job with a TTL index.

// TTL index — MongoDB auto-deletes documents after 30 days
// The background thread runs every 60 seconds checking for expired docs
db.sessions.createIndex(
  { created_at: 1 },
  { expireAfterSeconds: 2592000 }  // 30 days in seconds
)

// Verify the TTL index was created
db.sessions.getIndexes()
[
  { v: 2, key: { _id: 1 }, name: '_id_' },
  {
    v: 2, key: { created_at: 1 },
    name: 'created_at_1',
    expireAfterSeconds: 2592000
  }
]

TTL monitor running every 60 seconds.
Expired documents deleted in small batches — no collection lock.
No more 2 AM latency spikes ✓
expireAfterSeconds: 2592000

MongoDB's TTL monitor is a background thread that wakes up every 60 seconds and deletes documents where the TTL field value plus expireAfterSeconds is in the past. Deletions happen in small batches — no collection-level lock, no latency spike. The 60-second granularity means documents expire within a minute of their deadline, not exactly at it.

Teacher's Note

In Cassandra, reaching for a secondary index is almost always the wrong move. The right move is a new table. In MongoDB, the opposite is true — indexes are the primary optimisation tool and they work extremely well. Knowing which database you are in and applying the right mental model is what separates engineers who fix slow queries in 10 minutes from ones who spend three days adding secondary indexes that make things worse.

Practice Questions — You're the Engineer

Scenario:

You run explain("executionStats") on a slow MongoDB query. The output shows totalDocsExamined: 80000000 and nReturned: 312 — an examination ratio of 256,000:1. The query takes 9.4 seconds. You look at the winningPlan.stage field to confirm the problem. What stage name tells you MongoDB is reading every document in the collection?


Scenario:

A junior engineer creates a compound index as {"{"} posted_at: -1, city: 1, job_type: 1 {"}"} for the query find({"{"} city: "London", job_type: "Engineering" {"}"}).sort({"{"} posted_at: -1 {"}"}). The query still shows an in-memory SORT stage in the execution plan. You tell them the index field order violates a three-letter rule for ordering equality, sort, and range fields in compound indexes. What rule did they break?


Scenario:

Your Cassandra table is partitioned by user_id. A new product requirement needs you to query sessions by device_fingerprint — a field that is not the partition key. A colleague suggests adding a secondary index on device_fingerprint. You push back because a secondary index query without the partition key hits every node in the cluster. What should you create instead?


Quiz — Indexing in Production

Scenario:

Your e-commerce product search query filters by status: "active" and category: "Electronics", then returns products with price between £50 and £500. Currently running as a COLLSCAN on 20 million products — taking 4.2 seconds. No indexes exist beyond _id. Applying the ESR rule, what compound index should you create?

Scenario:

Your MongoDB sessions collection has 200 million documents. A nightly cron job deletes sessions older than 7 days using deleteMany(). It locks the collection for 3 minutes every night at 2 AM, causing timeouts for users in Asia-Pacific. You want automatic, continuous expiry with no locking. What is the correct solution?

Scenario:

Your team has added five Global Secondary Indexes to a DynamoDB orders table to support five different access patterns. Write throughput is 10,000 orders per minute. Your AWS bill has tripled since the GSIs were added and you are getting throttling errors on writes. Your architect asks you to explain to the team why GSIs are expensive at this write volume.

Up Next · Lesson 30

Transactions in NoSQL

NoSQL said transactions were impossible. Then it quietly added them back. Here is exactly what you get, what it costs, and when to use them.