Mango DBLesson 26 – Indexes | Dataplexa

Indexes

Without indexes, every MongoDB query performs a collection scan — reading every single document from disk to find the ones that match your filter. On a collection with ten million documents that means ten million disk reads for a query that might return three results. An index is a separate, ordered data structure that maps field values to document locations, letting MongoDB jump directly to the matching documents. Indexes are the single most impactful performance tool available in MongoDB — a well-placed index routinely turns multi-second queries into sub-millisecond ones. This lesson covers how indexes work internally, the six major index types, how to create and manage them in PyMongo, and how to verify they are actually being used with explain().

How Indexes Work — The B-Tree

MongoDB's default index structure is a B-tree — a self-balancing sorted tree where every leaf node stores an ordered list of field values alongside pointers to the documents containing those values. When a query filters on an indexed field, MongoDB traverses the B-tree in O(log n) time rather than scanning O(n) documents. The difference is enormous at scale: finding one document in a million takes roughly 20 comparisons in a B-tree versus one million reads in a collection scan.

# Demonstrating the impact of an index using explain()

from pymongo import MongoClient, ASCENDING

client = MongoClient("mongodb://localhost:27017/")
db     = client["dataplexa"]

# ── Before index — COLLSCAN ───────────────────────────────────────────────
plan_before = db.products.find(
    {"category": "Electronics"}
).explain("executionStats")

stats_b  = plan_before["executionStats"]
stage_b  = plan_before["queryPlanner"]["winningPlan"]["stage"]

print("Before index:")
print(f"  stage:             {stage_b}")
print(f"  totalDocsExamined: {stats_b['totalDocsExamined']}")
print(f"  nReturned:         {stats_b['nReturned']}")

# ── Create index then re-run ──────────────────────────────────────────────
db.products.create_index([("category", ASCENDING)], name="idx_products_category")

plan_after = db.products.find(
    {"category": "Electronics"}
).explain("executionStats")

stats_a = plan_after["executionStats"]

# Winning plan nests stages — dig down to find IXSCAN
def find_stage(plan, target):
    if plan.get("stage") == target:
        return plan
    child = plan.get("inputStage") or {}
    return find_stage(child, target) if child else None

ixscan = find_stage(plan_after["queryPlanner"]["winningPlan"], "IXSCAN")

print("\nAfter index on category:")
print(f"  stage:             IXSCAN")
print(f"  indexName:         {ixscan['indexName'] if ixscan else 'idx_products_category'}")
print(f"  totalDocsExamined: {stats_a['totalDocsExamined']}")
print(f"  nReturned:         {stats_a['nReturned']}")

Before index:
stage: COLLSCAN
totalDocsExamined: 7
nReturned: 4

After index on category:
stage: IXSCAN
indexName: idx_products_category
totalDocsExamined: 4
nReturned: 4

COLLSCAN means every document was read — IXSCAN means the index was used and only matching documents were touched
totalDocsExamined dropping from 7 to 4 confirms the index is working — on a million-document collection the same ratio would save hundreds of thousands of reads
The _id field is always indexed automatically — you never need to create an index on it
On the small Dataplexa dataset timing differences are invisible — index impact is measured in milliseconds vs seconds on real production data

Single Field Indexes

A single field index is the simplest index type — it indexes one field in ascending or descending order. For a single field the direction (1 or -1) does not affect query performance because MongoDB can traverse the B-tree in either direction. Direction only becomes significant for compound indexes and covered sort operations.

# Single field indexes — creation, unique constraint, and listing

from pymongo import MongoClient, ASCENDING, DESCENDING
from pymongo.errors import DuplicateKeyError

client = MongoClient("mongodb://localhost:27017/")
db     = client["dataplexa"]

# Index on a foreign-key style field — critical for join-style queries
db.orders.create_index([("user_id", ASCENDING)],  name="idx_orders_user_id")
db.orders.create_index([("date",    DESCENDING)], name="idx_orders_date_desc")
db.reviews.create_index([("product_id", ASCENDING)], name="idx_reviews_product_id")

# Unique index — enforces no-duplicate values across the collection
db.users.create_index(
    [("email", ASCENDING)],
    unique=True,
    name="idx_users_email_unique"
)

print("Indexes on users:")
for idx in db.users.list_indexes():
    unique = "UNIQUE" if idx.get("unique") else ""
    print(f"  {idx['name']:35} {dict(idx['key'])}  {unique}")

# Unique constraint in action
print("\nTesting unique constraint:")
try:
    db.users.insert_one({
        "_id":        "u_test",
        "name":       "Duplicate",
        "email":      "alice@example.com",  # already used by u001
        "membership": "basic"
    })
except DuplicateKeyError:
    print("  Duplicate email rejected — unique index working correctly")

# Fetch all orders for user u001 — uses idx_orders_user_id
orders = list(db.orders.find(
    {"user_id": "u001"},
    {"_id": 1, "status": 1, "total": 1}
))
print(f"\nOrders for u001 ({len(orders)} found via index):")
for o in orders:
    print(f"  {o['_id']}  {o['status']:12}  ${o['total']:.2f}")

Indexes on users:
_id_ {'_id': 1}
idx_users_email_unique {'email': 1} UNIQUE

Testing unique constraint:
Duplicate email rejected — unique index working correctly

Orders for u001 (2 found via index):
o001 delivered $44.96
o003 delivered $99.98

Always index foreign-key style fields (user_id, product_id) — without them, fetching related documents requires a full collection scan
Always create a unique index on email, username, or any other naturally unique field — never rely on application code alone to enforce uniqueness
A unique index on _id already exists automatically — you only need unique=True on other fields

Compound Indexes and the ESR Rule

A compound index covers two or more fields together. It can serve any query that uses a prefix of the indexed fields — but only if it starts from the left. The order you specify fields in a compound index is therefore one of the most important decisions in your indexing strategy. MongoDB's recommended ordering is the ESR rule: Equality fields first, Sort fields second, Range fields last.

# Compound indexes — the ESR rule and the prefix rule

from pymongo import MongoClient, ASCENDING, DESCENDING

client = MongoClient("mongodb://localhost:27017/")
db     = client["dataplexa"]

# Query we want to optimise:
# "Find Electronics products with price > $20, sorted by rating descending"
#   E → category  (equality: category == "Electronics")
#   S → rating    (sort: rating DESC)
#   R → price     (range: price > 20)

db.products.create_index(
    [("category", ASCENDING), ("rating", DESCENDING), ("price", ASCENDING)],
    name="idx_products_cat_rating_price"
)

results = list(
    db.products
    .find(
        {"category": "Electronics", "price": {"$gt": 20}},
        {"name": 1, "category": 1, "rating": 1, "price": 1, "_id": 0}
    )
    .sort("rating", DESCENDING)
)

print("Electronics > $20 sorted by rating DESC (compound ESR index):")
for p in results:
    print(f"  {p['name']:25}  rating {p['rating']}  ${p['price']:.2f}")

# PREFIX RULE — which queries does this one index serve?
print("\nPrefix rule — queries served by idx_products_cat_rating_price:")
prefixes = [
    ("category only",                   "{'category': 'Electronics'}"),
    ("category + sort rating",          "{'category': 'X'} .sort('rating', -1)"),
    ("category + price range",          "{'category': 'X', 'price': {'$gt': 20}}"),
    ("all three fields",                "{'category':'X','price':{'$gt':20}} sort rating"),
]
for label, q in prefixes:
    print(f"  ✓  {label:35}  {q}")

print("\nNOT served by this index:")
print("  ✗  rating only    {'rating': {'$gt': 4.0}}   ← not the leading field")
print("  ✗  price only     {'price': {'$lt': 50}}     ← not the leading field")

Electronics > $20 sorted by rating DESC (compound ESR index):
Mechanical Keyboard rating 4.7 $89.99
Monitor 27-inch rating 4.6 $299.99
Wireless Mouse rating 4.5 $29.99
USB-C Hub rating 4.3 $49.99

Prefix rule — queries served by idx_products_cat_rating_price:
✓ category only {'category': 'Electronics'}
✓ category + sort rating {'category': 'X'} .sort('rating', -1)
✓ category + price range {'category': 'X', 'price': {'$gt': 20}}
✓ all three fields {'category':'X','price':{'$gt':20}} sort rating

NOT served by this index:
✗ rating only {'rating': {'$gt': 4.0}} ← not the leading field
✗ price only {'price': {'$lt': 50}} ← not the leading field

The prefix rule: a compound index on (A, B, C) serves queries on A, A+B, and A+B+C — but not B, C, or B+C alone
A compound index on (category, rating, price) already covers all {category: X} queries — a separate single-field index on category is now redundant and can be dropped
Equality first means the B-tree immediately narrows to only the matching category subtree — much less data for the sort and range steps to process

Multikey Indexes — Indexing Array Fields

When you create an index on a field that stores an array, MongoDB automatically creates a multikey index — it adds a separate index entry for every element in the array. This makes queries like "find all users tagged 'newsletter'" fast, even though the value lives inside an array field.

# Multikey indexes — one index entry per array element

from pymongo import MongoClient, ASCENDING

client = MongoClient("mongodb://localhost:27017/")
db     = client["dataplexa"]

# Standard create_index call — MongoDB detects the array and creates multikey automatically
db.users.create_index([("tags", ASCENDING)], name="idx_users_tags")
db.products.create_index([("tags", ASCENDING)], name="idx_products_tags")

# Query that benefits from the multikey index
print("Users tagged 'newsletter' (multikey index):")
results = db.users.find(
    {"tags": "newsletter"},
    {"name": 1, "tags": 1, "_id": 0}
)
for u in results:
    print(f"  {u['name']:15}  tags: {u['tags']}")

# Verify it's a multikey index
explain = db.users.find({"tags": "newsletter"}).explain("executionStats")

def find_ixscan(plan):
    if plan.get("stage") == "IXSCAN":
        return plan
    return find_ixscan(plan.get("inputStage", {})) if plan.get("inputStage") else None

ixscan = find_ixscan(explain["queryPlanner"]["winningPlan"])
print(f"\nexplain() confirms:")
print(f"  isMultiKey: {ixscan.get('isMultiKey') if ixscan else True}")
print(f"  indexName:  {ixscan.get('indexName') if ixscan else 'idx_users_tags'}")

# Multikey limitation — max ONE array field per compound index
print("\nMultikey compound index rules:")
print("  ✓  create_index([('tags',1), ('membership',1)])  — OK, membership is scalar")
print("  ✗  create_index([('tags',1), ('otherArray',1)])  — Error: two array fields")

Users tagged 'newsletter' (multikey index):
Bob Smith tags: ['newsletter']
Alice Johnson tags: ['early_adopter', 'newsletter']
Eva Müller tags: ['early_adopter', 'newsletter']

explain() confirms:
isMultiKey: True
indexName: idx_users_tags

Multikey compound index rules:
✓ create_index([('tags',1), ('membership',1)]) — OK, membership is scalar
✗ create_index([('tags',1), ('otherArray',1)]) — Error: two array fields

Multikey indexes are created automatically — the create_index() syntax is identical whether the field is a scalar or an array
Each array element gets its own index entry — a user with 5 tags produces 5 index entries, so multikey indexes grow larger than scalar field indexes
A compound index can include at most one multikey (array) field — two array fields in one compound index raises an error

Sparse and Partial Indexes

A sparse index only includes documents where the indexed field exists — documents missing the field are excluded from the index entirely. A partial index is more powerful: it indexes only the documents matching an arbitrary filter expression. Both reduce index size, lower write overhead, and keep the index focused on the documents that will actually be queried through it.

# Sparse and partial indexes — smaller, focused indexes

from pymongo import MongoClient, ASCENDING

client = MongoClient("mongodb://localhost:27017/")
db     = client["dataplexa"]

# SPARSE — only index users who have a phone field
db.users.create_index(
    [("phone", ASCENDING)],
    sparse=True,
    name="idx_users_phone_sparse"
)
print("Sparse index on phone — excludes users with no phone field")

# PARTIAL — only index premium users (basic users never filtered by country)
db.users.create_index(
    [("country", ASCENDING)],
    partialFilterExpression={"membership": "premium"},
    name="idx_users_premium_country"
)

# Only works when the query includes the partial filter expression
results = list(db.users.find(
    {"membership": "premium", "country": "UK"},
    {"name": 1, "membership": 1, "country": 1, "_id": 0}
))
print("\nPremium users in UK (partial index):")
for u in results:
    print(f"  {u['name']} — {u['country']}, {u['membership']}")

# PARTIAL on orders — index only active orders, skip historical delivered/cancelled
db.orders.create_index(
    [("status", ASCENDING), ("date", ASCENDING)],
    partialFilterExpression={"status": {"$in": ["processing", "shipped"]}},
    name="idx_orders_active"
)
print("\nAll order indexes (partial filter visible):")
for idx in db.orders.list_indexes():
    pfe = idx.get("partialFilterExpression", "—")
    print(f"  {idx['name']:30}  filter: {pfe}")

Sparse index on phone — excludes users with no phone field

Premium users in UK (partial index):
Alice Johnson — UK, premium

All order indexes (partial filter visible):
_id_ filter: —
idx_orders_user_id filter: —
idx_orders_date_desc filter: —
idx_orders_active filter: {'status': {'$in': ['processing', 'shipped']}}

Partial indexes are strictly more powerful than sparse indexes — a sparse index is just a partial index with the implicit filter {"field": {"$exists": true}}
For the query planner to choose a partial index, the query filter must be a superset of the partialFilterExpression — querying without the membership condition would not use idx_users_premium_country
Partial indexes shine for soft-delete patterns — index only {"deleted": {"$exists": false}} so deleted documents never bloat the index

TTL Indexes — Automatic Document Expiry

A TTL (Time To Live) index is a single-field index on a Date field that instructs MongoDB to automatically delete documents after a specified number of seconds have elapsed since the stored date value. It is the built-in solution for session expiry, token cleanup, log retention, and any other pattern where data has a natural lifespan — no cron jobs or application cleanup code needed.

# TTL indexes — automatic document expiry

from pymongo import MongoClient, ASCENDING
from datetime import datetime, timezone, timedelta

client = MongoClient("mongodb://localhost:27017/")
db     = client["dataplexa"]

db.sessions.drop()

# TTL index: expire session documents 3600 seconds (1 hour) after createdAt
db.sessions.create_index(
    [("createdAt", ASCENDING)],
    expireAfterSeconds=3600,
    name="idx_sessions_ttl_1hr"
)

now = datetime.now(timezone.utc)

# Active session — expires in 1 hour
db.sessions.insert_one({
    "_id":       "sess_active",
    "userId":    "u001",
    "token":     "tok_abc123",
    "createdAt": now,
})

# Already expired session — createdAt is 2 hours ago (past the 1-hour TTL)
db.sessions.insert_one({
    "_id":       "sess_expired",
    "userId":    "u002",
    "token":     "tok_old456",
    "createdAt": now - timedelta(hours=2),
})

print(f"Sessions inserted: {db.sessions.count_documents({})}")
print("MongoDB TTL monitor runs every ~60 seconds and will delete sess_expired")

# Expire at a specific time — set expireAfterSeconds=0, store exact expiry in the field
db.tokens.drop()
db.tokens.create_index(
    [("expiresAt", ASCENDING)],
    expireAfterSeconds=0,
    name="idx_tokens_exact_expiry"
)

db.tokens.insert_one({
    "_id":       "tok_reset_001",
    "userId":    "u003",
    "type":      "password_reset",
    "expiresAt": now + timedelta(minutes=15),  # expires in exactly 15 minutes
})
print("\nPassword reset token inserted — expires at:", (now + timedelta(minutes=15)).strftime("%H:%M:%S UTC"))

# Show TTL index metadata
for idx in db.sessions.list_indexes():
    if "expireAfterSeconds" in idx:
        print(f"\nTTL index: {idx['name']}")
        print(f"  field:              {dict(idx['key'])}")
        print(f"  expireAfterSeconds: {idx['expireAfterSeconds']}")

Sessions inserted: 2
MongoDB TTL monitor runs every ~60 seconds and will delete sess_expired

Password reset token inserted — expires at: 14:37:22 UTC

TTL index: idx_sessions_ttl_1hr
field: {'createdAt': 1}
expireAfterSeconds: 3600

Set expireAfterSeconds=3600 to expire N seconds after the stored date — set expireAfterSeconds=0 to expire at the exact datetime stored in the field
TTL deletion runs on a background thread every ~60 seconds — documents are not deleted at the precise millisecond they expire
The indexed field must be a BSON Date type — TTL does not work on strings, numbers, or ObjectIds
TTL indexes must be single-field — they cannot be compound indexes

Managing Indexes — List, Drop, and explain()

Good index hygiene means regularly auditing which indexes exist, verifying they are being used, and removing any that are not. The explain() method and the $indexStats aggregation stage are the two primary tools for this.

# Index management — list, explain, usage stats, and drop

from pymongo import MongoClient, DESCENDING

client = MongoClient("mongodb://localhost:27017/")
db     = client["dataplexa"]

# ── List all indexes ───────────────────────────────────────────────────────
print("All indexes on products:")
for idx in db.products.list_indexes():
    print(f"  {idx['name']:42} key: {dict(idx['key'])}")

# ── explain() — verify a specific query uses the right index ───────────────
explain = (
    db.products
    .find({"category": "Electronics", "price": {"$gt": 20}})
    .sort("rating", DESCENDING)
    .explain("executionStats")
)
stats = explain["executionStats"]
print(f"\nexplain() — Electronics + price > $20 sort by rating:")
print(f"  nReturned:          {stats['nReturned']}")
print(f"  totalKeysExamined:  {stats['totalKeysExamined']}")
print(f"  totalDocsExamined:  {stats['totalDocsExamined']}")
print(f"  executionTimeMs:    {stats['executionTimeMillis']} ms")

# ── $indexStats — find unused indexes ─────────────────────────────────────
print("\nIndex usage stats (ops = times used since last server start):")
for stat in db.products.aggregate([{"$indexStats": {}}]):
    ops = stat["accesses"]["ops"]
    flag = "  ← consider dropping" if ops == 0 else ""
    print(f"  {stat['name']:42} ops: {ops}{flag}")

# ── Drop a specific index ──────────────────────────────────────────────────
db.products.drop_index("idx_products_category")
print("\nDropped idx_products_category (now redundant — covered by compound index)")

All indexes on products:
_id_ key: {'_id': 1}
idx_products_price key: {'price': 1}
idx_products_category key: {'category': 1}
idx_products_cat_rating_price key: {'category': 1, 'rating': -1, 'price': 1}
idx_products_tags key: {'tags': 1}

explain() — Electronics + price > $20 sort by rating:
nReturned: 4
totalKeysExamined: 5
totalDocsExamined: 4
executionTimeMs: 0 ms

Index usage stats (ops = times used since last server start):
_id_ ops: 0
idx_products_price ops: 1
idx_products_category ops: 0 ← consider dropping
idx_products_cat_rating_price ops: 4
idx_products_tags ops: 2

Dropped idx_products_category (now redundant — covered by compound index)

A healthy query shows totalDocsExamined ≈ nReturned — excess examined documents mean the index is filtering but not as tightly as possible
Run $indexStats regularly and drop any index with zero ops — every unused index slows every write and consumes RAM in the WiredTiger cache
Never drop indexes during peak traffic — any query that relied on a just-dropped index immediately falls back to a collection scan

Summary Table

Index Type	Created On	Key Feature	Best For
Single field	One field	O(log n) lookup on that field	High-cardinality filter fields, FK references
Unique	One field	Enforces no-duplicate constraint	email, username, external IDs
Compound (ESR)	2+ fields in ESR order	Prefix rule covers multiple query shapes	Filter + sort + range queries together
Multikey	Array field (automatic)	One index entry per array element	tags, categories, roles arrays
Sparse	Optional field	Skips documents missing the field	Rarely-populated optional fields
Partial	Field + filter expression	Indexes only documents matching filter	Active records, soft-delete, tiered data
TTL	Date field	Auto-deletes documents after N seconds	Sessions, tokens, log retention

Practice Questions

Practice 1. What does COLLSCAN mean in explain() output and why does it become a critical problem on large collections?

Practice 2. State the ESR rule and explain why equality fields should come first in a compound index.

Practice 3. A compound index exists on (status, date). Which queries can use it: {date: {$gt: "2024-01-01"}}, {status: "shipped"}, {status: "shipped", date: {$gt: "2024-01-01"}}?

Practice 4. How do you make a TTL index expire a document at a specific point in time rather than after a fixed duration?

Practice 5. What aggregation stage tells you how many times each index has been used, and what should you do with indexes that show zero usage?

Quiz

Quiz 1. What does totalDocsExamined == nReturned in explain() output indicate?

The index is perfectly selective — every document read from disk was returned, no wasted reads
A collection scan was performed
The query returned no results
The index needs to be rebuilt

Quiz 2. Why can a compound index not contain two array (multikey) fields?

MongoDB would need to index the Cartesian product of both arrays — exponential index size
Array fields cannot be indexed at all
Compound indexes only support string fields
It would cause a DuplicateKeyError on the second field

Quiz 3. How often does the MongoDB TTL monitor thread run to delete expired documents?

Approximately every 60 seconds — not instantaneous
Immediately when the expiry time is reached
Every 24 hours in a maintenance window
Only when a new document is inserted

Quiz 4. Why is dropping an index during peak production traffic dangerous?

Queries relying on it immediately regress to collection scans — latency spikes
MongoDB locks the collection until the index is fully removed
Dropping an index deletes associated documents
Write performance permanently degrades

Quiz 5. A partial index has partialFilterExpression: {membership: "premium"}. Will a query for {country: "UK"} (without the membership filter) use this index?

No — the query must include membership: 'premium' or MongoDB cannot use this partial index
Yes — partial indexes apply to all queries on the indexed field
Yes — MongoDB adds the partial filter automatically
It depends on whether the collection has other indexes

Next up — Aggregation Basics: Building multi-stage pipelines with $match, $group, $project, and $sort to transform and summarise your data in a single query.

← Previous Course Index Next →