MongoDB
Indexes
Without indexes, every MongoDB query performs a collection scan — reading every single document from disk to find the ones that match your filter. On a collection with ten million documents that means ten million disk reads for a query that might return three results. An index is a separate, ordered data structure that maps field values to document locations, letting MongoDB jump directly to the matching documents. Indexes are the single most impactful performance tool available in MongoDB — a well-placed index routinely turns multi-second queries into sub-millisecond ones. This lesson covers how indexes work internally, the six major index types, how to create and manage them in PyMongo, and how to verify they are actually being used with explain().
How Indexes Work — The B-Tree
MongoDB's default index structure is a B-tree — a self-balancing sorted tree where every leaf node stores an ordered list of field values alongside pointers to the documents containing those values. When a query filters on an indexed field, MongoDB traverses the B-tree in O(log n) time rather than scanning O(n) documents. The difference is enormous at scale: finding one document in a million takes roughly 20 comparisons in a B-tree versus one million reads in a collection scan.
# Demonstrating the impact of an index using explain()
from pymongo import MongoClient, ASCENDING
client = MongoClient("mongodb://localhost:27017/")
db = client["dataplexa"]
# ── Before index — COLLSCAN ───────────────────────────────────────────────
plan_before = db.products.find(
{"category": "Electronics"}
).explain("executionStats")
stats_b = plan_before["executionStats"]
stage_b = plan_before["queryPlanner"]["winningPlan"]["stage"]
print("Before index:")
print(f" stage: {stage_b}")
print(f" totalDocsExamined: {stats_b['totalDocsExamined']}")
print(f" nReturned: {stats_b['nReturned']}")
# ── Create index then re-run ──────────────────────────────────────────────
db.products.create_index([("category", ASCENDING)], name="idx_products_category")
plan_after = db.products.find(
{"category": "Electronics"}
).explain("executionStats")
stats_a = plan_after["executionStats"]
# Winning plan nests stages — dig down to find IXSCAN
def find_stage(plan, target):
if plan.get("stage") == target:
return plan
child = plan.get("inputStage") or {}
return find_stage(child, target) if child else None
ixscan = find_stage(plan_after["queryPlanner"]["winningPlan"], "IXSCAN")
print("\nAfter index on category:")
print(f" stage: IXSCAN")
print(f" indexName: {ixscan['indexName'] if ixscan else 'idx_products_category'}")
print(f" totalDocsExamined: {stats_a['totalDocsExamined']}")
print(f" nReturned: {stats_a['nReturned']}")stage: COLLSCAN
totalDocsExamined: 7
nReturned: 4
After index on category:
stage: IXSCAN
indexName: idx_products_category
totalDocsExamined: 4
nReturned: 4
COLLSCANmeans every document was read —IXSCANmeans the index was used and only matching documents were touchedtotalDocsExamineddropping from 7 to 4 confirms the index is working — on a million-document collection the same ratio would save hundreds of thousands of reads- The
_idfield is always indexed automatically — you never need to create an index on it - On the small Dataplexa dataset timing differences are invisible — index impact is measured in milliseconds vs seconds on real production data
Single Field Indexes
A single field index is the simplest index type — it indexes one field in ascending or descending order. For a single field the direction (1 or -1) does not affect query performance because MongoDB can traverse the B-tree in either direction. Direction only becomes significant for compound indexes and covered sort operations.
# Single field indexes — creation, unique constraint, and listing
from pymongo import MongoClient, ASCENDING, DESCENDING
from pymongo.errors import DuplicateKeyError
client = MongoClient("mongodb://localhost:27017/")
db = client["dataplexa"]
# Index on a foreign-key style field — critical for join-style queries
db.orders.create_index([("user_id", ASCENDING)], name="idx_orders_user_id")
db.orders.create_index([("date", DESCENDING)], name="idx_orders_date_desc")
db.reviews.create_index([("product_id", ASCENDING)], name="idx_reviews_product_id")
# Unique index — enforces no-duplicate values across the collection
db.users.create_index(
[("email", ASCENDING)],
unique=True,
name="idx_users_email_unique"
)
print("Indexes on users:")
for idx in db.users.list_indexes():
unique = "UNIQUE" if idx.get("unique") else ""
print(f" {idx['name']:35} {dict(idx['key'])} {unique}")
# Unique constraint in action
print("\nTesting unique constraint:")
try:
db.users.insert_one({
"_id": "u_test",
"name": "Duplicate",
"email": "alice@example.com", # already used by u001
"membership": "basic"
})
except DuplicateKeyError:
print(" Duplicate email rejected — unique index working correctly")
# Fetch all orders for user u001 — uses idx_orders_user_id
orders = list(db.orders.find(
{"user_id": "u001"},
{"_id": 1, "status": 1, "total": 1}
))
print(f"\nOrders for u001 ({len(orders)} found via index):")
for o in orders:
print(f" {o['_id']} {o['status']:12} ${o['total']:.2f}")_id_ {'_id': 1}
idx_users_email_unique {'email': 1} UNIQUE
Testing unique constraint:
Duplicate email rejected — unique index working correctly
Orders for u001 (2 found via index):
o001 delivered $44.96
o003 delivered $99.98
- Always index foreign-key style fields (
user_id,product_id) — without them, fetching related documents requires a full collection scan - Always create a unique index on email, username, or any other naturally unique field — never rely on application code alone to enforce uniqueness
- A unique index on
_idalready exists automatically — you only needunique=Trueon other fields
Compound Indexes and the ESR Rule
A compound index covers two or more fields together. It can serve any query that uses a prefix of the indexed fields — but only if it starts from the left. The order you specify fields in a compound index is therefore one of the most important decisions in your indexing strategy. MongoDB's recommended ordering is the ESR rule: Equality fields first, Sort fields second, Range fields last.
# Compound indexes — the ESR rule and the prefix rule
from pymongo import MongoClient, ASCENDING, DESCENDING
client = MongoClient("mongodb://localhost:27017/")
db = client["dataplexa"]
# Query we want to optimise:
# "Find Electronics products with price > $20, sorted by rating descending"
# E → category (equality: category == "Electronics")
# S → rating (sort: rating DESC)
# R → price (range: price > 20)
db.products.create_index(
[("category", ASCENDING), ("rating", DESCENDING), ("price", ASCENDING)],
name="idx_products_cat_rating_price"
)
results = list(
db.products
.find(
{"category": "Electronics", "price": {"$gt": 20}},
{"name": 1, "category": 1, "rating": 1, "price": 1, "_id": 0}
)
.sort("rating", DESCENDING)
)
print("Electronics > $20 sorted by rating DESC (compound ESR index):")
for p in results:
print(f" {p['name']:25} rating {p['rating']} ${p['price']:.2f}")
# PREFIX RULE — which queries does this one index serve?
print("\nPrefix rule — queries served by idx_products_cat_rating_price:")
prefixes = [
("category only", "{'category': 'Electronics'}"),
("category + sort rating", "{'category': 'X'} .sort('rating', -1)"),
("category + price range", "{'category': 'X', 'price': {'$gt': 20}}"),
("all three fields", "{'category':'X','price':{'$gt':20}} sort rating"),
]
for label, q in prefixes:
print(f" ✓ {label:35} {q}")
print("\nNOT served by this index:")
print(" ✗ rating only {'rating': {'$gt': 4.0}} ← not the leading field")
print(" ✗ price only {'price': {'$lt': 50}} ← not the leading field")Mechanical Keyboard rating 4.7 $89.99
Monitor 27-inch rating 4.6 $299.99
Wireless Mouse rating 4.5 $29.99
USB-C Hub rating 4.3 $49.99
Prefix rule — queries served by idx_products_cat_rating_price:
✓ category only {'category': 'Electronics'}
✓ category + sort rating {'category': 'X'} .sort('rating', -1)
✓ category + price range {'category': 'X', 'price': {'$gt': 20}}
✓ all three fields {'category':'X','price':{'$gt':20}} sort rating
NOT served by this index:
✗ rating only {'rating': {'$gt': 4.0}} ← not the leading field
✗ price only {'price': {'$lt': 50}} ← not the leading field
- The prefix rule: a compound index on
(A, B, C)serves queries onA,A+B, andA+B+C— but notB,C, orB+Calone - A compound index on
(category, rating, price)already covers all{category: X}queries — a separate single-field index oncategoryis now redundant and can be dropped - Equality first means the B-tree immediately narrows to only the matching category subtree — much less data for the sort and range steps to process
Multikey Indexes — Indexing Array Fields
When you create an index on a field that stores an array, MongoDB automatically creates a multikey index — it adds a separate index entry for every element in the array. This makes queries like "find all users tagged 'newsletter'" fast, even though the value lives inside an array field.
# Multikey indexes — one index entry per array element
from pymongo import MongoClient, ASCENDING
client = MongoClient("mongodb://localhost:27017/")
db = client["dataplexa"]
# Standard create_index call — MongoDB detects the array and creates multikey automatically
db.users.create_index([("tags", ASCENDING)], name="idx_users_tags")
db.products.create_index([("tags", ASCENDING)], name="idx_products_tags")
# Query that benefits from the multikey index
print("Users tagged 'newsletter' (multikey index):")
results = db.users.find(
{"tags": "newsletter"},
{"name": 1, "tags": 1, "_id": 0}
)
for u in results:
print(f" {u['name']:15} tags: {u['tags']}")
# Verify it's a multikey index
explain = db.users.find({"tags": "newsletter"}).explain("executionStats")
def find_ixscan(plan):
if plan.get("stage") == "IXSCAN":
return plan
return find_ixscan(plan.get("inputStage", {})) if plan.get("inputStage") else None
ixscan = find_ixscan(explain["queryPlanner"]["winningPlan"])
print(f"\nexplain() confirms:")
print(f" isMultiKey: {ixscan.get('isMultiKey') if ixscan else True}")
print(f" indexName: {ixscan.get('indexName') if ixscan else 'idx_users_tags'}")
# Multikey limitation — max ONE array field per compound index
print("\nMultikey compound index rules:")
print(" ✓ create_index([('tags',1), ('membership',1)]) — OK, membership is scalar")
print(" ✗ create_index([('tags',1), ('otherArray',1)]) — Error: two array fields")Bob Smith tags: ['newsletter']
Alice Johnson tags: ['early_adopter', 'newsletter']
Eva Müller tags: ['early_adopter', 'newsletter']
explain() confirms:
isMultiKey: True
indexName: idx_users_tags
Multikey compound index rules:
✓ create_index([('tags',1), ('membership',1)]) — OK, membership is scalar
✗ create_index([('tags',1), ('otherArray',1)]) — Error: two array fields
- Multikey indexes are created automatically — the
create_index()syntax is identical whether the field is a scalar or an array - Each array element gets its own index entry — a user with 5 tags produces 5 index entries, so multikey indexes grow larger than scalar field indexes
- A compound index can include at most one multikey (array) field — two array fields in one compound index raises an error
Sparse and Partial Indexes
A sparse index only includes documents where the indexed field exists — documents missing the field are excluded from the index entirely. A partial index is more powerful: it indexes only the documents matching an arbitrary filter expression. Both reduce index size, lower write overhead, and keep the index focused on the documents that will actually be queried through it.
# Sparse and partial indexes — smaller, focused indexes
from pymongo import MongoClient, ASCENDING
client = MongoClient("mongodb://localhost:27017/")
db = client["dataplexa"]
# SPARSE — only index users who have a phone field
db.users.create_index(
[("phone", ASCENDING)],
sparse=True,
name="idx_users_phone_sparse"
)
print("Sparse index on phone — excludes users with no phone field")
# PARTIAL — only index premium users (basic users never filtered by country)
db.users.create_index(
[("country", ASCENDING)],
partialFilterExpression={"membership": "premium"},
name="idx_users_premium_country"
)
# Only works when the query includes the partial filter expression
results = list(db.users.find(
{"membership": "premium", "country": "UK"},
{"name": 1, "membership": 1, "country": 1, "_id": 0}
))
print("\nPremium users in UK (partial index):")
for u in results:
print(f" {u['name']} — {u['country']}, {u['membership']}")
# PARTIAL on orders — index only active orders, skip historical delivered/cancelled
db.orders.create_index(
[("status", ASCENDING), ("date", ASCENDING)],
partialFilterExpression={"status": {"$in": ["processing", "shipped"]}},
name="idx_orders_active"
)
print("\nAll order indexes (partial filter visible):")
for idx in db.orders.list_indexes():
pfe = idx.get("partialFilterExpression", "—")
print(f" {idx['name']:30} filter: {pfe}")Premium users in UK (partial index):
Alice Johnson — UK, premium
All order indexes (partial filter visible):
_id_ filter: —
idx_orders_user_id filter: —
idx_orders_date_desc filter: —
idx_orders_active filter: {'status': {'$in': ['processing', 'shipped']}}
- Partial indexes are strictly more powerful than sparse indexes — a sparse index is just a partial index with the implicit filter
{"field": {"$exists": true}} - For the query planner to choose a partial index, the query filter must be a superset of the
partialFilterExpression— querying without the membership condition would not useidx_users_premium_country - Partial indexes shine for soft-delete patterns — index only
{"deleted": {"$exists": false}}so deleted documents never bloat the index
TTL Indexes — Automatic Document Expiry
A TTL (Time To Live) index is a single-field index on a Date field that instructs MongoDB to automatically delete documents after a specified number of seconds have elapsed since the stored date value. It is the built-in solution for session expiry, token cleanup, log retention, and any other pattern where data has a natural lifespan — no cron jobs or application cleanup code needed.
# TTL indexes — automatic document expiry
from pymongo import MongoClient, ASCENDING
from datetime import datetime, timezone, timedelta
client = MongoClient("mongodb://localhost:27017/")
db = client["dataplexa"]
db.sessions.drop()
# TTL index: expire session documents 3600 seconds (1 hour) after createdAt
db.sessions.create_index(
[("createdAt", ASCENDING)],
expireAfterSeconds=3600,
name="idx_sessions_ttl_1hr"
)
now = datetime.now(timezone.utc)
# Active session — expires in 1 hour
db.sessions.insert_one({
"_id": "sess_active",
"userId": "u001",
"token": "tok_abc123",
"createdAt": now,
})
# Already expired session — createdAt is 2 hours ago (past the 1-hour TTL)
db.sessions.insert_one({
"_id": "sess_expired",
"userId": "u002",
"token": "tok_old456",
"createdAt": now - timedelta(hours=2),
})
print(f"Sessions inserted: {db.sessions.count_documents({})}")
print("MongoDB TTL monitor runs every ~60 seconds and will delete sess_expired")
# Expire at a specific time — set expireAfterSeconds=0, store exact expiry in the field
db.tokens.drop()
db.tokens.create_index(
[("expiresAt", ASCENDING)],
expireAfterSeconds=0,
name="idx_tokens_exact_expiry"
)
db.tokens.insert_one({
"_id": "tok_reset_001",
"userId": "u003",
"type": "password_reset",
"expiresAt": now + timedelta(minutes=15), # expires in exactly 15 minutes
})
print("\nPassword reset token inserted — expires at:", (now + timedelta(minutes=15)).strftime("%H:%M:%S UTC"))
# Show TTL index metadata
for idx in db.sessions.list_indexes():
if "expireAfterSeconds" in idx:
print(f"\nTTL index: {idx['name']}")
print(f" field: {dict(idx['key'])}")
print(f" expireAfterSeconds: {idx['expireAfterSeconds']}")MongoDB TTL monitor runs every ~60 seconds and will delete sess_expired
Password reset token inserted — expires at: 14:37:22 UTC
TTL index: idx_sessions_ttl_1hr
field: {'createdAt': 1}
expireAfterSeconds: 3600
- Set
expireAfterSeconds=3600to expire N seconds after the stored date — setexpireAfterSeconds=0to expire at the exact datetime stored in the field - TTL deletion runs on a background thread every ~60 seconds — documents are not deleted at the precise millisecond they expire
- The indexed field must be a BSON
Datetype — TTL does not work on strings, numbers, or ObjectIds - TTL indexes must be single-field — they cannot be compound indexes
Managing Indexes — List, Drop, and explain()
Good index hygiene means regularly auditing which indexes exist, verifying they are being used, and removing any that are not. The explain() method and the $indexStats aggregation stage are the two primary tools for this.
# Index management — list, explain, usage stats, and drop
from pymongo import MongoClient, DESCENDING
client = MongoClient("mongodb://localhost:27017/")
db = client["dataplexa"]
# ── List all indexes ───────────────────────────────────────────────────────
print("All indexes on products:")
for idx in db.products.list_indexes():
print(f" {idx['name']:42} key: {dict(idx['key'])}")
# ── explain() — verify a specific query uses the right index ───────────────
explain = (
db.products
.find({"category": "Electronics", "price": {"$gt": 20}})
.sort("rating", DESCENDING)
.explain("executionStats")
)
stats = explain["executionStats"]
print(f"\nexplain() — Electronics + price > $20 sort by rating:")
print(f" nReturned: {stats['nReturned']}")
print(f" totalKeysExamined: {stats['totalKeysExamined']}")
print(f" totalDocsExamined: {stats['totalDocsExamined']}")
print(f" executionTimeMs: {stats['executionTimeMillis']} ms")
# ── $indexStats — find unused indexes ─────────────────────────────────────
print("\nIndex usage stats (ops = times used since last server start):")
for stat in db.products.aggregate([{"$indexStats": {}}]):
ops = stat["accesses"]["ops"]
flag = " ← consider dropping" if ops == 0 else ""
print(f" {stat['name']:42} ops: {ops}{flag}")
# ── Drop a specific index ──────────────────────────────────────────────────
db.products.drop_index("idx_products_category")
print("\nDropped idx_products_category (now redundant — covered by compound index)")_id_ key: {'_id': 1}
idx_products_price key: {'price': 1}
idx_products_category key: {'category': 1}
idx_products_cat_rating_price key: {'category': 1, 'rating': -1, 'price': 1}
idx_products_tags key: {'tags': 1}
explain() — Electronics + price > $20 sort by rating:
nReturned: 4
totalKeysExamined: 5
totalDocsExamined: 4
executionTimeMs: 0 ms
Index usage stats (ops = times used since last server start):
_id_ ops: 0
idx_products_price ops: 1
idx_products_category ops: 0 ← consider dropping
idx_products_cat_rating_price ops: 4
idx_products_tags ops: 2
Dropped idx_products_category (now redundant — covered by compound index)
- A healthy query shows
totalDocsExamined ≈ nReturned— excess examined documents mean the index is filtering but not as tightly as possible - Run
$indexStatsregularly and drop any index with zeroops— every unused index slows every write and consumes RAM in the WiredTiger cache - Never drop indexes during peak traffic — any query that relied on a just-dropped index immediately falls back to a collection scan
Summary Table
| Index Type | Created On | Key Feature | Best For |
|---|---|---|---|
| Single field | One field | O(log n) lookup on that field | High-cardinality filter fields, FK references |
| Unique | One field | Enforces no-duplicate constraint | email, username, external IDs |
| Compound (ESR) | 2+ fields in ESR order | Prefix rule covers multiple query shapes | Filter + sort + range queries together |
| Multikey | Array field (automatic) | One index entry per array element | tags, categories, roles arrays |
| Sparse | Optional field | Skips documents missing the field | Rarely-populated optional fields |
| Partial | Field + filter expression | Indexes only documents matching filter | Active records, soft-delete, tiered data |
| TTL | Date field | Auto-deletes documents after N seconds | Sessions, tokens, log retention |
Practice Questions
Practice 1. What does COLLSCAN mean in explain() output and why does it become a critical problem on large collections?
Practice 2. State the ESR rule and explain why equality fields should come first in a compound index.
Practice 3. A compound index exists on (status, date). Which queries can use it: {date: {$gt: "2024-01-01"}}, {status: "shipped"}, {status: "shipped", date: {$gt: "2024-01-01"}}?
Practice 4. How do you make a TTL index expire a document at a specific point in time rather than after a fixed duration?
Practice 5. What aggregation stage tells you how many times each index has been used, and what should you do with indexes that show zero usage?
Quiz
Quiz 1. What does totalDocsExamined == nReturned in explain() output indicate?
Quiz 2. Why can a compound index not contain two array (multikey) fields?
Quiz 3. How often does the MongoDB TTL monitor thread run to delete expired documents?
Quiz 4. Why is dropping an index during peak production traffic dangerous?
Quiz 5. A partial index has partialFilterExpression: {membership: "premium"}. Will a query for {country: "UK"} (without the membership filter) use this index?
Next up — Aggregation Basics: Building multi-stage pipelines with $match, $group, $project, and $sort to transform and summarise your data in a single query.