NoSQL
Document Databases
In 2010, Foursquare was the fastest-growing app on the internet. 10 million check-ins a day, each one a rich JSON object — location, venue, user, tags, comments, points. Their MySQL cluster was hitting the wall. Schema changes required all-hands migrations. Each check-in needed 6 table JOINs to assemble. They switched to MongoDB. Within weeks their engineering team was shipping features in hours instead of weeks — because the data they were storing was already shaped like documents. This lesson is about understanding exactly why that works.
What a Document Database Actually Stores
A document database stores data as self-describing documents — typically JSON or a binary variant of it. Each document contains its own structure. No fixed columns. No foreign keys. No JOINs required to reassemble related data.
SQL table vs Document — the same data, two very different shapes:
❌ SQL — 4 tables, 3 JOINs needed
id | name | email
1 | Priya | p@x.com
-- posts table
id | user_id | title | body
1 | 1 | Hello world | ...
-- tags table
post_id | tag
1 | tech
1 | intro
-- comments table
post_id | user_id | text
1 | 2 | Great post!
✅ Document — 1 document, 0 JOINs
"_id": "post_001",
"author": {
"id": "u_1",
"name": "Priya"
},
"title": "Hello world",
"tags": ["tech", "intro"],
"comments": [
{"user":"u_2",
"text":"Great post!"}
]
}
How MongoDB Stores Documents — BSON
MongoDB doesn't store plain JSON on disk. It uses BSON (Binary JSON) — a binary-encoded serialisation of JSON-like documents. BSON adds three things JSON doesn't have:
Additional types
Date, ObjectId, Binary, Decimal128, Int32/Int64 — richer than JSON's number/string/bool.
Length-prefixed
Each field stores its own length. MongoDB can skip fields without parsing them — fast field access.
Traversable
Nested documents and arrays are directly traversable without full deserialisation.
The ObjectId — MongoDB's Auto-Generated Primary Key
Every MongoDB document gets an _id field automatically. If you don't provide one, MongoDB generates an ObjectId — a 12-byte value that encodes useful metadata:
// ObjectId anatomy: 507f1f77bcf86cd799439011
507f1f77
4 bytes
Unix timestamp
bcf86c
3 bytes
Machine ID
d799
2 bytes
Process ID
439011
3 bytes
Random counter
Globally unique across all machines without a central coordinator. Sortable by creation time — newer ObjectIds are always lexicographically greater.
CRUD Operations — The Full Set
The scenario: You're building a product catalogue for a multi-category marketplace. Products have completely different attributes per category. Here's every CRUD operation you'll use daily:
// CREATE — insertOne and insertMany
db.products.insertOne({
name: "ProBook X14 Laptop",
category: "electronics",
price: 1199.00,
specs: { ram: 16, storage: "512GB", display: "14-inch" },
tags: ["laptop", "work", "featured"],
in_stock: true,
created: new Date()
})
// Insert multiple documents at once
db.products.insertMany([
{ name: "Classic Tee", category: "apparel", price: 29.99, sizes: ["S","M","L"] },
{ name: "Running Hat", category: "apparel", price: 19.99, sizes: ["M","L","XL"] }
])
insertMany — sends all documents in one network call. Faster than calling insertOne in a loop. Returns an object with insertedIds — a map of array index to generated ObjectId for each document.
new Date() — MongoDB's BSON Date type stores milliseconds since epoch. Stored as 8 bytes, more efficient than an ISO string. Supports date comparison operators natively: $gte, $lt, etc.
// READ — find with filters, projection, sort, limit
// Find all featured electronics under £800
db.products.find(
{
category: "electronics",
price: { $lt: 800 },
tags: "featured" // match any document where tags array contains "featured"
},
{
name: 1, price: 1, specs: 1, // projection: only return these fields
_id: 0 // exclude _id from results
}
).sort({ price: 1 }) // sort by price ascending
.limit(10) // return max 10 results
[
{
name: "UltraBook Z13",
price: 449,
specs: { ram: 8, storage: "256GB", display: "13-inch" }
},
{
name: "WorkStation M1",
price: 699,
specs: { ram: 16, storage: "512GB", display: "15-inch" }
}
]
tags: "featured"
When you query an array field with a scalar value, MongoDB matches any document where the array contains that value. No special operator needed. {"{"} tags: "featured" {"}"} means "tags array includes the string 'featured'."
Projection: {"{ name: 1, price: 1, _id: 0 }"}
Projections reduce network transfer. Only request the fields you actually need. 1 = include, 0 = exclude. You cannot mix includes and excludes in one projection except for _id which can always be explicitly excluded.
// UPDATE — updateOne, updateMany, findOneAndUpdate
// Update one product's price and mark it as on sale
db.products.updateOne(
{ name: "ProBook X14 Laptop" }, // filter
{
$set: { price: 999, on_sale: true }, // set these fields
$push: { tags: "sale" }, // append to tags array
$currentDate: { updated_at: true } // set updated_at to now
}
)
// Give all apparel items a 10% discount
db.products.updateMany(
{ category: "apparel" },
{ $mul: { price: 0.90 } } // multiply price by 0.90
$push: {"{ tags: 'sale' }"}
Appends a value to an array field. The array grows — existing values are untouched. Use $addToSet instead if you want uniqueness guaranteed (won't add if the value already exists). Use $pop to remove from start or end. Use $pull to remove specific values.
$mul: {"{ price: 0.90 }"}
Multiplies the field by the given value in a single atomic operation. No read-then-write needed. $mul: 0.90 applies 10% discount. Other arithmetic operators: $inc (add/subtract), $min/$max (only update if new value is lower/higher).
// DELETE — deleteOne, deleteMany
// Remove a specific product
db.products.deleteOne({ name: "Running Hat" })
// Remove all out-of-stock products older than 90 days
db.products.deleteMany({
in_stock: false,
created: { $lt: new Date(Date.now() - 90 * 24 * 60 * 60 * 1000) }
})
{ acknowledged: true, deletedCount: 1 }
{ acknowledged: true, deletedCount: 14 }
-- 14 out-of-stock products older than 90 days removed
deletedCount — always check this. If it's 0, either your filter matched nothing or the document was already gone. A common bug: using deleteOne when deleteMany was intended — only the first matching document is removed.
Indexing — The Difference Between Fast and Catastrophic
Without an index, every query does a collection scan — MongoDB reads every document to find matches. With 10 documents this is fine. With 10 million documents this means your API times out.
The scenario: Your products collection has 2 million documents. You query by category and price constantly. Here's what happens with and without an index:
// Check query performance BEFORE indexing
db.products.find({ category: "electronics", price: { $lt: 800 } })
.explain("executionStats")
executionStats: {
executionTimeMillis: 4820, // 4.8 seconds
totalDocsExamined: 2000000, // read every document
totalDocsReturned: 1847, // only 1,847 matched
stage: "COLLSCAN" // COLLSCAN = full collection scan = bad
}
// Create a compound index on category + price
db.products.createIndex(
{ category: 1, price: 1 }, // 1 = ascending, -1 = descending
{ name: "category_price_idx" } // optional name for management
)
// Same query AFTER indexing
db.products.find({ category: "electronics", price: { $lt: 800 } })
.explain("executionStats")
executionStats: {
executionTimeMillis: 3, // 3ms (was 4,820ms)
totalDocsExamined: 1847, // only examined matching docs
totalDocsReturned: 1847, // same result
stage: "IXSCAN" // IXSCAN = index scan = good
}
-- 1,607x faster with one index
{"{ category: 1, price: 1 }"} — compound index field order matters
Put the equality field (category) first, the range field (price) second. This is called the ESR rule: Equality fields first, then Sort fields, then Range fields. An index on {"{"} price: 1, category: 1 {"}"} would be less efficient for this query because MongoDB can't narrow down by category first.
.explain("executionStats")
Always run explain on slow queries. Look for COLLSCAN (bad — no index used) vs IXSCAN (good — index used). Check totalDocsExamined vs totalDocsReturned — a large ratio means your index isn't selective enough.
The Aggregation Pipeline — Complex Queries Without JOINs
MongoDB's aggregation pipeline lets you process documents through a sequence of stages — filter, group, sort, transform, join — each stage passing its output to the next. This is how you run analytics, generate reports, and compute summaries without SQL.
The scenario: Your product manager wants a sales dashboard: total revenue and average price per category, sorted by revenue descending:
db.products.aggregate([
// Stage 1: filter — only in-stock products
{ $match: { in_stock: true } },
// Stage 2: group by category, compute stats
{ $group: {
_id: "$category", // group key
total_revenue: { $sum: "$price" }, // sum all prices in this group
avg_price: { $avg: "$price" }, // average price
count: { $sum: 1 } // count documents in group
}},
// Stage 3: sort by total revenue descending
{ $sort: { total_revenue: -1 } },
// Stage 4: reshape the output document
{ $project: {
_id: 0,
category: "$_id",
total_revenue: { $round: ["$total_revenue", 2] },
avg_price: { $round: ["$avg_price", 2] },
count: 1
}}
])
[
{ category: "electronics", total_revenue: 4821099.50, avg_price: 849.32, count: 5676 },
{ category: "apparel", total_revenue: 892440.20, avg_price: 44.12, count: 20229 },
{ category: "books", total_revenue: 341200.80, avg_price: 18.90, count: 18053 }
]
Each pipeline stage explained:
$match
Filters documents — like a WHERE clause. Always put $match as early as possible in the pipeline. It reduces the number of documents flowing through subsequent stages. If you have an index on the match field, it will be used here.
$group
Groups documents by a field and computes aggregate values. $sum, $avg, $min, $max, $count, $push (collect into array) are the main accumulators. The _id field inside $group defines the grouping key — equivalent to SQL's GROUP BY.
$project
Reshapes documents — rename fields, compute derived values, include/exclude fields. Here we rename _id to category and round the numbers. $round: ["$total_revenue", 2] rounds to 2 decimal places — the array syntax is MongoDB's expression language.
$lookup — Joining Collections When You Need To
Document databases discourage JOINs but don't forbid them. MongoDB's $lookup stage performs a left outer join between two collections. Use it sparingly — for reports and admin queries, not for hot paths.
The scenario: You need to enrich order documents with the full user profile for a monthly report. Orders reference users by user_id:
db.orders.aggregate([
{ $match: { status: "delivered" } },
// Join users collection on user_id
{ $lookup: {
from: "users", // collection to join
localField: "user_id", // field in orders
foreignField: "_id", // field in users
as: "user_info" // name for the joined array
}},
// $lookup returns an array — unwind to get a single object
{ $unwind: "$user_info" },
{ $project: {
order_id: 1,
total: 1,
"user_info.name": 1,
"user_info.email": 1
}}
])
[
{
order_id: "ord_8821",
total: 149.99,
user_info: { name: "Priya Sharma", email: "priya@example.com" }
},
{
order_id: "ord_8990",
total: 89.50,
user_info: { name: "Carlos Ruiz", email: "carlos@example.com" }
}
]
When NOT to use $lookup:
$lookup is a full collection scan on the joined collection unless you have an index on foreignField. On hot API paths — user profile pages, product detail pages — embed the data you need instead of joining. Use $lookup for background reports, admin dashboards, and batch jobs where a few extra milliseconds don't matter.
Embedding vs Referencing — The Core Design Decision
Every time you model data in a document database, you face one decision: embed the related data inside the document, or reference it by ID and fetch it separately. There is no universal answer — it depends on your access patterns.
Embed when:
Reference when:
The 16MB limit rule:
MongoDB documents have a hard 16MB size limit. If you embed an unbounded array (like all comments ever on a popular post), you will eventually hit this limit. For any relationship where the child count could grow to thousands, always reference — never embed.
Teacher's Note
Document databases reward engineers who think about access patterns first and data structure second. The question is never "what is the most normalised way to store this?" It is always "how will I read this data most often?" If the answer is "together as one unit" — embed it. If the answer is "independently, from multiple places" — reference it. Every MongoDB performance problem I've ever debugged traced back to a data model designed like a SQL schema — normalised, referenced, JOIN-heavy. When you model documents like documents, queries become trivially simple.
Practice Questions — You're the Engineer
Scenario:
Scenario:
db.orders.find({"{"} status: "pending" {"}"}).explain("executionStats") and see totalDocsExamined: 4000000 and totalDocsReturned: 142. The query took 6,200ms. What does the stage field almost certainly show, and what does it mean?
Scenario:
$group, $sort, $limit, $project, and a filter stage. The filter stage reduces 10 million documents to 50,000 before the expensive $group runs. For best performance, which stage should come first in the pipeline — and what is it called?
Quiz — Document Database Decisions
Scenario:
Scenario:
db.products.find({"{"} category: "electronics", price: {"{"} $lt: 500 {"}"} {"}"}). The collection has 5 million documents. The query is slow. You want to create one index that optimally serves this query. What should it be and in what field order?
Scenario:
$group, then $lookup to join user data, then $sort by revenue. The pipeline processes all 20 million orders in the collection even though only 180,000 were placed in the target month. The pipeline takes 4 minutes. What single change would have the biggest performance impact?
Up Next · Lesson 15
MongoDB in the NoSQL Ecosystem
Where MongoDB fits in the broader NoSQL landscape, how it compares to CouchDB and Firestore, and the real-world architectures where MongoDB is the obvious choice versus where engineers regret using it.