NO SQL Lesson 14 – Document Databases | Dataplexa
NoSQL Database Types · Lesson 14

Document Databases

In 2010, Foursquare was the fastest-growing app on the internet. 10 million check-ins a day, each one a rich JSON object — location, venue, user, tags, comments, points. Their MySQL cluster was hitting the wall. Schema changes required all-hands migrations. Each check-in needed 6 table JOINs to assemble. They switched to MongoDB. Within weeks their engineering team was shipping features in hours instead of weeks — because the data they were storing was already shaped like documents. This lesson is about understanding exactly why that works.

What a Document Database Actually Stores

A document database stores data as self-describing documents — typically JSON or a binary variant of it. Each document contains its own structure. No fixed columns. No foreign keys. No JOINs required to reassemble related data.

SQL table vs Document — the same data, two very different shapes:

❌ SQL — 4 tables, 3 JOINs needed

-- users table
id | name | email
1 | Priya | p@x.com

-- posts table
id | user_id | title | body
1 | 1 | Hello world | ...

-- tags table
post_id | tag
1 | tech
1 | intro

-- comments table
post_id | user_id | text
1 | 2 | Great post!

✅ Document — 1 document, 0 JOINs

{
 "_id": "post_001",
 "author": {
  "id": "u_1",
  "name": "Priya"
 },
 "title": "Hello world",
 "tags": ["tech", "intro"],
 "comments": [
  {"user":"u_2",
   "text":"Great post!"}
 ]
}

How MongoDB Stores Documents — BSON

MongoDB doesn't store plain JSON on disk. It uses BSON (Binary JSON) — a binary-encoded serialisation of JSON-like documents. BSON adds three things JSON doesn't have:

Additional types

Date, ObjectId, Binary, Decimal128, Int32/Int64 — richer than JSON's number/string/bool.

Length-prefixed

Each field stores its own length. MongoDB can skip fields without parsing them — fast field access.

Traversable

Nested documents and arrays are directly traversable without full deserialisation.

The ObjectId — MongoDB's Auto-Generated Primary Key

Every MongoDB document gets an _id field automatically. If you don't provide one, MongoDB generates an ObjectId — a 12-byte value that encodes useful metadata:

// ObjectId anatomy: 507f1f77bcf86cd799439011

507f1f77

4 bytes
Unix timestamp

bcf86c

3 bytes
Machine ID

d799

2 bytes
Process ID

439011

3 bytes
Random counter

Globally unique across all machines without a central coordinator. Sortable by creation time — newer ObjectIds are always lexicographically greater.

CRUD Operations — The Full Set

The scenario: You're building a product catalogue for a multi-category marketplace. Products have completely different attributes per category. Here's every CRUD operation you'll use daily:

// CREATE — insertOne and insertMany
db.products.insertOne({
  name:     "ProBook X14 Laptop",
  category: "electronics",
  price:    1199.00,
  specs:    { ram: 16, storage: "512GB", display: "14-inch" },
  tags:     ["laptop", "work", "featured"],
  in_stock: true,
  created:  new Date()
})

// Insert multiple documents at once
db.products.insertMany([
  { name: "Classic Tee", category: "apparel", price: 29.99, sizes: ["S","M","L"] },
  { name: "Running Hat", category: "apparel", price: 19.99, sizes: ["M","L","XL"] }
])

insertMany — sends all documents in one network call. Faster than calling insertOne in a loop. Returns an object with insertedIds — a map of array index to generated ObjectId for each document.

new Date() — MongoDB's BSON Date type stores milliseconds since epoch. Stored as 8 bytes, more efficient than an ISO string. Supports date comparison operators natively: $gte, $lt, etc.

// READ — find with filters, projection, sort, limit
// Find all featured electronics under £800
db.products.find(
  {
    category: "electronics",
    price:    { $lt: 800 },
    tags:     "featured"         // match any document where tags array contains "featured"
  },
  {
    name: 1, price: 1, specs: 1, // projection: only return these fields
    _id:  0                      // exclude _id from results
  }
).sort({ price: 1 })             // sort by price ascending
 .limit(10)                      // return max 10 results
[
  {
    name:  "UltraBook Z13",
    price: 449,
    specs: { ram: 8, storage: "256GB", display: "13-inch" }
  },
  {
    name:  "WorkStation M1",
    price: 699,
    specs: { ram: 16, storage: "512GB", display: "15-inch" }
  }
]
tags: "featured"

When you query an array field with a scalar value, MongoDB matches any document where the array contains that value. No special operator needed. {"{"} tags: "featured" {"}"} means "tags array includes the string 'featured'."

Projection: {"{ name: 1, price: 1, _id: 0 }"}

Projections reduce network transfer. Only request the fields you actually need. 1 = include, 0 = exclude. You cannot mix includes and excludes in one projection except for _id which can always be explicitly excluded.

// UPDATE — updateOne, updateMany, findOneAndUpdate
// Update one product's price and mark it as on sale
db.products.updateOne(
  { name: "ProBook X14 Laptop" },             // filter
  {
    $set:  { price: 999, on_sale: true },      // set these fields
    $push: { tags: "sale" },                   // append to tags array
    $currentDate: { updated_at: true }         // set updated_at to now
  }
)

// Give all apparel items a 10% discount
db.products.updateMany(
  { category: "apparel" },
  { $mul: { price: 0.90 } }                   // multiply price by 0.90
$push: {"{ tags: 'sale' }"}

Appends a value to an array field. The array grows — existing values are untouched. Use $addToSet instead if you want uniqueness guaranteed (won't add if the value already exists). Use $pop to remove from start or end. Use $pull to remove specific values.

$mul: {"{ price: 0.90 }"}

Multiplies the field by the given value in a single atomic operation. No read-then-write needed. $mul: 0.90 applies 10% discount. Other arithmetic operators: $inc (add/subtract), $min/$max (only update if new value is lower/higher).

// DELETE — deleteOne, deleteMany
// Remove a specific product
db.products.deleteOne({ name: "Running Hat" })

// Remove all out-of-stock products older than 90 days
db.products.deleteMany({
  in_stock: false,
  created:  { $lt: new Date(Date.now() - 90 * 24 * 60 * 60 * 1000) }
})
{ acknowledged: true, deletedCount: 1 }

{ acknowledged: true, deletedCount: 14 }
-- 14 out-of-stock products older than 90 days removed

deletedCount — always check this. If it's 0, either your filter matched nothing or the document was already gone. A common bug: using deleteOne when deleteMany was intended — only the first matching document is removed.

Indexing — The Difference Between Fast and Catastrophic

Without an index, every query does a collection scan — MongoDB reads every document to find matches. With 10 documents this is fine. With 10 million documents this means your API times out.

The scenario: Your products collection has 2 million documents. You query by category and price constantly. Here's what happens with and without an index:

// Check query performance BEFORE indexing
db.products.find({ category: "electronics", price: { $lt: 800 } })
           .explain("executionStats")
executionStats: {
  executionTimeMillis: 4820,      // 4.8 seconds
  totalDocsExamined:  2000000,    // read every document
  totalDocsReturned:  1847,       // only 1,847 matched
  stage: "COLLSCAN"               // COLLSCAN = full collection scan = bad
}
// Create a compound index on category + price
db.products.createIndex(
  { category: 1, price: 1 },     // 1 = ascending, -1 = descending
  { name: "category_price_idx" } // optional name for management
)

// Same query AFTER indexing
db.products.find({ category: "electronics", price: { $lt: 800 } })
           .explain("executionStats")
executionStats: {
  executionTimeMillis: 3,         // 3ms (was 4,820ms)
  totalDocsExamined:  1847,       // only examined matching docs
  totalDocsReturned:  1847,       // same result
  stage: "IXSCAN"                 // IXSCAN = index scan = good
}
-- 1,607x faster with one index
{"{ category: 1, price: 1 }"} — compound index field order matters

Put the equality field (category) first, the range field (price) second. This is called the ESR rule: Equality fields first, then Sort fields, then Range fields. An index on {"{"} price: 1, category: 1 {"}"} would be less efficient for this query because MongoDB can't narrow down by category first.

.explain("executionStats")

Always run explain on slow queries. Look for COLLSCAN (bad — no index used) vs IXSCAN (good — index used). Check totalDocsExamined vs totalDocsReturned — a large ratio means your index isn't selective enough.

The Aggregation Pipeline — Complex Queries Without JOINs

MongoDB's aggregation pipeline lets you process documents through a sequence of stages — filter, group, sort, transform, join — each stage passing its output to the next. This is how you run analytics, generate reports, and compute summaries without SQL.

The scenario: Your product manager wants a sales dashboard: total revenue and average price per category, sorted by revenue descending:

db.products.aggregate([

  // Stage 1: filter — only in-stock products
  { $match: { in_stock: true } },

  // Stage 2: group by category, compute stats
  { $group: {
      _id:          "$category",          // group key
      total_revenue: { $sum: "$price" },  // sum all prices in this group
      avg_price:     { $avg: "$price" },  // average price
      count:         { $sum: 1 }          // count documents in group
  }},

  // Stage 3: sort by total revenue descending
  { $sort: { total_revenue: -1 } },

  // Stage 4: reshape the output document
  { $project: {
      _id:           0,
      category:      "$_id",
      total_revenue: { $round: ["$total_revenue", 2] },
      avg_price:     { $round: ["$avg_price", 2] },
      count:         1
  }}

])
[
  { category: "electronics", total_revenue: 4821099.50, avg_price: 849.32, count: 5676 },
  { category: "apparel",     total_revenue: 892440.20,  avg_price: 44.12,  count: 20229 },
  { category: "books",       total_revenue: 341200.80,  avg_price: 18.90,  count: 18053 }
]

Each pipeline stage explained:

$match

Filters documents — like a WHERE clause. Always put $match as early as possible in the pipeline. It reduces the number of documents flowing through subsequent stages. If you have an index on the match field, it will be used here.

$group

Groups documents by a field and computes aggregate values. $sum, $avg, $min, $max, $count, $push (collect into array) are the main accumulators. The _id field inside $group defines the grouping key — equivalent to SQL's GROUP BY.

$project

Reshapes documents — rename fields, compute derived values, include/exclude fields. Here we rename _id to category and round the numbers. $round: ["$total_revenue", 2] rounds to 2 decimal places — the array syntax is MongoDB's expression language.

$lookup — Joining Collections When You Need To

Document databases discourage JOINs but don't forbid them. MongoDB's $lookup stage performs a left outer join between two collections. Use it sparingly — for reports and admin queries, not for hot paths.

The scenario: You need to enrich order documents with the full user profile for a monthly report. Orders reference users by user_id:

db.orders.aggregate([
  { $match: { status: "delivered" } },

  // Join users collection on user_id
  { $lookup: {
      from:         "users",      // collection to join
      localField:   "user_id",    // field in orders
      foreignField: "_id",        // field in users
      as:           "user_info"   // name for the joined array
  }},

  // $lookup returns an array — unwind to get a single object
  { $unwind: "$user_info" },

  { $project: {
      order_id:   1,
      total:      1,
      "user_info.name":  1,
      "user_info.email": 1
  }}
])
[
  {
    order_id: "ord_8821",
    total: 149.99,
    user_info: { name: "Priya Sharma", email: "priya@example.com" }
  },
  {
    order_id: "ord_8990",
    total: 89.50,
    user_info: { name: "Carlos Ruiz", email: "carlos@example.com" }
  }
]

When NOT to use $lookup:

$lookup is a full collection scan on the joined collection unless you have an index on foreignField. On hot API paths — user profile pages, product detail pages — embed the data you need instead of joining. Use $lookup for background reports, admin dashboards, and batch jobs where a few extra milliseconds don't matter.

Embedding vs Referencing — The Core Design Decision

Every time you model data in a document database, you face one decision: embed the related data inside the document, or reference it by ID and fetch it separately. There is no universal answer — it depends on your access patterns.

Embed when:

✅ Data is always read together (order + line items)
✅ Data belongs to one parent only (address in a user)
✅ The embedded data is small and bounded
✅ 1-to-few relationship (user has 3-5 addresses)

Reference when:

🔗 Data is shared across many parents (a product in many orders)
🔗 The related data grows unboundedly (all comments on a post)
🔗 The related data is updated frequently and independently
🔗 1-to-many or many-to-many relationship at scale

The 16MB limit rule:

MongoDB documents have a hard 16MB size limit. If you embed an unbounded array (like all comments ever on a popular post), you will eventually hit this limit. For any relationship where the child count could grow to thousands, always reference — never embed.

Teacher's Note

Document databases reward engineers who think about access patterns first and data structure second. The question is never "what is the most normalised way to store this?" It is always "how will I read this data most often?" If the answer is "together as one unit" — embed it. If the answer is "independently, from multiple places" — reference it. Every MongoDB performance problem I've ever debugged traced back to a data model designed like a SQL schema — normalised, referenced, JOIN-heavy. When you model documents like documents, queries become trivially simple.

Practice Questions — You're the Engineer

Scenario:

You're building an e-commerce order system in MongoDB. Each order has a fixed set of line items — typically 1 to 10 products per order. When you fetch an order, you always need the line items at the same time. Line items belong exclusively to one order and are never updated independently. Should you embed the line items inside the order document or store them in a separate collection?


Scenario:

You run db.orders.find({"{"} status: "pending" {"}"}).explain("executionStats") and see totalDocsExamined: 4000000 and totalDocsReturned: 142. The query took 6,200ms. What does the stage field almost certainly show, and what does it mean?


Scenario:

You have an aggregation pipeline with 5 stages: $group, $sort, $limit, $project, and a filter stage. The filter stage reduces 10 million documents to 50,000 before the expensive $group runs. For best performance, which stage should come first in the pipeline — and what is it called?


Quiz — Document Database Decisions

Scenario:

You're building a blogging platform. Each blog post can receive thousands of comments over its lifetime. Popular posts on your platform have received over 50,000 comments. You're deciding whether to embed comments inside the post document or store them in a separate comments collection. What is the correct design and why?

Scenario:

Your most frequent MongoDB query is: db.products.find({"{"} category: "electronics", price: {"{"} $lt: 500 {"}"} {"}"}). The collection has 5 million documents. The query is slow. You want to create one index that optimally serves this query. What should it be and in what field order?

Scenario:

Your aggregation pipeline generates a monthly revenue report. It runs $group, then $lookup to join user data, then $sort by revenue. The pipeline processes all 20 million orders in the collection even though only 180,000 were placed in the target month. The pipeline takes 4 minutes. What single change would have the biggest performance impact?

Up Next · Lesson 15

MongoDB in the NoSQL Ecosystem

Where MongoDB fits in the broader NoSQL landscape, how it compares to CouchDB and Firestore, and the real-world architectures where MongoDB is the obvious choice versus where engineers regret using it.