NoSQL
Embedding vs Referencing
Every MongoDB schema decision eventually comes down to one question: should this related data live inside the document, or in a separate collection with a reference? Get it right and your app reads everything it needs in a single query. Get it wrong and you are making three round trips to assemble one page — or worse, storing gigabytes of duplicated data that drifts out of sync.
The Two Strategies — What They Actually Mean
Embedding means storing related data directly inside the parent document — an array, a nested object, a sub-document. One document, one read, everything you need. This is MongoDB's equivalent of pre-joining at write time.
Referencing means storing a reference (typically an _id) that points to a document in another collection — like a foreign key in SQL. Two documents, two reads, application code to assemble the result.
Same order data — two completely different document shapes:
✅ Embedding — one document, one read
{"{"} _id: ObjectId("..."),
customer: "Alice",
status: "shipped",
items: [
{"{"} name: "Keyboard", qty: 1, price: 79.99 {"}"},
{"{"} name: "Mouse", qty: 2, price: 29.99 {"}"}
]
{"}"}
One findOne() returns the order AND all its items. No second query.
🔗 Referencing — two collections, two reads
{"{"} _id: ObjectId("order_1"),
customer: "Alice",
status: "shipped",
item_ids: ["itm_1", "itm_2"]
{"}"}
// order_items collection (separate)
{"{"} _id: "itm_1", name: "Keyboard", ... {"}"}
{"{"} _id: "itm_2", name: "Mouse", ... {"}"}
Fetch the order, then fetch each item separately. Two round trips minimum.
When to Embed
Embedding is the right choice when the related data is always read together with the parent, the child data belongs exclusively to one parent, and the child list has a predictable, bounded size.
Always read together
You always fetch the parent and its children at the same time. Embedding eliminates the second query entirely.
Order + line items ✓
Owned exclusively
The child data belongs to one and only one parent — it would never make sense to share it across multiple documents.
User + address list ✓
Bounded size
The embedded array will not grow without limit. MongoDB's document size limit is 16MB — an unbounded array will eventually hit it.
Blog post + comments (max ~500) ✓
Infrequently updated together
The child data does not change constantly. If every write touches the embedded array, the parent document churns — fragmenting the storage and increasing write amplification.
Product + static specs ✓
When to Reference
Referencing is the right choice when the related data is shared across multiple parents, the child list could grow without a clear upper bound, or the child data is large and only sometimes needed.
Shared across parents
A tag, a category, or an author belongs to many documents. Embedding a copy in every document means updating one tag requires rewriting thousands of documents.
Post → author (author has many posts) ✓
Unbounded growth
A user's activity log, a product's review history, all comments ever on a viral post. Arrays that grow forever will breach the 16MB document limit and cause write performance to degrade as the document grows.
User → all tweets ever ✗ embed
Rarely read together
If the app almost never needs the child data when reading the parent, embedding wastes bandwidth and memory transferring data you will ignore most of the time.
Product → full audit log ✗ embed
Frequently updated independently
If the child data changes constantly and independently from the parent, embedding means every child update rewrites the entire parent document. Referencing lets you update one small document instead.
Order → live inventory count ✗ embed
Hands-on — Modeling a Blog Platform
The scenario: You are the backend engineer for a publishing platform. The product has three main entities: Authors, Posts, and Comments. You need to decide the embedding strategy for each relationship. Authors write many posts. Posts have many comments. Some posts have thousands of comments from a viral moment — the engineering team learnt this the hard way when a post about a major outage hit the front page and its embedded comments array grew to 22MB, breaking MongoDB's 16MB document limit.
// authors collection — referenced from posts, not embedded
// Reason: one author → many posts. Embedding author in every
// post means updating a bio rewrites thousands of post documents.
db.authors.insertOne({
_id: "auth_101",
name: "Sarah Chen",
bio: "Staff engineer. Writes about distributed systems.",
twitter: "@schen_eng",
joined: new Date("2021-03-15")
});
{ acknowledged: true, insertedId: 'auth_101' }Author stored in its own collection — referenced by _id
The author document is the single source of truth. Every post stores author_id: "auth_101". When Sarah updates her bio, one document update propagates everywhere. If the bio were embedded in every post, updating it would require a multi-document updateMany touching potentially thousands of post documents — slow, error-prone, and easy to forget.
// posts collection — embeds tags and metadata (bounded, owned)
// References author_id (shared, not owned)
// Does NOT embed comments (unbounded growth risk)
db.posts.insertOne({
_id: "post_882",
title: "Why We Migrated from Postgres to Cassandra",
slug: "postgres-to-cassandra-migration",
author_id: "auth_101", // reference — not embedded
published: new Date("2025-01-14"),
tags: ["cassandra", "nosql", "migration"], // embedded — bounded
stats: { // embedded — always read with post
views: 142839,
likes: 4201,
shares: 892
},
body: "Last year our write throughput hit 800k/sec..."
// comments NOT embedded — stored in separate collection
});
{ acknowledged: true, insertedId: 'post_882' }tags: ["cassandra", "nosql", "migration"]
Tags are embedded because they are small, bounded (a post rarely has more than 10), always read with the post, and are specific to this post's content — not shared entities that need their own lifecycle. Searching posts by tag uses a MongoDB index on the tags array field — MongoDB indexes each array element individually.
stats: {"{"} views, likes, shares {"}"}
Stats are embedded because they are always displayed alongside the post and owned exclusively by it. Frequent updates to stats.views are fine — MongoDB's $inc operator updates a single field in place without rewriting the entire document, so the churn cost is minimal.
Comments deliberately excluded
A post with 50,000 comments embedded would be hundreds of megabytes — beyond MongoDB's 16MB document limit and far beyond what any page needs to load. Comments are referenced via post_id in their own collection and loaded separately with pagination.
The scenario continues: You design the comments collection. Comments are referenced from posts — each comment stores the post_id it belongs to. This lets you paginate, sort, and query comments independently without touching the post document at all.
// comments collection — references post_id and author_id
// Reason: unbounded growth (viral posts), paginated independently,
// needs its own query pattern (newest first, per post)
db.comments.insertMany([
{
_id: "cmt_001",
post_id: "post_882", // reference to parent post
author_id: "auth_204", // reference to author
author_name: "Bob K.", // denormalised snapshot — fast reads
body: "Great write-up! We had the same hotspot issues.",
posted_at: new Date("2025-01-14T09:22:00Z"),
likes: 142
},
{
_id: "cmt_002",
post_id: "post_882",
author_id: "auth_317",
author_name: "Diana R.",
body: "Did you consider DynamoDB before Cassandra?",
posted_at: new Date("2025-01-14T10:05:00Z"),
likes: 38
}
]);
{ acknowledged: true, insertedIds: { '0': 'cmt_001', '1': 'cmt_002' } }author_name: "Bob K." — denormalised snapshot
This is a deliberate, controlled duplication. The comment stores the author's name at the time of writing — a snapshot — so rendering the comments list needs only one collection query. Fetching the full author document for every comment would require one extra query per comment per page load. The trade-off: if Bob changes his display name, old comments show the old name. This is usually acceptable for historical comments — it is the same pattern Twitter uses for retweet attribution.
post_id: "post_882" — the reference that enables pagination
With comments in their own collection, loading page 2 of comments is a simple find({"{"} post_id: "post_882" {"}"}).sort({"{"} posted_at: -1 {"}"}).skip(20).limit(20). With embedding, pagination would require loading the entire document with all comments and slicing in application code — transferring megabytes to return 20 rows.
The scenario continues: You now write the queries the frontend team will use. The post detail page needs the post and its author in one response. Comments are loaded separately in a paginated query. Both must be fast on a collection with 10 million posts and 500 million comments.
// Query 1: fetch post + author for the post detail page
// Two separate finds — but both are indexed _id lookups (O(1))
const post = await db.posts.findOne({ slug: "postgres-to-cassandra-migration" });
const author = await db.authors.findOne({ _id: post.author_id });
// Query 2: first page of comments, newest first
// Requires an index on { post_id: 1, posted_at: -1 }
const comments = await db.comments
.find({ post_id: post._id })
.sort({ posted_at: -1 })
.limit(20)
.toArray();
post: { _id: 'post_882', title: 'Why We Migrated...', stats: {...}, ... }
author: { _id: 'auth_101', name: 'Sarah Chen', bio: '...', ... }
comments (page 1, newest first):
[0] Bob K. — "Great write-up! We had the same hotspot issues."
[1] Diana R. — "Did you consider DynamoDB before Cassandra?"
...
Query times: post 1.2ms | author 0.8ms | comments 2.1msThree queries, all indexed, all under 3ms
Two of the three queries are direct _id lookups — the fastest possible operation in MongoDB. The third uses a compound index on {"{"} post_id: 1, posted_at: -1 {"}"}. Without that index, the comments query would scan all 500 million comment documents — always create the index that matches your sort + filter combination exactly.
The Hybrid Pattern — Extended Reference
In practice, the choice is rarely pure embedding or pure referencing. The extended reference pattern sits in between — store the reference ID plus a small snapshot of the most-used fields from the referenced document. This eliminates the second query for the common case while keeping the referenced document as the source of truth.
// Extended reference: store author_id + snapshot of display fields
// The full author document still exists in the authors collection
db.posts.insertOne({
_id: "post_999",
title: "Indexing Strategies in MongoDB",
// Extended reference — id + snapshot of display fields only
author: {
_id: "auth_101", // reference — can always fetch full doc
name: "Sarah Chen", // snapshot — avoids second query on render
avatar: "https://cdn.../sarah.jpg" // snapshot
},
published: new Date("2025-01-20"),
tags: ["mongodb", "indexing"]
});
{ acknowledged: true, insertedId: 'post_999' }
// Rendering the post list now needs ZERO extra author queries:
// post.author.name → "Sarah Chen" (from snapshot)
// post.author.avatar → "https://..." (from snapshot)
// post.author._id → available if full profile link neededExtended reference — the sweet spot
The post list page shows 20 posts with author names and avatars. With pure referencing, that is 20 extra author queries per page load. With extended reference, all 20 posts carry the name and avatar snapshot — zero extra queries. The _id is still there so the full author profile page is one click away. Snapshot fields should be the ones that rarely change (name, avatar) — not the ones that change frequently (follower count, last active).
Decision Framework — One Rule Per Situation
| Situation | Strategy | Example |
|---|---|---|
| Data always read together, owned by one parent, bounded size | Embed | Order + line items |
| Data shared across many parents, or updated independently | Reference | Post → author_id |
| Child array could grow without bound | Reference | Post + comments collection |
| Referenced, but a few fields needed on every parent read | Extended reference | Post + author name snapshot |
| Many-to-many relationship | Reference arrays on both sides | Post ↔ tags collection |
| Child data large but rarely needed alongside parent | Reference | Product → full audit log |
Teacher's Note
The extended reference pattern is the one most engineers discover too late — usually after they have already caused an N+1 query problem in production by using pure referencing. When you catch yourself writing a loop that fetches one document per item in a list, that is your signal to add a snapshot. Embed just the fields you always display, reference the rest.
Practice Questions — You're the Engineer
Scenario:
seller_id. When rendering the page, the app fires 40 separate queries to the sellers collection to get each seller's name and logo for display — one query per product. Your tech lead says there is a pattern that would reduce this to zero extra queries while keeping the seller document as the single source of truth. What pattern is it?
Scenario:
Scenario:
cuisine field — values like "Italian", "Japanese", "Mexican". There are only 25 cuisine types but they each have a description, an image URL, and a list of featured ingredients. A million recipes exist. A product manager proposes embedding the full cuisine object in every recipe document. You push back. What modeling strategy should you use instead, and why?
Quiz — Embedding vs Referencing in Production
Scenario:
Scenario:
product_id. A product's name, image, and price can change at any time. There are 50 million orders referencing 200,000 products. You advocate for referencing. What is the strongest argument for your position?
Scenario:
find({"{"} post_id: "post_882" {"}"}).sort({"{"} posted_at: -1 {"}"}).limit(20) is taking 4 seconds. You run explain() and see COLLSCAN — a full collection scan — followed by an in-memory sort. The comments collection has no indexes other than the default _id index. What single index would fix both the scan and the sort?
Up Next · Lesson 26
Denormalization
How deliberately duplicating data is not a design flaw in NoSQL — it is the strategy that keeps reads fast at any scale.