NO SQL Lesson 25 – Embedding vs Referencing | Dataplexa

Data Modeling & Design · Lesson 25

Embedding vs Referencing

Every MongoDB schema decision eventually comes down to one question: should this related data live inside the document, or in a separate collection with a reference? Get it right and your app reads everything it needs in a single query. Get it wrong and you are making three round trips to assemble one page — or worse, storing gigabytes of duplicated data that drifts out of sync.

The Two Strategies — What They Actually Mean

Embedding means storing related data directly inside the parent document — an array, a nested object, a sub-document. One document, one read, everything you need. This is MongoDB's equivalent of pre-joining at write time.

Referencing means storing a reference (typically an _id) that points to a document in another collection — like a foreign key in SQL. Two documents, two reads, application code to assemble the result.

Same order data — two completely different document shapes:

✅ Embedding — one document, one read

          // orders collection

          {"{"}
            _id: ObjectId("..."),

            customer: "Alice",

            status: "shipped",

            items: [

              {"{"} name: "Keyboard", qty: 1, price: 79.99 {"}"},

              {"{"} name: "Mouse",    qty: 2, price: 29.99 {"}"}

            ]

          {"}"}

One findOne() returns the order AND all its items. No second query.

🔗 Referencing — two collections, two reads

          // orders collection

          {"{"}
            _id: ObjectId("order_1"),

            customer: "Alice",

            status: "shipped",

            item_ids: ["itm_1", "itm_2"]

          {"}"}

          // order_items collection (separate)

          {"{"} _id: "itm_1", name: "Keyboard", ... {"}"}

          {"{"} _id: "itm_2", name: "Mouse",    ... {"}"}

Fetch the order, then fetch each item separately. Two round trips minimum.

When to Embed

Embedding is the right choice when the related data is always read together with the parent, the child data belongs exclusively to one parent, and the child list has a predictable, bounded size.

Always read together

You always fetch the parent and its children at the same time. Embedding eliminates the second query entirely.

Order + line items ✓

Owned exclusively

The child data belongs to one and only one parent — it would never make sense to share it across multiple documents.

User + address list ✓

Bounded size

The embedded array will not grow without limit. MongoDB's document size limit is 16MB — an unbounded array will eventually hit it.

Blog post + comments (max ~500) ✓

Infrequently updated together

The child data does not change constantly. If every write touches the embedded array, the parent document churns — fragmenting the storage and increasing write amplification.

Product + static specs ✓

When to Reference

Referencing is the right choice when the related data is shared across multiple parents, the child list could grow without a clear upper bound, or the child data is large and only sometimes needed.

Shared across parents

A tag, a category, or an author belongs to many documents. Embedding a copy in every document means updating one tag requires rewriting thousands of documents.

Post → author (author has many posts) ✓

Unbounded growth

A user's activity log, a product's review history, all comments ever on a viral post. Arrays that grow forever will breach the 16MB document limit and cause write performance to degrade as the document grows.

User → all tweets ever ✗ embed

Rarely read together

If the app almost never needs the child data when reading the parent, embedding wastes bandwidth and memory transferring data you will ignore most of the time.

Product → full audit log ✗ embed

Frequently updated independently

If the child data changes constantly and independently from the parent, embedding means every child update rewrites the entire parent document. Referencing lets you update one small document instead.

Order → live inventory count ✗ embed

Hands-on — Modeling a Blog Platform

The scenario: You are the backend engineer for a publishing platform. The product has three main entities: Authors, Posts, and Comments. You need to decide the embedding strategy for each relationship. Authors write many posts. Posts have many comments. Some posts have thousands of comments from a viral moment — the engineering team learnt this the hard way when a post about a major outage hit the front page and its embedded comments array grew to 22MB, breaking MongoDB's 16MB document limit.

// authors collection — referenced from posts, not embedded
// Reason: one author → many posts. Embedding author in every
// post means updating a bio rewrites thousands of post documents.
db.authors.insertOne({
  _id:      "auth_101",
  name:     "Sarah Chen",
  bio:      "Staff engineer. Writes about distributed systems.",
  twitter:  "@schen_eng",
  joined:   new Date("2021-03-15")
});

{ acknowledged: true, insertedId: 'auth_101' }

Author stored in its own collection — referenced by _id

The author document is the single source of truth. Every post stores author_id: "auth_101". When Sarah updates her bio, one document update propagates everywhere. If the bio were embedded in every post, updating it would require a multi-document updateMany touching potentially thousands of post documents — slow, error-prone, and easy to forget.

// posts collection — embeds tags and metadata (bounded, owned)
// References author_id (shared, not owned)
// Does NOT embed comments (unbounded growth risk)
db.posts.insertOne({
  _id:        "post_882",
  title:      "Why We Migrated from Postgres to Cassandra",
  slug:       "postgres-to-cassandra-migration",
  author_id:  "auth_101",          // reference — not embedded
  published:  new Date("2025-01-14"),
  tags:       ["cassandra", "nosql", "migration"],  // embedded — bounded
  stats: {                         // embedded — always read with post
    views:    142839,
    likes:    4201,
    shares:   892
  },
  body:       "Last year our write throughput hit 800k/sec..."
  // comments NOT embedded — stored in separate collection
});

{ acknowledged: true, insertedId: 'post_882' }

tags: ["cassandra", "nosql", "migration"]

Tags are embedded because they are small, bounded (a post rarely has more than 10), always read with the post, and are specific to this post's content — not shared entities that need their own lifecycle. Searching posts by tag uses a MongoDB index on the tags array field — MongoDB indexes each array element individually.

stats: {"{"} views, likes, shares {"}"}

Stats are embedded because they are always displayed alongside the post and owned exclusively by it. Frequent updates to stats.views are fine — MongoDB's $inc operator updates a single field in place without rewriting the entire document, so the churn cost is minimal.

Comments deliberately excluded

A post with 50,000 comments embedded would be hundreds of megabytes — beyond MongoDB's 16MB document limit and far beyond what any page needs to load. Comments are referenced via post_id in their own collection and loaded separately with pagination.

The scenario continues: You design the comments collection. Comments are referenced from posts — each comment stores the post_id it belongs to. This lets you paginate, sort, and query comments independently without touching the post document at all.

// comments collection — references post_id and author_id
// Reason: unbounded growth (viral posts), paginated independently,
// needs its own query pattern (newest first, per post)
db.comments.insertMany([
  {
    _id:        "cmt_001",
    post_id:    "post_882",       // reference to parent post
    author_id:  "auth_204",       // reference to author
    author_name: "Bob K.",        // denormalised snapshot — fast reads
    body:       "Great write-up! We had the same hotspot issues.",
    posted_at:  new Date("2025-01-14T09:22:00Z"),
    likes:      142
  },
  {
    _id:        "cmt_002",
    post_id:    "post_882",
    author_id:  "auth_317",
    author_name: "Diana R.",
    body:       "Did you consider DynamoDB before Cassandra?",
    posted_at:  new Date("2025-01-14T10:05:00Z"),
    likes:      38
  }
]);

{ acknowledged: true, insertedIds: { '0': 'cmt_001', '1': 'cmt_002' } }

author_name: "Bob K." — denormalised snapshot

This is a deliberate, controlled duplication. The comment stores the author's name at the time of writing — a snapshot — so rendering the comments list needs only one collection query. Fetching the full author document for every comment would require one extra query per comment per page load. The trade-off: if Bob changes his display name, old comments show the old name. This is usually acceptable for historical comments — it is the same pattern Twitter uses for retweet attribution.

post_id: "post_882" — the reference that enables pagination

With comments in their own collection, loading page 2 of comments is a simple find({"{"} post_id: "post_882" {"}"}).sort({"{"} posted_at: -1 {"}"}).skip(20).limit(20). With embedding, pagination would require loading the entire document with all comments and slicing in application code — transferring megabytes to return 20 rows.

The scenario continues: You now write the queries the frontend team will use. The post detail page needs the post and its author in one response. Comments are loaded separately in a paginated query. Both must be fast on a collection with 10 million posts and 500 million comments.

// Query 1: fetch post + author for the post detail page
// Two separate finds — but both are indexed _id lookups (O(1))
const post   = await db.posts.findOne({ slug: "postgres-to-cassandra-migration" });
const author = await db.authors.findOne({ _id: post.author_id });

// Query 2: first page of comments, newest first
// Requires an index on { post_id: 1, posted_at: -1 }
const comments = await db.comments
  .find({ post_id: post._id })
  .sort({ posted_at: -1 })
  .limit(20)
  .toArray();

post:    { _id: 'post_882', title: 'Why We Migrated...', stats: {...}, ... }
author:  { _id: 'auth_101', name: 'Sarah Chen', bio: '...', ... }
comments (page 1, newest first):
  [0] Bob K.   — "Great write-up! We had the same hotspot issues."
  [1] Diana R. — "Did you consider DynamoDB before Cassandra?"
  ...
Query times: post 1.2ms | author 0.8ms | comments 2.1ms

Three queries, all indexed, all under 3ms

Two of the three queries are direct _id lookups — the fastest possible operation in MongoDB. The third uses a compound index on {"{"} post_id: 1, posted_at: -1 {"}"}. Without that index, the comments query would scan all 500 million comment documents — always create the index that matches your sort + filter combination exactly.

The Hybrid Pattern — Extended Reference

In practice, the choice is rarely pure embedding or pure referencing. The extended reference pattern sits in between — store the reference ID plus a small snapshot of the most-used fields from the referenced document. This eliminates the second query for the common case while keeping the referenced document as the source of truth.

// Extended reference: store author_id + snapshot of display fields
// The full author document still exists in the authors collection
db.posts.insertOne({
  _id:   "post_999",
  title: "Indexing Strategies in MongoDB",
  // Extended reference — id + snapshot of display fields only
  author: {
    _id:    "auth_101",           // reference — can always fetch full doc
    name:   "Sarah Chen",         // snapshot — avoids second query on render
    avatar: "https://cdn.../sarah.jpg"  // snapshot
  },
  published: new Date("2025-01-20"),
  tags:      ["mongodb", "indexing"]
});

{ acknowledged: true, insertedId: 'post_999' }

// Rendering the post list now needs ZERO extra author queries:
// post.author.name   → "Sarah Chen"   (from snapshot)
// post.author.avatar → "https://..."  (from snapshot)
// post.author._id    → available if full profile link needed

Extended reference — the sweet spot

The post list page shows 20 posts with author names and avatars. With pure referencing, that is 20 extra author queries per page load. With extended reference, all 20 posts carry the name and avatar snapshot — zero extra queries. The _id is still there so the full author profile page is one click away. Snapshot fields should be the ones that rarely change (name, avatar) — not the ones that change frequently (follower count, last active).

Decision Framework — One Rule Per Situation

Situation	Strategy	Example
Data always read together, owned by one parent, bounded size	Embed	Order + line items
Data shared across many parents, or updated independently	Reference	Post → author_id
Child array could grow without bound	Reference	Post + comments collection
Referenced, but a few fields needed on every parent read	Extended reference	Post + author name snapshot
Many-to-many relationship	Reference arrays on both sides	Post ↔ tags collection
Child data large but rarely needed alongside parent	Reference	Product → full audit log

Teacher's Note

The extended reference pattern is the one most engineers discover too late — usually after they have already caused an N+1 query problem in production by using pure referencing. When you catch yourself writing a loop that fetches one document per item in a list, that is your signal to add a snapshot. Embed just the fields you always display, reference the rest.

Practice Questions — You're the Engineer

Scenario:

Your e-commerce product listing page shows 40 products per page. Each product references a seller_id. When rendering the page, the app fires 40 separate queries to the sellers collection to get each seller's name and logo for display — one query per product. Your tech lead says there is a pattern that would reduce this to zero extra queries while keeping the seller document as the single source of truth. What pattern is it?

Scenario:

A developer on your team embeds all user activity events directly in the user document as an array. After six months in production, writes to active users start failing with a MongoDB error. You diagnose the problem: heavy users have hundreds of thousands of events, and the document has grown beyond MongoDB's hard limit for single document size. What is that limit?

Scenario:

You are building a recipe platform. Each recipe has a cuisine field — values like "Italian", "Japanese", "Mexican". There are only 25 cuisine types but they each have a description, an image URL, and a list of featured ingredients. A million recipes exist. A product manager proposes embedding the full cuisine object in every recipe document. You push back. What modeling strategy should you use instead, and why?

Quiz — Embedding vs Referencing in Production

Up Next · Lesson 26

Denormalization

How deliberately duplicating data is not a design flaw in NoSQL — it is the strategy that keeps reads fast at any scale.

← Previous Course Index Next →