Mango DBLesson 24 – Embedded vs Referenced | Dataplexa

Embedded vs Referenced

The single most consequential decision in any MongoDB schema is whether to embed related data inside a document or store it in a separate collection and link it with a reference. Embed and your reads are fast, atomic, and require no joins — but updates to shared data become expensive. Reference and your data stays normalised and independently queryable — but every read that needs both sides requires either a $lookup aggregation or two separate queries. There is no universally correct answer. This lesson gives you the decision framework, the trade-off matrix, and the patterns you need to make the right call every time, illustrated throughout with the Dataplexa Store dataset.

Embedding — Data Inside the Document

Embedding places related data directly inside the parent document as a sub-document or an array of sub-documents. The entire entity — parent and children — is stored and retrieved as one unit. This is MongoDB's default recommendation and the approach that gives you the most benefit from the document model.

# Embedding — related data lives inside the parent document

from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")
db     = client["dataplexa"]

# Example: order with embedded line items (already our Dataplexa model)
# The items array is embedded — retrieved in the same query as the order
order = db.orders.find_one(
    {"_id": "o001"},
)
print("Order with embedded items — single query:")
print(f"  order id:  {order['_id']}")
print(f"  user:      {order['user_id']}")
print(f"  status:    {order['status']}")
print(f"  total:     ${order['total']:.2f}")
print(f"  items:")
for item in order["items"]:
    print(f"    product: {item['product_id']}  qty: {item['qty']}  price: ${item['price']:.2f}")

# Embedding is ideal here because:
# 1. Items are ALWAYS needed with the order — never queried independently
# 2. The item count per order is BOUNDED (realistic max ~100 items)
# 3. Writing the order and items is one ATOMIC operation — no partial writes
# 4. Zero network round trips — one query returns everything

print("\nBenefits of embedding for order items:")
benefits = [
    "One query returns order + all items — zero round trips",
    "Atomic write — order and items succeed or fail together",
    "No $lookup required — simpler query code",
    "Items are always consistent with their parent order",
]
for b in benefits:
    print(f"  ✓ {b}")

Order with embedded items — single query:
order id: o001
user: u001
status: delivered
total: $44.96
items:
product: p001 qty: 1 price: $29.99
product: p003 qty: 3 price: $4.99

Benefits of embedding for order items:
✓ One query returns order + all items — zero round trips
✓ Atomic write — order and items succeed or fail together
✓ No $lookup required — simpler query code
✓ Items are always consistent with their parent order

Embedding is the default choice — start by embedding and only switch to referencing when a specific problem forces you to
Single-document atomicity is free with embedding — MongoDB guarantees that writes to one document are atomic without requiring a multi-document transaction
Reads are faster with embedding because MongoDB reads one document from disk instead of performing an index lookup followed by a second read on another collection

Referencing — Links Between Collections

Referencing stores the related data in a separate collection and links the two documents with a shared identifier — typically the _id of one document stored as a field in the other. The application then issues two queries or uses a $lookup aggregation stage to join the data at query time.

# Referencing — linked documents in separate collections

from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")
db     = client["dataplexa"]

# Example: reviews referencing both user and product
# The review stores user_id and product_id as foreign-key-style references
review = db.reviews.find_one({"_id": "r001"})
print("Review document (references only — no embedded data):")
print(f"  review id:  {review['_id']}")
print(f"  product_id: {review['product_id']}  ← reference to products collection")
print(f"  user_id:    {review['user_id']}      ← reference to users collection")
print(f"  rating:     {review['rating']}")
print(f"  comment:    {review['comment']}")

# To display the full review with product name and user name
# requires two additional queries (or a $lookup aggregation)
product = db.products.find_one({"_id": review["product_id"]}, {"name": 1, "_id": 0})
user    = db.users.find_one({"_id": review["user_id"]},    {"name": 1, "_id": 0})

print(f"\nFull review (after two additional queries):")
print(f"  User:    {user['name']}")
print(f"  Product: {product['name']}")
print(f"  Rating:  {review['rating']} — \"{review['comment']}\"")

# Why reviews are referenced — not embedded in the product:
print("\nWhy reviews are referenced (not embedded in product):")
reasons = [
    "Unbounded growth — a popular product can have 100,000+ reviews",
    "Queried independently — 'all reviews by user u001' spans products",
    "Embedding would bloat product documents beyond useful size",
    "Reviews can be paginated, sorted, filtered as a standalone collection",
]
for r in reasons:
    print(f"  ✓ {r}")

Review document (references only — no embedded data):
review id: r001
product_id: p001 ← reference to products collection
user_id: u001 ← reference to users collection
rating: 5
comment: Fast and responsive, great for gaming.

Full review (after two additional queries):
User: Alice Johnson
Product: Wireless Mouse
Rating: 5 — "Fast and responsive, great for gaming."

Why reviews are referenced (not embedded in product):
✓ Unbounded growth — a popular product can have 100,000+ reviews
✓ Queried independently — 'all reviews by user u001' spans products
✓ Embedding would bloat product documents beyond useful size
✓ Reviews can be paginated, sorted, filtered as a standalone collection

Referencing is the right default when child data can grow without bound or when children are frequently queried independently of their parent
MongoDB does not enforce referential integrity — if you delete a product, its orphaned reviews remain. The application must manage consistency
Use $lookup in an aggregation pipeline to join referenced collections server-side, avoiding multiple round trips

$lookup — Server-Side Join for Referenced Data

$lookup is MongoDB's aggregation stage for joining a local collection with a foreign collection. It is the equivalent of a SQL LEFT OUTER JOIN and is the standard way to retrieve referenced data in one server round trip.

# $lookup — joining referenced collections server-side

from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")
db     = client["dataplexa"]

# Join reviews with their product and user in one aggregation pipeline
pipeline = [
    # Stage 1: filter to reviews for product p001
    {"$match": {"product_id": "p001"}},

    # Stage 2: join with products collection
    {"$lookup": {
        "from":         "products",     # foreign collection
        "localField":   "product_id",   # field in reviews
        "foreignField": "_id",          # field in products
        "as":           "product_info"  # output array field name
    }},

    # Stage 3: join with users collection
    {"$lookup": {
        "from":         "users",
        "localField":   "user_id",
        "foreignField": "_id",
        "as":           "user_info"
    }},

    # Stage 4: flatten the single-element lookup arrays
    {"$unwind": "$product_info"},
    {"$unwind": "$user_info"},

    # Stage 5: project only the fields we need
    {"$project": {
        "rating":       1,
        "comment":      1,
        "product_name": "$product_info.name",
        "reviewer":     "$user_info.name",
        "_id":          0
    }}
]

results = list(db.reviews.aggregate(pipeline))
print(f"Reviews for p001 with product and user names ($lookup):\n")
for r in results:
    print(f"  [{r['rating']}★] {r['reviewer']:15} — \"{r['comment']}\"")
    print(f"       Product: {r['product_name']}")

Reviews for p001 with product and user names ($lookup):

[5★] Alice Johnson — "Fast and responsive, great for gaming."
Product: Wireless Mouse
[4★] David Lee — "Good value for money, works well."
Product: Wireless Mouse

$lookup produces an array field — use $unwind to flatten it when you expect exactly one match per document
$lookup is a server-side operation — it is far more efficient than fetching documents and joining in Python application code
Index the foreign collection's join field (_id in most cases) — without an index, each lookup performs a full collection scan
Frequent $lookup usage is a signal that your schema may benefit from partial denormalisation — embedding the most-used fields from the foreign document to avoid the join

The Subset Pattern — Partial Embedding

The Subset Pattern is a practical middle ground between full embedding and full referencing. You embed a frequently-needed subset of the related data directly in the parent document — enough to display the common view without a join — while keeping the full data in a separate collection for when the complete picture is needed.

# Subset pattern — embed the hot data, reference the rest

from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")
db     = client["dataplexa"]

# Scenario: product page needs the 3 most recent reviews inline
# Full review collection may have thousands per product — don't embed all of them
# Instead: embed the top 3 reviews as a "reviews_preview" subset on the product

# What the product document would look like with the subset pattern:
product_with_subset = {
    "_id":      "p001",
    "name":     "Wireless Mouse",
    "price":    29.99,
    "category": "Electronics",
    "rating":   4.5,
    # Hot data — embedded for instant page load without $lookup
    "reviews_preview": [
        {"user_id": "u001", "reviewer": "Alice Johnson", "rating": 5,
         "comment": "Fast and responsive, great for gaming.", "date": "2024-01-15"},
        {"user_id": "u004", "reviewer": "David Lee",     "rating": 4,
         "comment": "Good value for money, works well.",    "date": "2024-02-10"},
    ],
    "review_count": 42,   # computed — updated on each new review
}

# Fast product page load — no join needed for the preview
print("Product page — instant load with subset:")
print(f"  {product_with_subset['name']}  ${product_with_subset['price']}")
print(f"  Rating: {product_with_subset['rating']} ({product_with_subset['review_count']} reviews)")
print(f"  Recent reviews:")
for r in product_with_subset["reviews_preview"]:
    print(f"    [{r['rating']}★] {r['reviewer']} — \"{r['comment']}\"")

# Full reviews page — query the reviews collection with pagination
print("\nFull review page — query reviews collection:")
all_reviews = db.reviews.find(
    {"product_id": "p001"},
    {"user_id": 1, "rating": 1, "comment": 1, "_id": 0}
)
for r in all_reviews:
    print(f"  [{r['rating']}★] user: {r['user_id']} — \"{r['comment']}\"")

Product page — instant load with subset:
Wireless Mouse $29.99
Rating: 4.5 (42 reviews)
Recent reviews:
[5★] Alice Johnson — "Fast and responsive, great for gaming."
[4★] David Lee — "Good value for money, works well."

Full review page — query reviews collection:
[5★] user: u001 — "Fast and responsive, great for gaming."
[4★] user: u004 — "Good value for money, works well."

The Subset Pattern is used by Amazon, IMDb, and most major e-commerce platforms — the product card embeds just enough review data for the listing page
The embedded subset becomes stale when the underlying data changes — you must update it with the same write that changes the canonical data in the reviews collection
Keep the subset small and bounded — typically the top 3–5 most recent or most relevant items

The Decision Framework

Six questions determine whether to embed or reference. Work through them in order — the first question that gives a definitive answer stops the process.

# Decision framework — embed or reference?

decision_tree = [
    {
        "question": "Can the child data grow without bound?",
        "if_yes":   "REFERENCE — unbounded arrays hit the 16 MB limit",
        "if_no":    "Continue to next question",
        "example":  "Orders per user → YES → reference",
    },
    {
        "question": "Is the child data queried independently of the parent?",
        "if_yes":   "REFERENCE — child needs its own collection for independent queries",
        "if_no":    "Continue to next question",
        "example":  "Reviews queried by user_id across all products → YES → reference",
    },
    {
        "question": "Is the child data always retrieved with the parent?",
        "if_yes":   "EMBED — co-retrieval means embedding eliminates round trips",
        "if_no":    "Continue to next question",
        "example":  "Order items always fetched with order → YES → embed",
    },
    {
        "question": "Is the data written atomically with the parent?",
        "if_yes":   "EMBED — single-document atomicity is free",
        "if_no":    "Continue to next question",
        "example":  "Order + items written in one operation → YES → embed",
    },
    {
        "question": "Does the child data change frequently and independently?",
        "if_yes":   "REFERENCE — frequent updates to embedded data cause document churn",
        "if_no":    "EMBED — stable data is safe to embed",
        "example":  "User profile fields updated independently → YES → reference each field separately",
    },
]

print("Embed vs Reference decision framework:\n")
for i, step in enumerate(decision_tree, 1):
    print(f"  Step {i}: {step['question']}")
    print(f"    YES → {step['if_yes']}")
    print(f"    NO  → {step['if_no']}")
    print(f"    e.g. {step['example']}")
    print()

Embed vs Reference decision framework:

Step 1: Can the child data grow without bound?
YES → REFERENCE — unbounded arrays hit the 16 MB limit
NO → Continue to next question
e.g. Orders per user → YES → reference

Step 2: Is the child data queried independently of the parent?
YES → REFERENCE — child needs its own collection for independent queries
NO → Continue to next question
e.g. Reviews queried by user_id across all products → YES → reference

Step 3: Is the child data always retrieved with the parent?
YES → EMBED — co-retrieval means embedding eliminates round trips
NO → Continue to next question
e.g. Order items always fetched with order → YES → embed

Step 4: Is the data written atomically with the parent?
YES → EMBED — single-document atomicity is free
NO → Continue to next question
e.g. Order + items written in one operation → YES → embed

Step 5: Does the child data change frequently and independently?
YES → REFERENCE — frequent updates to embedded data cause document churn
NO → EMBED — stable data is safe to embed

Work through the questions in order — stop at the first definitive answer
Most schemas use a mix of both patterns — embedding for some relationships and referencing for others is not only normal but expected
Document churn occurs when updating an embedded sub-document requires rewriting the entire parent document — on high-write collections this degrades performance

Trade-Off Matrix

# Embed vs Reference trade-off comparison

tradeoffs = {
    "Dimension": [
        "Read performance",
        "Write performance (single parent)",
        "Write performance (shared child data)",
        "Atomicity",
        "Data duplication",
        "Max data size",
        "Independent queries on child",
        "Schema flexibility",
        "Referential integrity",
    ],
    "Embed": [
        "Excellent — one document read",
        "Good — one document write",
        "Poor — must update every parent containing the copy",
        "Free — single-document atomic",
        "High if child shared across many parents",
        "Limited to 16 MB per document",
        "Only via array scan inside document",
        "Lower — parent shape changes with child",
        "Automatic — child is part of parent",
    ],
    "Reference": [
        "Moderate — requires $lookup or two queries",
        "Moderate — may need two writes",
        "Excellent — update child once",
        "Requires multi-doc transaction for atomicity",
        "None — single source of truth",
        "Unlimited — child is its own document",
        "Full — index and query child collection freely",
        "Higher — child schema evolves independently",
        "Manual — application must maintain consistency",
    ],
}

header = f"  {'Dimension':45} {'Embed':45} {'Reference'}"
print(header)
print("  " + "─" * 130)
for i, dim in enumerate(tradeoffs["Dimension"]):
    embed = tradeoffs["Embed"][i]
    ref   = tradeoffs["Reference"][i]
    print(f"  {dim:45} {embed:45} {ref}")

Dimension Embed Reference
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Read performance Excellent — one document read Moderate — requires $lookup or two queries
Write performance (single parent) Good — one document write Moderate — may need two writes
Write performance (shared child data) Poor — must update every parent copy Excellent — update child once
Atomicity Free — single-document atomic Requires multi-doc transaction for atomicity
Data duplication High if child shared across many parents None — single source of truth
Max data size Limited to 16 MB per document Unlimited — child is its own document
Independent queries on child Only via array scan inside document Full — index and query child collection freely
Schema flexibility Lower — parent shape changes with child Higher — child schema evolves independently
Referential integrity Automatic — child is part of parent Manual — application must maintain consistency

The "shared child data" row is the most important embedding anti-pattern — if the same child data (e.g. a product name) is embedded in thousands of parent documents, updating it requires updating every one of those documents
The atomicity row shows the key advantage of embedding — referencing requires a multi-document transaction if you need atomic writes across two collections
In practice, read performance usually wins — most applications read far more than they write, so embedding's read advantage outweighs referencing's write advantage

Summary Table

Signal	Embed	Reference
Array growth	Bounded (< 100 items)	Unbounded or unknown
Query pattern	Always with parent	Often independently
Write pattern	Written with parent atomically	Written independently
Data sharing	Unique to this parent	Shared across many parents
Update frequency	Rarely changes	Changes frequently
Dataplexa example	Order items in order doc	Reviews in own collection

Practice Questions

Practice 1. A social network wants to store the list of a user's friends. Each user has between 0 and 5,000 friends. Should friends be embedded or referenced — and why?

Practice 2. A blog post has exactly one author. The author's name is displayed on every post page. Should the author data be embedded or referenced?

Practice 3. What is the "document churn" problem and which pattern causes it?

Practice 4. What is the Subset Pattern and when should you use it?

Practice 5. MongoDB does not enforce referential integrity. What practical consequence does this have for referenced data?

Quiz

Quiz 1. Which of the following is the strongest signal to reference rather than embed?

The child data can grow without bound — embedding unbounded arrays risks exceeding the 16 MB document limit
The child data is always retrieved with the parent
The child data is written atomically with the parent
The child data rarely changes

Quiz 2. What does $unwind do after a $lookup stage?

It flattens the array produced by $lookup into individual documents — one output document per matched element
It removes the joined collection from the pipeline
It sorts the lookup results by _id
It deduplicates documents returned by the join

Quiz 3. A product's name is embedded in 50,000 order documents. The product is renamed. What is the embedding cost?

50,000 order documents must each be updated — an expensive bulk update. Referencing would require updating the product document once.
MongoDB automatically propagates the name change to all embedded copies
Only the product document needs updating — embedded copies are references
The change requires dropping and recreating the orders collection

Quiz 4. When does embedding provide free atomicity that referencing cannot match without a transaction?

When both the parent and child data need to be written in one operation — a single-document write in MongoDB is always atomic, but writing to two separate collections requires a multi-document transaction
When using write concern w='majority'
Only when using Atlas and not a local server
Embedding never provides atomicity guarantees

Quiz 5. What is the key maintenance cost of using the Subset Pattern?

The embedded subset can become stale — whenever the canonical data in the referenced collection changes, the embedded copy must also be updated in the same write operation
The subset requires a special index type to query efficiently
Subset documents cannot be larger than 1 KB
MongoDB charges extra storage for duplicated subset data

Next up — Schema Design Best Practices: Validation rules, schema versioning, anti-patterns to avoid, and evolving your schema safely in production.

← Previous Course Index Next →