MongoDB
Embedded vs Referenced
The single most consequential decision in any MongoDB schema is whether to embed related data inside a document or store it in a separate collection and link it with a reference. Embed and your reads are fast, atomic, and require no joins — but updates to shared data become expensive. Reference and your data stays normalised and independently queryable — but every read that needs both sides requires either a $lookup aggregation or two separate queries. There is no universally correct answer. This lesson gives you the decision framework, the trade-off matrix, and the patterns you need to make the right call every time, illustrated throughout with the Dataplexa Store dataset.
Embedding — Data Inside the Document
Embedding places related data directly inside the parent document as a sub-document or an array of sub-documents. The entire entity — parent and children — is stored and retrieved as one unit. This is MongoDB's default recommendation and the approach that gives you the most benefit from the document model.
# Embedding — related data lives inside the parent document
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["dataplexa"]
# Example: order with embedded line items (already our Dataplexa model)
# The items array is embedded — retrieved in the same query as the order
order = db.orders.find_one(
{"_id": "o001"},
)
print("Order with embedded items — single query:")
print(f" order id: {order['_id']}")
print(f" user: {order['user_id']}")
print(f" status: {order['status']}")
print(f" total: ${order['total']:.2f}")
print(f" items:")
for item in order["items"]:
print(f" product: {item['product_id']} qty: {item['qty']} price: ${item['price']:.2f}")
# Embedding is ideal here because:
# 1. Items are ALWAYS needed with the order — never queried independently
# 2. The item count per order is BOUNDED (realistic max ~100 items)
# 3. Writing the order and items is one ATOMIC operation — no partial writes
# 4. Zero network round trips — one query returns everything
print("\nBenefits of embedding for order items:")
benefits = [
"One query returns order + all items — zero round trips",
"Atomic write — order and items succeed or fail together",
"No $lookup required — simpler query code",
"Items are always consistent with their parent order",
]
for b in benefits:
print(f" ✓ {b}")order id: o001
user: u001
status: delivered
total: $44.96
items:
product: p001 qty: 1 price: $29.99
product: p003 qty: 3 price: $4.99
Benefits of embedding for order items:
✓ One query returns order + all items — zero round trips
✓ Atomic write — order and items succeed or fail together
✓ No $lookup required — simpler query code
✓ Items are always consistent with their parent order
- Embedding is the default choice — start by embedding and only switch to referencing when a specific problem forces you to
- Single-document atomicity is free with embedding — MongoDB guarantees that writes to one document are atomic without requiring a multi-document transaction
- Reads are faster with embedding because MongoDB reads one document from disk instead of performing an index lookup followed by a second read on another collection
Referencing — Links Between Collections
Referencing stores the related data in a separate collection and links the two documents with a shared identifier — typically the _id of one document stored as a field in the other. The application then issues two queries or uses a $lookup aggregation stage to join the data at query time.
# Referencing — linked documents in separate collections
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["dataplexa"]
# Example: reviews referencing both user and product
# The review stores user_id and product_id as foreign-key-style references
review = db.reviews.find_one({"_id": "r001"})
print("Review document (references only — no embedded data):")
print(f" review id: {review['_id']}")
print(f" product_id: {review['product_id']} ← reference to products collection")
print(f" user_id: {review['user_id']} ← reference to users collection")
print(f" rating: {review['rating']}")
print(f" comment: {review['comment']}")
# To display the full review with product name and user name
# requires two additional queries (or a $lookup aggregation)
product = db.products.find_one({"_id": review["product_id"]}, {"name": 1, "_id": 0})
user = db.users.find_one({"_id": review["user_id"]}, {"name": 1, "_id": 0})
print(f"\nFull review (after two additional queries):")
print(f" User: {user['name']}")
print(f" Product: {product['name']}")
print(f" Rating: {review['rating']} — \"{review['comment']}\"")
# Why reviews are referenced — not embedded in the product:
print("\nWhy reviews are referenced (not embedded in product):")
reasons = [
"Unbounded growth — a popular product can have 100,000+ reviews",
"Queried independently — 'all reviews by user u001' spans products",
"Embedding would bloat product documents beyond useful size",
"Reviews can be paginated, sorted, filtered as a standalone collection",
]
for r in reasons:
print(f" ✓ {r}")review id: r001
product_id: p001 ← reference to products collection
user_id: u001 ← reference to users collection
rating: 5
comment: Fast and responsive, great for gaming.
Full review (after two additional queries):
User: Alice Johnson
Product: Wireless Mouse
Rating: 5 — "Fast and responsive, great for gaming."
Why reviews are referenced (not embedded in product):
✓ Unbounded growth — a popular product can have 100,000+ reviews
✓ Queried independently — 'all reviews by user u001' spans products
✓ Embedding would bloat product documents beyond useful size
✓ Reviews can be paginated, sorted, filtered as a standalone collection
- Referencing is the right default when child data can grow without bound or when children are frequently queried independently of their parent
- MongoDB does not enforce referential integrity — if you delete a product, its orphaned reviews remain. The application must manage consistency
- Use
$lookupin an aggregation pipeline to join referenced collections server-side, avoiding multiple round trips
$lookup — Server-Side Join for Referenced Data
$lookup is MongoDB's aggregation stage for joining a local collection with a foreign collection. It is the equivalent of a SQL LEFT OUTER JOIN and is the standard way to retrieve referenced data in one server round trip.
# $lookup — joining referenced collections server-side
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["dataplexa"]
# Join reviews with their product and user in one aggregation pipeline
pipeline = [
# Stage 1: filter to reviews for product p001
{"$match": {"product_id": "p001"}},
# Stage 2: join with products collection
{"$lookup": {
"from": "products", # foreign collection
"localField": "product_id", # field in reviews
"foreignField": "_id", # field in products
"as": "product_info" # output array field name
}},
# Stage 3: join with users collection
{"$lookup": {
"from": "users",
"localField": "user_id",
"foreignField": "_id",
"as": "user_info"
}},
# Stage 4: flatten the single-element lookup arrays
{"$unwind": "$product_info"},
{"$unwind": "$user_info"},
# Stage 5: project only the fields we need
{"$project": {
"rating": 1,
"comment": 1,
"product_name": "$product_info.name",
"reviewer": "$user_info.name",
"_id": 0
}}
]
results = list(db.reviews.aggregate(pipeline))
print(f"Reviews for p001 with product and user names ($lookup):\n")
for r in results:
print(f" [{r['rating']}★] {r['reviewer']:15} — \"{r['comment']}\"")
print(f" Product: {r['product_name']}")[5★] Alice Johnson — "Fast and responsive, great for gaming."
Product: Wireless Mouse
[4★] David Lee — "Good value for money, works well."
Product: Wireless Mouse
$lookupproduces an array field — use$unwindto flatten it when you expect exactly one match per document$lookupis a server-side operation — it is far more efficient than fetching documents and joining in Python application code- Index the foreign collection's join field (
_idin most cases) — without an index, each lookup performs a full collection scan - Frequent
$lookupusage is a signal that your schema may benefit from partial denormalisation — embedding the most-used fields from the foreign document to avoid the join
The Subset Pattern — Partial Embedding
The Subset Pattern is a practical middle ground between full embedding and full referencing. You embed a frequently-needed subset of the related data directly in the parent document — enough to display the common view without a join — while keeping the full data in a separate collection for when the complete picture is needed.
# Subset pattern — embed the hot data, reference the rest
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["dataplexa"]
# Scenario: product page needs the 3 most recent reviews inline
# Full review collection may have thousands per product — don't embed all of them
# Instead: embed the top 3 reviews as a "reviews_preview" subset on the product
# What the product document would look like with the subset pattern:
product_with_subset = {
"_id": "p001",
"name": "Wireless Mouse",
"price": 29.99,
"category": "Electronics",
"rating": 4.5,
# Hot data — embedded for instant page load without $lookup
"reviews_preview": [
{"user_id": "u001", "reviewer": "Alice Johnson", "rating": 5,
"comment": "Fast and responsive, great for gaming.", "date": "2024-01-15"},
{"user_id": "u004", "reviewer": "David Lee", "rating": 4,
"comment": "Good value for money, works well.", "date": "2024-02-10"},
],
"review_count": 42, # computed — updated on each new review
}
# Fast product page load — no join needed for the preview
print("Product page — instant load with subset:")
print(f" {product_with_subset['name']} ${product_with_subset['price']}")
print(f" Rating: {product_with_subset['rating']} ({product_with_subset['review_count']} reviews)")
print(f" Recent reviews:")
for r in product_with_subset["reviews_preview"]:
print(f" [{r['rating']}★] {r['reviewer']} — \"{r['comment']}\"")
# Full reviews page — query the reviews collection with pagination
print("\nFull review page — query reviews collection:")
all_reviews = db.reviews.find(
{"product_id": "p001"},
{"user_id": 1, "rating": 1, "comment": 1, "_id": 0}
)
for r in all_reviews:
print(f" [{r['rating']}★] user: {r['user_id']} — \"{r['comment']}\"")Wireless Mouse $29.99
Rating: 4.5 (42 reviews)
Recent reviews:
[5★] Alice Johnson — "Fast and responsive, great for gaming."
[4★] David Lee — "Good value for money, works well."
Full review page — query reviews collection:
[5★] user: u001 — "Fast and responsive, great for gaming."
[4★] user: u004 — "Good value for money, works well."
- The Subset Pattern is used by Amazon, IMDb, and most major e-commerce platforms — the product card embeds just enough review data for the listing page
- The embedded subset becomes stale when the underlying data changes — you must update it with the same write that changes the canonical data in the reviews collection
- Keep the subset small and bounded — typically the top 3–5 most recent or most relevant items
The Decision Framework
Six questions determine whether to embed or reference. Work through them in order — the first question that gives a definitive answer stops the process.
# Decision framework — embed or reference?
decision_tree = [
{
"question": "Can the child data grow without bound?",
"if_yes": "REFERENCE — unbounded arrays hit the 16 MB limit",
"if_no": "Continue to next question",
"example": "Orders per user → YES → reference",
},
{
"question": "Is the child data queried independently of the parent?",
"if_yes": "REFERENCE — child needs its own collection for independent queries",
"if_no": "Continue to next question",
"example": "Reviews queried by user_id across all products → YES → reference",
},
{
"question": "Is the child data always retrieved with the parent?",
"if_yes": "EMBED — co-retrieval means embedding eliminates round trips",
"if_no": "Continue to next question",
"example": "Order items always fetched with order → YES → embed",
},
{
"question": "Is the data written atomically with the parent?",
"if_yes": "EMBED — single-document atomicity is free",
"if_no": "Continue to next question",
"example": "Order + items written in one operation → YES → embed",
},
{
"question": "Does the child data change frequently and independently?",
"if_yes": "REFERENCE — frequent updates to embedded data cause document churn",
"if_no": "EMBED — stable data is safe to embed",
"example": "User profile fields updated independently → YES → reference each field separately",
},
]
print("Embed vs Reference decision framework:\n")
for i, step in enumerate(decision_tree, 1):
print(f" Step {i}: {step['question']}")
print(f" YES → {step['if_yes']}")
print(f" NO → {step['if_no']}")
print(f" e.g. {step['example']}")
print()Step 1: Can the child data grow without bound?
YES → REFERENCE — unbounded arrays hit the 16 MB limit
NO → Continue to next question
e.g. Orders per user → YES → reference
Step 2: Is the child data queried independently of the parent?
YES → REFERENCE — child needs its own collection for independent queries
NO → Continue to next question
e.g. Reviews queried by user_id across all products → YES → reference
Step 3: Is the child data always retrieved with the parent?
YES → EMBED — co-retrieval means embedding eliminates round trips
NO → Continue to next question
e.g. Order items always fetched with order → YES → embed
Step 4: Is the data written atomically with the parent?
YES → EMBED — single-document atomicity is free
NO → Continue to next question
e.g. Order + items written in one operation → YES → embed
Step 5: Does the child data change frequently and independently?
YES → REFERENCE — frequent updates to embedded data cause document churn
NO → EMBED — stable data is safe to embed
- Work through the questions in order — stop at the first definitive answer
- Most schemas use a mix of both patterns — embedding for some relationships and referencing for others is not only normal but expected
- Document churn occurs when updating an embedded sub-document requires rewriting the entire parent document — on high-write collections this degrades performance
Trade-Off Matrix
# Embed vs Reference trade-off comparison
tradeoffs = {
"Dimension": [
"Read performance",
"Write performance (single parent)",
"Write performance (shared child data)",
"Atomicity",
"Data duplication",
"Max data size",
"Independent queries on child",
"Schema flexibility",
"Referential integrity",
],
"Embed": [
"Excellent — one document read",
"Good — one document write",
"Poor — must update every parent containing the copy",
"Free — single-document atomic",
"High if child shared across many parents",
"Limited to 16 MB per document",
"Only via array scan inside document",
"Lower — parent shape changes with child",
"Automatic — child is part of parent",
],
"Reference": [
"Moderate — requires $lookup or two queries",
"Moderate — may need two writes",
"Excellent — update child once",
"Requires multi-doc transaction for atomicity",
"None — single source of truth",
"Unlimited — child is its own document",
"Full — index and query child collection freely",
"Higher — child schema evolves independently",
"Manual — application must maintain consistency",
],
}
header = f" {'Dimension':45} {'Embed':45} {'Reference'}"
print(header)
print(" " + "─" * 130)
for i, dim in enumerate(tradeoffs["Dimension"]):
embed = tradeoffs["Embed"][i]
ref = tradeoffs["Reference"][i]
print(f" {dim:45} {embed:45} {ref}")──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Read performance Excellent — one document read Moderate — requires $lookup or two queries
Write performance (single parent) Good — one document write Moderate — may need two writes
Write performance (shared child data) Poor — must update every parent copy Excellent — update child once
Atomicity Free — single-document atomic Requires multi-doc transaction for atomicity
Data duplication High if child shared across many parents None — single source of truth
Max data size Limited to 16 MB per document Unlimited — child is its own document
Independent queries on child Only via array scan inside document Full — index and query child collection freely
Schema flexibility Lower — parent shape changes with child Higher — child schema evolves independently
Referential integrity Automatic — child is part of parent Manual — application must maintain consistency
- The "shared child data" row is the most important embedding anti-pattern — if the same child data (e.g. a product name) is embedded in thousands of parent documents, updating it requires updating every one of those documents
- The atomicity row shows the key advantage of embedding — referencing requires a multi-document transaction if you need atomic writes across two collections
- In practice, read performance usually wins — most applications read far more than they write, so embedding's read advantage outweighs referencing's write advantage
Summary Table
| Signal | Embed | Reference |
|---|---|---|
| Array growth | Bounded (< 100 items) | Unbounded or unknown |
| Query pattern | Always with parent | Often independently |
| Write pattern | Written with parent atomically | Written independently |
| Data sharing | Unique to this parent | Shared across many parents |
| Update frequency | Rarely changes | Changes frequently |
| Dataplexa example | Order items in order doc | Reviews in own collection |
Practice Questions
Practice 1. A social network wants to store the list of a user's friends. Each user has between 0 and 5,000 friends. Should friends be embedded or referenced — and why?
Practice 2. A blog post has exactly one author. The author's name is displayed on every post page. Should the author data be embedded or referenced?
Practice 3. What is the "document churn" problem and which pattern causes it?
Practice 4. What is the Subset Pattern and when should you use it?
Practice 5. MongoDB does not enforce referential integrity. What practical consequence does this have for referenced data?
Quiz
Quiz 1. Which of the following is the strongest signal to reference rather than embed?
Quiz 2. What does $unwind do after a $lookup stage?
Quiz 3. A product's name is embedded in 50,000 order documents. The product is renamed. What is the embedding cost?
Quiz 4. When does embedding provide free atomicity that referencing cannot match without a transaction?
Quiz 5. What is the key maintenance cost of using the Subset Pattern?
Next up — Schema Design Best Practices: Validation rules, schema versioning, anti-patterns to avoid, and evolving your schema safely in production.