MongoDB
$lookup (Joins)
MongoDB is a document database — related data is often embedded rather than split across tables. But referencing is unavoidable for unbounded relationships, and when you reference you eventually need to join. The $lookup aggregation stage is MongoDB's join mechanism. It performs a left outer join between the current pipeline documents and a foreign collection, attaching matching documents as an array. This lesson covers every form of $lookup: the simple equality join, the expressive pipeline-style join with custom conditions, joining multiple collections in one pipeline, and the self-join pattern. All examples use the Dataplexa Store so you can see exactly what the output looks like at each step.
1. Simple $lookup — Equality Join
The simplest form of $lookup matches a field in the local collection against a field in the foreign collection using a straight equality check. This is identical to a SQL LEFT JOIN … ON localField = foreignField and covers the vast majority of real join requirements.
# Simple $lookup — join orders to users on user_id = _id
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["dataplexa"]
pipeline = [
# Stage 1: filter to the orders we care about
{"$match": {"status": {"$in": ["delivered", "shipped"]}}},
# Stage 2: simple equality $lookup
# For each order, find all users where users._id == orders.user_id
{"$lookup": {
"from": "users", # foreign collection
"localField": "user_id", # field in the current (orders) documents
"foreignField": "_id", # field in the foreign (users) collection
"as": "user_info" # name of the output array field
}},
# Stage 3: $unwind to flatten the single-element array into a sub-document
{"$unwind": "$user_info"},
# Stage 4: shape the output
{"$project": {
"_id": 0,
"order_id": "$_id",
"status": 1,
"total": 1,
"date": 1,
"customer": "$user_info.name",
"membership": "$user_info.membership",
"country": "$user_info.country"
}},
{"$sort": {"total": -1}}
]
results = list(db.orders.aggregate(pipeline))
print("Delivered/shipped orders with customer details ($lookup):\n")
print(f" {'Order':6} {'Customer':15} {'Tier':8} {'Country':8} {'Status':10} {'Total':>8}")
print(f" {'─'*6} {'─'*15} {'─'*8} {'─'*8} {'─'*10} {'─'*8}")
for r in results:
print(f" {r['order_id']:6} {r['customer']:15} "
f"{r['membership']:8} {r['country']:8} "
f"{r['status']:10} ${r['total']:>7.2f}")Order Customer Tier Country Status Total
────── ─────────────── ──────── ──────── ────────── ────────
o005 Eva Müller premium Germany delivered $ 329.98
o003 Alice Johnson premium UK delivered $ 99.98
o002 Bob Smith basic UK shipped $ 89.99
o001 Alice Johnson premium UK delivered $ 44.96
o007 Bob Smith basic UK delivered $ 11.97
$lookupalways produces an array — even when there is exactly one match. Use$unwindimmediately after to flatten it to a sub-document when you expect one result per document- When no matching document exists in the foreign collection, the output array is empty and the default
$unwinddrops that document from the pipeline — usepreserveNullAndEmptyArrays: Trueto keep it - Always index the foreign collection's join field —
users._idis already indexed automatically, but custom join fields likeorders.user_idmust be indexed manually
2. Pipeline-Style $lookup — Custom Join Conditions
The simple localField/foreignField form only supports equality joins. For more complex conditions — joining on multiple fields, applying filters to the foreign collection before joining, or using expressions in the join condition — MongoDB provides a pipeline-style $lookup using let and pipeline parameters. This is the equivalent of a SQL JOIN with a more complex ON clause.
# Pipeline-style $lookup — custom join conditions with let + pipeline
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["dataplexa"]
# Scenario: for each user, find only their DELIVERED orders
# Simple $lookup would return ALL orders for a user — we want only delivered ones
# Pipeline $lookup lets us filter the foreign collection before joining
pipeline = [
# Start from users collection
{"$match": {"membership": "premium"}},
{"$lookup": {
"from": "orders",
# let: expose local (users) fields as variables for use inside the pipeline
"let": {
"uid": "$_id", # $$uid = the current user's _id
},
# pipeline: run this against the foreign (orders) collection
# $$ references variables declared in let
# $ references fields in the foreign collection
"pipeline": [
{"$match": {
"$expr": {
"$and": [
{"$eq": ["$user_id", "$$uid"]}, # join condition
{"$eq": ["$status", "delivered"]} # additional filter
]
}
}},
{"$project": {"_id": 1, "total": 1, "date": 1, "status": 1}},
{"$sort": {"date": -1}}
],
"as": "delivered_orders"
}},
# Keep only users who have at least one delivered order
{"$match": {"delivered_orders.0": {"$exists": True}}},
{"$project": {
"_id": 0,
"customer": "$name",
"membership": 1,
"country": 1,
"delivered_count": {"$size": "$delivered_orders"},
"total_spent": {"$sum": "$delivered_orders.total"},
"orders": "$delivered_orders"
}},
{"$sort": {"total_spent": -1}}
]
results = list(db.users.aggregate(pipeline))
print("Premium users with their delivered orders (pipeline $lookup):\n")
for r in results:
print(f" {r['customer']} ({r['country']}, {r['membership']})")
print(f" Delivered orders: {r['delivered_count']} "
f"Total spent: ${r['total_spent']:.2f}")
for o in r["orders"]:
print(f" {o['_id']} ${o['total']:.2f} {o['date']}")
print()Alice Johnson (UK, premium)
Delivered orders: 2 Total spent: $144.94
o003 $99.98 2024-02-20
o001 $44.96 2024-01-10
Eva Müller (Germany, premium)
Delivered orders: 1 Total spent: $329.98
o005 $329.98 2024-03-10
- Variables declared in
letare accessed inside the sub-pipeline with a double-dollar prefix:$$uid. Single-dollar$user_idrefers to fields in the foreign collection - The sub-pipeline inside
pipelinecan use any aggregation stages —$match,$project,$sort,$limit,$group— giving you full control over what gets joined - Filtering inside the pipeline
$lookupis more efficient than joining everything and filtering afterwards — the foreign collection is pre-filtered before the join result is materialised
3. Joining Multiple Collections in One Pipeline
A real reporting query often needs data from three or more collections. You can chain multiple $lookup stages in a single pipeline — each one joining a new collection to the documents flowing through. The output of each $lookup enriches the document further before passing it to the next stage.
# Multiple $lookup stages — joining orders → users → products in one pipeline
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["dataplexa"]
# Goal: for each delivered order, show the customer name AND
# the name of every product in the order
pipeline = [
# Stage 1: filter delivered orders only
{"$match": {"status": "delivered"}},
# Stage 2: join order → user
{"$lookup": {
"from": "users",
"localField": "user_id",
"foreignField": "_id",
"as": "user_info"
}},
{"$unwind": "$user_info"},
# Stage 3: unwind the items array so we can join each item to products
{"$unwind": "$items"},
# Stage 4: join each line item → product
{"$lookup": {
"from": "products",
"localField": "items.product_id",
"foreignField": "_id",
"as": "product_info"
}},
{"$unwind": "$product_info"},
# Stage 5: project the final shape
{"$project": {
"_id": 0,
"order_id": "$_id",
"customer": "$user_info.name",
"product": "$product_info.name",
"category": "$product_info.category",
"qty": "$items.qty",
"line_total": {"$round": [
{"$multiply": ["$items.qty", "$items.price"]}, 2
]},
}},
{"$sort": {"order_id": 1, "product": 1}}
]
results = list(db.orders.aggregate(pipeline))
print("Delivered order line items with customer and product names:\n")
print(f" {'Order':6} {'Customer':15} {'Product':25} {'Cat':12} {'Qty':3} {'Line'}")
print(f" {'─'*6} {'─'*15} {'─'*25} {'─'*12} {'─'*3} {'─'*8}")
for r in results:
print(f" {r['order_id']:6} {r['customer']:15} "
f"{r['product']:25} {r['category']:12} "
f"{r['qty']:3} ${r['line_total']:.2f}")Order Customer Product Cat Qty Line
────── ─────────────── ───────────────────────── ──────────── ─── ────────
o001 Alice Johnson Notebook A5 Stationery 3 $14.97
o001 Alice Johnson Wireless Mouse Electronics 1 $29.99
o003 Alice Johnson USB-C Hub Electronics 1 $49.99
o003 Alice Johnson Wireless Mouse Electronics 2 $59.98
o005 Eva Müller Monitor 27-inch Electronics 1 $299.99
o005 Eva Müller Standing Desk Furniture 1 $349.99
o007 Bob Smith Ballpoint Pens 10-pack Stationery 2 $6.98
o007 Bob Smith Notebook A5 Stationery 1 $4.99
- Each
$lookupstage adds a new array field to the document — after$unwindeach one becomes a flat sub-document you can reference with dot notation - The order of
$lookupstages matters when later joins depend on fields produced by earlier ones — here the items$unwindmust happen before the products$lookup - Multi-join pipelines can become expensive — always place
$matchas early as possible to reduce the document set before the first join, and ensure all join fields are indexed
4. $lookup with $unwind Optimisation
When a $lookup is immediately followed by an $unwind on the joined array, MongoDB's query planner recognises this as a single optimised operation — it streams joined documents one at a time rather than building the entire joined array in memory first. This significantly reduces memory usage for large joins.
# $lookup + $unwind optimisation — and handling missing joins gracefully
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["dataplexa"]
# Demonstrate preserveNullAndEmptyArrays — keep orders even if no matching user
# (simulates orphaned documents — good defensive practice)
pipeline = [
{"$lookup": {
"from": "users",
"localField": "user_id",
"foreignField": "_id",
"as": "user_info"
}},
# preserveNullAndEmptyArrays keeps orders with no matching user in the output
{"$unwind": {
"path": "$user_info",
"preserveNullAndEmptyArrays": True
}},
{"$project": {
"_id": 0,
"order_id": "$_id",
"status": 1,
"total": 1,
# $ifNull handles the case where user_info is missing
"customer": {"$ifNull": ["$user_info.name", "Unknown Customer"]},
"tier": {"$ifNull": ["$user_info.membership", "unknown"]},
}},
{"$sort": {"order_id": 1}}
]
results = list(db.orders.aggregate(pipeline))
print("All orders with customer name (preserveNullAndEmptyArrays):\n")
print(f" {'Order':6} {'Status':12} {'Total':>8} {'Customer':15} Tier")
print(f" {'─'*6} {'─'*12} {'─'*8} {'─'*15} {'─'*8}")
for r in results:
print(f" {r['order_id']:6} {r['status']:12} "
f"${r['total']:>7.2f} {r['customer']:15} {r['tier']}")
# Explain the optimisation
print("\nQuery planner note:")
print(" When $lookup is immediately followed by $unwind on the same array,")
print(" MongoDB coalesces them into one streaming operation — lower memory,")
print(" no full array materialisation needed before unwinding.")Order Status Total Customer Tier
────── ──────────── ──────── ─────────────── ────────
o001 delivered $ 44.96 Alice Johnson premium
o002 shipped $ 89.99 Bob Smith basic
o003 delivered $ 99.98 Alice Johnson premium
o004 cancelled $ 0.00 Clara Diaz premium
o005 delivered $ 329.98 Eva Müller premium
o006 processing $ 349.99 Eva Müller premium
o007 delivered $ 11.97 Bob Smith basic
Query planner note:
When $lookup is immediately followed by $unwind on the same array,
MongoDB coalesces them into one streaming operation — lower memory,
no full array materialisation needed before unwinding.
- Place
$unwindimmediately after$lookup— not only is this the correct pattern for flattening, it also triggers MongoDB's internal coalescing optimisation preserveNullAndEmptyArrays: Trueis essential for defensive pipelines — it prevents silently losing documents that have no match in the foreign collection- Use
$ifNullafterpreserveNullAndEmptyArraysto replace missing joined fields with readable fallback values in the output
5. Self-Join — Joining a Collection to Itself
A self-join uses $lookup with from pointing to the same collection as the source. This is useful for hierarchical data — finding related documents within the same collection, such as products in the same category as a given product, or users who joined in the same month as another user.
# Self-join — find other products in the same category using $lookup
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["dataplexa"]
# For each Electronics product, find the other Electronics products
# (self-join on category, excluding the document itself)
pipeline = [
# Only process Electronics products
{"$match": {"category": "Electronics"}},
# Self-join: from the same products collection
{"$lookup": {
"from": "products",
"let": {"current_id": "$_id", "current_cat": "$category"},
"pipeline": [
{"$match": {
"$expr": {
"$and": [
# Same category
{"$eq": ["$category", "$$current_cat"]},
# But NOT the document itself
{"$ne": ["$_id", "$$current_id"]}
]
}
}},
{"$project": {"name": 1, "price": 1, "rating": 1, "_id": 0}},
{"$sort": {"rating": -1}}
],
"as": "related_products"
}},
{"$project": {
"_id": 0,
"product": "$name",
"price": 1,
"related_count": {"$size": "$related_products"},
"top_related": {"$slice": ["$related_products", 2]} # top 2 only
}},
{"$sort": {"price": -1}}
]
results = list(db.products.aggregate(pipeline))
print("Electronics products with related items (self-join):\n")
for r in results:
print(f" {r['product']} (${r['price']:.2f}) — "
f"{r['related_count']} related products")
for rel in r["top_related"]:
print(f" → {rel['name']:25} ${rel['price']:.2f} ★{rel['rating']}")Monitor 27-inch ($299.99) — 3 related products
→ Mechanical Keyboard $89.99 ★4.7
→ USB-C Hub $49.99 ★4.3
Mechanical Keyboard ($89.99) — 3 related products
→ Monitor 27-inch $299.99 ★4.6
→ USB-C Hub $49.99 ★4.3
USB-C Hub ($49.99) — 3 related products
→ Mechanical Keyboard $89.99 ★4.7
→ Monitor 27-inch $299.99 ★4.6
Wireless Mouse ($29.99) — 3 related products
→ Mechanical Keyboard $89.99 ★4.7
→ Monitor 27-inch $299.99 ★4.6
- A self-join requires the pipeline-style
$lookupwithlet— because you need to pass the current document's fields into the sub-pipeline to exclude it from its own results $sliceinside$projecttrims an array to the first N elements — use it to avoid returning the full related products list when you only need a preview- Self-joins are the foundation of "related items", "customers also bought", and "similar profiles" features — patterns that power recommendation sections in real e-commerce applications
Summary Table
| $lookup Form | When to Use | Key Parameters | SQL Equivalent |
|---|---|---|---|
| Simple equality | One field matches one foreign field | from, localField, foreignField, as |
LEFT JOIN ON a.id = b.id |
| Pipeline-style | Multiple conditions, filter foreign docs | from, let, pipeline, as |
LEFT JOIN ON … AND … |
| Multiple joins | Data from 3+ collections in one query | Chain multiple $lookup stages |
Multiple JOIN clauses |
| + $unwind optimisation | Always — reduces memory usage | preserveNullAndEmptyArrays |
Streaming join |
| Self-join | Related documents in same collection | from = same collection, $ne self exclusion |
JOIN table AS t2 ON … |
$$variable in let |
Pass local fields into sub-pipeline | Double-dollar prefix references let variables |
Correlated subquery parameter |
Practice Questions
Practice 1. What is the key difference between a simple $lookup and a pipeline-style $lookup?
Practice 2. In a pipeline-style $lookup, what is the difference between $$uid and $user_id?
Practice 3. What happens by default when a $lookup finds no matching document in the foreign collection, and how do you change this behaviour?
Practice 4. Why must $unwind be applied before a second $lookup when joining through an embedded array field like items.product_id?
Practice 5. What performance optimisation does MongoDB apply when $lookup is immediately followed by $unwind on the same array field?
Quiz
Quiz 1. What type of SQL join does $lookup perform by default?
Quiz 2. In a pipeline-style $lookup, where do you declare the local collection's fields to make them accessible inside the sub-pipeline?
Quiz 3. Which $project operator trims an array to its first N elements — useful after $lookup to limit a related items list?
Quiz 4. Why does a self-join require the pipeline-style $lookup rather than the simple form?
Quiz 5. What is the correct way to check whether a $lookup returned at least one result — to filter out documents with an empty joined array?
Next up — Transactions: How multi-document ACID transactions work in MongoDB, when to use them, and how to implement them safely with sessions in PyMongo.