Mango DBLesson 30 – Lookup (Joins) | Dataplexa

$lookup (Joins)

MongoDB is a document database — related data is often embedded rather than split across tables. But referencing is unavoidable for unbounded relationships, and when you reference you eventually need to join. The $lookup aggregation stage is MongoDB's join mechanism. It performs a left outer join between the current pipeline documents and a foreign collection, attaching matching documents as an array. This lesson covers every form of $lookup: the simple equality join, the expressive pipeline-style join with custom conditions, joining multiple collections in one pipeline, and the self-join pattern. All examples use the Dataplexa Store so you can see exactly what the output looks like at each step.

1. Simple $lookup — Equality Join

The simplest form of $lookup matches a field in the local collection against a field in the foreign collection using a straight equality check. This is identical to a SQL LEFT JOIN … ON localField = foreignField and covers the vast majority of real join requirements.

# Simple $lookup — join orders to users on user_id = _id

from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")
db     = client["dataplexa"]

pipeline = [
    # Stage 1: filter to the orders we care about
    {"$match": {"status": {"$in": ["delivered", "shipped"]}}},

    # Stage 2: simple equality $lookup
    # For each order, find all users where users._id == orders.user_id
    {"$lookup": {
        "from":         "users",      # foreign collection
        "localField":   "user_id",    # field in the current (orders) documents
        "foreignField": "_id",        # field in the foreign (users) collection
        "as":           "user_info"   # name of the output array field
    }},

    # Stage 3: $unwind to flatten the single-element array into a sub-document
    {"$unwind": "$user_info"},

    # Stage 4: shape the output
    {"$project": {
        "_id":        0,
        "order_id":   "$_id",
        "status":     1,
        "total":      1,
        "date":       1,
        "customer":   "$user_info.name",
        "membership": "$user_info.membership",
        "country":    "$user_info.country"
    }},

    {"$sort": {"total": -1}}
]

results = list(db.orders.aggregate(pipeline))
print("Delivered/shipped orders with customer details ($lookup):\n")
print(f"  {'Order':6}  {'Customer':15}  {'Tier':8}  {'Country':8}  {'Status':10}  {'Total':>8}")
print(f"  {'─'*6}  {'─'*15}  {'─'*8}  {'─'*8}  {'─'*10}  {'─'*8}")
for r in results:
    print(f"  {r['order_id']:6}  {r['customer']:15}  "
          f"{r['membership']:8}  {r['country']:8}  "
          f"{r['status']:10}  ${r['total']:>7.2f}")

Delivered/shipped orders with customer details ($lookup):

Order Customer Tier Country Status Total
────── ─────────────── ──────── ──────── ────────── ────────
o005 Eva Müller premium Germany delivered $ 329.98
o003 Alice Johnson premium UK delivered $ 99.98
o002 Bob Smith basic UK shipped $ 89.99
o001 Alice Johnson premium UK delivered $ 44.96
o007 Bob Smith basic UK delivered $ 11.97

$lookup always produces an array — even when there is exactly one match. Use $unwind immediately after to flatten it to a sub-document when you expect one result per document
When no matching document exists in the foreign collection, the output array is empty and the default $unwind drops that document from the pipeline — use preserveNullAndEmptyArrays: True to keep it
Always index the foreign collection's join field — users._id is already indexed automatically, but custom join fields like orders.user_id must be indexed manually

2. Pipeline-Style $lookup — Custom Join Conditions

The simple localField/foreignField form only supports equality joins. For more complex conditions — joining on multiple fields, applying filters to the foreign collection before joining, or using expressions in the join condition — MongoDB provides a pipeline-style $lookup using let and pipeline parameters. This is the equivalent of a SQL JOIN with a more complex ON clause.

# Pipeline-style $lookup — custom join conditions with let + pipeline

from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")
db     = client["dataplexa"]

# Scenario: for each user, find only their DELIVERED orders
# Simple $lookup would return ALL orders for a user — we want only delivered ones
# Pipeline $lookup lets us filter the foreign collection before joining

pipeline = [
    # Start from users collection
    {"$match": {"membership": "premium"}},

    {"$lookup": {
        "from": "orders",

        # let: expose local (users) fields as variables for use inside the pipeline
        "let": {
            "uid":  "$_id",          # $$uid = the current user's _id
        },

        # pipeline: run this against the foreign (orders) collection
        # $$ references variables declared in let
        # $ references fields in the foreign collection
        "pipeline": [
            {"$match": {
                "$expr": {
                    "$and": [
                        {"$eq":  ["$user_id", "$$uid"]},         # join condition
                        {"$eq":  ["$status",  "delivered"]}      # additional filter
                    ]
                }
            }},
            {"$project": {"_id": 1, "total": 1, "date": 1, "status": 1}},
            {"$sort": {"date": -1}}
        ],

        "as": "delivered_orders"
    }},

    # Keep only users who have at least one delivered order
    {"$match": {"delivered_orders.0": {"$exists": True}}},

    {"$project": {
        "_id":             0,
        "customer":        "$name",
        "membership":      1,
        "country":         1,
        "delivered_count": {"$size": "$delivered_orders"},
        "total_spent":     {"$sum":  "$delivered_orders.total"},
        "orders":          "$delivered_orders"
    }},

    {"$sort": {"total_spent": -1}}
]

results = list(db.users.aggregate(pipeline))
print("Premium users with their delivered orders (pipeline $lookup):\n")
for r in results:
    print(f"  {r['customer']} ({r['country']}, {r['membership']})")
    print(f"  Delivered orders: {r['delivered_count']}  "
          f"Total spent: ${r['total_spent']:.2f}")
    for o in r["orders"]:
        print(f"    {o['_id']}  ${o['total']:.2f}  {o['date']}")
    print()

Premium users with their delivered orders (pipeline $lookup):

Alice Johnson (UK, premium)
Delivered orders: 2 Total spent: $144.94
o003 $99.98 2024-02-20
o001 $44.96 2024-01-10

Eva Müller (Germany, premium)
Delivered orders: 1 Total spent: $329.98
o005 $329.98 2024-03-10

Variables declared in let are accessed inside the sub-pipeline with a double-dollar prefix: $$uid. Single-dollar $user_id refers to fields in the foreign collection
The sub-pipeline inside pipeline can use any aggregation stages — $match, $project, $sort, $limit, $group — giving you full control over what gets joined
Filtering inside the pipeline $lookup is more efficient than joining everything and filtering afterwards — the foreign collection is pre-filtered before the join result is materialised

3. Joining Multiple Collections in One Pipeline

A real reporting query often needs data from three or more collections. You can chain multiple $lookup stages in a single pipeline — each one joining a new collection to the documents flowing through. The output of each $lookup enriches the document further before passing it to the next stage.

# Multiple $lookup stages — joining orders → users → products in one pipeline

from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")
db     = client["dataplexa"]

# Goal: for each delivered order, show the customer name AND
# the name of every product in the order

pipeline = [
    # Stage 1: filter delivered orders only
    {"$match": {"status": "delivered"}},

    # Stage 2: join order → user
    {"$lookup": {
        "from":         "users",
        "localField":   "user_id",
        "foreignField": "_id",
        "as":           "user_info"
    }},
    {"$unwind": "$user_info"},

    # Stage 3: unwind the items array so we can join each item to products
    {"$unwind": "$items"},

    # Stage 4: join each line item → product
    {"$lookup": {
        "from":         "products",
        "localField":   "items.product_id",
        "foreignField": "_id",
        "as":           "product_info"
    }},
    {"$unwind": "$product_info"},

    # Stage 5: project the final shape
    {"$project": {
        "_id":          0,
        "order_id":     "$_id",
        "customer":     "$user_info.name",
        "product":      "$product_info.name",
        "category":     "$product_info.category",
        "qty":          "$items.qty",
        "line_total":   {"$round": [
                            {"$multiply": ["$items.qty", "$items.price"]}, 2
                        ]},
    }},

    {"$sort": {"order_id": 1, "product": 1}}
]

results = list(db.orders.aggregate(pipeline))
print("Delivered order line items with customer and product names:\n")
print(f"  {'Order':6}  {'Customer':15}  {'Product':25}  {'Cat':12}  {'Qty':3}  {'Line'}")
print(f"  {'─'*6}  {'─'*15}  {'─'*25}  {'─'*12}  {'─'*3}  {'─'*8}")
for r in results:
    print(f"  {r['order_id']:6}  {r['customer']:15}  "
          f"{r['product']:25}  {r['category']:12}  "
          f"{r['qty']:3}  ${r['line_total']:.2f}")

Delivered order line items with customer and product names:

Order Customer Product Cat Qty Line
────── ─────────────── ───────────────────────── ──────────── ─── ────────
o001 Alice Johnson Notebook A5 Stationery 3 $14.97
o001 Alice Johnson Wireless Mouse Electronics 1 $29.99
o003 Alice Johnson USB-C Hub Electronics 1 $49.99
o003 Alice Johnson Wireless Mouse Electronics 2 $59.98
o005 Eva Müller Monitor 27-inch Electronics 1 $299.99
o005 Eva Müller Standing Desk Furniture 1 $349.99
o007 Bob Smith Ballpoint Pens 10-pack Stationery 2 $6.98
o007 Bob Smith Notebook A5 Stationery 1 $4.99

Each $lookup stage adds a new array field to the document — after $unwind each one becomes a flat sub-document you can reference with dot notation
The order of $lookup stages matters when later joins depend on fields produced by earlier ones — here the items $unwind must happen before the products $lookup
Multi-join pipelines can become expensive — always place $match as early as possible to reduce the document set before the first join, and ensure all join fields are indexed

4. $lookup with $unwind Optimisation

When a $lookup is immediately followed by an $unwind on the joined array, MongoDB's query planner recognises this as a single optimised operation — it streams joined documents one at a time rather than building the entire joined array in memory first. This significantly reduces memory usage for large joins.

# $lookup + $unwind optimisation — and handling missing joins gracefully

from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")
db     = client["dataplexa"]

# Demonstrate preserveNullAndEmptyArrays — keep orders even if no matching user
# (simulates orphaned documents — good defensive practice)

pipeline = [
    {"$lookup": {
        "from":         "users",
        "localField":   "user_id",
        "foreignField": "_id",
        "as":           "user_info"
    }},

    # preserveNullAndEmptyArrays keeps orders with no matching user in the output
    {"$unwind": {
        "path":                       "$user_info",
        "preserveNullAndEmptyArrays": True
    }},

    {"$project": {
        "_id":       0,
        "order_id":  "$_id",
        "status":    1,
        "total":     1,
        # $ifNull handles the case where user_info is missing
        "customer":  {"$ifNull": ["$user_info.name",       "Unknown Customer"]},
        "tier":      {"$ifNull": ["$user_info.membership", "unknown"]},
    }},

    {"$sort": {"order_id": 1}}
]

results = list(db.orders.aggregate(pipeline))
print("All orders with customer name (preserveNullAndEmptyArrays):\n")
print(f"  {'Order':6}  {'Status':12}  {'Total':>8}  {'Customer':15}  Tier")
print(f"  {'─'*6}  {'─'*12}  {'─'*8}  {'─'*15}  {'─'*8}")
for r in results:
    print(f"  {r['order_id']:6}  {r['status']:12}  "
          f"${r['total']:>7.2f}  {r['customer']:15}  {r['tier']}")

# Explain the optimisation
print("\nQuery planner note:")
print("  When $lookup is immediately followed by $unwind on the same array,")
print("  MongoDB coalesces them into one streaming operation — lower memory,")
print("  no full array materialisation needed before unwinding.")

All orders with customer name (preserveNullAndEmptyArrays):

Order Status Total Customer Tier
────── ──────────── ──────── ─────────────── ────────
o001 delivered $ 44.96 Alice Johnson premium
o002 shipped $ 89.99 Bob Smith basic
o003 delivered $ 99.98 Alice Johnson premium
o004 cancelled $ 0.00 Clara Diaz premium
o005 delivered $ 329.98 Eva Müller premium
o006 processing $ 349.99 Eva Müller premium
o007 delivered $ 11.97 Bob Smith basic

Query planner note:
When $lookup is immediately followed by $unwind on the same array,
MongoDB coalesces them into one streaming operation — lower memory,
no full array materialisation needed before unwinding.

Place $unwind immediately after $lookup — not only is this the correct pattern for flattening, it also triggers MongoDB's internal coalescing optimisation
preserveNullAndEmptyArrays: True is essential for defensive pipelines — it prevents silently losing documents that have no match in the foreign collection
Use $ifNull after preserveNullAndEmptyArrays to replace missing joined fields with readable fallback values in the output

5. Self-Join — Joining a Collection to Itself

A self-join uses $lookup with from pointing to the same collection as the source. This is useful for hierarchical data — finding related documents within the same collection, such as products in the same category as a given product, or users who joined in the same month as another user.

# Self-join — find other products in the same category using $lookup

from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")
db     = client["dataplexa"]

# For each Electronics product, find the other Electronics products
# (self-join on category, excluding the document itself)

pipeline = [
    # Only process Electronics products
    {"$match": {"category": "Electronics"}},

    # Self-join: from the same products collection
    {"$lookup": {
        "from": "products",
        "let":  {"current_id": "$_id", "current_cat": "$category"},
        "pipeline": [
            {"$match": {
                "$expr": {
                    "$and": [
                        # Same category
                        {"$eq":  ["$category", "$$current_cat"]},
                        # But NOT the document itself
                        {"$ne":  ["$_id",      "$$current_id"]}
                    ]
                }
            }},
            {"$project": {"name": 1, "price": 1, "rating": 1, "_id": 0}},
            {"$sort": {"rating": -1}}
        ],
        "as": "related_products"
    }},

    {"$project": {
        "_id":              0,
        "product":          "$name",
        "price":            1,
        "related_count":    {"$size": "$related_products"},
        "top_related":      {"$slice": ["$related_products", 2]}  # top 2 only
    }},

    {"$sort": {"price": -1}}
]

results = list(db.products.aggregate(pipeline))
print("Electronics products with related items (self-join):\n")
for r in results:
    print(f"  {r['product']} (${r['price']:.2f}) — "
          f"{r['related_count']} related products")
    for rel in r["top_related"]:
        print(f"    → {rel['name']:25} ${rel['price']:.2f}  ★{rel['rating']}")

Electronics products with related items (self-join):

Monitor 27-inch ($299.99) — 3 related products
→ Mechanical Keyboard $89.99 ★4.7
→ USB-C Hub $49.99 ★4.3

Mechanical Keyboard ($89.99) — 3 related products
→ Monitor 27-inch $299.99 ★4.6
→ USB-C Hub $49.99 ★4.3

USB-C Hub ($49.99) — 3 related products
→ Mechanical Keyboard $89.99 ★4.7
→ Monitor 27-inch $299.99 ★4.6

Wireless Mouse ($29.99) — 3 related products
→ Mechanical Keyboard $89.99 ★4.7
→ Monitor 27-inch $299.99 ★4.6

A self-join requires the pipeline-style $lookup with let — because you need to pass the current document's fields into the sub-pipeline to exclude it from its own results
$slice inside $project trims an array to the first N elements — use it to avoid returning the full related products list when you only need a preview
Self-joins are the foundation of "related items", "customers also bought", and "similar profiles" features — patterns that power recommendation sections in real e-commerce applications

Summary Table

$lookup Form	When to Use	Key Parameters	SQL Equivalent
Simple equality	One field matches one foreign field	`from`, `localField`, `foreignField`, `as`	`LEFT JOIN ON a.id = b.id`
Pipeline-style	Multiple conditions, filter foreign docs	`from`, `let`, `pipeline`, `as`	`LEFT JOIN ON … AND …`
Multiple joins	Data from 3+ collections in one query	Chain multiple `$lookup` stages	Multiple `JOIN` clauses
+ $unwind optimisation	Always — reduces memory usage	`preserveNullAndEmptyArrays`	Streaming join
Self-join	Related documents in same collection	`from` = same collection, `$ne` self exclusion	`JOIN table AS t2 ON …`
`$$variable` in let	Pass local fields into sub-pipeline	Double-dollar prefix references `let` variables	Correlated subquery parameter

Practice Questions

Practice 1. What is the key difference between a simple $lookup and a pipeline-style $lookup?

Practice 2. In a pipeline-style $lookup, what is the difference between $$uid and $user_id?

Practice 3. What happens by default when a $lookup finds no matching document in the foreign collection, and how do you change this behaviour?

Practice 4. Why must $unwind be applied before a second $lookup when joining through an embedded array field like items.product_id?

Practice 5. What performance optimisation does MongoDB apply when $lookup is immediately followed by $unwind on the same array field?

Quiz

Quiz 1. What type of SQL join does $lookup perform by default?

Left outer join — all local documents included, empty array if no foreign match
Inner join — only documents with a match in both collections
Right outer join — all foreign documents are included
Cross join — every combination from both collections

Quiz 2. In a pipeline-style $lookup, where do you declare the local collection's fields to make them accessible inside the sub-pipeline?

In the let parameter — accessed with $$variableName inside the sub-pipeline
In the localField parameter — same as the simple form
In the vars parameter at the top of the pipeline
Local fields are automatically available with $$ without declaration

Quiz 3. Which $project operator trims an array to its first N elements — useful after $lookup to limit a related items list?

$slice — returns the first N elements of an array
$limit — caps the number of documents in the pipeline
$first — returns only the first element
$trim — removes leading and trailing elements

Quiz 4. Why does a self-join require the pipeline-style $lookup rather than the simple form?

You need $ne on _id to exclude self — only possible with let + pipeline $lookup
Simple $lookup cannot use the same collection as both local and foreign
Self-joins are not supported in MongoDB
Simple $lookup always returns all documents including the source

Quiz 5. What is the correct way to check whether a $lookup returned at least one result — to filter out documents with an empty joined array?

{"joinedField.0": {"$exists": True}} — checks first element exists
{"joinedField": {"$ne": []}} — compare to empty array
{"joinedField": {"$exists": True}} — field exists check
{"joinedField.length": {"$gt": 0}} — length property check

Next up — Transactions: How multi-document ACID transactions work in MongoDB, when to use them, and how to implement them safely with sessions in PyMongo.

← Previous Course Index Next →