Mango DBLesson 29 – Group, Match & Project | Dataplexa

Group, Match & Project

Lessons 27 and 28 introduced these three stages as part of broader pipeline discussions. This lesson goes deep on each one individually — covering every accumulator operator in $group, every expression operator in $project, and the full range of query patterns available in $match. Mastering these three stages means mastering the core of the aggregation framework — every other stage exists to feed data into them or shape their output. All examples use the Dataplexa Store dataset so you can see exactly how each operator behaves against real data.

1. $group — Every Accumulator Operator

$group collapses multiple documents into one per unique grouping key and computes accumulated values using accumulator operators. Each operator defines what to do with the field values from all documents in the group — sum them, average them, find the extremes, collect them into an array, or pick the first and last.

# $group — all accumulator operators demonstrated on Dataplexa data

from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")
db     = client["dataplexa"]

# Group orders by status — demonstrate every accumulator at once
pipeline = [
    {"$group": {
        "_id":           "$status",

        # $sum with 1 — counts documents in the group
        "order_count":   {"$sum": 1},

        # $sum with a field — totals numeric values
        "total_revenue": {"$sum": "$total"},

        # $avg — arithmetic mean
        "avg_order":     {"$avg": "$total"},

        # $min / $max — smallest and largest values
        "min_order":     {"$min": "$total"},
        "max_order":     {"$max": "$total"},

        # $push — collects ALL values into an array (duplicates allowed)
        "order_ids":     {"$push": "$_id"},

        # $addToSet — collects UNIQUE values only
        "unique_users":  {"$addToSet": "$user_id"},

        # $first / $last — value from the first / last document in the group
        # (meaningful only when pipeline is sorted before $group)
        "first_order_date": {"$first": "$date"},
        "last_order_date":  {"$last":  "$date"},
    }},
    {"$sort": {"total_revenue": -1}},
    {"$project": {
        "status":        "$_id",
        "order_count":   1,
        "total_revenue": {"$round": ["$total_revenue", 2]},
        "avg_order":     {"$round": ["$avg_order",     2]},
        "min_order":     1,
        "max_order":     1,
        "order_ids":     1,
        "unique_users":  1,
        "_id":           0
    }}
]

results = list(db.orders.aggregate(pipeline))
print("Orders grouped by status — all accumulators:\n")
for r in results:
    print(f"  Status:       {r['status']}")
    print(f"  Count:        {r['order_count']}")
    print(f"  Revenue:      ${r['total_revenue']:.2f}")
    print(f"  Avg:          ${r['avg_order']:.2f}")
    print(f"  Min/Max:      ${r['min_order']:.2f} / ${r['max_order']:.2f}")
    print(f"  Order IDs:    {r['order_ids']}")
    print(f"  Unique users: {r['unique_users']}")
    print()

Orders grouped by status — all accumulators:

Status: delivered
Count: 4
Revenue: $486.89
Avg: $121.72
Min/Max: $11.97 / $329.98
Order IDs: ['o001', 'o003', 'o005', 'o007']
Unique users: ['u001', 'u004', 'u002']

Status: processing
Count: 1
Revenue: $349.99
Avg: $349.99
Min/Max: $349.99 / $349.99
Order IDs: ['o006']
Unique users: ['u005']

Status: shipped
Count: 1
Revenue: $89.99
Avg: $89.99
Min/Max: $89.99 / $89.99
Order IDs: ['o002']
Unique users: ['u002']

Status: cancelled
Count: 1
Revenue: $0.00
Avg: $0.00
Min/Max: $0.00 / $0.00
Order IDs: ['o004']
Unique users: ['u003']

$push collects every value including duplicates — use it when you need the full list. $addToSet deduplicates automatically — use it when you need unique values only
$first and $last are only meaningful when the pipeline is sorted before $group — without a prior $sort, the first and last document in a group are arbitrary
$sum: 1 is the idiomatic count pattern — it adds 1 for every document in the group, giving the total document count

2. $group — Compound Grouping Keys

The _id grouping key in $group does not have to be a single field — it can be a sub-document combining multiple fields. This lets you group by combinations like category + membership tier, or year + month, producing one output document per unique combination.

# $group with compound _id — grouping by multiple fields simultaneously

from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")
db     = client["dataplexa"]

# Group orders by user_id AND status — one row per user+status combination
pipeline = [
    {"$group": {
        "_id": {
            "user":   "$user_id",   # compound key — field 1
            "status": "$status"     # compound key — field 2
        },
        "count":   {"$sum": 1},
        "revenue": {"$sum": "$total"}
    }},
    {"$sort": {"_id.user": 1, "revenue": -1}},
    {"$project": {
        "user":    "$_id.user",
        "status":  "$_id.status",
        "count":   1,
        "revenue": {"$round": ["$revenue", 2]},
        "_id":     0
    }}
]

results = list(db.orders.aggregate(pipeline))
print("Orders grouped by user + status (compound key):\n")
print(f"  {'User':6}  {'Status':12}  {'Count':5}  {'Revenue':>9}")
print(f"  {'─'*6}  {'─'*12}  {'─'*5}  {'─'*9}")
for r in results:
    print(f"  {r['user']:6}  {r['status']:12}  {r['count']:5}  ${r['revenue']:>8.2f}")

# Group by extracted date parts — orders per month
print("\nOrders per month (date part grouping):")
pipeline2 = [
    {"$group": {
        "_id": {
            "year":  {"$year":  "$date"},   # extract year from date field
            "month": {"$month": "$date"}    # extract month from date field
        },
        "order_count":   {"$sum": 1},
        "month_revenue": {"$sum": "$total"}
    }},
    {"$sort": {"_id.year": 1, "_id.month": 1}},
    {"$project": {
        "period":        {"$concat": [
            {"$toString": "$_id.year"}, "-",
            {"$toString": "$_id.month"}
        ]},
        "order_count":   1,
        "month_revenue": {"$round": ["$month_revenue", 2]},
        "_id":           0
    }}
]
for r in db.orders.aggregate(pipeline2):
    print(f"  {r['period']:8}  orders: {r['order_count']}  revenue: ${r['month_revenue']:.2f}")

Orders grouped by user + status (compound key):

User Status Count Revenue
────── ──────────── ───── ─────────
u001 delivered 2 $ 144.94
u002 delivered 1 $ 11.97
u002 shipped 1 $ 89.99
u003 cancelled 1 $ 0.00
u004 delivered 1 $ 329.98
u005 processing 1 $ 349.99

Orders per month (date part grouping):
2024-1 orders: 1 revenue: $44.96
2024-2 orders: 3 revenue: $189.97
2024-3 orders: 2 revenue: $679.97
2024-4 orders: 1 revenue: $11.97

A compound _id is written as a sub-document — access its fields in later stages using dot notation: "$_id.user", "$_id.status"
Date extraction operators — $year, $month, $dayOfMonth, $dayOfWeek, $hour — make time-series grouping straightforward without any string manipulation
Every unique combination of compound key values produces a separate output document — five users × four statuses could theoretically produce up to 20 output documents

3. $match — Advanced Filter Patterns

$match accepts every query operator available in find() — but it has additional power inside a pipeline because it can filter on fields computed by earlier stages. This section demonstrates the most useful patterns: filtering on computed fields, using $expr to compare two fields in the same document, and using $match as a HAVING clause after $group.

# $match — advanced patterns inside a pipeline

from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")
db     = client["dataplexa"]

# Pattern 1 — $match on a computed $addFields value
print("Pattern 1: filter on computed field (stock_status):")
pipeline1 = [
    {"$addFields": {
        "stock_status": {
            "$cond": {
                "if":   {"$lte": ["$stock", 10]},
                "then": "low",
                "else": "ok"
            }
        }
    }},
    {"$match": {"stock_status": "low"}},   # filter on the computed field
    {"$project": {"name": 1, "stock": 1, "stock_status": 1, "_id": 0}}
]
for p in db.products.aggregate(pipeline1):
    print(f"  {p['name']:25}  stock: {p['stock']:3}  status: {p['stock_status']}")

# Pattern 2 — $expr: compare two fields within the same document
# Find orders where the total is greater than the average product price ($117)
# More practically: find reviews where rating equals the product's own rating
print("\nPattern 2: $expr — filter products where stock < rating * 10:")
pipeline2 = [
    {"$match": {
        "$expr": {"$lt": ["$stock", {"$multiply": ["$rating", 10]}]}
    }},
    {"$project": {"name": 1, "stock": 1, "rating": 1, "_id": 0}}
]
for p in db.products.aggregate(pipeline2):
    print(f"  {p['name']:25}  stock: {p['stock']:3}  rating: {p['rating']}")

# Pattern 3 — $match after $group (HAVING equivalent)
print("\nPattern 3: HAVING — users with total delivered spend > $50:")
pipeline3 = [
    {"$match":  {"status": "delivered"}},
    {"$group":  {"_id": "$user_id", "total_spent": {"$sum": "$total"}}},
    {"$match":  {"total_spent": {"$gt": 50}}},         # HAVING
    {"$sort":   {"total_spent": -1}},
    {"$project":{"user": "$_id", "total_spent": {"$round": ["$total_spent", 2]}, "_id": 0}}
]
for r in db.orders.aggregate(pipeline3):
    print(f"  {r['user']}  spent: ${r['total_spent']:.2f}")

Pattern 1: filter on computed field (stock_status):
Notebook A5 stock: 5 status: low
Standing Desk stock: 12 status: ok

(only Notebook A5 has stock ≤ 10)

Pattern 2: $expr — filter products where stock < rating * 10:
Notebook A5 stock: 5 rating: 4.2
Monitor 27-inch stock: 8 rating: 4.6

Pattern 3: HAVING — users with total delivered spend > $50:
u004 spent: $329.98
u001 spent: $144.94

$expr is the only way to compare two fields within the same document inside a $match — standard query operators can only compare a field against a literal value
$cond is the inline ternary operator — {"if": condition, "then": valueIfTrue, "else": valueIfFalse} — equivalent to a CASE WHEN with one branch
A $match stage that follows a $group cannot use a collection index — it filters the in-memory group results. Only the first $match in a pipeline can benefit from an index

4. $project — Expression Operators Deep Dive

$project supports a rich library of expression operators for arithmetic, string manipulation, conditional logic, array operations, and type conversion. This section covers the most useful ones with Dataplexa examples so you have a practical reference for real pipeline work.

# $project — expression operators reference with Dataplexa examples

from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")
db     = client["dataplexa"]

pipeline = [
    {"$project": {
        "name": 1,

        # ── ARITHMETIC ────────────────────────────────────────────────
        # Add a fixed amount
        "price_plus_vat": {"$round": [{"$multiply": ["$price", 1.20]}, 2]},

        # Subtract — savings from a 10% discount
        "saving_10pct":   {"$round": [{"$multiply": ["$price", 0.10]}, 2]},

        # Divide — price per unit if sold in packs (assume 10 per pack)
        "price_per_unit": {"$round": [{"$divide": ["$price", 10]}, 3]},

        # ── STRING ────────────────────────────────────────────────────
        # Concatenate fields into a display label
        "display_label":  {"$concat": ["$name", " — ", "$category"]},

        # Uppercase the category
        "cat_upper":      {"$toUpper": "$category"},

        # String length
        "name_length":    {"$strLenCP": "$name"},

        # ── CONDITIONAL ───────────────────────────────────────────────
        # $cond — ternary: is this a premium product (price > $50)?
        "tier": {
            "$cond": {
                "if":   {"$gt": ["$price", 50]},
                "then": "premium",
                "else": "budget"
            }
        },

        # ── TYPE CONVERSION ───────────────────────────────────────────
        # Convert numeric rating to string for display
        "rating_label":   {"$concat": [{"$toString": "$rating"}, " ★"]},

        "_id": 0
    }},
    {"$sort": {"price": -1}},
    {"$limit": 4}
]

results = list(db.products.aggregate(pipeline))
print("$project expression operators — top 4 products by price:\n")
for p in results:
    print(f"  {p['name']:25}")
    print(f"    price+VAT:   ${p['price_plus_vat']:.2f}")
    print(f"    saving 10%:  ${p['saving_10pct']:.2f}")
    print(f"    label:       {p['display_label']}")
    print(f"    tier:        {p['tier']}")
    print(f"    rating:      {p['rating_label']}")
    print(f"    name length: {p['name_length']} chars")
    print()

$project expression operators — top 4 products by price:

Standing Desk
price+VAT: $419.99
saving 10%: $35.00
label: Standing Desk — Furniture
tier: premium
rating: 4.8 ★
name length: 13 chars

Monitor 27-inch
price+VAT: $359.99
saving 10%: $30.00
label: Monitor 27-inch — Electronics
tier: premium
rating: 4.6 ★
name length: 15 chars

Mechanical Keyboard
price+VAT: $107.99
saving 10%: $9.00
label: Mechanical Keyboard — Electronics
tier: premium
rating: 4.7 ★
name length: 19 chars

USB-C Hub
price+VAT: $59.99
saving 10%: $5.00
label: USB-C Hub — Electronics
tier: premium
rating: 4.3 ★
name length: 9 chars

Arithmetic operators — $multiply, $add, $subtract, $divide, $mod, $round, $abs, $ceil, $floor — all work on numeric fields or literal numbers
String operators — $concat, $toUpper, $toLower, $trim, $substr, $strLenCP, $split — all work on string fields
Type conversion operators — $toString, $toInt, $toDouble, $toDate, $toBool — are essential when you need to mix field types inside expressions like $concat

5. $project — Conditional Expressions

Beyond the simple ternary $cond, $project supports $switch for multi-branch logic and $ifNull for null-safe field access. These replace what would otherwise be complex application-side if/else chains with clean, server-side expressions.

# $project — $switch, $ifNull, and nested $cond

from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")
db     = client["dataplexa"]

pipeline = [
    {"$project": {
        "name":     1,
        "price":    1,
        "stock":    1,
        "rating":   1,

        # $switch — multi-branch CASE expression
        "price_band": {
            "$switch": {
                "branches": [
                    {"case": {"$lte": ["$price",  10]}, "then": "Under $10"},
                    {"case": {"$lte": ["$price",  50]}, "then": "$10–$50"},
                    {"case": {"$lte": ["$price", 150]}, "then": "$50–$150"},
                    {"case": {"$lte": ["$price", 500]}, "then": "$150–$500"},
                ],
                "default": "Over $500"
            }
        },

        # $switch — stock level label
        "stock_label": {
            "$switch": {
                "branches": [
                    {"case": {"$eq":  ["$stock", 0]},  "then": "Out of Stock"},
                    {"case": {"$lte": ["$stock", 5]},  "then": "Critical"},
                    {"case": {"$lte": ["$stock", 20]}, "then": "Low"},
                    {"case": {"$lte": ["$stock", 50]}, "then": "Medium"},
                ],
                "default": "High"
            }
        },

        # $ifNull — use a fallback value if the field is null or missing
        "brand_display": {"$ifNull": ["$brand", "Unknown Brand"]},

        "_id": 0
    }},
    {"$sort": {"price": 1}}
]

results = list(db.products.aggregate(pipeline))
print("Products with conditional labels:\n")
print(f"  {'Name':25}  {'Price':>7}  {'Band':12}  {'Stock':3}  {'Level':12}  Brand")
print(f"  {'─'*25}  {'─'*7}  {'─'*12}  {'─'*3}  {'─'*12}  {'─'*10}")
for p in results:
    print(f"  {p['name']:25}  ${p['price']:>6.2f}  "
          f"{p['price_band']:12}  {p['stock']:3}  "
          f"{p['stock_label']:12}  {p['brand_display']}")

Products with conditional labels:

Name Price Band Stock Level Brand
───────────────────────── ─────── ──────────── ─── ──────────── ──────────
Ballpoint Pens 10-pack $ 3.49 Under $10 120 High Bic
Notebook A5 $ 4.99 Under $10 5 Critical Moleskine
Wireless Mouse $ 29.99 $10–$50 42 Medium Logitech
USB-C Hub $ 49.99 $10–$50 50 Medium Anker
Mechanical Keyboard $ 89.99 $50–$150 35 Medium Corsair
Monitor 27-inch $299.99 $150–$500 8 Low Samsung
Standing Desk $349.99 $150–$500 12 Low FlexiSpot

$switch evaluates branches in order and returns the value of the first matching case — always include a default to handle values that match no branch
$ifNull takes two arguments — the field to check and the fallback value. If the field is null or missing, the fallback is returned instead
Conditional expressions can be nested — a $cond can contain another $cond as its then or else value for complex multi-level logic

Summary Table

Stage / Operator	What It Does	Key Note
`$sum: 1`	Count documents in group	Idiomatic count pattern
`$push` / `$addToSet`	Collect values into array	`$push` keeps duplicates, `$addToSet` deduplicates
`$first` / `$last`	Value from first/last doc in group	Only meaningful after a prior `$sort`
Compound `_id`	Group by multiple fields	Access with `"$_id.field"` in later stages
`$expr` in `$match`	Compare two fields in same document	Only way to use field references in `$match`
`$match` after `$group`	Filter on computed group results	SQL `HAVING` equivalent — cannot use indexes
`$cond`	Ternary if/then/else expression	SQL `CASE WHEN … ELSE` with one branch
`$switch`	Multi-branch conditional	Always include a `default`
`$ifNull`	Fallback for null or missing fields	Essential for optional fields

Practice Questions

Practice 1. What is the difference between $push and $addToSet in a $group stage?

Practice 2. Why are $first and $last only meaningful when the pipeline is sorted before $group?

Practice 3. What does $expr allow you to do inside $match that standard query operators cannot?

Practice 4. Write a $project expression that labels a product as "expensive" if price > $100 and "affordable" otherwise.

Practice 5. How do you access individual fields of a compound $group _id in a later pipeline stage?

Quiz

Quiz 1. Which accumulator operator collects all values including duplicates into an array?

$push — appends every value, preserving duplicates
$addToSet — collects only unique values
$sum — totals the values numerically
$first — takes only the first value

Quiz 2. What is the correct syntax to group documents by both user_id and status in a single $group stage?

_id as a sub-document: {"user": "$user_id", "status": "$status"}
Use two separate $group stages
Comma-separate field names as a string
Use an array as the _id

Quiz 3. Which $project operator should you use when a field might be null or missing and you want to display a fallback value?

$ifNull — returns fallback if field is null or missing
$cond — the ternary operator handles all conditional cases
$exists — checks whether a field is present
$default — sets a fallback at the collection level

Quiz 4. Can a $match stage placed after $group use a collection index to improve performance?

No — only the first $match can use a collection index
Yes — MongoDB applies indexes at every $match stage
Yes — if the grouped _id field is indexed
It depends on the MongoDB version

Quiz 5. What date extraction operator would you use in a $group _id to group orders by month?

$month — extracts month number (1–12) from a Date field
$dateToString — converts the full date to a string
$substr — takes a substring of the date string
$split — splits the date by the dash character

Next up — $lookup (Joins): Mastering cross-collection joins with pipeline-style $lookup, multiple joins in one pipeline, and self-joins.

← Previous Course Index Next →