MongoDB
Group, Match & Project
Lessons 27 and 28 introduced these three stages as part of broader pipeline discussions. This lesson goes deep on each one individually — covering every accumulator operator in $group, every expression operator in $project, and the full range of query patterns available in $match. Mastering these three stages means mastering the core of the aggregation framework — every other stage exists to feed data into them or shape their output. All examples use the Dataplexa Store dataset so you can see exactly how each operator behaves against real data.
1. $group — Every Accumulator Operator
$group collapses multiple documents into one per unique grouping key and computes accumulated values using accumulator operators. Each operator defines what to do with the field values from all documents in the group — sum them, average them, find the extremes, collect them into an array, or pick the first and last.
# $group — all accumulator operators demonstrated on Dataplexa data
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["dataplexa"]
# Group orders by status — demonstrate every accumulator at once
pipeline = [
{"$group": {
"_id": "$status",
# $sum with 1 — counts documents in the group
"order_count": {"$sum": 1},
# $sum with a field — totals numeric values
"total_revenue": {"$sum": "$total"},
# $avg — arithmetic mean
"avg_order": {"$avg": "$total"},
# $min / $max — smallest and largest values
"min_order": {"$min": "$total"},
"max_order": {"$max": "$total"},
# $push — collects ALL values into an array (duplicates allowed)
"order_ids": {"$push": "$_id"},
# $addToSet — collects UNIQUE values only
"unique_users": {"$addToSet": "$user_id"},
# $first / $last — value from the first / last document in the group
# (meaningful only when pipeline is sorted before $group)
"first_order_date": {"$first": "$date"},
"last_order_date": {"$last": "$date"},
}},
{"$sort": {"total_revenue": -1}},
{"$project": {
"status": "$_id",
"order_count": 1,
"total_revenue": {"$round": ["$total_revenue", 2]},
"avg_order": {"$round": ["$avg_order", 2]},
"min_order": 1,
"max_order": 1,
"order_ids": 1,
"unique_users": 1,
"_id": 0
}}
]
results = list(db.orders.aggregate(pipeline))
print("Orders grouped by status — all accumulators:\n")
for r in results:
print(f" Status: {r['status']}")
print(f" Count: {r['order_count']}")
print(f" Revenue: ${r['total_revenue']:.2f}")
print(f" Avg: ${r['avg_order']:.2f}")
print(f" Min/Max: ${r['min_order']:.2f} / ${r['max_order']:.2f}")
print(f" Order IDs: {r['order_ids']}")
print(f" Unique users: {r['unique_users']}")
print()Status: delivered
Count: 4
Revenue: $486.89
Avg: $121.72
Min/Max: $11.97 / $329.98
Order IDs: ['o001', 'o003', 'o005', 'o007']
Unique users: ['u001', 'u004', 'u002']
Status: processing
Count: 1
Revenue: $349.99
Avg: $349.99
Min/Max: $349.99 / $349.99
Order IDs: ['o006']
Unique users: ['u005']
Status: shipped
Count: 1
Revenue: $89.99
Avg: $89.99
Min/Max: $89.99 / $89.99
Order IDs: ['o002']
Unique users: ['u002']
Status: cancelled
Count: 1
Revenue: $0.00
Avg: $0.00
Min/Max: $0.00 / $0.00
Order IDs: ['o004']
Unique users: ['u003']
$pushcollects every value including duplicates — use it when you need the full list.$addToSetdeduplicates automatically — use it when you need unique values only$firstand$lastare only meaningful when the pipeline is sorted before$group— without a prior$sort, the first and last document in a group are arbitrary$sum: 1is the idiomatic count pattern — it adds 1 for every document in the group, giving the total document count
2. $group — Compound Grouping Keys
The _id grouping key in $group does not have to be a single field — it can be a sub-document combining multiple fields. This lets you group by combinations like category + membership tier, or year + month, producing one output document per unique combination.
# $group with compound _id — grouping by multiple fields simultaneously
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["dataplexa"]
# Group orders by user_id AND status — one row per user+status combination
pipeline = [
{"$group": {
"_id": {
"user": "$user_id", # compound key — field 1
"status": "$status" # compound key — field 2
},
"count": {"$sum": 1},
"revenue": {"$sum": "$total"}
}},
{"$sort": {"_id.user": 1, "revenue": -1}},
{"$project": {
"user": "$_id.user",
"status": "$_id.status",
"count": 1,
"revenue": {"$round": ["$revenue", 2]},
"_id": 0
}}
]
results = list(db.orders.aggregate(pipeline))
print("Orders grouped by user + status (compound key):\n")
print(f" {'User':6} {'Status':12} {'Count':5} {'Revenue':>9}")
print(f" {'─'*6} {'─'*12} {'─'*5} {'─'*9}")
for r in results:
print(f" {r['user']:6} {r['status']:12} {r['count']:5} ${r['revenue']:>8.2f}")
# Group by extracted date parts — orders per month
print("\nOrders per month (date part grouping):")
pipeline2 = [
{"$group": {
"_id": {
"year": {"$year": "$date"}, # extract year from date field
"month": {"$month": "$date"} # extract month from date field
},
"order_count": {"$sum": 1},
"month_revenue": {"$sum": "$total"}
}},
{"$sort": {"_id.year": 1, "_id.month": 1}},
{"$project": {
"period": {"$concat": [
{"$toString": "$_id.year"}, "-",
{"$toString": "$_id.month"}
]},
"order_count": 1,
"month_revenue": {"$round": ["$month_revenue", 2]},
"_id": 0
}}
]
for r in db.orders.aggregate(pipeline2):
print(f" {r['period']:8} orders: {r['order_count']} revenue: ${r['month_revenue']:.2f}")User Status Count Revenue
────── ──────────── ───── ─────────
u001 delivered 2 $ 144.94
u002 delivered 1 $ 11.97
u002 shipped 1 $ 89.99
u003 cancelled 1 $ 0.00
u004 delivered 1 $ 329.98
u005 processing 1 $ 349.99
Orders per month (date part grouping):
2024-1 orders: 1 revenue: $44.96
2024-2 orders: 3 revenue: $189.97
2024-3 orders: 2 revenue: $679.97
2024-4 orders: 1 revenue: $11.97
- A compound
_idis written as a sub-document — access its fields in later stages using dot notation:"$_id.user","$_id.status" - Date extraction operators —
$year,$month,$dayOfMonth,$dayOfWeek,$hour— make time-series grouping straightforward without any string manipulation - Every unique combination of compound key values produces a separate output document — five users × four statuses could theoretically produce up to 20 output documents
3. $match — Advanced Filter Patterns
$match accepts every query operator available in find() — but it has additional power inside a pipeline because it can filter on fields computed by earlier stages. This section demonstrates the most useful patterns: filtering on computed fields, using $expr to compare two fields in the same document, and using $match as a HAVING clause after $group.
# $match — advanced patterns inside a pipeline
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["dataplexa"]
# Pattern 1 — $match on a computed $addFields value
print("Pattern 1: filter on computed field (stock_status):")
pipeline1 = [
{"$addFields": {
"stock_status": {
"$cond": {
"if": {"$lte": ["$stock", 10]},
"then": "low",
"else": "ok"
}
}
}},
{"$match": {"stock_status": "low"}}, # filter on the computed field
{"$project": {"name": 1, "stock": 1, "stock_status": 1, "_id": 0}}
]
for p in db.products.aggregate(pipeline1):
print(f" {p['name']:25} stock: {p['stock']:3} status: {p['stock_status']}")
# Pattern 2 — $expr: compare two fields within the same document
# Find orders where the total is greater than the average product price ($117)
# More practically: find reviews where rating equals the product's own rating
print("\nPattern 2: $expr — filter products where stock < rating * 10:")
pipeline2 = [
{"$match": {
"$expr": {"$lt": ["$stock", {"$multiply": ["$rating", 10]}]}
}},
{"$project": {"name": 1, "stock": 1, "rating": 1, "_id": 0}}
]
for p in db.products.aggregate(pipeline2):
print(f" {p['name']:25} stock: {p['stock']:3} rating: {p['rating']}")
# Pattern 3 — $match after $group (HAVING equivalent)
print("\nPattern 3: HAVING — users with total delivered spend > $50:")
pipeline3 = [
{"$match": {"status": "delivered"}},
{"$group": {"_id": "$user_id", "total_spent": {"$sum": "$total"}}},
{"$match": {"total_spent": {"$gt": 50}}}, # HAVING
{"$sort": {"total_spent": -1}},
{"$project":{"user": "$_id", "total_spent": {"$round": ["$total_spent", 2]}, "_id": 0}}
]
for r in db.orders.aggregate(pipeline3):
print(f" {r['user']} spent: ${r['total_spent']:.2f}")Notebook A5 stock: 5 status: low
Standing Desk stock: 12 status: ok
(only Notebook A5 has stock ≤ 10)
Pattern 2: $expr — filter products where stock < rating * 10:
Notebook A5 stock: 5 rating: 4.2
Monitor 27-inch stock: 8 rating: 4.6
Pattern 3: HAVING — users with total delivered spend > $50:
u004 spent: $329.98
u001 spent: $144.94
$expris the only way to compare two fields within the same document inside a$match— standard query operators can only compare a field against a literal value$condis the inline ternary operator —{"if": condition, "then": valueIfTrue, "else": valueIfFalse}— equivalent to a CASE WHEN with one branch- A
$matchstage that follows a$groupcannot use a collection index — it filters the in-memory group results. Only the first$matchin a pipeline can benefit from an index
4. $project — Expression Operators Deep Dive
$project supports a rich library of expression operators for arithmetic, string manipulation, conditional logic, array operations, and type conversion. This section covers the most useful ones with Dataplexa examples so you have a practical reference for real pipeline work.
# $project — expression operators reference with Dataplexa examples
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["dataplexa"]
pipeline = [
{"$project": {
"name": 1,
# ── ARITHMETIC ────────────────────────────────────────────────
# Add a fixed amount
"price_plus_vat": {"$round": [{"$multiply": ["$price", 1.20]}, 2]},
# Subtract — savings from a 10% discount
"saving_10pct": {"$round": [{"$multiply": ["$price", 0.10]}, 2]},
# Divide — price per unit if sold in packs (assume 10 per pack)
"price_per_unit": {"$round": [{"$divide": ["$price", 10]}, 3]},
# ── STRING ────────────────────────────────────────────────────
# Concatenate fields into a display label
"display_label": {"$concat": ["$name", " — ", "$category"]},
# Uppercase the category
"cat_upper": {"$toUpper": "$category"},
# String length
"name_length": {"$strLenCP": "$name"},
# ── CONDITIONAL ───────────────────────────────────────────────
# $cond — ternary: is this a premium product (price > $50)?
"tier": {
"$cond": {
"if": {"$gt": ["$price", 50]},
"then": "premium",
"else": "budget"
}
},
# ── TYPE CONVERSION ───────────────────────────────────────────
# Convert numeric rating to string for display
"rating_label": {"$concat": [{"$toString": "$rating"}, " ★"]},
"_id": 0
}},
{"$sort": {"price": -1}},
{"$limit": 4}
]
results = list(db.products.aggregate(pipeline))
print("$project expression operators — top 4 products by price:\n")
for p in results:
print(f" {p['name']:25}")
print(f" price+VAT: ${p['price_plus_vat']:.2f}")
print(f" saving 10%: ${p['saving_10pct']:.2f}")
print(f" label: {p['display_label']}")
print(f" tier: {p['tier']}")
print(f" rating: {p['rating_label']}")
print(f" name length: {p['name_length']} chars")
print()Standing Desk
price+VAT: $419.99
saving 10%: $35.00
label: Standing Desk — Furniture
tier: premium
rating: 4.8 ★
name length: 13 chars
Monitor 27-inch
price+VAT: $359.99
saving 10%: $30.00
label: Monitor 27-inch — Electronics
tier: premium
rating: 4.6 ★
name length: 15 chars
Mechanical Keyboard
price+VAT: $107.99
saving 10%: $9.00
label: Mechanical Keyboard — Electronics
tier: premium
rating: 4.7 ★
name length: 19 chars
USB-C Hub
price+VAT: $59.99
saving 10%: $5.00
label: USB-C Hub — Electronics
tier: premium
rating: 4.3 ★
name length: 9 chars
- Arithmetic operators —
$multiply,$add,$subtract,$divide,$mod,$round,$abs,$ceil,$floor— all work on numeric fields or literal numbers - String operators —
$concat,$toUpper,$toLower,$trim,$substr,$strLenCP,$split— all work on string fields - Type conversion operators —
$toString,$toInt,$toDouble,$toDate,$toBool— are essential when you need to mix field types inside expressions like$concat
5. $project — Conditional Expressions
Beyond the simple ternary $cond, $project supports $switch for multi-branch logic and $ifNull for null-safe field access. These replace what would otherwise be complex application-side if/else chains with clean, server-side expressions.
# $project — $switch, $ifNull, and nested $cond
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["dataplexa"]
pipeline = [
{"$project": {
"name": 1,
"price": 1,
"stock": 1,
"rating": 1,
# $switch — multi-branch CASE expression
"price_band": {
"$switch": {
"branches": [
{"case": {"$lte": ["$price", 10]}, "then": "Under $10"},
{"case": {"$lte": ["$price", 50]}, "then": "$10–$50"},
{"case": {"$lte": ["$price", 150]}, "then": "$50–$150"},
{"case": {"$lte": ["$price", 500]}, "then": "$150–$500"},
],
"default": "Over $500"
}
},
# $switch — stock level label
"stock_label": {
"$switch": {
"branches": [
{"case": {"$eq": ["$stock", 0]}, "then": "Out of Stock"},
{"case": {"$lte": ["$stock", 5]}, "then": "Critical"},
{"case": {"$lte": ["$stock", 20]}, "then": "Low"},
{"case": {"$lte": ["$stock", 50]}, "then": "Medium"},
],
"default": "High"
}
},
# $ifNull — use a fallback value if the field is null or missing
"brand_display": {"$ifNull": ["$brand", "Unknown Brand"]},
"_id": 0
}},
{"$sort": {"price": 1}}
]
results = list(db.products.aggregate(pipeline))
print("Products with conditional labels:\n")
print(f" {'Name':25} {'Price':>7} {'Band':12} {'Stock':3} {'Level':12} Brand")
print(f" {'─'*25} {'─'*7} {'─'*12} {'─'*3} {'─'*12} {'─'*10}")
for p in results:
print(f" {p['name']:25} ${p['price']:>6.2f} "
f"{p['price_band']:12} {p['stock']:3} "
f"{p['stock_label']:12} {p['brand_display']}")Name Price Band Stock Level Brand
───────────────────────── ─────── ──────────── ─── ──────────── ──────────
Ballpoint Pens 10-pack $ 3.49 Under $10 120 High Bic
Notebook A5 $ 4.99 Under $10 5 Critical Moleskine
Wireless Mouse $ 29.99 $10–$50 42 Medium Logitech
USB-C Hub $ 49.99 $10–$50 50 Medium Anker
Mechanical Keyboard $ 89.99 $50–$150 35 Medium Corsair
Monitor 27-inch $299.99 $150–$500 8 Low Samsung
Standing Desk $349.99 $150–$500 12 Low FlexiSpot
$switchevaluates branches in order and returns the value of the first matching case — always include adefaultto handle values that match no branch$ifNulltakes two arguments — the field to check and the fallback value. If the field isnullor missing, the fallback is returned instead- Conditional expressions can be nested — a
$condcan contain another$condas itsthenorelsevalue for complex multi-level logic
Summary Table
| Stage / Operator | What It Does | Key Note |
|---|---|---|
$sum: 1 |
Count documents in group | Idiomatic count pattern |
$push / $addToSet |
Collect values into array | $push keeps duplicates, $addToSet deduplicates |
$first / $last |
Value from first/last doc in group | Only meaningful after a prior $sort |
Compound _id |
Group by multiple fields | Access with "$_id.field" in later stages |
$expr in $match |
Compare two fields in same document | Only way to use field references in $match |
$match after $group |
Filter on computed group results | SQL HAVING equivalent — cannot use indexes |
$cond |
Ternary if/then/else expression | SQL CASE WHEN … ELSE with one branch |
$switch |
Multi-branch conditional | Always include a default |
$ifNull |
Fallback for null or missing fields | Essential for optional fields |
Practice Questions
Practice 1. What is the difference between $push and $addToSet in a $group stage?
Practice 2. Why are $first and $last only meaningful when the pipeline is sorted before $group?
Practice 3. What does $expr allow you to do inside $match that standard query operators cannot?
Practice 4. Write a $project expression that labels a product as "expensive" if price > $100 and "affordable" otherwise.
Practice 5. How do you access individual fields of a compound $group _id in a later pipeline stage?
Quiz
Quiz 1. Which accumulator operator collects all values including duplicates into an array?
Quiz 2. What is the correct syntax to group documents by both user_id and status in a single $group stage?
Quiz 3. Which $project operator should you use when a field might be null or missing and you want to display a fallback value?
Quiz 4. Can a $match stage placed after $group use a collection index to improve performance?
Quiz 5. What date extraction operator would you use in a $group _id to group orders by month?
Next up — $lookup (Joins): Mastering cross-collection joins with pipeline-style $lookup, multiple joins in one pipeline, and self-joins.