MongoDB
Data Modelling
Data modelling in MongoDB is fundamentally different from relational database design. In SQL you normalise data into separate tables and join them at query time. In MongoDB you design documents around the questions your application asks — the schema follows the access pattern, not the other way around. Getting this right at the start pays enormous dividends in query simplicity, performance, and scalability. Getting it wrong means expensive migrations and slow queries that no amount of indexing can fully rescue.
This lesson covers the core principles of MongoDB data modelling: understanding your access patterns first, the document size limit, schema flexibility, the one-to-few / one-to-many / one-to-squillions spectrum, and the foundational rules that guide every modelling decision. The next lesson goes deep on the embedding vs referencing choice specifically.
The Golden Rule — Model for Your Access Patterns
In a relational database you model your data first and write queries later. MongoDB inverts this: you understand your queries first and model your data to answer them efficiently. Every modelling decision — what to embed, what to reference, which fields to include in a document — flows from the answer to one question: how will the application read and write this data?
# Access pattern analysis — the questions you must answer before modelling
access_pattern_questions = [
("What does the app read most often?",
"→ Those fields should be in the same document — avoid cross-collection lookups"),
("What is written together?",
"→ Data written in one operation belongs in one document"),
("What is the read:write ratio?",
"→ High reads: optimise for query speed (denormalise if needed)\n"
" High writes: avoid large arrays that require frequent updates"),
("How large can arrays grow?",
"→ Small bounded arrays (< 100 items): embed\n"
" Unbounded arrays: reference in a separate collection"),
("Are documents queried individually or as a set?",
"→ Individual: store all needed data in the document\n"
" Set: consider a parent document with child references"),
("Does the data change together or independently?",
"→ Changes together: embed\n"
" Changes independently: reference"),
]
print("Access pattern questions to answer before modelling:\n")
for i, (question, guidance) in enumerate(access_pattern_questions, 1):
print(f" {i}. {question}")
for line in guidance.split("\n"):
print(f" {line}")
print()1. What does the app read most often?
→ Those fields should be in the same document — avoid cross-collection lookups
2. What is written together?
→ Data written in one operation belongs in one document
3. What is the read:write ratio?
→ High reads: optimise for query speed (denormalise if needed)
→ High writes: avoid large arrays that require frequent updates
4. How large can arrays grow?
→ Small bounded arrays (< 100 items): embed
→ Unbounded arrays: reference in a separate collection
5. Are documents queried individually or as a set?
→ Individual: store all needed data in the document
→ Set: consider a parent document with child references
6. Does the data change together or independently?
→ Changes together: embed
→ Changes independently: reference
- Write down your top 5 most frequent queries before touching schema design — let those queries drive every decision
- Unlike SQL, there is no query planner that can join across collections efficiently — cross-collection lookups with
$lookupare expensive and should be the exception, not the rule - Access patterns change over time — document them and revisit the schema when the application's read/write profile changes significantly
The Document Size Limit and Its Implications
Every MongoDB document has a hard maximum size of 16 MB. This limit exists to prevent runaway document growth and to keep memory usage predictable. For most documents this limit is irrelevant — the average document is kilobytes, not megabytes. But it becomes a practical constraint when embedding large arrays or binary data, and it is the primary reason unbounded arrays should never be embedded directly.
# Document size limit — understanding when it matters
import sys
import json
# Estimate the size of a typical Dataplexa user document
user_doc = {
"_id": "u001",
"name": "Alice Johnson",
"email": "alice@example.com",
"age": 30,
"city": "London",
"country": "UK",
"membership": "premium",
"tags": ["early_adopter", "newsletter"],
"joined": "2022-03-15"
}
doc_size_bytes = len(json.dumps(user_doc).encode("utf-8"))
limit_bytes = 16 * 1024 * 1024 # 16 MB
print(f"Typical user document size: {doc_size_bytes} bytes")
print(f"MongoDB document size limit: {limit_bytes:,} bytes ({limit_bytes / 1024 / 1024:.0f} MB)")
print(f"Headroom remaining: {limit_bytes - doc_size_bytes:,} bytes")
print()
# Demonstrate the unbounded array problem
# If a user could have unlimited orders embedded:
simulated_order = {"order_id": "o001", "total": 99.99, "items": [{"product_id": "p001", "qty": 1}]}
order_size = len(json.dumps(simulated_order).encode("utf-8"))
max_orders = limit_bytes // order_size
print(f"Single embedded order size: {order_size} bytes")
print(f"Maximum orders embeddable: ~{max_orders:,} before hitting 16 MB limit")
print()
print("Problem: a power user with 100,000+ orders would EXCEED the 16 MB limit")
print("Solution: store orders in a separate collection, reference by user_id")MongoDB document size limit: 16,777,216 bytes (16 MB)
Headroom remaining: 16,776,929 bytes
Single embedded order size: 76 bytes
Maximum orders embeddable: ~220,752 before hitting 16 MB limit
Problem: a power user with 100,000+ orders would EXCEED the 16 MB limit
Solution: store orders in a separate collection, reference by user_id
- The 16 MB limit is per document — a collection can hold unlimited documents of up to 16 MB each
- For binary data (images, PDFs) use GridFS — it splits files into chunks stored in a separate collection and is not subject to the 16 MB per-document limit
- Even before hitting 16 MB, large documents slow down reads, consume more WiredTiger cache, and increase network transfer — aim for the smallest document that satisfies your access patterns
Schema Flexibility — Polymorphic Documents
Unlike SQL tables where every row has identical columns, MongoDB documents in the same collection can have different shapes. This is called a polymorphic pattern — multiple document types coexist in one collection. It is powerful for inheritance hierarchies and product catalogues where different items have different attributes.
# Polymorphic pattern — different document shapes in one collection
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["dataplexa"]
# The products collection already stores different categories
# Each category can have category-specific fields alongside common ones
polymorphic_products = [
{
"_id": "p_demo_01",
"type": "electronics",
"name": "Wireless Headphones",
"price": 79.99,
"brand": "SoundWave",
# Electronics-specific fields
"battery_hours": 30,
"connectivity": "Bluetooth 5.0",
"noise_cancelling": True,
},
{
"_id": "p_demo_02",
"type": "book",
"name": "MongoDB: The Definitive Guide",
"price": 39.99,
# Book-specific fields
"author": "Shannon Bradshaw",
"pages": 514,
"isbn": "978-1491954461",
"publisher": "O'Reilly",
},
{
"_id": "p_demo_03",
"type": "clothing",
"name": "Merino Wool Jumper",
"price": 89.99,
# Clothing-specific fields
"sizes": ["S", "M", "L", "XL"],
"material": "100% Merino Wool",
"colours": ["Navy", "Charcoal", "Cream"],
},
]
db.demo_products.drop()
db.demo_products.insert_many(polymorphic_products)
# Query all products regardless of type
print("All products (polymorphic — different shapes):")
for p in db.demo_products.find({}, {"name": 1, "type": 1, "price": 1, "_id": 0}):
print(f" [{p['type']:12}] {p['name']:35} ${p['price']:.2f}")
# Query type-specific fields
print("\nElectronics-specific query:")
for p in db.demo_products.find(
{"type": "electronics", "noise_cancelling": True},
{"name": 1, "connectivity": 1, "_id": 0}
):
print(f" {p['name']} — {p['connectivity']}")
db.demo_products.drop()[electronics ] Wireless Headphones $79.99
[book ] MongoDB: The Definitive Guide $39.99
[clothing ] Merino Wool Jumper $89.99
Electronics-specific query:
Wireless Headphones — Bluetooth 5.0
- Always include a
typeorkinddiscriminator field so the application knows which shape a document has when it fetches it - Shared fields like
name,price, and_idcan be queried across all document types without knowing the type - Polymorphic documents work best when the shared fields dominate and the type-specific fields are supplementary — if two types have almost nothing in common, separate collections are cleaner
The Relationship Spectrum — One-to-Few, One-to-Many, One-to-Squillions
Every relationship in a data model falls somewhere on a spectrum from "one parent has a handful of children" to "one parent has millions of children". Where a relationship sits on this spectrum determines whether you should embed the children in the parent document or store them in a separate collection.
# The relationship spectrum — three patterns with Dataplexa examples
relationship_spectrum = {
"One-to-Few (embed)": {
"example": "User → addresses (1–5 addresses per user)",
"max_children": "< 100",
"approach": "Embed children directly in parent document",
"reason": "Always retrieved together, small size, no independent queries",
"dataplexa": "User tags array — ['early_adopter', 'newsletter']",
"schema": {
"_id": "u001",
"name": "Alice Johnson",
"addresses": [
{"type": "home", "city": "London", "country": "UK"},
{"type": "billing", "city": "Edinburgh", "country": "UK"}
]
}
},
"One-to-Many (reference)": {
"example": "User → orders (dozens to thousands of orders)",
"max_children": "100 – 100,000",
"approach": "Store children in separate collection, reference parent _id",
"reason": "Children queried independently, unbounded growth risk",
"dataplexa": "Orders collection — each order has user_id: 'u001'",
"schema": {
"orders collection": {"_id": "o001", "user_id": "u001", "total": 44.96}
}
},
"One-to-Squillions (parent reference)": {
"example": "Server → log events (millions per server per day)",
"max_children": "> 100,000",
"approach": "Child documents each store a reference to the parent",
"reason": "An array of child IDs on the parent would hit 16 MB limit",
"dataplexa": "Event log — each event has server_id field pointing to server",
"schema": {
"log_events collection": {"_id": "evt_001", "server_id": "srv_01", "level": "ERROR"}
}
},
}
for pattern, details in relationship_spectrum.items():
print(f"\n{'─'*60}")
print(f" Pattern: {pattern}")
print(f" Example: {details['example']}")
print(f" Max child: {details['max_children']}")
print(f" Approach: {details['approach']}")
print(f" Reason: {details['reason']}")
print(f" Dataplexa: {details['dataplexa']}")Pattern: One-to-Few (embed)
Example: User → addresses (1–5 addresses per user)
Max child: < 100
Approach: Embed children directly in parent document
Reason: Always retrieved together, small size, no independent queries
Dataplexa: User tags array — ['early_adopter', 'newsletter']
────────────────────────────────────────────────────────────
Pattern: One-to-Many (reference)
Example: User → orders (dozens to thousands of orders)
Max child: 100 – 100,000
Approach: Store children in separate collection, reference parent _id
Reason: Children queried independently, unbounded growth risk
Dataplexa: Orders collection — each order has user_id: 'u001'
────────────────────────────────────────────────────────────
Pattern: One-to-Squillions (parent reference)
Example: Server → log events (millions per server per day)
Max child: > 100,000
Approach: Child documents each store a reference to the parent
Reason: An array of child IDs on the parent would hit 16 MB limit
Dataplexa: Event log — each event has server_id field pointing to server
- The boundary between "few" and "many" depends on your data — a product with 50 reviews may be fine to embed, but a popular product with 50,000 reviews must be referenced
- For one-to-squillions always store the parent reference on the child — never store a child ID array on the parent
- The Dataplexa orders collection is a classic one-to-many pattern — each order references its user by
user_idrather than embedding all orders inside the user document
Schema Design Patterns
MongoDB has established a library of named schema design patterns — reusable solutions to common modelling problems. The most important ones for application developers are the Attribute Pattern, the Bucket Pattern, and the Computed Pattern.
# Three essential schema design patterns
# ── 1. ATTRIBUTE PATTERN ────────────────────────────────────────────────────
# Problem: A product has many optional specification fields
# (different products have different specs — hard to index them all)
# Solution: Convert sparse fields into a key-value array
# Before — sparse fields, hard to index
before_attribute = {
"_id": "p_laptop",
"name": "Pro Laptop",
"ram_gb": 16,
"cpu_cores": 8,
# "screen_size": only for laptops
"screen_size_inches": 15.6,
# "battery_hours": only for portable devices
"battery_hours": 10,
}
# After — attribute pattern: normalised key-value pairs, easy to index
after_attribute = {
"_id": "p_laptop",
"name": "Pro Laptop",
"specs": [
{"k": "ram_gb", "v": 16, "unit": "GB"},
{"k": "cpu_cores", "v": 8, "unit": "cores"},
{"k": "screen_size_inches", "v": 15.6, "unit": "inches"},
{"k": "battery_hours", "v": 10, "unit": "hours"},
]
}
# One index on specs.k and specs.v covers all attribute queries
# ── 2. BUCKET PATTERN ───────────────────────────────────────────────────────
# Problem: Time-series data (IoT, metrics) — one document per reading
# creates millions of tiny documents, crushing index and storage performance
# Solution: Group N readings into a single bucket document
bucket_doc = {
"_id": "sensor_42_2024-03-01_00",
"sensor_id": "sensor_42",
"date": "2024-03-01",
"hour": 0,
"count": 60, # readings in this bucket
"readings": [ # up to 60 readings per bucket (one per minute)
{"minute": 0, "temp": 21.3, "humidity": 45},
{"minute": 1, "temp": 21.4, "humidity": 44},
# ... up to minute 59
],
"summary": {"min_temp": 21.1, "max_temp": 22.0, "avg_temp": 21.5}
}
# ── 3. COMPUTED PATTERN ─────────────────────────────────────────────────────
# Problem: Expensive aggregation (e.g. average rating) run on every page load
# Solution: Pre-compute and store the result, update on write
computed_product = {
"_id": "p001",
"name": "Wireless Mouse",
"price": 29.99,
# Pre-computed fields updated whenever a new review is inserted
"review_count": 42,
"avg_rating": 4.5,
"rating_dist": {"1": 1, "2": 2, "3": 3, "4": 12, "5": 24}
}
print("Attribute pattern — one index covers all specs:")
print(f" Index: db.products.create_index([('specs.k', 1), ('specs.v', 1)])")
print()
print("Bucket pattern — one document per hour of sensor readings:")
print(f" Bucket _id: {bucket_doc['_id']}")
print(f" Readings per bucket: up to {bucket_doc['count']}")
print()
print("Computed pattern — pre-computed aggregation stored on document:")
print(f" avg_rating: {computed_product['avg_rating']} review_count: {computed_product['review_count']}")Index: db.products.create_index([('specs.k', 1), ('specs.v', 1)])
Bucket pattern — one document per hour of sensor readings:
Bucket _id: sensor_42_2024-03-01_00
Readings per bucket: up to 60
Computed pattern — pre-computed aggregation stored on document:
avg_rating: 4.5 review_count: 42
- The Attribute Pattern solves the sparse field and wildcard indexing problem — one compound index on
specs.kandspecs.vcovers queries on any attribute - The Bucket Pattern dramatically reduces document count and index size for time-series data — MongoDB even has a native Time Series collection type that implements this automatically
- The Computed Pattern trades storage for compute — you store a redundant value to avoid recalculating it on every read. Update the computed field in the same write operation that changes the underlying data
Applying the Dataplexa Schema Decisions
Every schema decision in the Dataplexa Store dataset reflects a deliberate modelling choice. Understanding why those choices were made is the best way to internalise the principles.
# Dataplexa schema decisions — rationale for each choice
schema_decisions = [
{
"entity": "User tags",
"decision": "Embedded array in user document",
"rationale": [
"Small bounded set — max 5-10 tags per user",
"Always retrieved with the user — no independent query needed",
"Written together with user document on profile update",
],
},
{
"entity": "Order line items (items array)",
"decision": "Embedded array of sub-documents in order",
"rationale": [
"Items are always fetched with their order — never queried alone",
"Bounded per order — a realistic max is 50-100 items",
"Writing an order and its items is one atomic operation",
],
},
{
"entity": "Orders themselves",
"decision": "Separate collection, referencing user_id",
"rationale": [
"Unbounded per user — a user can have thousands of orders",
"Orders are frequently queried independently (order history, status)",
"Embedding all orders in the user would risk exceeding 16 MB",
],
},
{
"entity": "Reviews",
"decision": "Separate collection, referencing product_id and user_id",
"rationale": [
"Potentially thousands of reviews per popular product",
"Queried independently — all reviews for a product, all reviews by a user",
"Embedding would create unbounded product document growth",
],
},
]
print("Dataplexa Store — schema decision rationale:\n")
for d in schema_decisions:
print(f" {d['entity']}")
print(f" Decision: {d['decision']}")
print(f" Why:")
for r in d["rationale"]:
print(f" • {r}")
print()User tags
Decision: Embedded array in user document
Why:
• Small bounded set — max 5-10 tags per user
• Always retrieved with the user — no independent query needed
• Written together with user document on profile update
Order line items (items array)
Decision: Embedded array of sub-documents in order
Why:
• Items are always fetched with their order — never queried alone
• Bounded per order — a realistic max is 50-100 items
• Writing an order and its items is one atomic operation
Orders themselves
Decision: Separate collection, referencing user_id
Why:
• Unbounded per user — a user can have thousands of orders
• Orders are frequently queried independently
• Embedding all orders in the user would risk exceeding 16 MB
Reviews
Decision: Separate collection, referencing product_id and user_id
Why:
• Potentially thousands of reviews per popular product
• Queried independently — all reviews for a product or by a user
• Embedding would create unbounded product document growth
- Notice how the same principle — embed when bounded and co-retrieved, reference when unbounded or independently queried — explains every decision in the Dataplexa schema
- The items array inside an order is embedded because it is bounded and atomic — the order line items and the order total are written and read as one unit
- Neither pattern is universally better — the right choice always depends on your specific access patterns and data cardinality
Summary Table
| Concept | Key Principle | Dataplexa Example |
|---|---|---|
| Access-pattern-first design | Model for queries, not for normalisation | Items embedded in order — always fetched together |
| 16 MB document limit | Never embed unbounded arrays | Orders in separate collection, not in user doc |
| Polymorphic pattern | Different shapes in one collection | Electronics vs Stationery vs Furniture products |
| One-to-few | Embed — bounded, co-retrieved | User tags array |
| One-to-many | Reference — separate collection | Orders referencing user_id |
| One-to-squillions | Child stores parent reference | Log events storing server_id |
| Attribute pattern | Sparse fields → k/v array | Product specs with one compound index |
| Bucket pattern | Group time-series into batches | Sensor readings bucketed by hour |
| Computed pattern | Pre-store expensive aggregations | avg_rating stored on product document |
Practice Questions
Practice 1. What is the single most important question to answer before designing a MongoDB schema?
Practice 2. Why should an unbounded array never be embedded directly in a parent document?
Practice 3. What is the Attribute Pattern and when should you use it?
Practice 4. Explain the one-to-squillions pattern and how it differs from one-to-many.
Practice 5. In the Dataplexa Store, why are order line items embedded inside the order document but orders themselves are in a separate collection from users?
Quiz
Quiz 1. What is the hard maximum size of a single MongoDB document?
Quiz 2. What distinguishes the one-to-squillions pattern from one-to-many?
Quiz 3. What is the primary purpose of the Computed Pattern?
Quiz 4. Which MongoDB tool should you use to store files larger than 16 MB?
Quiz 5. Why does the Bucket Pattern improve performance for time-series data compared to one document per reading?
Next up — Embedded vs Referenced: A deep dive into when to embed data and when to reference it, with decision frameworks and real-world trade-offs.