Mango DBLesson 23 – Data Modeling | Dataplexa

Data Modelling

Data modelling in MongoDB is fundamentally different from relational database design. In SQL you normalise data into separate tables and join them at query time. In MongoDB you design documents around the questions your application asks — the schema follows the access pattern, not the other way around. Getting this right at the start pays enormous dividends in query simplicity, performance, and scalability. Getting it wrong means expensive migrations and slow queries that no amount of indexing can fully rescue.

This lesson covers the core principles of MongoDB data modelling: understanding your access patterns first, the document size limit, schema flexibility, the one-to-few / one-to-many / one-to-squillions spectrum, and the foundational rules that guide every modelling decision. The next lesson goes deep on the embedding vs referencing choice specifically.

The Golden Rule — Model for Your Access Patterns

In a relational database you model your data first and write queries later. MongoDB inverts this: you understand your queries first and model your data to answer them efficiently. Every modelling decision — what to embed, what to reference, which fields to include in a document — flows from the answer to one question: how will the application read and write this data?

# Access pattern analysis — the questions you must answer before modelling

access_pattern_questions = [
    ("What does the app read most often?",
     "→ Those fields should be in the same document — avoid cross-collection lookups"),

    ("What is written together?",
     "→ Data written in one operation belongs in one document"),

    ("What is the read:write ratio?",
     "→ High reads: optimise for query speed (denormalise if needed)\n"
     "   High writes: avoid large arrays that require frequent updates"),

    ("How large can arrays grow?",
     "→ Small bounded arrays (< 100 items): embed\n"
     "   Unbounded arrays: reference in a separate collection"),

    ("Are documents queried individually or as a set?",
     "→ Individual: store all needed data in the document\n"
     "   Set: consider a parent document with child references"),

    ("Does the data change together or independently?",
     "→ Changes together: embed\n"
     "   Changes independently: reference"),
]

print("Access pattern questions to answer before modelling:\n")
for i, (question, guidance) in enumerate(access_pattern_questions, 1):
    print(f"  {i}. {question}")
    for line in guidance.split("\n"):
        print(f"     {line}")
    print()

Access pattern questions to answer before modelling:

1. What does the app read most often?
→ Those fields should be in the same document — avoid cross-collection lookups

2. What is written together?
→ Data written in one operation belongs in one document

3. What is the read:write ratio?
→ High reads: optimise for query speed (denormalise if needed)
→ High writes: avoid large arrays that require frequent updates

4. How large can arrays grow?
→ Small bounded arrays (< 100 items): embed
→ Unbounded arrays: reference in a separate collection

5. Are documents queried individually or as a set?
→ Individual: store all needed data in the document
→ Set: consider a parent document with child references

6. Does the data change together or independently?
→ Changes together: embed
→ Changes independently: reference

Write down your top 5 most frequent queries before touching schema design — let those queries drive every decision
Unlike SQL, there is no query planner that can join across collections efficiently — cross-collection lookups with $lookup are expensive and should be the exception, not the rule
Access patterns change over time — document them and revisit the schema when the application's read/write profile changes significantly

The Document Size Limit and Its Implications

Every MongoDB document has a hard maximum size of 16 MB. This limit exists to prevent runaway document growth and to keep memory usage predictable. For most documents this limit is irrelevant — the average document is kilobytes, not megabytes. But it becomes a practical constraint when embedding large arrays or binary data, and it is the primary reason unbounded arrays should never be embedded directly.

# Document size limit — understanding when it matters

import sys
import json

# Estimate the size of a typical Dataplexa user document
user_doc = {
    "_id":        "u001",
    "name":       "Alice Johnson",
    "email":      "alice@example.com",
    "age":        30,
    "city":       "London",
    "country":    "UK",
    "membership": "premium",
    "tags":       ["early_adopter", "newsletter"],
    "joined":     "2022-03-15"
}

doc_size_bytes = len(json.dumps(user_doc).encode("utf-8"))
limit_bytes    = 16 * 1024 * 1024   # 16 MB

print(f"Typical user document size:  {doc_size_bytes} bytes")
print(f"MongoDB document size limit: {limit_bytes:,} bytes  ({limit_bytes / 1024 / 1024:.0f} MB)")
print(f"Headroom remaining:          {limit_bytes - doc_size_bytes:,} bytes")
print()

# Demonstrate the unbounded array problem
# If a user could have unlimited orders embedded:
simulated_order = {"order_id": "o001", "total": 99.99, "items": [{"product_id": "p001", "qty": 1}]}
order_size      = len(json.dumps(simulated_order).encode("utf-8"))
max_orders      = limit_bytes // order_size

print(f"Single embedded order size:  {order_size} bytes")
print(f"Maximum orders embeddable:   ~{max_orders:,} before hitting 16 MB limit")
print()
print("Problem: a power user with 100,000+ orders would EXCEED the 16 MB limit")
print("Solution: store orders in a separate collection, reference by user_id")

Typical user document size: 187 bytes
MongoDB document size limit: 16,777,216 bytes (16 MB)
Headroom remaining: 16,776,929 bytes

Single embedded order size: 76 bytes
Maximum orders embeddable: ~220,752 before hitting 16 MB limit

Problem: a power user with 100,000+ orders would EXCEED the 16 MB limit
Solution: store orders in a separate collection, reference by user_id

The 16 MB limit is per document — a collection can hold unlimited documents of up to 16 MB each
For binary data (images, PDFs) use GridFS — it splits files into chunks stored in a separate collection and is not subject to the 16 MB per-document limit
Even before hitting 16 MB, large documents slow down reads, consume more WiredTiger cache, and increase network transfer — aim for the smallest document that satisfies your access patterns

Schema Flexibility — Polymorphic Documents

Unlike SQL tables where every row has identical columns, MongoDB documents in the same collection can have different shapes. This is called a polymorphic pattern — multiple document types coexist in one collection. It is powerful for inheritance hierarchies and product catalogues where different items have different attributes.

# Polymorphic pattern — different document shapes in one collection

from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")
db     = client["dataplexa"]

# The products collection already stores different categories
# Each category can have category-specific fields alongside common ones
polymorphic_products = [
    {
        "_id":      "p_demo_01",
        "type":     "electronics",
        "name":     "Wireless Headphones",
        "price":    79.99,
        "brand":    "SoundWave",
        # Electronics-specific fields
        "battery_hours":    30,
        "connectivity":     "Bluetooth 5.0",
        "noise_cancelling": True,
    },
    {
        "_id":   "p_demo_02",
        "type":  "book",
        "name":  "MongoDB: The Definitive Guide",
        "price": 39.99,
        # Book-specific fields
        "author":     "Shannon Bradshaw",
        "pages":      514,
        "isbn":       "978-1491954461",
        "publisher":  "O'Reilly",
    },
    {
        "_id":   "p_demo_03",
        "type":  "clothing",
        "name":  "Merino Wool Jumper",
        "price": 89.99,
        # Clothing-specific fields
        "sizes":    ["S", "M", "L", "XL"],
        "material": "100% Merino Wool",
        "colours":  ["Navy", "Charcoal", "Cream"],
    },
]

db.demo_products.drop()
db.demo_products.insert_many(polymorphic_products)

# Query all products regardless of type
print("All products (polymorphic — different shapes):")
for p in db.demo_products.find({}, {"name": 1, "type": 1, "price": 1, "_id": 0}):
    print(f"  [{p['type']:12}] {p['name']:35} ${p['price']:.2f}")

# Query type-specific fields
print("\nElectronics-specific query:")
for p in db.demo_products.find(
    {"type": "electronics", "noise_cancelling": True},
    {"name": 1, "connectivity": 1, "_id": 0}
):
    print(f"  {p['name']} — {p['connectivity']}")

db.demo_products.drop()

All products (polymorphic — different shapes):
[electronics ] Wireless Headphones $79.99
[book ] MongoDB: The Definitive Guide $39.99
[clothing ] Merino Wool Jumper $89.99

Electronics-specific query:
Wireless Headphones — Bluetooth 5.0

Always include a type or kind discriminator field so the application knows which shape a document has when it fetches it
Shared fields like name, price, and _id can be queried across all document types without knowing the type
Polymorphic documents work best when the shared fields dominate and the type-specific fields are supplementary — if two types have almost nothing in common, separate collections are cleaner

The Relationship Spectrum — One-to-Few, One-to-Many, One-to-Squillions

Every relationship in a data model falls somewhere on a spectrum from "one parent has a handful of children" to "one parent has millions of children". Where a relationship sits on this spectrum determines whether you should embed the children in the parent document or store them in a separate collection.

# The relationship spectrum — three patterns with Dataplexa examples

relationship_spectrum = {
    "One-to-Few (embed)": {
        "example":     "User → addresses (1–5 addresses per user)",
        "max_children": "< 100",
        "approach":    "Embed children directly in parent document",
        "reason":      "Always retrieved together, small size, no independent queries",
        "dataplexa":   "User tags array — ['early_adopter', 'newsletter']",
        "schema": {
            "_id": "u001",
            "name": "Alice Johnson",
            "addresses": [
                {"type": "home",    "city": "London",    "country": "UK"},
                {"type": "billing", "city": "Edinburgh", "country": "UK"}
            ]
        }
    },
    "One-to-Many (reference)": {
        "example":     "User → orders (dozens to thousands of orders)",
        "max_children": "100 – 100,000",
        "approach":    "Store children in separate collection, reference parent _id",
        "reason":      "Children queried independently, unbounded growth risk",
        "dataplexa":   "Orders collection — each order has user_id: 'u001'",
        "schema": {
            "orders collection": {"_id": "o001", "user_id": "u001", "total": 44.96}
        }
    },
    "One-to-Squillions (parent reference)": {
        "example":     "Server → log events (millions per server per day)",
        "max_children": "> 100,000",
        "approach":    "Child documents each store a reference to the parent",
        "reason":      "An array of child IDs on the parent would hit 16 MB limit",
        "dataplexa":   "Event log — each event has server_id field pointing to server",
        "schema": {
            "log_events collection": {"_id": "evt_001", "server_id": "srv_01", "level": "ERROR"}
        }
    },
}

for pattern, details in relationship_spectrum.items():
    print(f"\n{'─'*60}")
    print(f"  Pattern:    {pattern}")
    print(f"  Example:    {details['example']}")
    print(f"  Max child:  {details['max_children']}")
    print(f"  Approach:   {details['approach']}")
    print(f"  Reason:     {details['reason']}")
    print(f"  Dataplexa:  {details['dataplexa']}")

────────────────────────────────────────────────────────────
Pattern: One-to-Few (embed)
Example: User → addresses (1–5 addresses per user)
Max child: < 100
Approach: Embed children directly in parent document
Reason: Always retrieved together, small size, no independent queries
Dataplexa: User tags array — ['early_adopter', 'newsletter']

────────────────────────────────────────────────────────────
Pattern: One-to-Many (reference)
Example: User → orders (dozens to thousands of orders)
Max child: 100 – 100,000
Approach: Store children in separate collection, reference parent _id
Reason: Children queried independently, unbounded growth risk
Dataplexa: Orders collection — each order has user_id: 'u001'

────────────────────────────────────────────────────────────
Pattern: One-to-Squillions (parent reference)
Example: Server → log events (millions per server per day)
Max child: > 100,000
Approach: Child documents each store a reference to the parent
Reason: An array of child IDs on the parent would hit 16 MB limit
Dataplexa: Event log — each event has server_id field pointing to server

The boundary between "few" and "many" depends on your data — a product with 50 reviews may be fine to embed, but a popular product with 50,000 reviews must be referenced
For one-to-squillions always store the parent reference on the child — never store a child ID array on the parent
The Dataplexa orders collection is a classic one-to-many pattern — each order references its user by user_id rather than embedding all orders inside the user document

Schema Design Patterns

MongoDB has established a library of named schema design patterns — reusable solutions to common modelling problems. The most important ones for application developers are the Attribute Pattern, the Bucket Pattern, and the Computed Pattern.

# Three essential schema design patterns

# ── 1. ATTRIBUTE PATTERN ────────────────────────────────────────────────────
# Problem: A product has many optional specification fields
# (different products have different specs — hard to index them all)
# Solution: Convert sparse fields into a key-value array

# Before — sparse fields, hard to index
before_attribute = {
    "_id":       "p_laptop",
    "name":      "Pro Laptop",
    "ram_gb":    16,
    "cpu_cores": 8,
    # "screen_size": only for laptops
    "screen_size_inches": 15.6,
    # "battery_hours": only for portable devices
    "battery_hours": 10,
}

# After — attribute pattern: normalised key-value pairs, easy to index
after_attribute = {
    "_id":   "p_laptop",
    "name":  "Pro Laptop",
    "specs": [
        {"k": "ram_gb",               "v": 16,   "unit": "GB"},
        {"k": "cpu_cores",            "v": 8,    "unit": "cores"},
        {"k": "screen_size_inches",   "v": 15.6, "unit": "inches"},
        {"k": "battery_hours",        "v": 10,   "unit": "hours"},
    ]
}
# One index on specs.k and specs.v covers all attribute queries

# ── 2. BUCKET PATTERN ───────────────────────────────────────────────────────
# Problem: Time-series data (IoT, metrics) — one document per reading
# creates millions of tiny documents, crushing index and storage performance
# Solution: Group N readings into a single bucket document

bucket_doc = {
    "_id":       "sensor_42_2024-03-01_00",
    "sensor_id": "sensor_42",
    "date":      "2024-03-01",
    "hour":      0,
    "count":     60,       # readings in this bucket
    "readings":  [         # up to 60 readings per bucket (one per minute)
        {"minute": 0,  "temp": 21.3, "humidity": 45},
        {"minute": 1,  "temp": 21.4, "humidity": 44},
        # ... up to minute 59
    ],
    "summary": {"min_temp": 21.1, "max_temp": 22.0, "avg_temp": 21.5}
}

# ── 3. COMPUTED PATTERN ─────────────────────────────────────────────────────
# Problem: Expensive aggregation (e.g. average rating) run on every page load
# Solution: Pre-compute and store the result, update on write

computed_product = {
    "_id":          "p001",
    "name":         "Wireless Mouse",
    "price":        29.99,
    # Pre-computed fields updated whenever a new review is inserted
    "review_count": 42,
    "avg_rating":   4.5,
    "rating_dist":  {"1": 1, "2": 2, "3": 3, "4": 12, "5": 24}
}

print("Attribute pattern — one index covers all specs:")
print(f"  Index: db.products.create_index([('specs.k', 1), ('specs.v', 1)])")
print()
print("Bucket pattern — one document per hour of sensor readings:")
print(f"  Bucket _id: {bucket_doc['_id']}")
print(f"  Readings per bucket: up to {bucket_doc['count']}")
print()
print("Computed pattern — pre-computed aggregation stored on document:")
print(f"  avg_rating: {computed_product['avg_rating']}  review_count: {computed_product['review_count']}")

Attribute pattern — one index covers all specs:
Index: db.products.create_index([('specs.k', 1), ('specs.v', 1)])

Bucket pattern — one document per hour of sensor readings:
Bucket _id: sensor_42_2024-03-01_00
Readings per bucket: up to 60

Computed pattern — pre-computed aggregation stored on document:
avg_rating: 4.5 review_count: 42

The Attribute Pattern solves the sparse field and wildcard indexing problem — one compound index on specs.k and specs.v covers queries on any attribute
The Bucket Pattern dramatically reduces document count and index size for time-series data — MongoDB even has a native Time Series collection type that implements this automatically
The Computed Pattern trades storage for compute — you store a redundant value to avoid recalculating it on every read. Update the computed field in the same write operation that changes the underlying data

Applying the Dataplexa Schema Decisions

Every schema decision in the Dataplexa Store dataset reflects a deliberate modelling choice. Understanding why those choices were made is the best way to internalise the principles.

# Dataplexa schema decisions — rationale for each choice

schema_decisions = [
    {
        "entity":    "User tags",
        "decision":  "Embedded array in user document",
        "rationale": [
            "Small bounded set — max 5-10 tags per user",
            "Always retrieved with the user — no independent query needed",
            "Written together with user document on profile update",
        ],
    },
    {
        "entity":    "Order line items (items array)",
        "decision":  "Embedded array of sub-documents in order",
        "rationale": [
            "Items are always fetched with their order — never queried alone",
            "Bounded per order — a realistic max is 50-100 items",
            "Writing an order and its items is one atomic operation",
        ],
    },
    {
        "entity":    "Orders themselves",
        "decision":  "Separate collection, referencing user_id",
        "rationale": [
            "Unbounded per user — a user can have thousands of orders",
            "Orders are frequently queried independently (order history, status)",
            "Embedding all orders in the user would risk exceeding 16 MB",
        ],
    },
    {
        "entity":    "Reviews",
        "decision":  "Separate collection, referencing product_id and user_id",
        "rationale": [
            "Potentially thousands of reviews per popular product",
            "Queried independently — all reviews for a product, all reviews by a user",
            "Embedding would create unbounded product document growth",
        ],
    },
]

print("Dataplexa Store — schema decision rationale:\n")
for d in schema_decisions:
    print(f"  {d['entity']}")
    print(f"  Decision: {d['decision']}")
    print(f"  Why:")
    for r in d["rationale"]:
        print(f"    • {r}")
    print()

Dataplexa Store — schema decision rationale:

User tags
Decision: Embedded array in user document
Why:
• Small bounded set — max 5-10 tags per user
• Always retrieved with the user — no independent query needed
• Written together with user document on profile update

Order line items (items array)
Decision: Embedded array of sub-documents in order
Why:
• Items are always fetched with their order — never queried alone
• Bounded per order — a realistic max is 50-100 items
• Writing an order and its items is one atomic operation

Orders themselves
Decision: Separate collection, referencing user_id
Why:
• Unbounded per user — a user can have thousands of orders
• Orders are frequently queried independently
• Embedding all orders in the user would risk exceeding 16 MB

Reviews
Decision: Separate collection, referencing product_id and user_id
Why:
• Potentially thousands of reviews per popular product
• Queried independently — all reviews for a product or by a user
• Embedding would create unbounded product document growth

Notice how the same principle — embed when bounded and co-retrieved, reference when unbounded or independently queried — explains every decision in the Dataplexa schema
The items array inside an order is embedded because it is bounded and atomic — the order line items and the order total are written and read as one unit
Neither pattern is universally better — the right choice always depends on your specific access patterns and data cardinality

Summary Table

Concept	Key Principle	Dataplexa Example
Access-pattern-first design	Model for queries, not for normalisation	Items embedded in order — always fetched together
16 MB document limit	Never embed unbounded arrays	Orders in separate collection, not in user doc
Polymorphic pattern	Different shapes in one collection	Electronics vs Stationery vs Furniture products
One-to-few	Embed — bounded, co-retrieved	User tags array
One-to-many	Reference — separate collection	Orders referencing user_id
One-to-squillions	Child stores parent reference	Log events storing server_id
Attribute pattern	Sparse fields → k/v array	Product specs with one compound index
Bucket pattern	Group time-series into batches	Sensor readings bucketed by hour
Computed pattern	Pre-store expensive aggregations	avg_rating stored on product document

Practice Questions

Practice 1. What is the single most important question to answer before designing a MongoDB schema?

Practice 2. Why should an unbounded array never be embedded directly in a parent document?

Practice 3. What is the Attribute Pattern and when should you use it?

Practice 4. Explain the one-to-squillions pattern and how it differs from one-to-many.

Practice 5. In the Dataplexa Store, why are order line items embedded inside the order document but orders themselves are in a separate collection from users?

Quiz

Quiz 1. What is the hard maximum size of a single MongoDB document?

16 MB
64 MB
4 MB
There is no limit

Quiz 2. What distinguishes the one-to-squillions pattern from one-to-many?

In one-to-squillions the child documents store a reference to the parent — the parent never holds an array of child IDs because there are too many children
One-to-squillions uses a special MongoDB collection type
One-to-squillions always embeds children
There is no difference — they are the same pattern

Quiz 3. What is the primary purpose of the Computed Pattern?

To pre-store expensive aggregation results on the document so reads are fast — avoiding recalculating them on every read request
To compress document storage using computed hashes
To automatically compute _id values from field data
To generate schema validation rules automatically

Quiz 4. Which MongoDB tool should you use to store files larger than 16 MB?

GridFS — it splits large files into chunks stored across multiple documents in a separate collection
Atlas Search — it indexes and stores large text files
BinData with compression enabled
Large files cannot be stored in MongoDB

Quiz 5. Why does the Bucket Pattern improve performance for time-series data compared to one document per reading?

It dramatically reduces the total document count and index size — millions of individual readings become thousands of bucket documents, making index scans and storage far more efficient
Bucket documents use a compressed binary format automatically
Each bucket is stored on a dedicated server node
Bucket documents bypass the 16 MB limit

Next up — Embedded vs Referenced: A deep dive into when to embed data and when to reference it, with decision frameworks and real-world trade-offs.

← Previous Course Index Next →