Mango DBLesson 25 – Schema Design Best Practices | Dataplexa

Schema Design Best Practices

A MongoDB schema that works perfectly today can become a maintenance burden or a performance bottleneck tomorrow if it was not designed with durability in mind. Schema design best practices cover the habits and techniques that keep your data model healthy as the application evolves: enforcing structure with JSON Schema validation, naming fields consistently, avoiding the classic anti-patterns that cause problems at scale, versioning your schema so migrations are safe, and evolving the schema in production without downtime. This lesson codifies those practices using the Dataplexa Store as a running reference.

JSON Schema Validation

MongoDB is schemaless by default — any document can be inserted into any collection with any shape. In production this freedom becomes a liability: a bug in application code can silently insert malformed documents that corrupt your data. JSON Schema validation lets you define rules at the collection level that MongoDB enforces on every insert and update, catching bad data at the database layer before it can cause harm.

# JSON Schema validation — enforcing structure at the collection level

from pymongo import MongoClient
from pymongo.errors import WriteError

client = MongoClient("mongodb://localhost:27017/")
db     = client["dataplexa"]

# Define a validator for the users collection
user_validator = {
    "$jsonSchema": {
        "bsonType":   "object",
        "required":   ["name", "email", "membership"],
        "properties": {
            "name": {
                "bsonType":    "string",
                "minLength":   2,
                "description": "must be a string of at least 2 characters"
            },
            "email": {
                "bsonType": "string",
                "pattern":  "^[^@]+@[^@]+\\.[^@]+$",
                "description": "must be a valid email address"
            },
            "age": {
                "bsonType": "int",
                "minimum":  0,
                "maximum":  150,
                "description": "must be an integer between 0 and 150"
            },
            "membership": {
                "enum":        ["basic", "premium"],
                "description": "must be 'basic' or 'premium'"
            },
            "tags": {
                "bsonType": "array",
                "items":    {"bsonType": "string"},
                "description": "must be an array of strings"
            }
        }
    }
}

# Apply the validator — validationAction "error" rejects bad writes
# (use "warn" to log violations without rejecting)
db.command("collMod", "users",
           validator=user_validator,
           validationAction="error",
           validationLevel="moderate")   # "moderate" validates only new inserts and updates

print("Validator applied to users collection")

# Test 1 — valid document
try:
    db.users.insert_one({
        "_id":        "u_valid",
        "name":       "Test User",
        "email":      "test@example.com",
        "membership": "basic",
        "age":        25
    })
    print("Valid document inserted: OK")
    db.users.delete_one({"_id": "u_valid"})
except WriteError as e:
    print(f"Unexpected error: {e}")

# Test 2 — missing required field (email)
try:
    db.users.insert_one({
        "_id":        "u_bad",
        "name":       "No Email User",
        "membership": "basic"
    })
    print("Should not reach here")
except WriteError as e:
    print(f"Missing email rejected: {type(e).__name__}")

# Test 3 — invalid enum value
try:
    db.users.insert_one({
        "_id":        "u_bad2",
        "name":       "Bad Tier",
        "email":      "tier@example.com",
        "membership": "gold"   # not in enum
    })
    print("Should not reach here")
except WriteError as e:
    print(f"Invalid membership rejected: {type(e).__name__}")

Validator applied to users collection
Valid document inserted: OK
Missing email rejected: WriteError
Invalid membership rejected: WriteError

validationAction: "error" rejects the write — "warn" logs the violation but allows it, useful during migrations when you want to discover bad data without breaking the application
validationLevel: "moderate" only validates new inserts and updates to existing documents that already pass the schema — "strict" validates all writes including updates to documents that previously violated the schema
Start with validationAction: "warn" on an existing collection to discover what percentage of documents would fail before switching to "error"
Validation rules are stored in the collection metadata — they persist across server restarts and apply regardless of which driver or tool inserts the document

Consistent Field Naming Conventions

MongoDB field names are case-sensitive strings stored in every document. Inconsistent naming — mixing camelCase with snake_case, abbreviating some fields but not others, using different names for the same concept across collections — creates confusion, bugs, and schema debt that compounds over time.

# Field naming conventions — consistency matters

naming_rules = {
    "Use camelCase throughout (MongoDB convention)": [
        "userId, productId, createdAt, updatedAt",
        "firstName, lastName, emailAddress",
        "orderStatus, paymentMethod, shippingAddress",
    ],
    "Use snake_case if your team prefers it — but never mix": [
        "user_id, product_id, created_at, updated_at",
        "first_name, last_name, email_address",
        "order_status, payment_method, shipping_address",
    ],
    "Timestamp field names — be explicit and consistent": [
        "createdAt  — when the document was first created",
        "updatedAt  — when the document was last modified",
        "deletedAt  — soft delete timestamp",
        "expiresAt  — TTL expiry timestamp",
    ],
    "ID field names — always suffix with Id or _id": [
        "_id        — the document's own primary key (always this name)",
        "userId     — reference to users collection",
        "productId  — reference to products collection",
        "orderId    — reference to orders collection",
    ],
    "Boolean fields — use positive names": [
        "isActive   not  inactive",
        "isVerified not  unverified",
        "hasShipped not  notShipped",
    ],
    "Anti-patterns to avoid": [
        "Mixing: userId in one collection, user_id in another",
        "Abbreviating: usr, prod, qty (use full names — space is cheap)",
        "Generic names: data, info, metadata (be specific about what it contains)",
        "Numeric suffixes: field1, field2 (use an array or the Attribute Pattern)",
    ]
}

for rule, examples in naming_rules.items():
    print(f"\n  {rule}:")
    for ex in examples:
        print(f"    {ex}")

Use camelCase throughout (MongoDB convention):
userId, productId, createdAt, updatedAt
firstName, lastName, emailAddress

Timestamp field names — be explicit and consistent:
createdAt — when the document was first created
updatedAt — when the document was last modified
deletedAt — soft delete timestamp
expiresAt — TTL expiry timestamp

ID field names — always suffix with Id or _id:
_id — the document's own primary key
userId — reference to users collection
productId — reference to products collection

Anti-patterns to avoid:
Mixing: userId in one collection, user_id in another
Abbreviating: usr, prod, qty (use full names — space is cheap)
Generic names: data, info, metadata
Numeric suffixes: field1, field2 (use an array instead)

Field names are stored in every document — a 30-character field name on 10 million documents costs 300 MB of storage. Keep names descriptive but not verbose
The Dataplexa Store uses snake_case for readability in this course — in a real application, pick one convention and enforce it in code review
Add createdAt and updatedAt to every collection from day one — retroactively adding timestamps requires a migration across every existing document

Schema Anti-Patterns

MongoDB's schema flexibility can be misused in ways that look harmless at small scale but cause serious performance and maintenance problems in production. These are the most common anti-patterns and how to fix them.

# Schema anti-patterns — what to avoid and why

anti_patterns = [
    {
        "name":    "Massive Arrays",
        "problem": "An array field that grows without bound",
        "example_bad": {
            "_id":     "product_001",
            "name":    "Wireless Mouse",
            # BAD — this array will eventually hit 16 MB
            "all_reviews": ["r001", "r002", "r003", "...10,000 more IDs"]
        },
        "fix":     "Store reviews in a separate collection, reference by product_id",
        "why":     "Unbounded arrays hit the 16 MB document limit and cause document churn on every insert",
    },
    {
        "name":    "Bloated Documents",
        "problem": "Fetching huge documents when only a few fields are needed",
        "example_bad": {
            "_id":           "p001",
            "name":          "Wireless Mouse",
            "full_spec_pdf": "<<< 2 MB base64 encoded PDF >>>",
            "raw_html_desc": "<<< 500 KB of HTML >>>",
        },
        "fix":     "Store binary/large content in GridFS or S3, keep document lean",
        "why":     "Every query that touches this collection loads 2 MB+ per document into the WiredTiger cache",
    },
    {
        "name":    "Separating Data That Is Always Accessed Together",
        "problem": "Splitting a logical entity across two collections unnecessarily",
        "example_bad": {
            "profiles collection": {"_id": "u001", "name": "Alice"},
            "settings collection": {"_id": "u001", "theme": "dark", "language": "en"},
            "note": "Every user page load requires two queries and a join"
        },
        "fix":     "Embed settings inside the user document — they are always needed together",
        "why":     "Unnecessary cross-collection joins add latency and code complexity",
    },
    {
        "name":    "Unnecessary Indexes",
        "problem": "Indexing every field 'just in case'",
        "example_bad": "db.products.create_index([('name',1),('brand',1),('category',1),"
                       "('price',1),('rating',1),('stock',1),('tags',1)])",
        "fix":     "Index only fields that appear in real query filters and sorts",
        "why":     "Every index slows writes and consumes RAM — unused indexes are pure overhead",
    },
    {
        "name":    "Case-Insensitive String Queries Without a Collation Index",
        "problem": "Using $regex with 'i' flag for case-insensitive search at scale",
        "example_bad": 'db.users.find({"name": {"$regex": "alice", "$options": "i"}})',
        "fix":     "Use a collation index (strength: 2) or Atlas Search for production text search",
        "why":     "$regex with 'i' cannot use a standard index — it always does a full collection scan",
    },
]

for ap in anti_patterns:
    print(f"\n  ✗ Anti-pattern: {ap['name']}")
    print(f"    Problem: {ap['problem']}")
    print(f"    Fix:     {ap['fix']}")
    print(f"    Why:     {ap['why']}")

✗ Anti-pattern: Massive Arrays
Problem: An array field that grows without bound
Fix: Store reviews in a separate collection, reference by product_id
Why: Unbounded arrays hit the 16 MB document limit and cause document churn

✗ Anti-pattern: Bloated Documents
Problem: Fetching huge documents when only a few fields are needed
Fix: Store binary/large content in GridFS or S3, keep document lean
Why: Every query loads 2 MB+ per document into the WiredTiger cache

✗ Anti-pattern: Separating Data That Is Always Accessed Together
Problem: Splitting a logical entity across two collections unnecessarily
Fix: Embed settings inside the user document
Why: Unnecessary cross-collection joins add latency and code complexity

✗ Anti-pattern: Unnecessary Indexes
Problem: Indexing every field just in case
Fix: Index only fields that appear in real query filters and sorts
Why: Every index slows writes and consumes RAM

✗ Anti-pattern: Case-Insensitive Regex Without Collation Index
Problem: $regex with i flag for case-insensitive search at scale
Fix: Use a collation index or Atlas Search
Why: $regex with i cannot use a standard index — full collection scan

The massive array and bloated document anti-patterns are the most common causes of MongoDB performance problems in production — they are easy to introduce and expensive to fix later
The "separating data that belongs together" anti-pattern is the mirror image of the embedding mistake — too much referencing is just as harmful as too much embedding
Audit your indexes quarterly using db.collection.aggregate([{"$indexStats": {}}]) — remove any index with zero accesses

Schema Versioning

When your application evolves, the schema evolves with it. Without a versioning strategy, old documents with the old shape and new documents with the new shape coexist silently — and the application code must handle both. Schema versioning makes this explicit and manageable.

# Schema versioning — tracking document shape across migrations

from pymongo import MongoClient
from datetime import datetime, timezone

client = MongoClient("mongodb://localhost:27017/")
db     = client["dataplexa"]

# Add a schema_version field to every new document
# Old documents without schema_version are implicitly version 1

# Version 1 — original user shape (what we have now)
v1_user = {
    "_id":        "u_v1",
    "schema_version": 1,
    "name":       "Version One User",
    "email":      "v1@example.com",
    "membership": "basic",
    # No createdAt, no phone, flat address fields
    "city":       "London",
    "country":    "UK",
}

# Version 2 — address moved to sub-document, createdAt added
v2_user = {
    "_id":        "u_v2",
    "schema_version": 2,
    "name":       "Version Two User",
    "email":      "v2@example.com",
    "membership": "premium",
    "address": {
        "city":    "Paris",
        "country": "France",
        "postcode": "75001"
    },
    "createdAt": datetime.now(timezone.utc),
}

db.users.insert_many([v1_user, v2_user])

# Application layer — upgrade on read (lazy migration)
def get_user_normalised(user_id: str) -> dict:
    """
    Fetch a user and normalise to the latest schema version.
    Old documents are upgraded in memory — optionally written back.
    """
    user = db.users.find_one({"_id": user_id})
    if not user:
        return None

    version = user.get("schema_version", 1)

    if version == 1:
        # Upgrade v1 → v2: migrate flat address to sub-document
        user["address"] = {
            "city":    user.pop("city",    None),
            "country": user.pop("country", None),
        }
        user["schema_version"] = 2
        # Optionally write back the upgraded document
        # db.users.replace_one({"_id": user_id}, user)

    return user

u1 = get_user_normalised("u_v1")
u2 = get_user_normalised("u_v2")

print(f"v1 user normalised address: {u1['address']}")
print(f"v2 user address:            {u2['address']}")

# Clean up
db.users.delete_many({"_id": {"$in": ["u_v1", "u_v2"]}})

v1 user normalised address: {'city': 'London', 'country': 'UK'}
v2 user address: {'city': 'Paris', 'country': 'France', 'postcode': '75001'}

Always add a schema_version field (or schemaVersion) to documents from day one — retrofitting it requires a full collection migration
The upgrade-on-read (lazy migration) pattern avoids a big-bang migration — old documents are upgraded when they are next accessed
Pair lazy migration with a background job that upgrades all remaining old documents during off-peak hours — ensures 100% of documents eventually reach the latest version
Never delete the migration logic for old versions until you are certain no documents of that version remain in the collection

Evolving a Schema Safely in Production

Changing a live schema without downtime requires a disciplined approach. Adding fields is safe. Renaming or removing fields requires a multi-step process. Following the expand-migrate-contract pattern ensures no version of the application ever sees a document shape it cannot handle.

# Expand-Migrate-Contract — safe schema evolution in production

from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")
db     = client["dataplexa"]

migration_steps = {
    "Renaming 'city' to 'address.city' — safe three-phase process": [

        "PHASE 1 — EXPAND (deploy new code that writes both old and new):",
        "  - New code writes BOTH city (old) AND address.city (new)",
        "  - Old code still reads city — no breakage",
        "  - New code reads address.city, falls back to city if missing",
        "",
        "PHASE 2 — MIGRATE (backfill all old documents):",
        "  db.users.update_many(",
        "      {'city': {'$exists': True}, 'address': {'$exists': False}},",
        "      [{'$set': {'address': {'city': '$city', 'country': '$country'}}}]",
        "  )",
        "",
        "PHASE 3 — CONTRACT (deploy code that only uses new field, remove old):",
        "  - All documents now have address.city",
        "  - Deploy code that only reads address.city",
        "  - Run: db.users.update_many({}, {'$unset': {'city': '', 'country': ''}})",
        "  - Remove the fallback read logic from application code",
    ]
}

for scenario, steps in migration_steps.items():
    print(f"{scenario}\n")
    for step in steps:
        print(f"  {step}")

# Adding a new field — always safe, no migration needed
print("\n\nAdding a new field (always safe):")
print("  db.users.update_many(")
print("      {'loyaltyPoints': {'$exists': False}},")
print("      {'$set': {'loyaltyPoints': 0}}")
print("  )")
print("  → Old documents get the default, new documents include it on insert")

# Removing a field — safe if application already ignores it
print("\nRemoving an unused field:")
print("  Step 1: Deploy code that no longer reads the field")
print("  Step 2: Remove the field from all documents:")
print("  db.users.update_many({}, {'$unset': {'legacyField': ''}})")
print("  Step 3: Remove from schema validator if present")

Renaming 'city' to 'address.city' — safe three-phase process:

PHASE 1 — EXPAND (deploy new code that writes both old and new):
- New code writes BOTH city (old) AND address.city (new)
- Old code still reads city — no breakage
- New code reads address.city, falls back to city if missing

PHASE 2 — MIGRATE (backfill all old documents):
db.users.update_many(...)

PHASE 3 — CONTRACT (deploy code that only uses new field, remove old):
- All documents now have address.city
- Deploy code that only reads address.city
- Run: db.users.update_many({}, {'$unset': {'city': '', 'country': ''}})

Adding a new field (always safe):
db.users.update_many(
{'loyaltyPoints': {'$exists': False}},
{'$set': {'loyaltyPoints': 0}}
)

Removing an unused field:
Step 1: Deploy code that no longer reads the field
Step 2: Remove the field from all documents
Step 3: Remove from schema validator if present

Adding a field is always safe — old code ignores it, new code uses it, no downtime required
Renaming a field requires the expand-migrate-contract pattern — skipping any phase causes a version of your application to see a shape it cannot handle
Never run Phase 3 until Phase 2 is fully complete and verified — a partially-migrated collection with Phase 3 code deployed causes silent data loss
Test migrations on a copy of production data before running them live — mongodump and mongorestore make this easy

Summary Table

Best Practice	What It Prevents	Key Rule
JSON Schema validation	Silent insertion of malformed documents	Start with `warn`, switch to `error` when stable
Consistent naming	Confusion, bugs, schema debt	Pick one convention — never mix camelCase and snake_case
Avoid massive arrays	16 MB limit, document churn	Reference unbounded child data in a separate collection
Avoid bloated documents	Cache thrashing, slow reads	Store binary/large content in GridFS or object storage
Schema versioning	Silent shape mismatch across migrations	Add `schema_version` from day one, use upgrade-on-read
Expand-Migrate-Contract	Downtime and data loss during renames	Never skip a phase — verify Phase 2 before Phase 3

Practice Questions

Practice 1. What is the difference between validationAction "error" and "warn" in a MongoDB schema validator?

Practice 2. Why should you add createdAt and updatedAt fields to every collection from day one?

Practice 3. What is the upgrade-on-read (lazy migration) pattern and what is its trade-off?

Practice 4. List the three phases of the Expand-Migrate-Contract pattern for renaming a field.

Practice 5. Why does using $regex with the "i" flag for case-insensitive search cause performance problems at scale?

Quiz

Quiz 1. Which validationLevel setting validates only new inserts and updates to documents that already comply with the schema?

moderate — it skips validation on updates to documents that already violate the schema
strict — all writes are validated without exception
off — no validation is applied
partial — validates only fields listed in the schema

Quiz 2. What MongoDB aggregation stage lets you audit which indexes are actually being used in production?

$indexStats — returns usage statistics per index
$explain — shows the query execution plan
$collStats — returns collection-level statistics
$planCache — shows cached query plans

Quiz 3. In the Expand-Migrate-Contract pattern, what happens if you deploy Phase 3 before Phase 2 is complete?

Documents not yet migrated will be missing the new field — silent data loss or application errors
MongoDB automatically completes the migration before Phase 3 takes effect
Phase 3 will fail with a SchemaError
Old and new field names continue to coexist indefinitely

Quiz 4. Which of the following field naming practices should be avoided?

Using numeric suffixes like field1, field2
Using createdAt for the document creation timestamp
Suffixing reference fields with Id (e.g. userId, productId)
Using positive boolean names like isActive

Quiz 5. What is the primary risk of storing large binary data (e.g. a 2 MB PDF) directly inside a MongoDB document?

Every query loads the full document into the WiredTiger cache — degrading performance across the collection
Binary data cannot be stored in BSON format
The document becomes read-only once binary data is added
MongoDB charges extra per byte for BinData fields

Next up — Indexes: Understanding how indexes work, the types available, and how to build an indexing strategy that makes every query fast.

← Previous Course Index Next →