MongoDB
Schema Design Best Practices
A MongoDB schema that works perfectly today can become a maintenance burden or a performance bottleneck tomorrow if it was not designed with durability in mind. Schema design best practices cover the habits and techniques that keep your data model healthy as the application evolves: enforcing structure with JSON Schema validation, naming fields consistently, avoiding the classic anti-patterns that cause problems at scale, versioning your schema so migrations are safe, and evolving the schema in production without downtime. This lesson codifies those practices using the Dataplexa Store as a running reference.
JSON Schema Validation
MongoDB is schemaless by default — any document can be inserted into any collection with any shape. In production this freedom becomes a liability: a bug in application code can silently insert malformed documents that corrupt your data. JSON Schema validation lets you define rules at the collection level that MongoDB enforces on every insert and update, catching bad data at the database layer before it can cause harm.
# JSON Schema validation — enforcing structure at the collection level
from pymongo import MongoClient
from pymongo.errors import WriteError
client = MongoClient("mongodb://localhost:27017/")
db = client["dataplexa"]
# Define a validator for the users collection
user_validator = {
"$jsonSchema": {
"bsonType": "object",
"required": ["name", "email", "membership"],
"properties": {
"name": {
"bsonType": "string",
"minLength": 2,
"description": "must be a string of at least 2 characters"
},
"email": {
"bsonType": "string",
"pattern": "^[^@]+@[^@]+\\.[^@]+$",
"description": "must be a valid email address"
},
"age": {
"bsonType": "int",
"minimum": 0,
"maximum": 150,
"description": "must be an integer between 0 and 150"
},
"membership": {
"enum": ["basic", "premium"],
"description": "must be 'basic' or 'premium'"
},
"tags": {
"bsonType": "array",
"items": {"bsonType": "string"},
"description": "must be an array of strings"
}
}
}
}
# Apply the validator — validationAction "error" rejects bad writes
# (use "warn" to log violations without rejecting)
db.command("collMod", "users",
validator=user_validator,
validationAction="error",
validationLevel="moderate") # "moderate" validates only new inserts and updates
print("Validator applied to users collection")
# Test 1 — valid document
try:
db.users.insert_one({
"_id": "u_valid",
"name": "Test User",
"email": "test@example.com",
"membership": "basic",
"age": 25
})
print("Valid document inserted: OK")
db.users.delete_one({"_id": "u_valid"})
except WriteError as e:
print(f"Unexpected error: {e}")
# Test 2 — missing required field (email)
try:
db.users.insert_one({
"_id": "u_bad",
"name": "No Email User",
"membership": "basic"
})
print("Should not reach here")
except WriteError as e:
print(f"Missing email rejected: {type(e).__name__}")
# Test 3 — invalid enum value
try:
db.users.insert_one({
"_id": "u_bad2",
"name": "Bad Tier",
"email": "tier@example.com",
"membership": "gold" # not in enum
})
print("Should not reach here")
except WriteError as e:
print(f"Invalid membership rejected: {type(e).__name__}")Valid document inserted: OK
Missing email rejected: WriteError
Invalid membership rejected: WriteError
validationAction: "error"rejects the write —"warn"logs the violation but allows it, useful during migrations when you want to discover bad data without breaking the applicationvalidationLevel: "moderate"only validates new inserts and updates to existing documents that already pass the schema —"strict"validates all writes including updates to documents that previously violated the schema- Start with
validationAction: "warn"on an existing collection to discover what percentage of documents would fail before switching to"error" - Validation rules are stored in the collection metadata — they persist across server restarts and apply regardless of which driver or tool inserts the document
Consistent Field Naming Conventions
MongoDB field names are case-sensitive strings stored in every document. Inconsistent naming — mixing camelCase with snake_case, abbreviating some fields but not others, using different names for the same concept across collections — creates confusion, bugs, and schema debt that compounds over time.
# Field naming conventions — consistency matters
naming_rules = {
"Use camelCase throughout (MongoDB convention)": [
"userId, productId, createdAt, updatedAt",
"firstName, lastName, emailAddress",
"orderStatus, paymentMethod, shippingAddress",
],
"Use snake_case if your team prefers it — but never mix": [
"user_id, product_id, created_at, updated_at",
"first_name, last_name, email_address",
"order_status, payment_method, shipping_address",
],
"Timestamp field names — be explicit and consistent": [
"createdAt — when the document was first created",
"updatedAt — when the document was last modified",
"deletedAt — soft delete timestamp",
"expiresAt — TTL expiry timestamp",
],
"ID field names — always suffix with Id or _id": [
"_id — the document's own primary key (always this name)",
"userId — reference to users collection",
"productId — reference to products collection",
"orderId — reference to orders collection",
],
"Boolean fields — use positive names": [
"isActive not inactive",
"isVerified not unverified",
"hasShipped not notShipped",
],
"Anti-patterns to avoid": [
"Mixing: userId in one collection, user_id in another",
"Abbreviating: usr, prod, qty (use full names — space is cheap)",
"Generic names: data, info, metadata (be specific about what it contains)",
"Numeric suffixes: field1, field2 (use an array or the Attribute Pattern)",
]
}
for rule, examples in naming_rules.items():
print(f"\n {rule}:")
for ex in examples:
print(f" {ex}")userId, productId, createdAt, updatedAt
firstName, lastName, emailAddress
Timestamp field names — be explicit and consistent:
createdAt — when the document was first created
updatedAt — when the document was last modified
deletedAt — soft delete timestamp
expiresAt — TTL expiry timestamp
ID field names — always suffix with Id or _id:
_id — the document's own primary key
userId — reference to users collection
productId — reference to products collection
Anti-patterns to avoid:
Mixing: userId in one collection, user_id in another
Abbreviating: usr, prod, qty (use full names — space is cheap)
Generic names: data, info, metadata
Numeric suffixes: field1, field2 (use an array instead)
- Field names are stored in every document — a 30-character field name on 10 million documents costs 300 MB of storage. Keep names descriptive but not verbose
- The Dataplexa Store uses
snake_casefor readability in this course — in a real application, pick one convention and enforce it in code review - Add
createdAtandupdatedAtto every collection from day one — retroactively adding timestamps requires a migration across every existing document
Schema Anti-Patterns
MongoDB's schema flexibility can be misused in ways that look harmless at small scale but cause serious performance and maintenance problems in production. These are the most common anti-patterns and how to fix them.
# Schema anti-patterns — what to avoid and why
anti_patterns = [
{
"name": "Massive Arrays",
"problem": "An array field that grows without bound",
"example_bad": {
"_id": "product_001",
"name": "Wireless Mouse",
# BAD — this array will eventually hit 16 MB
"all_reviews": ["r001", "r002", "r003", "...10,000 more IDs"]
},
"fix": "Store reviews in a separate collection, reference by product_id",
"why": "Unbounded arrays hit the 16 MB document limit and cause document churn on every insert",
},
{
"name": "Bloated Documents",
"problem": "Fetching huge documents when only a few fields are needed",
"example_bad": {
"_id": "p001",
"name": "Wireless Mouse",
"full_spec_pdf": "<<< 2 MB base64 encoded PDF >>>",
"raw_html_desc": "<<< 500 KB of HTML >>>",
},
"fix": "Store binary/large content in GridFS or S3, keep document lean",
"why": "Every query that touches this collection loads 2 MB+ per document into the WiredTiger cache",
},
{
"name": "Separating Data That Is Always Accessed Together",
"problem": "Splitting a logical entity across two collections unnecessarily",
"example_bad": {
"profiles collection": {"_id": "u001", "name": "Alice"},
"settings collection": {"_id": "u001", "theme": "dark", "language": "en"},
"note": "Every user page load requires two queries and a join"
},
"fix": "Embed settings inside the user document — they are always needed together",
"why": "Unnecessary cross-collection joins add latency and code complexity",
},
{
"name": "Unnecessary Indexes",
"problem": "Indexing every field 'just in case'",
"example_bad": "db.products.create_index([('name',1),('brand',1),('category',1),"
"('price',1),('rating',1),('stock',1),('tags',1)])",
"fix": "Index only fields that appear in real query filters and sorts",
"why": "Every index slows writes and consumes RAM — unused indexes are pure overhead",
},
{
"name": "Case-Insensitive String Queries Without a Collation Index",
"problem": "Using $regex with 'i' flag for case-insensitive search at scale",
"example_bad": 'db.users.find({"name": {"$regex": "alice", "$options": "i"}})',
"fix": "Use a collation index (strength: 2) or Atlas Search for production text search",
"why": "$regex with 'i' cannot use a standard index — it always does a full collection scan",
},
]
for ap in anti_patterns:
print(f"\n ✗ Anti-pattern: {ap['name']}")
print(f" Problem: {ap['problem']}")
print(f" Fix: {ap['fix']}")
print(f" Why: {ap['why']}")Problem: An array field that grows without bound
Fix: Store reviews in a separate collection, reference by product_id
Why: Unbounded arrays hit the 16 MB document limit and cause document churn
✗ Anti-pattern: Bloated Documents
Problem: Fetching huge documents when only a few fields are needed
Fix: Store binary/large content in GridFS or S3, keep document lean
Why: Every query loads 2 MB+ per document into the WiredTiger cache
✗ Anti-pattern: Separating Data That Is Always Accessed Together
Problem: Splitting a logical entity across two collections unnecessarily
Fix: Embed settings inside the user document
Why: Unnecessary cross-collection joins add latency and code complexity
✗ Anti-pattern: Unnecessary Indexes
Problem: Indexing every field just in case
Fix: Index only fields that appear in real query filters and sorts
Why: Every index slows writes and consumes RAM
✗ Anti-pattern: Case-Insensitive Regex Without Collation Index
Problem: $regex with i flag for case-insensitive search at scale
Fix: Use a collation index or Atlas Search
Why: $regex with i cannot use a standard index — full collection scan
- The massive array and bloated document anti-patterns are the most common causes of MongoDB performance problems in production — they are easy to introduce and expensive to fix later
- The "separating data that belongs together" anti-pattern is the mirror image of the embedding mistake — too much referencing is just as harmful as too much embedding
- Audit your indexes quarterly using
db.collection.aggregate([{"$indexStats": {}}])— remove any index with zero accesses
Schema Versioning
When your application evolves, the schema evolves with it. Without a versioning strategy, old documents with the old shape and new documents with the new shape coexist silently — and the application code must handle both. Schema versioning makes this explicit and manageable.
# Schema versioning — tracking document shape across migrations
from pymongo import MongoClient
from datetime import datetime, timezone
client = MongoClient("mongodb://localhost:27017/")
db = client["dataplexa"]
# Add a schema_version field to every new document
# Old documents without schema_version are implicitly version 1
# Version 1 — original user shape (what we have now)
v1_user = {
"_id": "u_v1",
"schema_version": 1,
"name": "Version One User",
"email": "v1@example.com",
"membership": "basic",
# No createdAt, no phone, flat address fields
"city": "London",
"country": "UK",
}
# Version 2 — address moved to sub-document, createdAt added
v2_user = {
"_id": "u_v2",
"schema_version": 2,
"name": "Version Two User",
"email": "v2@example.com",
"membership": "premium",
"address": {
"city": "Paris",
"country": "France",
"postcode": "75001"
},
"createdAt": datetime.now(timezone.utc),
}
db.users.insert_many([v1_user, v2_user])
# Application layer — upgrade on read (lazy migration)
def get_user_normalised(user_id: str) -> dict:
"""
Fetch a user and normalise to the latest schema version.
Old documents are upgraded in memory — optionally written back.
"""
user = db.users.find_one({"_id": user_id})
if not user:
return None
version = user.get("schema_version", 1)
if version == 1:
# Upgrade v1 → v2: migrate flat address to sub-document
user["address"] = {
"city": user.pop("city", None),
"country": user.pop("country", None),
}
user["schema_version"] = 2
# Optionally write back the upgraded document
# db.users.replace_one({"_id": user_id}, user)
return user
u1 = get_user_normalised("u_v1")
u2 = get_user_normalised("u_v2")
print(f"v1 user normalised address: {u1['address']}")
print(f"v2 user address: {u2['address']}")
# Clean up
db.users.delete_many({"_id": {"$in": ["u_v1", "u_v2"]}})v2 user address: {'city': 'Paris', 'country': 'France', 'postcode': '75001'}
- Always add a
schema_versionfield (orschemaVersion) to documents from day one — retrofitting it requires a full collection migration - The upgrade-on-read (lazy migration) pattern avoids a big-bang migration — old documents are upgraded when they are next accessed
- Pair lazy migration with a background job that upgrades all remaining old documents during off-peak hours — ensures 100% of documents eventually reach the latest version
- Never delete the migration logic for old versions until you are certain no documents of that version remain in the collection
Evolving a Schema Safely in Production
Changing a live schema without downtime requires a disciplined approach. Adding fields is safe. Renaming or removing fields requires a multi-step process. Following the expand-migrate-contract pattern ensures no version of the application ever sees a document shape it cannot handle.
# Expand-Migrate-Contract — safe schema evolution in production
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["dataplexa"]
migration_steps = {
"Renaming 'city' to 'address.city' — safe three-phase process": [
"PHASE 1 — EXPAND (deploy new code that writes both old and new):",
" - New code writes BOTH city (old) AND address.city (new)",
" - Old code still reads city — no breakage",
" - New code reads address.city, falls back to city if missing",
"",
"PHASE 2 — MIGRATE (backfill all old documents):",
" db.users.update_many(",
" {'city': {'$exists': True}, 'address': {'$exists': False}},",
" [{'$set': {'address': {'city': '$city', 'country': '$country'}}}]",
" )",
"",
"PHASE 3 — CONTRACT (deploy code that only uses new field, remove old):",
" - All documents now have address.city",
" - Deploy code that only reads address.city",
" - Run: db.users.update_many({}, {'$unset': {'city': '', 'country': ''}})",
" - Remove the fallback read logic from application code",
]
}
for scenario, steps in migration_steps.items():
print(f"{scenario}\n")
for step in steps:
print(f" {step}")
# Adding a new field — always safe, no migration needed
print("\n\nAdding a new field (always safe):")
print(" db.users.update_many(")
print(" {'loyaltyPoints': {'$exists': False}},")
print(" {'$set': {'loyaltyPoints': 0}}")
print(" )")
print(" → Old documents get the default, new documents include it on insert")
# Removing a field — safe if application already ignores it
print("\nRemoving an unused field:")
print(" Step 1: Deploy code that no longer reads the field")
print(" Step 2: Remove the field from all documents:")
print(" db.users.update_many({}, {'$unset': {'legacyField': ''}})")
print(" Step 3: Remove from schema validator if present")PHASE 1 — EXPAND (deploy new code that writes both old and new):
- New code writes BOTH city (old) AND address.city (new)
- Old code still reads city — no breakage
- New code reads address.city, falls back to city if missing
PHASE 2 — MIGRATE (backfill all old documents):
db.users.update_many(...)
PHASE 3 — CONTRACT (deploy code that only uses new field, remove old):
- All documents now have address.city
- Deploy code that only reads address.city
- Run: db.users.update_many({}, {'$unset': {'city': '', 'country': ''}})
Adding a new field (always safe):
db.users.update_many(
{'loyaltyPoints': {'$exists': False}},
{'$set': {'loyaltyPoints': 0}}
)
Removing an unused field:
Step 1: Deploy code that no longer reads the field
Step 2: Remove the field from all documents
Step 3: Remove from schema validator if present
- Adding a field is always safe — old code ignores it, new code uses it, no downtime required
- Renaming a field requires the expand-migrate-contract pattern — skipping any phase causes a version of your application to see a shape it cannot handle
- Never run Phase 3 until Phase 2 is fully complete and verified — a partially-migrated collection with Phase 3 code deployed causes silent data loss
- Test migrations on a copy of production data before running them live —
mongodumpandmongorestoremake this easy
Summary Table
| Best Practice | What It Prevents | Key Rule |
|---|---|---|
| JSON Schema validation | Silent insertion of malformed documents | Start with warn, switch to error when stable |
| Consistent naming | Confusion, bugs, schema debt | Pick one convention — never mix camelCase and snake_case |
| Avoid massive arrays | 16 MB limit, document churn | Reference unbounded child data in a separate collection |
| Avoid bloated documents | Cache thrashing, slow reads | Store binary/large content in GridFS or object storage |
| Schema versioning | Silent shape mismatch across migrations | Add schema_version from day one, use upgrade-on-read |
| Expand-Migrate-Contract | Downtime and data loss during renames | Never skip a phase — verify Phase 2 before Phase 3 |
Practice Questions
Practice 1. What is the difference between validationAction "error" and "warn" in a MongoDB schema validator?
Practice 2. Why should you add createdAt and updatedAt fields to every collection from day one?
Practice 3. What is the upgrade-on-read (lazy migration) pattern and what is its trade-off?
Practice 4. List the three phases of the Expand-Migrate-Contract pattern for renaming a field.
Practice 5. Why does using $regex with the "i" flag for case-insensitive search cause performance problems at scale?
Quiz
Quiz 1. Which validationLevel setting validates only new inserts and updates to documents that already comply with the schema?
Quiz 2. What MongoDB aggregation stage lets you audit which indexes are actually being used in production?
Quiz 3. In the Expand-Migrate-Contract pattern, what happens if you deploy Phase 3 before Phase 2 is complete?
Quiz 4. Which of the following field naming practices should be avoided?
Quiz 5. What is the primary risk of storing large binary data (e.g. a 2 MB PDF) directly inside a MongoDB document?
Next up — Indexes: Understanding how indexes work, the types available, and how to build an indexing strategy that makes every query fast.