MongoDB
Databases, Collections & Documents
MongoDB organises all data in a three-level hierarchy: databases at the top, collections in the middle, and documents at the bottom. Every piece of data you store, query, update, or delete lives inside a document, inside a collection, inside a database. Understanding this hierarchy — and the rules and behaviours at each level — is the foundation for everything else in this course.
This lesson uses the Dataplexa Store dataset to make every concept concrete and immediately applicable.
The Three-Level Hierarchy
Think of the hierarchy this way: a database is like a filing cabinet, a collection is a drawer inside that cabinet, and a document is an individual folder inside the drawer. Each level has its own rules, commands, and behaviours.
# The MongoDB data hierarchy — visualised
hierarchy = {
"MongoDB Server (mongod)": {
"dataplexa (database)": {
"users (collection)": [
'{ "_id": "u001", "name": "Alice Johnson", "membership": "premium" }',
'{ "_id": "u002", "name": "Bob Smith", "membership": "basic" }',
],
"products (collection)": [
'{ "_id": "p001", "name": "Wireless Mouse", "price": 29.99 }',
'{ "_id": "p002", "name": "Standing Desk", "price": 349.99 }',
],
"orders (collection)": [
'{ "_id": "o001", "user_id": "u001", "total": 44.96 }',
],
},
"admin (database)": "internal MongoDB system database",
"local (database)": "internal — stores oplog for replication",
}
}
print("One server → many databases → many collections → many documents")
print()
print("dataplexa database contains:")
for collection in ["users", "products", "orders", "reviews"]:
print(f" └── {collection} (collection)")dataplexa database contains:
└── users (collection)
└── products (collection)
└── orders (collection)
└── reviews (collection)
- One MongoDB server can host many databases — each is isolated with its own collections and access controls
adminandlocalare reserved system databases — never use them for application data- The
configdatabase is also reserved — used internally by sharded clusters
Databases
A database is the top-level namespace in MongoDB. It groups related collections together and provides isolation between applications. A single MongoDB server typically hosts one database per application — for example, one database for your e-commerce app and a separate one for your analytics pipeline.
Why it matters: databases are the primary unit of access control — you grant permissions at the database level. They also allow you to run multiple applications on the same MongoDB server without data bleeding between them.
# Working with databases in mongosh and PyMongo
# ── mongosh ─────────────────────────────────────────────
# show dbs # list all databases (only shows non-empty ones)
# use dataplexa # switch to the dataplexa database
# db # print current database name
# db.dropDatabase() # delete the current database entirely
# ── PyMongo ─────────────────────────────────────────────
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
# List all databases
all_dbs = client.list_database_names()
print("All databases:", all_dbs)
# Access a specific database — creates it on first write
db = client["dataplexa"]
print("Current database:", db.name)
# List collections in the database
collections = db.list_collection_names()
print("Collections:", collections)Current database: dataplexa
Collections: ['users', 'products', 'orders', 'reviews']
- In MongoDB a database is created lazily — it does not appear until you insert the first document into it
show dbsin mongosh only lists databases that contain at least one document- In PyMongo,
client["dbname"]andclient.dbnameare equivalent — both access the same database object - Dropping a database removes all its collections and documents — use with extreme care in production
Collections
A collection is a group of MongoDB documents — the equivalent of a table in a relational database, but without a fixed schema. Documents inside a collection can have completely different fields. Collections are schema-free by default, though you can enforce structure using JSON Schema validation when needed.
Why it matters: collections are the main unit of organisation for your data. Choosing how to group documents into collections — and how many collections to create — is a core data modelling decision that affects query performance, access control, and maintainability.
# Working with collections in PyMongo
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["dataplexa"]
# Access a collection — created automatically on first write
users = db["users"]
products = db["products"]
# Or using attribute-style access
orders = db.orders
reviews = db.reviews
# Collection information
print("users collection — document count:", db.users.count_documents({}))
print("products collection — document count:", db.products.count_documents({}))
# Create a collection explicitly (optional — rarely needed)
# db.create_collection("logs")
# Create a capped collection — fixed size, oldest docs auto-deleted
db.create_collection("activity_log", capped=True, size=1048576, max=1000)
# size = max bytes, max = max number of documents
print("\nCollections in dataplexa:", db.list_collection_names())products collection — document count: 7
Collections in dataplexa: ['users', 'products', 'orders', 'reviews', 'activity_log']
- Collections are created implicitly on the first insert — you rarely need to create them explicitly
- Capped collections maintain insertion order and automatically overwrite oldest documents when full — ideal for logs and activity feeds
- Collection names are case-sensitive —
Usersandusersare different collections - Avoid dots and dollar signs in collection names — they have special meaning in MongoDB
Documents
A document is the fundamental unit of data in MongoDB. It is stored internally as BSON but represented as JSON-like objects in your code. A document can hold any combination of fields — strings, numbers, booleans, arrays, nested documents, dates, and more — with no requirement to match other documents in the same collection.
Why it matters: the document model is what makes MongoDB developer-friendly. Objects in your code map directly to documents — there is no translation layer, no ORM required, no painful schema migrations when you add a new field.
# A real document from the Dataplexa Store — anatomy breakdown
order_document = {
"_id": "o001", # unique identifier — required, auto-generated if omitted
"user_id": "u001", # reference to a user document (by convention)
"status": "delivered", # string field
"date": "2024-01-15", # date stored as string here (BSON Date recommended)
"total": 44.96, # number field (float)
"items": [ # array field — holds sub-documents
{
"product_id": "p001",
"qty": 1,
"price": 29.99
},
{
"product_id": "p003",
"qty": 3,
"price": 4.99
}
]
}
# Accessing fields
print("Order ID: ", order_document["_id"])
print("Status: ", order_document["status"])
print("Total: ", order_document["total"])
print("Item count: ", len(order_document["items"]))
print("First item: ", order_document["items"][0]["product_id"])Status: delivered
Total: 44.96
Item count: 2
First item: p001
- Every document must have an
_idfield — MongoDB generates a unique ObjectId automatically if you do not provide one - The maximum document size is 16 MB — for larger data (files, images) use GridFS
- Arrays inside documents can hold any mix of types — strings, numbers, sub-documents, even other arrays
- Nested sub-documents can be queried directly using dot notation:
"items.product_id"
The _id Field and ObjectId
The _id field is the primary key of every MongoDB document. If you do not provide one, MongoDB automatically generates a 12-byte ObjectId — a value that is globally unique across all servers, embedded with a creation timestamp, and sortable by time.
# _id field and ObjectId — how they work
from bson import ObjectId
from datetime import datetime
# MongoDB auto-generates an ObjectId when _id is omitted
auto_id = ObjectId()
print("Generated ObjectId: ", auto_id)
print("String representation:", str(auto_id))
print("Creation timestamp: ", auto_id.generation_time)
print("Is it unique? Yes — globally, without coordination")
# ObjectId structure — 12 bytes total
print("\nObjectId breakdown:")
print(" Bytes 0-3: Unix timestamp (seconds since epoch)")
print(" Bytes 4-8: Random value (unique per machine/process)")
print(" Bytes 9-11: Incrementing counter (unique per second)")
# You can use any value as _id — not just ObjectId
custom_ids = [
{"_id": "u001", "name": "Alice"}, # string _id
{"_id": 42, "name": "Bob"}, # integer _id
{"_id": ObjectId(), "name": "Clara"}, # auto ObjectId
{"_id": "prod-electronics-0001", "name": "Mouse"}, # meaningful string
]
for doc in custom_ids:
print(f" _id: {str(doc['_id']):35} name: {doc['name']}")String representation: 64a1f2e3b4c5d6e7f8a9b0c1
Creation timestamp: 2024-01-15 09:30:00+00:00
Is it unique? Yes — globally, without coordination
ObjectId breakdown:
Bytes 0-3: Unix timestamp (seconds since epoch)
Bytes 4-8: Random value (unique per machine/process)
Bytes 9-11: Incrementing counter (unique per second)
_id: u001 name: Alice
_id: 42 name: Bob
_id: 64a1f2e3b4c5d6e7f8a9b0c1 name: Clara
_id: prod-electronics-0001 name: Mouse
- You can use any unique value as
_id— string, integer, ObjectId, or even a sub-document - In the Dataplexa dataset we use simple string IDs like
"u001","p001"— easy to read and reference - ObjectIds are sortable by creation time —
db.collection.find().sort({"_id": 1})returns documents oldest-first - Duplicate
_idvalues in the same collection throw aDuplicateKeyError—_idalways has a unique index
Document Flexibility in Practice
The schema-free nature of collections means different documents can have different shapes — a feature that makes MongoDB ideal for evolving applications and varied data types.
# Schema flexibility — same collection, different document shapes
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["dataplexa"]
# Products in the same collection with completely different fields
electronics_product = {
"_id": "p_new_001",
"name": "Noise-Cancelling Headphones",
"category": "Electronics",
"brand": "SoundMax",
"price": 149.99,
"stock": 60,
"rating": 4.6,
"specs": {
"driver_size_mm": 40,
"frequency_hz": "20-20000",
"wireless": True,
"battery_hours": 30
}
}
stationery_product = {
"_id": "p_new_002",
"name": "Sticky Notes Pack",
"category": "Stationery",
"brand": "WriteCo",
"price": 2.99,
"stock": 800,
"rating": 4.1,
"colours": ["yellow", "pink", "blue", "green"],
"sheets_per_pad": 100
# No 'specs' field — that's fine
}
# Both inserted into the same 'products' collection
db.products.insert_one(electronics_product)
db.products.insert_one(stationery_product)
print("Both documents inserted — different shapes, same collection")
print("Electronics has 'specs' sub-doc:", "specs" in electronics_product)
print("Stationery has 'colours' array: ", "colours" in stationery_product)Electronics has 'specs' sub-doc: True
Stationery has 'colours' array: True
- Fields that don't exist in a document are simply absent — no NULL columns wasting space as in SQL
- Adding a new field to future documents requires zero schema migration — just start writing it
- You can query for documents that have or don't have a field using
{"field": {"$exists": true}}
Summary Table
| Level | SQL Equivalent | Key Behaviour | Create Command |
|---|---|---|---|
| Database | Database / Schema | Created lazily on first insert | use dbname |
| Collection | Table | Schema-free, created on first insert | db.createCollection() |
| Document | Row | Max 16 MB, must have _id |
insertOne() / insertMany() |
| Field | Column | Optional per document, any BSON type | Part of document on insert |
_id |
Primary Key | Unique, auto-generated ObjectId if omitted | Auto or user-supplied |
Practice Questions
Practice 1. What are the three levels of the MongoDB data hierarchy from top to bottom?
Practice 2. What happens in MongoDB when you try to use a database or collection that does not yet exist?
Practice 3. What is the maximum size of a single MongoDB document?
Practice 4. What makes a capped collection different from a regular collection?
Practice 5. What MQL filter finds all documents in a collection that have a specific field present?
Quiz
Quiz 1. In MongoDB, when does a database actually get created on disk?
Quiz 2. What is an ObjectId composed of?
Quiz 3. What error does MongoDB throw if you try to insert a document with a duplicate _id?
Quiz 4. Which of the following is a valid value for the _id field in MongoDB?
Quiz 5. What is the SQL equivalent of a MongoDB collection?
Next up — BSON Data Format: how MongoDB stores data internally, the types it supports beyond JSON, and why it matters for your queries.