Mango DBLesson 9 – Databases, Collections & Documents | Dataplexa

Databases, Collections & Documents

MongoDB organises all data in a three-level hierarchy: databases at the top, collections in the middle, and documents at the bottom. Every piece of data you store, query, update, or delete lives inside a document, inside a collection, inside a database. Understanding this hierarchy — and the rules and behaviours at each level — is the foundation for everything else in this course.

This lesson uses the Dataplexa Store dataset to make every concept concrete and immediately applicable.

The Three-Level Hierarchy

Think of the hierarchy this way: a database is like a filing cabinet, a collection is a drawer inside that cabinet, and a document is an individual folder inside the drawer. Each level has its own rules, commands, and behaviours.

# The MongoDB data hierarchy — visualised

hierarchy = {
    "MongoDB Server (mongod)": {
        "dataplexa (database)": {
            "users (collection)": [
                '{ "_id": "u001", "name": "Alice Johnson", "membership": "premium" }',
                '{ "_id": "u002", "name": "Bob Smith",    "membership": "basic"   }',
            ],
            "products (collection)": [
                '{ "_id": "p001", "name": "Wireless Mouse",  "price": 29.99 }',
                '{ "_id": "p002", "name": "Standing Desk",   "price": 349.99 }',
            ],
            "orders (collection)": [
                '{ "_id": "o001", "user_id": "u001", "total": 44.96 }',
            ],
        },
        "admin (database)": "internal MongoDB system database",
        "local (database)": "internal — stores oplog for replication",
    }
}

print("One server → many databases → many collections → many documents")
print()
print("dataplexa database contains:")
for collection in ["users", "products", "orders", "reviews"]:
    print(f"  └── {collection} (collection)")
One server → many databases → many collections → many documents

dataplexa database contains:
└── users (collection)
└── products (collection)
└── orders (collection)
└── reviews (collection)
  • One MongoDB server can host many databases — each is isolated with its own collections and access controls
  • admin and local are reserved system databases — never use them for application data
  • The config database is also reserved — used internally by sharded clusters

Databases

A database is the top-level namespace in MongoDB. It groups related collections together and provides isolation between applications. A single MongoDB server typically hosts one database per application — for example, one database for your e-commerce app and a separate one for your analytics pipeline.

Why it matters: databases are the primary unit of access control — you grant permissions at the database level. They also allow you to run multiple applications on the same MongoDB server without data bleeding between them.

# Working with databases in mongosh and PyMongo

# ── mongosh ─────────────────────────────────────────────
# show dbs                    # list all databases (only shows non-empty ones)
# use dataplexa               # switch to the dataplexa database
# db                          # print current database name
# db.dropDatabase()           # delete the current database entirely

# ── PyMongo ─────────────────────────────────────────────
from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")

# List all databases
all_dbs = client.list_database_names()
print("All databases:", all_dbs)

# Access a specific database — creates it on first write
db = client["dataplexa"]
print("Current database:", db.name)

# List collections in the database
collections = db.list_collection_names()
print("Collections:", collections)
All databases: ['admin', 'config', 'dataplexa', 'local']
Current database: dataplexa
Collections: ['users', 'products', 'orders', 'reviews']
  • In MongoDB a database is created lazily — it does not appear until you insert the first document into it
  • show dbs in mongosh only lists databases that contain at least one document
  • In PyMongo, client["dbname"] and client.dbname are equivalent — both access the same database object
  • Dropping a database removes all its collections and documents — use with extreme care in production

Collections

A collection is a group of MongoDB documents — the equivalent of a table in a relational database, but without a fixed schema. Documents inside a collection can have completely different fields. Collections are schema-free by default, though you can enforce structure using JSON Schema validation when needed.

Why it matters: collections are the main unit of organisation for your data. Choosing how to group documents into collections — and how many collections to create — is a core data modelling decision that affects query performance, access control, and maintainability.

# Working with collections in PyMongo

from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")
db     = client["dataplexa"]

# Access a collection — created automatically on first write
users    = db["users"]
products = db["products"]

# Or using attribute-style access
orders  = db.orders
reviews = db.reviews

# Collection information
print("users collection — document count:", db.users.count_documents({}))
print("products collection — document count:", db.products.count_documents({}))

# Create a collection explicitly (optional — rarely needed)
# db.create_collection("logs")

# Create a capped collection — fixed size, oldest docs auto-deleted
db.create_collection("activity_log", capped=True, size=1048576, max=1000)
# size = max bytes, max = max number of documents

print("\nCollections in dataplexa:", db.list_collection_names())
users collection — document count: 5
products collection — document count: 7

Collections in dataplexa: ['users', 'products', 'orders', 'reviews', 'activity_log']
  • Collections are created implicitly on the first insert — you rarely need to create them explicitly
  • Capped collections maintain insertion order and automatically overwrite oldest documents when full — ideal for logs and activity feeds
  • Collection names are case-sensitive — Users and users are different collections
  • Avoid dots and dollar signs in collection names — they have special meaning in MongoDB

Documents

A document is the fundamental unit of data in MongoDB. It is stored internally as BSON but represented as JSON-like objects in your code. A document can hold any combination of fields — strings, numbers, booleans, arrays, nested documents, dates, and more — with no requirement to match other documents in the same collection.

Why it matters: the document model is what makes MongoDB developer-friendly. Objects in your code map directly to documents — there is no translation layer, no ORM required, no painful schema migrations when you add a new field.

# A real document from the Dataplexa Store — anatomy breakdown

order_document = {
    "_id":     "o001",           # unique identifier — required, auto-generated if omitted
    "user_id": "u001",           # reference to a user document (by convention)
    "status":  "delivered",      # string field
    "date":    "2024-01-15",     # date stored as string here (BSON Date recommended)
    "total":   44.96,            # number field (float)
    "items": [                   # array field — holds sub-documents
        {
            "product_id": "p001",
            "qty":        1,
            "price":      29.99
        },
        {
            "product_id": "p003",
            "qty":        3,
            "price":      4.99
        }
    ]
}

# Accessing fields
print("Order ID:   ", order_document["_id"])
print("Status:     ", order_document["status"])
print("Total:      ", order_document["total"])
print("Item count: ", len(order_document["items"]))
print("First item: ", order_document["items"][0]["product_id"])
Order ID: o001
Status: delivered
Total: 44.96
Item count: 2
First item: p001
  • Every document must have an _id field — MongoDB generates a unique ObjectId automatically if you do not provide one
  • The maximum document size is 16 MB — for larger data (files, images) use GridFS
  • Arrays inside documents can hold any mix of types — strings, numbers, sub-documents, even other arrays
  • Nested sub-documents can be queried directly using dot notation: "items.product_id"

The _id Field and ObjectId

The _id field is the primary key of every MongoDB document. If you do not provide one, MongoDB automatically generates a 12-byte ObjectId — a value that is globally unique across all servers, embedded with a creation timestamp, and sortable by time.

# _id field and ObjectId — how they work

from bson import ObjectId
from datetime import datetime

# MongoDB auto-generates an ObjectId when _id is omitted
auto_id = ObjectId()
print("Generated ObjectId:   ", auto_id)
print("String representation:", str(auto_id))
print("Creation timestamp:   ", auto_id.generation_time)
print("Is it unique?          Yes — globally, without coordination")

# ObjectId structure — 12 bytes total
print("\nObjectId breakdown:")
print("  Bytes 0-3:  Unix timestamp (seconds since epoch)")
print("  Bytes 4-8:  Random value (unique per machine/process)")
print("  Bytes 9-11: Incrementing counter (unique per second)")

# You can use any value as _id — not just ObjectId
custom_ids = [
    {"_id": "u001",                   "name": "Alice"},   # string _id
    {"_id": 42,                       "name": "Bob"},     # integer _id
    {"_id": ObjectId(),               "name": "Clara"},   # auto ObjectId
    {"_id": "prod-electronics-0001",  "name": "Mouse"},   # meaningful string
]

for doc in custom_ids:
    print(f"  _id: {str(doc['_id']):35} name: {doc['name']}")
Generated ObjectId: 64a1f2e3b4c5d6e7f8a9b0c1
String representation: 64a1f2e3b4c5d6e7f8a9b0c1
Creation timestamp: 2024-01-15 09:30:00+00:00
Is it unique? Yes — globally, without coordination

ObjectId breakdown:
Bytes 0-3: Unix timestamp (seconds since epoch)
Bytes 4-8: Random value (unique per machine/process)
Bytes 9-11: Incrementing counter (unique per second)

_id: u001 name: Alice
_id: 42 name: Bob
_id: 64a1f2e3b4c5d6e7f8a9b0c1 name: Clara
_id: prod-electronics-0001 name: Mouse
  • You can use any unique value as _id — string, integer, ObjectId, or even a sub-document
  • In the Dataplexa dataset we use simple string IDs like "u001", "p001" — easy to read and reference
  • ObjectIds are sortable by creation time — db.collection.find().sort({"_id": 1}) returns documents oldest-first
  • Duplicate _id values in the same collection throw a DuplicateKeyError_id always has a unique index

Document Flexibility in Practice

The schema-free nature of collections means different documents can have different shapes — a feature that makes MongoDB ideal for evolving applications and varied data types.

# Schema flexibility — same collection, different document shapes

from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")
db     = client["dataplexa"]

# Products in the same collection with completely different fields
electronics_product = {
    "_id":      "p_new_001",
    "name":     "Noise-Cancelling Headphones",
    "category": "Electronics",
    "brand":    "SoundMax",
    "price":    149.99,
    "stock":    60,
    "rating":   4.6,
    "specs": {
        "driver_size_mm": 40,
        "frequency_hz":   "20-20000",
        "wireless":       True,
        "battery_hours":  30
    }
}

stationery_product = {
    "_id":      "p_new_002",
    "name":     "Sticky Notes Pack",
    "category": "Stationery",
    "brand":    "WriteCo",
    "price":    2.99,
    "stock":    800,
    "rating":   4.1,
    "colours":  ["yellow", "pink", "blue", "green"],
    "sheets_per_pad": 100
    # No 'specs' field — that's fine
}

# Both inserted into the same 'products' collection
db.products.insert_one(electronics_product)
db.products.insert_one(stationery_product)

print("Both documents inserted — different shapes, same collection")
print("Electronics has 'specs' sub-doc:", "specs" in electronics_product)
print("Stationery has 'colours' array: ", "colours" in stationery_product)
Both documents inserted — different shapes, same collection
Electronics has 'specs' sub-doc: True
Stationery has 'colours' array: True
  • Fields that don't exist in a document are simply absent — no NULL columns wasting space as in SQL
  • Adding a new field to future documents requires zero schema migration — just start writing it
  • You can query for documents that have or don't have a field using {"field": {"$exists": true}}

Summary Table

Level SQL Equivalent Key Behaviour Create Command
Database Database / Schema Created lazily on first insert use dbname
Collection Table Schema-free, created on first insert db.createCollection()
Document Row Max 16 MB, must have _id insertOne() / insertMany()
Field Column Optional per document, any BSON type Part of document on insert
_id Primary Key Unique, auto-generated ObjectId if omitted Auto or user-supplied

Practice Questions

Practice 1. What are the three levels of the MongoDB data hierarchy from top to bottom?



Practice 2. What happens in MongoDB when you try to use a database or collection that does not yet exist?



Practice 3. What is the maximum size of a single MongoDB document?



Practice 4. What makes a capped collection different from a regular collection?



Practice 5. What MQL filter finds all documents in a collection that have a specific field present?



Quiz

Quiz 1. In MongoDB, when does a database actually get created on disk?






Quiz 2. What is an ObjectId composed of?






Quiz 3. What error does MongoDB throw if you try to insert a document with a duplicate _id?






Quiz 4. Which of the following is a valid value for the _id field in MongoDB?






Quiz 5. What is the SQL equivalent of a MongoDB collection?






Next up — BSON Data Format: how MongoDB stores data internally, the types it supports beyond JSON, and why it matters for your queries.