MongoDB
BSON Data Format
When you write a MongoDB query you work with JSON — human-readable key-value pairs surrounded by curly braces. But MongoDB never actually stores JSON on disk. It stores BSON — Binary JSON — a binary-encoded format that extends JSON with additional data types, is faster to encode and decode, and is more space-efficient for the numerical and date-heavy data that real applications produce. Understanding BSON tells you exactly what types you can store, how MongoDB represents them in memory and on disk, and why certain operations behave the way they do.
JSON vs BSON — The Key Differences
JSON was designed for human readability and data interchange over HTTP. BSON was designed for machine efficiency — fast traversal, fast encoding, and support for types that JSON simply cannot express.
# JSON vs BSON — side-by-side comparison
json_limitations = {
"number types": "JSON has one number type — no distinction between int and float",
"dates": "JSON has no Date type — dates must be stored as strings",
"binary data": "JSON cannot store raw binary — must base64-encode it",
"undefined": "JSON has no undefined type",
"regex": "JSON cannot natively represent regular expressions",
"ObjectId": "JSON has no ObjectId type",
}
bson_additions = {
"Int32 / Int64": "Separate integer types for 32-bit and 64-bit integers",
"Double": "64-bit floating point number",
"Decimal128": "128-bit high-precision decimal — essential for financial data",
"Date": "64-bit integer — milliseconds since Unix epoch",
"ObjectId": "12-byte unique identifier — the default _id type",
"Binary": "Raw binary data (images, files, hashed passwords)",
"Boolean": "True / False (same as JSON)",
"Null": "Null value (same as JSON)",
"Array": "Ordered list (same as JSON)",
"Embedded doc": "Nested document (same as JSON object)",
"Regex": "Native regular expression with flags",
"Timestamp": "Internal MongoDB type for replication ordering",
"MinKey/MaxKey": "Special types that compare lower/higher than all other values",
}
print("JSON lacks:")
for k, v in json_limitations.items():
print(f" {k:15} — {v}")
print("\nBSON adds:")
for k, v in list(bson_additions.items())[:6]:
print(f" {k:15} — {v}")number types — JSON has one number type — no distinction between int and float
dates — JSON has no Date type — dates must be stored as strings
binary data — JSON cannot store raw binary — must base64-encode it
undefined — JSON has no undefined type
regex — JSON cannot natively represent regular expressions
ObjectId — JSON has no ObjectId type
BSON adds:
Int32 / Int64 — Separate integer types for 32-bit and 64-bit integers
Double — 64-bit floating point number
Decimal128 — 128-bit high-precision decimal — essential for financial data
Date — 64-bit integer — milliseconds since Unix epoch
ObjectId — 12-byte unique identifier — the default _id type
Binary — Raw binary data (images, files, hashed passwords)
- BSON is a superset of JSON — everything valid in JSON is valid in BSON, plus more
- MongoDB drivers automatically convert between your language's native types and BSON when reading and writing
- BSON includes length prefixes for strings and documents — the database can skip over fields it does not need without parsing them
BSON Types in Python with PyMongo
The bson module (installed with PyMongo) provides Python classes for all BSON-specific types. Knowing which Python type maps to which BSON type prevents subtle bugs — especially with numbers and dates.
# BSON types in Python — using the bson module
from bson import ObjectId, Decimal128, Regex, Binary
from datetime import datetime
import hashlib
# A document demonstrating every major BSON type
full_bson_document = {
# ── Identifiers ─────────────────────────────────────
"_id": ObjectId(), # BSON ObjectId (12 bytes)
# ── Strings ─────────────────────────────────────────
"name": "Alice Johnson", # BSON String (UTF-8)
# ── Numbers ─────────────────────────────────────────
"age": 30, # BSON Int32
"total_orders": 150, # BSON Int64 (large int)
"balance": 9999.99, # BSON Double (float)
"price_exact": Decimal128("19.99"), # BSON Decimal128 (financial)
# ── Boolean ─────────────────────────────────────────
"verified": True, # BSON Boolean
# ── Dates ───────────────────────────────────────────
"joined_at": datetime(2022, 3, 10, 9, 30), # BSON Date (never use strings)
"updated_at": datetime.utcnow(), # Current UTC time
# ── Binary ──────────────────────────────────────────
"avatar_hash": Binary(hashlib.md5(b"img").digest()), # BSON Binary
# ── Arrays and embedded documents ───────────────────
"tags": ["premium", "early_adopter"], # BSON Array
"address": { # Embedded BSON Document
"city": "London",
"country": "UK"
},
# ── Regex ───────────────────────────────────────────
"email_pattern": Regex(r"^.+@.+\..+$"), # BSON Regex
# ── Null ────────────────────────────────────────────
"phone": None, # BSON Null
}
print("BSON document field types:")
for key, value in full_bson_document.items():
print(f" {key:15} → {type(value).__name__}")_id → ObjectId
name → str
age → int
total_orders → int
balance → float
price_exact → Decimal128
verified → bool
joined_at → datetime
updated_at → datetime
avatar_hash → Binary
tags → list
address → dict
email_pattern → Regex
phone → NoneType
- Python
intmaps to BSON Int32 or Int64 automatically depending on size - Python
floatmaps to BSON Double — a 64-bit IEEE 754 floating point number - Python
datetimemaps to BSON Date — always usedatetime, never store dates as strings - Python
dictmaps to an embedded BSON document; Pythonlistmaps to a BSON array
Why Decimal128 Matters for Financial Data
This is one of the most important BSON subtleties for real applications. Floating-point numbers (Double) cannot precisely represent many decimal values — a well-known problem in computer science. For money, prices, and financial calculations, always use Decimal128.
# Why Double fails for financial data — and how Decimal128 fixes it
from bson import Decimal128
from decimal import Decimal
# The classic floating-point precision problem
price_a = 0.1
price_b = 0.2
total = price_a + price_b
print("Double arithmetic:")
print(f" 0.1 + 0.2 = {total}") # Not exactly 0.3!
print(f" 0.1 + 0.2 == 0.3: {total == 0.3}")
# Decimal128 solves this — exact decimal arithmetic
d_price_a = Decimal128("0.10")
d_price_b = Decimal128("0.20")
# Convert to Python Decimal for arithmetic
result = Decimal(str(d_price_a)) + Decimal(str(d_price_b))
print("\nDecimal128 arithmetic:")
print(f" 0.10 + 0.20 = {result}")
print(f" Exact: {result == Decimal('0.30')}")
# Practical rule for the Dataplexa Store
print("\nDataplexa Store rule:")
print(" order totals → Decimal128 (financial accuracy required)")
print(" product ratings → float / Double (approximate is fine)")0.1 + 0.2 = 0.30000000000000004
0.1 + 0.2 == 0.3: False
Decimal128 arithmetic:
0.10 + 0.20 = 0.30
Exact: True
Dataplexa Store rule:
order totals → Decimal128 (financial accuracy required)
product ratings → float / Double (approximate is fine)
- Use
Decimal128for any value where exact decimal precision is required — prices, totals, tax, interest rates - Use
Double(Python float) for scientific measurements, ratings, percentages — where small rounding is acceptable - Store
Decimal128values as strings in the constructor:Decimal128("19.99")— never as a float literal
BSON Dates — Always Use datetime, Never Strings
Storing dates as strings is one of the most common MongoDB mistakes. BSON has a native Date type that stores time as a 64-bit integer (milliseconds since epoch) — enabling fast date comparisons, date arithmetic, and range queries. String dates require complex regex matching and cannot be sorted chronologically without additional parsing.
# BSON Date — correct and incorrect approaches
from pymongo import MongoClient
from datetime import datetime, timezone
client = MongoClient("mongodb://localhost:27017/")
db = client["dataplexa"]
# WRONG — storing dates as strings
bad_document = {
"_id": "bad_001",
"event": "user_signup",
"date_str": "2024-03-15" # string — cannot do date arithmetic or range queries
}
# CORRECT — storing dates as BSON Date via Python datetime
good_document = {
"_id": "good_001",
"event": "user_signup",
"date": datetime(2024, 3, 15, 9, 30, 0), # BSON Date
"created": datetime.now(timezone.utc), # UTC recommended for production
}
# With BSON Date you can run date range queries naturally
# db.events.find({ "date": { "$gte": datetime(2024, 1, 1), "$lt": datetime(2025, 1, 1) } })
# With string dates you would need ugly regex
# db.events.find({ "date_str": { "$regex": "^2024" } }) ← avoid this
print("BSON Date query — clean and fast:")
print(' db.events.find({ "date": { "$gte": datetime(2024,1,1) } })')
print("\nString date query — messy and slow:")
print(' db.events.find({ "date_str": { "$regex": "^2024" } })')db.events.find({ "date": { "$gte": datetime(2024,1,1) } })
String date query — messy and slow:
db.events.find({ "date_str": { "$regex": "^2024" } })
- Always store dates as Python
datetimeobjects — PyMongo converts them to BSON Date automatically - Use
datetime.now(timezone.utc)for production — always store UTC, convert to local time in the application layer - BSON Date range queries use standard comparison operators:
$gte,$lte,$gt,$lt - MongoDB's aggregation pipeline has date operators (
$year,$month,$dayOfWeeketc.) that only work on BSON Date fields, not strings
Checking BSON Types in Queries
MongoDB provides the $type operator to filter documents based on the BSON type of a field — useful for data quality checks and debugging mixed-type collections.
# $type operator — filter by BSON type
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["dataplexa"]
# BSON type names (string aliases — easier to read than numeric codes)
bson_type_aliases = {
"double": "64-bit float",
"string": "UTF-8 string",
"object": "embedded document",
"array": "array",
"binData": "binary data",
"objectId": "ObjectId",
"bool": "boolean",
"date": "BSON Date",
"null": "null",
"regex": "regular expression",
"int": "32-bit integer",
"long": "64-bit integer",
"decimal": "Decimal128",
}
# Find products where price is stored as a double (correct)
# db.products.find({ "price": { "$type": "double" } })
# Find any documents where a field is null
# db.users.find({ "phone": { "$type": "null" } })
# Find documents where _id is an ObjectId (vs string)
# db.products.find({ "_id": { "$type": "objectId" } })
# In the Dataplexa Store — check all products have numeric prices
price_check = list(db.products.find(
{"price": {"$type": ["double", "int", "decimal"]}},
{"name": 1, "price": 1, "_id": 0}
))
for doc in price_check:
print(f" {doc['name']:30} price: {doc['price']}")Mechanical Keyboard price: 89.99
Notebook A5 price: 4.99
Standing Desk price: 349.99
USB-C Hub price: 49.99
Ballpoint Pens 10-pack price: 3.49
Monitor 27-inch price: 299.99
$typeaccepts both string aliases ("double") and numeric type codes (1) — string aliases are preferred for readability- Pass an array to
$typeto match multiple types at once:{"$type": ["double", "int"]} - Use
$typeduring data migrations to find documents that store a field in the wrong type
Summary Table
| BSON Type | Python Type | Type Alias | Use For |
|---|---|---|---|
| Double | float |
"double" |
Ratings, measurements, percentages |
| String | str |
"string" |
Names, descriptions, statuses |
| ObjectId | ObjectId |
"objectId" |
Default _id, document references |
| Date | datetime |
"date" |
All timestamps, created_at, updated_at |
| Decimal128 | Decimal128 |
"decimal" |
Prices, financial totals, currency |
| Boolean | bool |
"bool" |
Flags, feature toggles, verified status |
| Array | list |
"array" |
Tags, items, multiple values per field |
| Binary | Binary |
"binData" |
Images, files, hashed passwords |
Practice Questions
Practice 1. What are two important data types that BSON supports that plain JSON does not?
Practice 2. Why should you use Decimal128 instead of Double for storing product prices?
Practice 3. What Python type does PyMongo automatically convert to a BSON Date?
Practice 4. What MQL operator lets you filter documents based on the BSON type of a field?
Practice 5. How many bytes is a BSON ObjectId and what information is embedded in it?
Quiz
Quiz 1. What format does MongoDB actually store data in on disk?
Quiz 2. Which BSON type stores time as a 64-bit integer representing milliseconds since the Unix epoch?
Quiz 3. What is the correct way to create a Decimal128 value of 19.99 in PyMongo?
Quiz 4. What timezone should you always use when storing BSON Date values in production?
Quiz 5. What advantage does BSON's length-prefix encoding give the database engine?
Next up — Insert Documents: adding single and multiple documents to collections with insertOne() and insertMany().