MongoDB
Replication
A single MongoDB server is a single point of failure — if it crashes, goes offline for maintenance, or loses its storage, your application loses access to its data. Replication solves this by maintaining multiple copies of the same data on different servers simultaneously. MongoDB's replication mechanism is called a replica set — a group of MongoDB instances that hold the same data, elect a leader automatically when one member fails, and recover without any manual intervention. Replication provides three things at once: high availability through automatic failover, data redundancy through multiple copies, and read scalability by routing read queries to secondary members. This lesson covers how replica sets work internally, how elections and failover happen, how to configure read preferences, and how to monitor replica set health from PyMongo.
1. Replica Set Architecture
A replica set consists of one primary and one or more secondaries. All writes go to the primary, which records every change in an operations log called the oplog. Secondaries continuously tail the primary's oplog and replay each operation on their own copy of the data — a process called replication lag when the secondary falls behind. A typical production replica set has three members: one primary and two secondaries.
# Replica set architecture — connecting and inspecting member roles
from pymongo import MongoClient
# Connect to a replica set — provide all members in the connection string
# MongoDB automatically discovers all members from any one of them
client = MongoClient(
"mongodb://localhost:27017,localhost:27018,localhost:27019/",
replicaSet="rs0" # replica set name must match the configured name
)
db = client["dataplexa"]
# Check which member is currently the primary
hello = db.command("hello") # modern replacement for isMaster
print("Replica set status from hello command:")
print(f" isWritablePrimary: {hello.get('isWritablePrimary', False)}")
print(f" primary: {hello.get('primary', 'unknown')}")
print(f" hosts: {hello.get('hosts', [])}")
print(f" setName: {hello.get('setName', 'unknown')}")
print(f" me: {hello.get('me', 'unknown')}")
# Full replica set status — all member details
rs_status = client.admin.command("replSetGetStatus")
print(f"\nReplica set: {rs_status['set']}")
print(f"Members ({len(rs_status['members'])}):")
for member in rs_status["members"]:
state_str = member.get("stateStr", "UNKNOWN")
name = member.get("name", "")
health = member.get("health", 0)
lag_secs = member.get("optimeDate", None)
print(f" {name:30} state: {state_str:10} health: {health}")
# Replica set roles
print("\nReplica set member roles:")
roles = [
("Primary", "Accepts all writes. One at a time — elected by majority vote"),
("Secondary", "Replicates from primary oplog. Can serve reads if configured"),
("Arbiter", "Votes in elections but holds no data — lightweight tie-breaker"),
("Hidden", "Replicates data but invisible to clients — for backups/reporting"),
("Delayed", "Replicates with a time delay — protection against accidental deletes"),
]
for role, desc in roles:
print(f" {role:10} {desc}")isWritablePrimary: True
primary: localhost:27017
hosts: ['localhost:27017', 'localhost:27018', 'localhost:27019']
setName: rs0
me: localhost:27017
Replica set: rs0
Members (3):
localhost:27017 state: PRIMARY health: 1
localhost:27018 state: SECONDARY health: 1
localhost:27019 state: SECONDARY health: 1
Replica set member roles:
Primary Accepts all writes. One at a time — elected by majority vote
Secondary Replicates from primary oplog. Can serve reads if configured
Arbiter Votes in elections but holds no data — lightweight tie-breaker
Hidden Replicates data but invisible to clients — for backups/reporting
Delayed Replicates with a time delay — protection against accidental deletes
- Always include all replica set members in the connection string — if you connect to only one member and it goes down, the driver cannot discover the new primary
- A replica set requires an odd number of voting members (typically 3 or 5) so elections always produce a clear majority winner — with an even number, a tie is possible and no primary can be elected
- The
hellocommand is the modern replacement for the deprecatedisMastercommand — both return the same information buthellois the current standard
2. The Oplog — How Replication Works Internally
The oplog (operations log) is a special capped collection on every replica set member stored in the local database. Every write to the primary — insert, update, delete — is recorded as an idempotent operation in the oplog. Secondaries tail this oplog continuously and replay each entry on their own data copy. Because it is a capped collection, older entries are overwritten when it fills up — if a secondary falls too far behind and its last replicated entry has been overwritten, it needs a full resync.
# Reading the oplog — understanding what replication tracks
from pymongo import MongoClient, DESCENDING
client = MongoClient(
"mongodb://localhost:27017,localhost:27018,localhost:27019/",
replicaSet="rs0"
)
local_db = client["local"] # oplog lives in the local database
# Check oplog size and current usage
oplog_stats = local_db.command("collStats", "oplog.rs")
oplog_size_mb = oplog_stats.get("maxSize", 0) / (1024 * 1024)
oplog_used_mb = oplog_stats.get("size", 0) / (1024 * 1024)
oplog_count = oplog_stats.get("count", 0)
print(f"Oplog statistics:")
print(f" Max size: {oplog_size_mb:.0f} MB")
print(f" Used: {oplog_used_mb:.1f} MB")
print(f" Operations: {oplog_count:,}")
# Read the most recent oplog entries
print("\nMost recent oplog entries:")
entries = local_db["oplog.rs"].find(
{},
{"op": 1, "ns": 1, "o": 1, "ts": 1, "wall": 1}
).sort("$natural", DESCENDING).limit(5)
op_names = {
"i": "INSERT",
"u": "UPDATE",
"d": "DELETE",
"c": "COMMAND",
"n": "NOOP"
}
for entry in entries:
op = op_names.get(entry.get("op", "?"), entry.get("op", "?"))
ns = entry.get("ns", "")
wall = entry.get("wall", "")
print(f" {op:8} {ns:35} {str(wall)[:19]}")
# Replication lag — how far behind each secondary is
print("\nReplication lag per member:")
rs_status = client.admin.command("replSetGetStatus")
primary_optime = None
for m in rs_status["members"]:
if m.get("stateStr") == "PRIMARY":
primary_optime = m.get("optimeDate")
break
for member in rs_status["members"]:
state = member.get("stateStr", "")
name = member.get("name", "")
optime = member.get("optimeDate")
if state == "SECONDARY" and primary_optime and optime:
lag_secs = (primary_optime - optime).total_seconds()
print(f" {name:30} lag: {lag_secs:.1f} seconds")
else:
print(f" {name:30} {state}")Max size: 1,024 MB
Used: 2.3 MB
Operations: 1,847
Most recent oplog entries:
INSERT dataplexa.orders 2024-04-01 09:14:22
UPDATE dataplexa.products 2024-04-01 09:14:21
INSERT dataplexa.orders 2024-04-01 09:14:20
UPDATE dataplexa.products 2024-04-01 09:14:19
INSERT dataplexa.reviews 2024-04-01 09:14:18
Replication lag per member:
localhost:27017 PRIMARY
localhost:27018 lag: 0.1 seconds
localhost:27019 lag: 0.3 seconds
- Oplog entries are idempotent — replaying the same entry multiple times produces the same result. This is what makes replication safe across network interruptions
- A replication lag of under one second is healthy — lag above 10 seconds indicates the secondary is struggling to keep up, possibly due to high write load or insufficient hardware
- Size the oplog generously — a larger oplog gives secondaries more time to catch up after a network interruption before needing a full resync. The default is 5% of available disk space
3. Elections and Automatic Failover
When the primary becomes unavailable — server crash, network partition, maintenance shutdown — the remaining members hold an election to choose a new primary. The member with the most up-to-date oplog and the highest priority wins. The entire election process typically completes in 10–30 seconds. During this window, writes are rejected — the driver buffers or fails them depending on configuration.
# Elections and failover — understanding the process and handling it in code
from pymongo import MongoClient
from pymongo.errors import (
NotPrimaryError,
ConnectionFailure,
ServerSelectionTimeoutError
)
import time
client = MongoClient(
"mongodb://localhost:27017,localhost:27018,localhost:27019/",
replicaSet="rs0",
# How long the driver waits for a primary before raising an error
serverSelectionTimeoutMS=5000,
# Retry writes automatically once after a network error or failover
retryWrites=True
)
db = client["dataplexa"]
def write_with_failover_handling(document: dict, max_attempts: int = 3) -> bool:
"""
Write a document with proper failover handling.
retryWrites=True handles most cases automatically —
this shows explicit handling for remaining edge cases.
"""
for attempt in range(1, max_attempts + 1):
try:
db.orders.insert_one(document)
return True
except NotPrimaryError:
# Current connection target is no longer the primary
# Driver will automatically discover the new primary
print(f" Attempt {attempt}: not primary — waiting for election...")
time.sleep(2)
except ServerSelectionTimeoutError:
# No primary available within serverSelectionTimeoutMS
print(f" Attempt {attempt}: no primary available yet — retrying...")
time.sleep(3)
except ConnectionFailure as e:
print(f" Attempt {attempt}: connection failure — {e}")
time.sleep(2)
return False
# Election timeline explanation
print("Automatic failover timeline:\n")
timeline = [
("0s", "Primary stops responding (crash / network / maintenance)"),
("0–2s", "Secondaries detect heartbeat timeout (default: 10s)"),
("2–10s", "Secondaries declare primary unreachable after missed heartbeats"),
("10s", "Election triggered — candidates request votes from all members"),
("10–30s","New primary elected by majority vote — begins accepting writes"),
("30s+", "Old primary (if it recovers) steps down and rejoins as secondary"),
]
for t, event in timeline:
print(f" {t:8} {event}")
# retryWrites behaviour
print("\nretryWrites=True (default in Atlas connection strings):")
print(" ✓ Automatically retries once after a network error or primary failover")
print(" ✓ Safe for idempotent operations: insert, update, replace, delete")
print(" ✗ Does NOT retry multi-document transactions automatically")
print(" ✗ Does NOT retry non-idempotent operations like $inc mid-transaction")0s Primary stops responding (crash / network / maintenance)
0–2s Secondaries detect heartbeat timeout (default: 10s)
2–10s Secondaries declare primary unreachable after missed heartbeats
10s Election triggered — candidates request votes from all members
10–30s New primary elected by majority vote — begins accepting writes
30s+ Old primary (if it recovers) steps down and rejoins as secondary
retryWrites=True (default in Atlas connection strings):
✓ Automatically retries once after a network error or primary failover
✓ Safe for idempotent operations: insert, update, replace, delete
✗ Does NOT retry multi-document transactions automatically
✗ Does NOT retry non-idempotent operations like $inc mid-transaction
- Always set
retryWrites=True— it is the default in MongoDB Atlas connection strings but must be set explicitly for self-hosted connections to get automatic single-operation retry on failover - Elections require a majority of voting members — a three-member set can survive one member failing (2 of 3 = majority). A two-member set cannot hold an election if one fails
- The
serverSelectionTimeoutMSsetting controls how long a PyMongo operation waits before raisingServerSelectionTimeoutError— set it to a value your application can tolerate, not too short
4. Read Preferences — Distributing Reads to Secondaries
By default all reads go to the primary. Read preferences allow you to route reads to secondary members instead — reducing load on the primary and using the hardware of secondary servers that would otherwise be idle. The trade-off is that secondaries may serve slightly stale data due to replication lag.
# Read preferences — routing reads to secondaries
from pymongo import MongoClient, ReadPreference
client = MongoClient(
"mongodb://localhost:27017,localhost:27018,localhost:27019/",
replicaSet="rs0"
)
db = client["dataplexa"]
# READ PREFERENCE MODES
print("Read preference modes:\n")
modes = {
"primary": "Always read from primary (default). Strongly consistent.",
"primaryPreferred": "Read from primary if available, fall back to secondary.",
"secondary": "Always read from a secondary. May return stale data.",
"secondaryPreferred": "Read from secondary if available, fall back to primary.",
"nearest": "Read from the member with lowest network latency.",
}
for mode, desc in modes.items():
print(f" {mode:22} {desc}")
# Apply read preference at the collection level
# Primary — for financial data that must be current
orders_primary = db.get_collection(
"orders",
read_preference=ReadPreference.PRIMARY
)
# Secondary — for reporting queries (can tolerate slight staleness)
orders_secondary = db.get_collection(
"orders",
read_preference=ReadPreference.SECONDARY
)
# secondaryPreferred — for product catalogue (high read volume, tolerate lag)
products_sec_pref = db.get_collection(
"products",
read_preference=ReadPreference.SECONDARY_PREFERRED
)
# Nearest — for globally distributed apps (lowest latency regardless of role)
reviews_nearest = db.get_collection(
"reviews",
read_preference=ReadPreference.NEAREST
)
# Demonstrate reading from secondary with maxStalenessSeconds
# Reject secondaries that are more than 90 seconds behind the primary
from pymongo.read_preferences import Secondary
orders_bounded = db.get_collection(
"orders",
read_preference=Secondary(max_staleness=90) # seconds
)
print("\nRead preference applied per collection:")
print(f" orders (financial): PRIMARY — always current")
print(f" orders (reporting): SECONDARY — tolerate slight lag")
print(f" products (catalogue):SECONDARY_PREFERRED — high read volume")
print(f" reviews (global): NEAREST — lowest latency")
print(f" orders (bounded): SECONDARY ≤90s lag — staleness bounded")
# Run a reporting query on secondary — total revenue by month
results = list(orders_secondary.aggregate([
{"$group": {
"_id": {"$month": "$date"},
"revenue": {"$sum": "$total"},
"count": {"$sum": 1}
}},
{"$sort": {"_id": 1}},
{"$project": {"month": "$_id", "revenue": {"$round": ["$revenue",2]},
"count": 1, "_id": 0}}
]))
print("\nMonthly revenue (read from secondary):")
for r in results:
print(f" Month {r['month']:2} orders: {r['count']} revenue: ${r['revenue']:.2f}")primary Always read from primary (default). Strongly consistent.
primaryPreferred Read from primary if available, fall back to secondary.
secondary Always read from a secondary. May return stale data.
secondaryPreferred Read from secondary if available, fall back to primary.
nearest Read from the member with lowest network latency.
Read preference applied per collection:
orders (financial): PRIMARY — always current
orders (reporting): SECONDARY — tolerate slight lag
products (catalogue): SECONDARY_PREFERRED — high read volume
reviews (global): NEAREST — lowest latency
orders (bounded): SECONDARY ≤90s lag — staleness bounded
Monthly revenue (read from secondary):
Month 1 orders: 1 revenue: $44.96
Month 2 orders: 3 revenue: $189.97
Month 3 orders: 2 revenue: $679.97
Month 4 orders: 1 revenue: $11.97
- Never use
secondaryread preference for financial data, inventory counts, or any query where stale results cause incorrect application behaviour — always useprimaryfor these - Use
maxStalenessSecondsto bound how stale secondary reads can be — MongoDB rejects secondaries that exceed the threshold and falls back to a fresher member nearestread preference is designed for geographically distributed replica sets — it routes each read to the closest data centre regardless of primary/secondary role
5. Write Concern and Replication Acknowledgement
Write concern controls how many replica set members must acknowledge a write before MongoDB reports success to the application. A higher write concern means stronger durability guarantees — the write has been committed on more members — at the cost of slightly higher latency.
# Write concern — controlling replication acknowledgement level
from pymongo import MongoClient
from pymongo.write_concern import WriteConcern
client = MongoClient(
"mongodb://localhost:27017,localhost:27018,localhost:27019/",
replicaSet="rs0"
)
db = client["dataplexa"]
# Write concern levels
print("Write concern options:\n")
wc_levels = [
("w=0", "Fire and forget — no acknowledgement. Fastest, no durability."),
("w=1", "Primary acknowledges (default). Fast, survives primary restart."),
("w=2", "Primary + 1 secondary acknowledge. Survives one member failure."),
("w='majority'", "Majority of voting members acknowledge. Strongest guarantee."),
("j=True", "Journal flush required before acknowledgement. Crash-safe."),
("wtimeout", "Milliseconds to wait for acknowledgement before error."),
]
for level, desc in wc_levels:
print(f" {level:16} {desc}")
# Apply write concern at the collection level
# Critical financial operation — majority write concern with journal
orders_safe = db.get_collection(
"orders",
write_concern=WriteConcern(w="majority", j=True, wtimeout=5000)
)
# High-throughput logging — w=1, no journal (speed over durability)
from pymongo.write_concern import WriteConcern as WC
events_fast = db.get_collection(
"events",
write_concern=WC(w=1, j=False)
)
# Insert an order with majority write concern
import datetime
result = orders_safe.insert_one({
"_id": "o_wc_demo",
"user_id": "u001",
"status": "processing",
"total": 59.99,
"date": "2024-04-05"
})
print(f"\nOrder inserted with w=majority acknowledged: {result.acknowledged}")
# Read concern — consistency level for reads
print("\nRead concern levels:")
rc_levels = [
("local", "Returns most recent data on the queried member (default)"),
("majority", "Returns data acknowledged by majority — no rollback risk"),
("snapshot", "Reads a consistent snapshot — used inside transactions"),
("available","Fastest — may return data that has been rolled back (shards only)"),
]
for level, desc in rc_levels:
print(f" {level:12} {desc}")
# Clean up demo document
db.orders.delete_one({"_id": "o_wc_demo"})w=0 Fire and forget — no acknowledgement. Fastest, no durability.
w=1 Primary acknowledges (default). Fast, survives primary restart.
w=2 Primary + 1 secondary acknowledge. Survives one member failure.
w='majority' Majority of voting members acknowledge. Strongest guarantee.
j=True Journal flush required before acknowledgement. Crash-safe.
wtimeout Milliseconds to wait for acknowledgement before error.
Order inserted with w=majority acknowledged: True
Read concern levels:
local Returns most recent data on the queried member (default)
majority Returns data acknowledged by majority — no rollback risk
snapshot Reads a consistent snapshot — used inside transactions
available Fastest — may return data that has been rolled back (shards only)
- Use
w="majority"for any write that must survive a primary failover — withw=1, a write acknowledged by the primary but not yet replicated to a secondary can be rolled back if the primary crashes before replicating - Combining
w="majority"andj=Truegives the strongest durability guarantee — the write is on disk on a majority of members. This is the recommended setting for financial data - Always set
wtimeout— without it, a write concern that cannot be satisfied (e.g. a secondary is down and you needw=2) will block indefinitely
Summary Table
| Concept | What It Does | Key Rule |
|---|---|---|
| Replica set | Group of MongoDB instances with the same data | Always use an odd number of voting members |
| Primary | Only member that accepts writes | Elected by majority vote — one at a time |
| Oplog | Capped collection of idempotent write operations | Size it generously — small oplog forces full resyncs |
| Election / Failover | Auto-selects new primary in 10–30 seconds | Set retryWrites=True to handle failover transparently |
| Read preference | Routes reads to primary or secondary | Use primary for consistent data, secondary for reports |
w="majority" |
Write acknowledged by majority of members | Strongest durability — combine with j=True |
| Replication lag | How far behind a secondary is from the primary | Under 1 second is healthy — above 10 seconds investigate |
maxStalenessSeconds |
Rejects secondaries lagging beyond a threshold | Always set when using secondary read preferences |
Practice Questions
Practice 1. Why must a replica set have an odd number of voting members?
Practice 2. What is the oplog and why does its size matter?
Practice 3. For which types of queries is it safe to use secondary read preference and for which is it not?
Practice 4. What is the risk of using w=1 write concern for a critical financial write?
Practice 5. How long does a typical MongoDB replica set election take, and what happens to write operations during that window?
Quiz
Quiz 1. In a three-member replica set, how many members must fail before the set loses the ability to elect a new primary?
Quiz 2. What read preference should you use for a monthly sales report that runs on a secondary to reduce load on the primary?
Quiz 3. What does retryWrites=True do and what does it NOT handle?
Quiz 4. What is the purpose of an arbiter in a replica set?
Quiz 5. What write concern combination provides the strongest durability guarantee for a financial transaction?
Next up — Sharding: How MongoDB distributes data across multiple servers, choosing a shard key, and scaling writes horizontally beyond a single replica set.