Mango DBLesson 33 – | Dataplexa

Replication

A single MongoDB server is a single point of failure — if it crashes, goes offline for maintenance, or loses its storage, your application loses access to its data. Replication solves this by maintaining multiple copies of the same data on different servers simultaneously. MongoDB's replication mechanism is called a replica set — a group of MongoDB instances that hold the same data, elect a leader automatically when one member fails, and recover without any manual intervention. Replication provides three things at once: high availability through automatic failover, data redundancy through multiple copies, and read scalability by routing read queries to secondary members. This lesson covers how replica sets work internally, how elections and failover happen, how to configure read preferences, and how to monitor replica set health from PyMongo.

1. Replica Set Architecture

A replica set consists of one primary and one or more secondaries. All writes go to the primary, which records every change in an operations log called the oplog. Secondaries continuously tail the primary's oplog and replay each operation on their own copy of the data — a process called replication lag when the secondary falls behind. A typical production replica set has three members: one primary and two secondaries.

# Replica set architecture — connecting and inspecting member roles

from pymongo import MongoClient

# Connect to a replica set — provide all members in the connection string
# MongoDB automatically discovers all members from any one of them
client = MongoClient(
    "mongodb://localhost:27017,localhost:27018,localhost:27019/",
    replicaSet="rs0"    # replica set name must match the configured name
)

db = client["dataplexa"]

# Check which member is currently the primary
hello = db.command("hello")   # modern replacement for isMaster
print("Replica set status from hello command:")
print(f"  isWritablePrimary: {hello.get('isWritablePrimary', False)}")
print(f"  primary:           {hello.get('primary', 'unknown')}")
print(f"  hosts:             {hello.get('hosts', [])}")
print(f"  setName:           {hello.get('setName', 'unknown')}")
print(f"  me:                {hello.get('me', 'unknown')}")

# Full replica set status — all member details
rs_status = client.admin.command("replSetGetStatus")
print(f"\nReplica set: {rs_status['set']}")
print(f"Members ({len(rs_status['members'])}):")
for member in rs_status["members"]:
    state_str = member.get("stateStr", "UNKNOWN")
    name      = member.get("name", "")
    health    = member.get("health", 0)
    lag_secs  = member.get("optimeDate", None)
    print(f"  {name:30}  state: {state_str:10}  health: {health}")

# Replica set roles
print("\nReplica set member roles:")
roles = [
    ("Primary",   "Accepts all writes. One at a time — elected by majority vote"),
    ("Secondary", "Replicates from primary oplog. Can serve reads if configured"),
    ("Arbiter",   "Votes in elections but holds no data — lightweight tie-breaker"),
    ("Hidden",    "Replicates data but invisible to clients — for backups/reporting"),
    ("Delayed",   "Replicates with a time delay — protection against accidental deletes"),
]
for role, desc in roles:
    print(f"  {role:10}  {desc}")
Replica set status from hello command:
isWritablePrimary: True
primary: localhost:27017
hosts: ['localhost:27017', 'localhost:27018', 'localhost:27019']
setName: rs0
me: localhost:27017

Replica set: rs0
Members (3):
localhost:27017 state: PRIMARY health: 1
localhost:27018 state: SECONDARY health: 1
localhost:27019 state: SECONDARY health: 1

Replica set member roles:
Primary Accepts all writes. One at a time — elected by majority vote
Secondary Replicates from primary oplog. Can serve reads if configured
Arbiter Votes in elections but holds no data — lightweight tie-breaker
Hidden Replicates data but invisible to clients — for backups/reporting
Delayed Replicates with a time delay — protection against accidental deletes
  • Always include all replica set members in the connection string — if you connect to only one member and it goes down, the driver cannot discover the new primary
  • A replica set requires an odd number of voting members (typically 3 or 5) so elections always produce a clear majority winner — with an even number, a tie is possible and no primary can be elected
  • The hello command is the modern replacement for the deprecated isMaster command — both return the same information but hello is the current standard

2. The Oplog — How Replication Works Internally

The oplog (operations log) is a special capped collection on every replica set member stored in the local database. Every write to the primary — insert, update, delete — is recorded as an idempotent operation in the oplog. Secondaries tail this oplog continuously and replay each entry on their own data copy. Because it is a capped collection, older entries are overwritten when it fills up — if a secondary falls too far behind and its last replicated entry has been overwritten, it needs a full resync.

# Reading the oplog — understanding what replication tracks

from pymongo import MongoClient, DESCENDING

client = MongoClient(
    "mongodb://localhost:27017,localhost:27018,localhost:27019/",
    replicaSet="rs0"
)

local_db = client["local"]   # oplog lives in the local database

# Check oplog size and current usage
oplog_stats = local_db.command("collStats", "oplog.rs")
oplog_size_mb  = oplog_stats.get("maxSize",  0) / (1024 * 1024)
oplog_used_mb  = oplog_stats.get("size",     0) / (1024 * 1024)
oplog_count    = oplog_stats.get("count",    0)
print(f"Oplog statistics:")
print(f"  Max size:   {oplog_size_mb:.0f} MB")
print(f"  Used:       {oplog_used_mb:.1f} MB")
print(f"  Operations: {oplog_count:,}")

# Read the most recent oplog entries
print("\nMost recent oplog entries:")
entries = local_db["oplog.rs"].find(
    {},
    {"op": 1, "ns": 1, "o": 1, "ts": 1, "wall": 1}
).sort("$natural", DESCENDING).limit(5)

op_names = {
    "i": "INSERT",
    "u": "UPDATE",
    "d": "DELETE",
    "c": "COMMAND",
    "n": "NOOP"
}
for entry in entries:
    op   = op_names.get(entry.get("op", "?"), entry.get("op", "?"))
    ns   = entry.get("ns", "")
    wall = entry.get("wall", "")
    print(f"  {op:8}  {ns:35}  {str(wall)[:19]}")

# Replication lag — how far behind each secondary is
print("\nReplication lag per member:")
rs_status = client.admin.command("replSetGetStatus")
primary_optime = None
for m in rs_status["members"]:
    if m.get("stateStr") == "PRIMARY":
        primary_optime = m.get("optimeDate")
        break

for member in rs_status["members"]:
    state     = member.get("stateStr", "")
    name      = member.get("name", "")
    optime    = member.get("optimeDate")
    if state == "SECONDARY" and primary_optime and optime:
        lag_secs = (primary_optime - optime).total_seconds()
        print(f"  {name:30}  lag: {lag_secs:.1f} seconds")
    else:
        print(f"  {name:30}  {state}")
Oplog statistics:
Max size: 1,024 MB
Used: 2.3 MB
Operations: 1,847

Most recent oplog entries:
INSERT dataplexa.orders 2024-04-01 09:14:22
UPDATE dataplexa.products 2024-04-01 09:14:21
INSERT dataplexa.orders 2024-04-01 09:14:20
UPDATE dataplexa.products 2024-04-01 09:14:19
INSERT dataplexa.reviews 2024-04-01 09:14:18

Replication lag per member:
localhost:27017 PRIMARY
localhost:27018 lag: 0.1 seconds
localhost:27019 lag: 0.3 seconds
  • Oplog entries are idempotent — replaying the same entry multiple times produces the same result. This is what makes replication safe across network interruptions
  • A replication lag of under one second is healthy — lag above 10 seconds indicates the secondary is struggling to keep up, possibly due to high write load or insufficient hardware
  • Size the oplog generously — a larger oplog gives secondaries more time to catch up after a network interruption before needing a full resync. The default is 5% of available disk space

3. Elections and Automatic Failover

When the primary becomes unavailable — server crash, network partition, maintenance shutdown — the remaining members hold an election to choose a new primary. The member with the most up-to-date oplog and the highest priority wins. The entire election process typically completes in 10–30 seconds. During this window, writes are rejected — the driver buffers or fails them depending on configuration.

# Elections and failover — understanding the process and handling it in code

from pymongo import MongoClient
from pymongo.errors import (
    NotPrimaryError,
    ConnectionFailure,
    ServerSelectionTimeoutError
)
import time

client = MongoClient(
    "mongodb://localhost:27017,localhost:27018,localhost:27019/",
    replicaSet="rs0",
    # How long the driver waits for a primary before raising an error
    serverSelectionTimeoutMS=5000,
    # Retry writes automatically once after a network error or failover
    retryWrites=True
)

db = client["dataplexa"]

def write_with_failover_handling(document: dict, max_attempts: int = 3) -> bool:
    """
    Write a document with proper failover handling.
    retryWrites=True handles most cases automatically —
    this shows explicit handling for remaining edge cases.
    """
    for attempt in range(1, max_attempts + 1):
        try:
            db.orders.insert_one(document)
            return True

        except NotPrimaryError:
            # Current connection target is no longer the primary
            # Driver will automatically discover the new primary
            print(f"  Attempt {attempt}: not primary — waiting for election...")
            time.sleep(2)

        except ServerSelectionTimeoutError:
            # No primary available within serverSelectionTimeoutMS
            print(f"  Attempt {attempt}: no primary available yet — retrying...")
            time.sleep(3)

        except ConnectionFailure as e:
            print(f"  Attempt {attempt}: connection failure — {e}")
            time.sleep(2)

    return False

# Election timeline explanation
print("Automatic failover timeline:\n")
timeline = [
    ("0s",    "Primary stops responding (crash / network / maintenance)"),
    ("0–2s",  "Secondaries detect heartbeat timeout (default: 10s)"),
    ("2–10s", "Secondaries declare primary unreachable after missed heartbeats"),
    ("10s",   "Election triggered — candidates request votes from all members"),
    ("10–30s","New primary elected by majority vote — begins accepting writes"),
    ("30s+",  "Old primary (if it recovers) steps down and rejoins as secondary"),
]
for t, event in timeline:
    print(f"  {t:8}  {event}")

# retryWrites behaviour
print("\nretryWrites=True (default in Atlas connection strings):")
print("  ✓ Automatically retries once after a network error or primary failover")
print("  ✓ Safe for idempotent operations: insert, update, replace, delete")
print("  ✗ Does NOT retry multi-document transactions automatically")
print("  ✗ Does NOT retry non-idempotent operations like $inc mid-transaction")
Automatic failover timeline:

0s Primary stops responding (crash / network / maintenance)
0–2s Secondaries detect heartbeat timeout (default: 10s)
2–10s Secondaries declare primary unreachable after missed heartbeats
10s Election triggered — candidates request votes from all members
10–30s New primary elected by majority vote — begins accepting writes
30s+ Old primary (if it recovers) steps down and rejoins as secondary

retryWrites=True (default in Atlas connection strings):
✓ Automatically retries once after a network error or primary failover
✓ Safe for idempotent operations: insert, update, replace, delete
✗ Does NOT retry multi-document transactions automatically
✗ Does NOT retry non-idempotent operations like $inc mid-transaction
  • Always set retryWrites=True — it is the default in MongoDB Atlas connection strings but must be set explicitly for self-hosted connections to get automatic single-operation retry on failover
  • Elections require a majority of voting members — a three-member set can survive one member failing (2 of 3 = majority). A two-member set cannot hold an election if one fails
  • The serverSelectionTimeoutMS setting controls how long a PyMongo operation waits before raising ServerSelectionTimeoutError — set it to a value your application can tolerate, not too short

4. Read Preferences — Distributing Reads to Secondaries

By default all reads go to the primary. Read preferences allow you to route reads to secondary members instead — reducing load on the primary and using the hardware of secondary servers that would otherwise be idle. The trade-off is that secondaries may serve slightly stale data due to replication lag.

# Read preferences — routing reads to secondaries

from pymongo import MongoClient, ReadPreference

client = MongoClient(
    "mongodb://localhost:27017,localhost:27018,localhost:27019/",
    replicaSet="rs0"
)

db = client["dataplexa"]

# READ PREFERENCE MODES
print("Read preference modes:\n")
modes = {
    "primary":            "Always read from primary (default). Strongly consistent.",
    "primaryPreferred":   "Read from primary if available, fall back to secondary.",
    "secondary":          "Always read from a secondary. May return stale data.",
    "secondaryPreferred": "Read from secondary if available, fall back to primary.",
    "nearest":            "Read from the member with lowest network latency.",
}
for mode, desc in modes.items():
    print(f"  {mode:22}  {desc}")

# Apply read preference at the collection level
# Primary — for financial data that must be current
orders_primary = db.get_collection(
    "orders",
    read_preference=ReadPreference.PRIMARY
)

# Secondary — for reporting queries (can tolerate slight staleness)
orders_secondary = db.get_collection(
    "orders",
    read_preference=ReadPreference.SECONDARY
)

# secondaryPreferred — for product catalogue (high read volume, tolerate lag)
products_sec_pref = db.get_collection(
    "products",
    read_preference=ReadPreference.SECONDARY_PREFERRED
)

# Nearest — for globally distributed apps (lowest latency regardless of role)
reviews_nearest = db.get_collection(
    "reviews",
    read_preference=ReadPreference.NEAREST
)

# Demonstrate reading from secondary with maxStalenessSeconds
# Reject secondaries that are more than 90 seconds behind the primary
from pymongo.read_preferences import Secondary

orders_bounded = db.get_collection(
    "orders",
    read_preference=Secondary(max_staleness=90)   # seconds
)

print("\nRead preference applied per collection:")
print(f"  orders (financial):  PRIMARY           — always current")
print(f"  orders (reporting):  SECONDARY         — tolerate slight lag")
print(f"  products (catalogue):SECONDARY_PREFERRED — high read volume")
print(f"  reviews (global):    NEAREST           — lowest latency")
print(f"  orders (bounded):    SECONDARY ≤90s lag — staleness bounded")

# Run a reporting query on secondary — total revenue by month
results = list(orders_secondary.aggregate([
    {"$group": {
        "_id":     {"$month": "$date"},
        "revenue": {"$sum": "$total"},
        "count":   {"$sum": 1}
    }},
    {"$sort": {"_id": 1}},
    {"$project": {"month": "$_id", "revenue": {"$round": ["$revenue",2]},
                  "count": 1, "_id": 0}}
]))
print("\nMonthly revenue (read from secondary):")
for r in results:
    print(f"  Month {r['month']:2}  orders: {r['count']}  revenue: ${r['revenue']:.2f}")
Read preference modes:

primary Always read from primary (default). Strongly consistent.
primaryPreferred Read from primary if available, fall back to secondary.
secondary Always read from a secondary. May return stale data.
secondaryPreferred Read from secondary if available, fall back to primary.
nearest Read from the member with lowest network latency.

Read preference applied per collection:
orders (financial): PRIMARY — always current
orders (reporting): SECONDARY — tolerate slight lag
products (catalogue): SECONDARY_PREFERRED — high read volume
reviews (global): NEAREST — lowest latency
orders (bounded): SECONDARY ≤90s lag — staleness bounded

Monthly revenue (read from secondary):
Month 1 orders: 1 revenue: $44.96
Month 2 orders: 3 revenue: $189.97
Month 3 orders: 2 revenue: $679.97
Month 4 orders: 1 revenue: $11.97
  • Never use secondary read preference for financial data, inventory counts, or any query where stale results cause incorrect application behaviour — always use primary for these
  • Use maxStalenessSeconds to bound how stale secondary reads can be — MongoDB rejects secondaries that exceed the threshold and falls back to a fresher member
  • nearest read preference is designed for geographically distributed replica sets — it routes each read to the closest data centre regardless of primary/secondary role

5. Write Concern and Replication Acknowledgement

Write concern controls how many replica set members must acknowledge a write before MongoDB reports success to the application. A higher write concern means stronger durability guarantees — the write has been committed on more members — at the cost of slightly higher latency.

# Write concern — controlling replication acknowledgement level

from pymongo import MongoClient
from pymongo.write_concern import WriteConcern

client = MongoClient(
    "mongodb://localhost:27017,localhost:27018,localhost:27019/",
    replicaSet="rs0"
)

db = client["dataplexa"]

# Write concern levels
print("Write concern options:\n")
wc_levels = [
    ("w=0",          "Fire and forget — no acknowledgement. Fastest, no durability."),
    ("w=1",          "Primary acknowledges (default). Fast, survives primary restart."),
    ("w=2",          "Primary + 1 secondary acknowledge. Survives one member failure."),
    ("w='majority'", "Majority of voting members acknowledge. Strongest guarantee."),
    ("j=True",       "Journal flush required before acknowledgement. Crash-safe."),
    ("wtimeout",     "Milliseconds to wait for acknowledgement before error."),
]
for level, desc in wc_levels:
    print(f"  {level:16}  {desc}")

# Apply write concern at the collection level
# Critical financial operation — majority write concern with journal
orders_safe = db.get_collection(
    "orders",
    write_concern=WriteConcern(w="majority", j=True, wtimeout=5000)
)

# High-throughput logging — w=1, no journal (speed over durability)
from pymongo.write_concern import WriteConcern as WC
events_fast = db.get_collection(
    "events",
    write_concern=WC(w=1, j=False)
)

# Insert an order with majority write concern
import datetime
result = orders_safe.insert_one({
    "_id":     "o_wc_demo",
    "user_id": "u001",
    "status":  "processing",
    "total":   59.99,
    "date":    "2024-04-05"
})
print(f"\nOrder inserted with w=majority acknowledged: {result.acknowledged}")

# Read concern — consistency level for reads
print("\nRead concern levels:")
rc_levels = [
    ("local",    "Returns most recent data on the queried member (default)"),
    ("majority", "Returns data acknowledged by majority — no rollback risk"),
    ("snapshot", "Reads a consistent snapshot — used inside transactions"),
    ("available","Fastest — may return data that has been rolled back (shards only)"),
]
for level, desc in rc_levels:
    print(f"  {level:12}  {desc}")

# Clean up demo document
db.orders.delete_one({"_id": "o_wc_demo"})
Write concern options:

w=0 Fire and forget — no acknowledgement. Fastest, no durability.
w=1 Primary acknowledges (default). Fast, survives primary restart.
w=2 Primary + 1 secondary acknowledge. Survives one member failure.
w='majority' Majority of voting members acknowledge. Strongest guarantee.
j=True Journal flush required before acknowledgement. Crash-safe.
wtimeout Milliseconds to wait for acknowledgement before error.

Order inserted with w=majority acknowledged: True

Read concern levels:
local Returns most recent data on the queried member (default)
majority Returns data acknowledged by majority — no rollback risk
snapshot Reads a consistent snapshot — used inside transactions
available Fastest — may return data that has been rolled back (shards only)
  • Use w="majority" for any write that must survive a primary failover — with w=1, a write acknowledged by the primary but not yet replicated to a secondary can be rolled back if the primary crashes before replicating
  • Combining w="majority" and j=True gives the strongest durability guarantee — the write is on disk on a majority of members. This is the recommended setting for financial data
  • Always set wtimeout — without it, a write concern that cannot be satisfied (e.g. a secondary is down and you need w=2) will block indefinitely

Summary Table

Concept What It Does Key Rule
Replica set Group of MongoDB instances with the same data Always use an odd number of voting members
Primary Only member that accepts writes Elected by majority vote — one at a time
Oplog Capped collection of idempotent write operations Size it generously — small oplog forces full resyncs
Election / Failover Auto-selects new primary in 10–30 seconds Set retryWrites=True to handle failover transparently
Read preference Routes reads to primary or secondary Use primary for consistent data, secondary for reports
w="majority" Write acknowledged by majority of members Strongest durability — combine with j=True
Replication lag How far behind a secondary is from the primary Under 1 second is healthy — above 10 seconds investigate
maxStalenessSeconds Rejects secondaries lagging beyond a threshold Always set when using secondary read preferences

Practice Questions

Practice 1. Why must a replica set have an odd number of voting members?



Practice 2. What is the oplog and why does its size matter?



Practice 3. For which types of queries is it safe to use secondary read preference and for which is it not?



Practice 4. What is the risk of using w=1 write concern for a critical financial write?



Practice 5. How long does a typical MongoDB replica set election take, and what happens to write operations during that window?



Quiz

Quiz 1. In a three-member replica set, how many members must fail before the set loses the ability to elect a new primary?






Quiz 2. What read preference should you use for a monthly sales report that runs on a secondary to reduce load on the primary?






Quiz 3. What does retryWrites=True do and what does it NOT handle?






Quiz 4. What is the purpose of an arbiter in a replica set?






Quiz 5. What write concern combination provides the strongest durability guarantee for a financial transaction?






Next up — Sharding: How MongoDB distributes data across multiple servers, choosing a shard key, and scaling writes horizontally beyond a single replica set.