NoSQL
Replication Strategies
On 28 October 2021, Facebook went offline for six hours. The root cause was a configuration change that took down the backbone routers connecting their data centres — and because data was only in those data centres, everything went dark. Replication is how you ensure that when a machine, a rack, or an entire data centre fails, your data survives and your service keeps running. This lesson explains exactly how NoSQL databases replicate data and what happens when replicas disagree.
Why Replication Exists
Replication serves two distinct purposes that are often conflated. Understanding them separately helps you make better configuration decisions.
Fault Tolerance
If a node fails, a replica on a different node holds a copy of the data. Reads and writes continue without interruption. The system is said to be highly available — it keeps running despite node failures. This is why replication factor 1 is never acceptable in production.
Read Scalability
Multiple replicas of the same data can serve reads in parallel. A replication factor of 3 on a 6-node cluster means three nodes can answer queries for the same row simultaneously — tripling read throughput for hot data without any additional writes.
Replication Models — Leader-Based vs Leaderless
Every replication system answers one fundamental question: who is allowed to accept writes? The answer splits all databases into two camps.
Leader-Based Replication
↓ write
Leader (Primary)
↓ replicate
Follower 1 Follower 2
(read only) (read only)
One node accepts all writes. Followers replicate from the leader and serve reads. Used by: MongoDB, Redis, MySQL. Simple to reason about — only one place accepts writes. Bottleneck: the leader is a single write endpoint.
Leaderless Replication
↓ write to ALL (or quorum)
Node 1 Node 2 Node 3
(any node)
↑ read from quorum
Any node accepts writes. The client (or coordinator) writes to multiple nodes simultaneously. Used by: Cassandra, DynamoDB, Riak. No single write bottleneck. Consistency managed via quorum rules.
Replication Factor and Quorum — The Core Trade-off
In a leaderless system like Cassandra, two numbers control the consistency vs availability trade-off on every operation: W (write quorum — how many replicas must acknowledge a write) and R (read quorum — how many replicas must respond to a read). The replication factor is N (total copies).
Quorum Rules — N=3 examples
W=3, R=1
Strong consistency
All 3 replicas must ACK every write. Any single replica is guaranteed up-to-date. Reads are fast (1 node). Writes are slow — must wait for all 3. Unavailable if any node is down during a write.
W=2, R=2
Balanced (W+R > N)
W+R=4 > N=3, so at least one replica in every read set overlaps with every write set. Reads always see the latest write. One node can fail without losing availability. The standard production setting.
W=1, R=1
Maximum availability
Fastest possible reads and writes — one node ACK is enough. Two nodes can fail. But reads may return stale data — the replica you read may not have the latest write yet. Eventual consistency.
W=1, R=3
Write-optimised
Writes are fast (1 ACK). Reads must check all 3 replicas and return the most recent version. Used for write-heavy workloads where reads can afford to be slower. Rare in practice.
The magic rule: when W + R > N, you are guaranteed strong consistency — the read set and write set always share at least one replica.
Hands-on — Configuring Replication in Cassandra
The scenario: You are the infrastructure engineer for a fintech startup deploying a Cassandra cluster across two data centres — one in London and one in Frankfurt — for disaster recovery. A payment record written in London must be readable from Frankfurt even during a London outage. You are configuring the keyspace replication strategy and setting consistency levels appropriate for payment data.
-- NetworkTopologyStrategy: specify replicas per data centre
-- This gives you 3 copies in London, 3 in Frankfurt
-- If London goes entirely offline, Frankfurt has 3 healthy replicas
CREATE KEYSPACE payments
WITH replication = {
'class': 'NetworkTopologyStrategy',
'london': 3,
'frankfurt': 3
}
AND durable_writes = true;
-- Confirm the keyspace was created with correct replication
DESCRIBE KEYSPACE payments;
CREATE KEYSPACE payments WITH replication = {
'class': 'NetworkTopologyStrategy',
'frankfurt': '3',
'london': '3'
} AND durable_writes = true;
Replication factor summary:
Total replicas: 6 (3 per DC)
Can survive: loss of entire London DC or entire Frankfurt DC
durable_writes: true — commit log written before ACKNetworkTopologyStrategy vs SimpleStrategy
SimpleStrategy places replicas on the next N nodes clockwise on the ring with no awareness of racks or data centres — never use it in production. NetworkTopologyStrategy is rack-aware and DC-aware: it ensures replicas are spread across different racks within a DC and different DCs across regions. If all 3 London replicas landed on the same rack, a single rack power failure would lose all London copies.
durable_writes = true
Forces Cassandra to write to the commit log before acknowledging a write. For payment data this is non-negotiable — if a node crashes between writing to the memtable (RAM) and flushing to disk, the commit log is replayed on restart to recover the write. Setting this to false speeds up writes by 20–30% but risks data loss on node crash — acceptable only for ephemeral cache-like data.
The scenario continues: With the keyspace created, you now configure the consistency level for payment writes and reads. Payment records must never be lost and must never show a stale value to the fraud detection system. You choose QUORUM — which with RF=3 per DC means 2 nodes must ACK. For cross-DC operations you use LOCAL_QUORUM to avoid cross-Atlantic latency on every write.
from cassandra.cluster import Cluster
from cassandra.policies import DCAwareRoundRobinPolicy
from cassandra import ConsistencyLevel
from cassandra.query import SimpleStatement
cluster = Cluster(
['10.0.1.1', '10.0.1.2'], # seed nodes in London DC
load_balancing_policy=DCAwareRoundRobinPolicy(local_dc='london')
)
session = cluster.connect('payments')
# Write with LOCAL_QUORUM: 2 of 3 London replicas must ACK
# Avoids cross-DC round trip while still ensuring durability
write_stmt = SimpleStatement(
"""INSERT INTO transactions
(txn_id, account_id, amount, currency, ts)
VALUES (%s, %s, %s, %s, toTimestamp(now()))""",
consistency_level=ConsistencyLevel.LOCAL_QUORUM
)
session.execute(write_stmt, ('txn_9812', 'acc_441', 149.99, 'GBP'))
# Read with LOCAL_QUORUM: ensures we see the latest committed write
read_stmt = SimpleStatement(
"SELECT * FROM transactions WHERE txn_id = %s",
consistency_level=ConsistencyLevel.LOCAL_QUORUM
)
row = session.execute(read_stmt, ('txn_9812',)).one()
Write acknowledged by 2/3 London replicas (LOCAL_QUORUM ✓) Write latency: 4.2ms Read returned from 2/3 London replicas: txn_id: txn_9812 account_id: acc_441 amount: 149.99 currency: GBP Read latency: 3.1ms
ConsistencyLevel.LOCAL_QUORUM
LOCAL_QUORUM requires a quorum of replicas in the local DC only — 2 of 3 London replicas. QUORUM would require a quorum across all DCs (4 of 6 total) which means every write and read must synchronise across the Atlantic, adding 80–100ms of cross-DC latency. LOCAL_QUORUM gives you strong consistency within the DC without that latency penalty — the standard choice for multi-DC deployments.
DCAwareRoundRobinPolicy(local_dc='london')
Tells the driver to prefer London nodes as coordinators. Without this, the driver might route a write from a London application server through a Frankfurt coordinator — adding unnecessary cross-DC latency before the write even starts. The driver should always talk to the nearest node first.
Hands-on — Replication in MongoDB Replica Sets
The scenario: Your MongoDB deployment is a single standalone instance. After a disk failure last month caused 4 hours of downtime, your CTO mandates a replica set with automatic failover. You are initialising a 3-node replica set and verifying the replication state.
# Initialise the replica set from the primary node
rs.initiate({
_id: "rs0",
members: [
{ _id: 0, host: "mongo1:27017", priority: 2 }, // preferred primary
{ _id: 1, host: "mongo2:27017", priority: 1 },
{ _id: 2, host: "mongo3:27017", priority: 1 }
]
})
# Check replica set status
rs.status()
{
set: 'rs0',
members: [
{ _id: 0, name: 'mongo1:27017', stateStr: 'PRIMARY',
optime: { ts: Timestamp(1736512943, 1) },
health: 1, uptime: 142 },
{ _id: 1, name: 'mongo2:27017', stateStr: 'SECONDARY',
optime: { ts: Timestamp(1736512943, 1) },
health: 1, uptime: 141 },
{ _id: 2, name: 'mongo3:27017', stateStr: 'SECONDARY',
optime: { ts: Timestamp(1736512943, 1) },
health: 1, uptime: 141 }
],
ok: 1
}priority: 2 on mongo1
Priority controls which node is preferred for primary election. mongo1 has priority 2, the others have priority 1 — so after any failover, once mongo1 recovers it will reclaim the primary role. This is useful when one node has better hardware. Set priority to 0 to make a node permanently non-eligible for primary (useful for a hidden analytics replica that should never serve writes).
optime: Timestamp(1736512943, 1)
The operation timestamp — the position in the primary's oplog that this member has replicated up to. When all three members show the same optime, replication is fully caught up. If a secondary's optime lags significantly behind the primary, it means replication is falling behind — usually caused by a slow secondary or a burst of writes.
The scenario continues: You simulate a primary failure to verify automatic failover works correctly. The secondary with the highest optime is elected primary within seconds.
# Simulate primary failure by stepping down (or kill the process)
rs.stepDown() # forces the primary to step down and triggers election
# On reconnect — check who won the election
rs.status().members.filter(m => m.stateStr === 'PRIMARY')
mongo1:27017 stepped down. Election triggered — candidates: mongo2, mongo3 mongo2:27017 elected PRIMARY (highest optime, priority 1) Election completed in: 2.4 seconds rs.status() after failover: mongo1:27017 SECONDARY (recovering) mongo2:27017 PRIMARY ← new primary mongo3:27017 SECONDARY Write availability restored after: 2.4 seconds
Election completed in 2.4 seconds
MongoDB uses the Raft consensus algorithm for elections. A candidate node requests votes from the other members. The node with the most up-to-date oplog (highest optime) and highest priority wins. The default election timeout is 10 seconds — you can tune settings.electionTimeoutMillis down to 5 seconds for faster failover, at the cost of more false elections if the primary is only briefly unreachable.
Write unavailability during election (2.4s)
During the election window, the replica set has no primary — writes are rejected. Applications must implement retry logic with exponential backoff to handle this window. The MongoDB driver does this automatically for retryable writes (enabled by default since MongoDB 4.0) — it retries the write once a new primary is elected without any application code changes.
Replication Lag — The Consistency Problem
In any asynchronous replication system, there is a window of time between when a write is committed on the primary and when it appears on secondaries. This is called replication lag. During that window, a read from a secondary returns stale data.
Read-Your-Writes Violation
A user updates their profile photo. The write goes to the primary. The user immediately refreshes — their request lands on a secondary that has not yet replicated the update. They see their old photo. This is a read-your-writes violation. The fix: route reads to the primary for operations that must see their own recent writes, or wait for replication to complete before redirecting to secondaries.
Consistency Level Reference — Cassandra
| Consistency Level | Nodes required | Consistency | Use when |
|---|---|---|---|
| ONE | 1 replica | Eventual | Maximum availability — stale reads acceptable (logging, analytics) |
| LOCAL_QUORUM | Majority in local DC | Strong (within DC) | Multi-DC production workloads — payments, user state |
| QUORUM | Majority across all DCs | Strong (global) | Global strong consistency needed — accept cross-DC latency |
| ALL | Every replica | Strongest | Rarely used — any single node down blocks writes |
| ANY | 1 node (hint counts) | Weakest | Maximum write availability — even hinted handoff counts |
Hinted Handoff — Surviving Node Failures Without Data Loss
When a Cassandra node is temporarily down, the coordinator node stores a hint — a record of the write that the down node missed. When the node comes back online, the coordinator delivers the hint and the node catches up. This is why Cassandra can acknowledge a write at ONE consistency even when a replica is down — the hint ensures the data will eventually reach all replicas.
Read Repair — The Complementary Mechanism
When Cassandra reads with quorum and discovers that replicas disagree (different values for the same row), it automatically corrects the outdated replica in the background. This is called read repair. Together with hinted handoff, it ensures that even after node failures and temporary inconsistencies, all replicas eventually converge on the correct value without manual intervention.
Teacher's Note
The single most common replication misconfiguration in production is using ONE consistency level for everything because it is the fastest — and then being surprised when users see stale data or, worse, when a node failure causes a permanent read of old values. For anything that matters — user accounts, payments, inventory — LOCAL_QUORUM is the minimum. ONE is for event logging and analytics where you genuinely do not care if a row is briefly stale.
Practice Questions — You're the Engineer
Scenario:
Scenario:
Scenario:
Quiz — Replication in Production
Scenario:
Scenario:
SimpleStrategy with replication_factor: 3. You immediately flag this as incorrect for a multi-DC deployment. What should they use instead, and why does SimpleStrategy fail here?
Scenario:
Up Next · Lesson 29
Indexing in NoSQL
Why NoSQL indexes are nothing like SQL indexes — and the specific patterns that turn a 10-second query into a 2ms lookup.