NO SQL Lesson 28 – Replication Strategies | Dataplexa

Data Modeling & Design · Lesson 28

Replication Strategies

On 28 October 2021, Facebook went offline for six hours. The root cause was a configuration change that took down the backbone routers connecting their data centres — and because data was only in those data centres, everything went dark. Replication is how you ensure that when a machine, a rack, or an entire data centre fails, your data survives and your service keeps running. This lesson explains exactly how NoSQL databases replicate data and what happens when replicas disagree.

Why Replication Exists

Replication serves two distinct purposes that are often conflated. Understanding them separately helps you make better configuration decisions.

Fault Tolerance

If a node fails, a replica on a different node holds a copy of the data. Reads and writes continue without interruption. The system is said to be highly available — it keeps running despite node failures. This is why replication factor 1 is never acceptable in production.

Read Scalability

Multiple replicas of the same data can serve reads in parallel. A replication factor of 3 on a 6-node cluster means three nodes can answer queries for the same row simultaneously — tripling read throughput for hot data without any additional writes.

Replication Models — Leader-Based vs Leaderless

Every replication system answers one fundamental question: who is allowed to accept writes? The answer splits all databases into two camps.

Leader-Based Replication

          Client

            ↓ write

          Leader (Primary)

            ↓ replicate

          Follower 1  Follower 2

          (read only)  (read only)

One node accepts all writes. Followers replicate from the leader and serve reads. Used by: MongoDB, Redis, MySQL. Simple to reason about — only one place accepts writes. Bottleneck: the leader is a single write endpoint.

Leaderless Replication

          Client

            ↓ write to ALL (or quorum)

          Node 1   Node 2   Node 3

                          (any node)

          ↑ read from quorum

Any node accepts writes. The client (or coordinator) writes to multiple nodes simultaneously. Used by: Cassandra, DynamoDB, Riak. No single write bottleneck. Consistency managed via quorum rules.

Replication Factor and Quorum — The Core Trade-off

In a leaderless system like Cassandra, two numbers control the consistency vs availability trade-off on every operation: W (write quorum — how many replicas must acknowledge a write) and R (read quorum — how many replicas must respond to a read). The replication factor is N (total copies).

Quorum Rules — N=3 examples

W=3, R=1

Strong consistency

All 3 replicas must ACK every write. Any single replica is guaranteed up-to-date. Reads are fast (1 node). Writes are slow — must wait for all 3. Unavailable if any node is down during a write.

W=2, R=2

Balanced (W+R > N)

W+R=4 > N=3, so at least one replica in every read set overlaps with every write set. Reads always see the latest write. One node can fail without losing availability. The standard production setting.

W=1, R=1

Maximum availability

Fastest possible reads and writes — one node ACK is enough. Two nodes can fail. But reads may return stale data — the replica you read may not have the latest write yet. Eventual consistency.

W=1, R=3

Write-optimised

Writes are fast (1 ACK). Reads must check all 3 replicas and return the most recent version. Used for write-heavy workloads where reads can afford to be slower. Rare in practice.

The magic rule: when W + R > N, you are guaranteed strong consistency — the read set and write set always share at least one replica.

Hands-on — Configuring Replication in Cassandra

The scenario: You are the infrastructure engineer for a fintech startup deploying a Cassandra cluster across two data centres — one in London and one in Frankfurt — for disaster recovery. A payment record written in London must be readable from Frankfurt even during a London outage. You are configuring the keyspace replication strategy and setting consistency levels appropriate for payment data.

-- NetworkTopologyStrategy: specify replicas per data centre
-- This gives you 3 copies in London, 3 in Frankfurt
-- If London goes entirely offline, Frankfurt has 3 healthy replicas
CREATE KEYSPACE payments
  WITH replication = {
    'class':           'NetworkTopologyStrategy',
    'london':          3,
    'frankfurt':       3
  }
  AND durable_writes = true;

-- Confirm the keyspace was created with correct replication
DESCRIBE KEYSPACE payments;

CREATE KEYSPACE payments WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'frankfurt': '3',
  'london': '3'
} AND durable_writes = true;

Replication factor summary:
  Total replicas: 6  (3 per DC)
  Can survive: loss of entire London DC or entire Frankfurt DC
  durable_writes: true — commit log written before ACK

NetworkTopologyStrategy vs SimpleStrategy

SimpleStrategy places replicas on the next N nodes clockwise on the ring with no awareness of racks or data centres — never use it in production. NetworkTopologyStrategy is rack-aware and DC-aware: it ensures replicas are spread across different racks within a DC and different DCs across regions. If all 3 London replicas landed on the same rack, a single rack power failure would lose all London copies.

durable_writes = true

Forces Cassandra to write to the commit log before acknowledging a write. For payment data this is non-negotiable — if a node crashes between writing to the memtable (RAM) and flushing to disk, the commit log is replayed on restart to recover the write. Setting this to false speeds up writes by 20–30% but risks data loss on node crash — acceptable only for ephemeral cache-like data.

The scenario continues: With the keyspace created, you now configure the consistency level for payment writes and reads. Payment records must never be lost and must never show a stale value to the fraud detection system. You choose QUORUM — which with RF=3 per DC means 2 nodes must ACK. For cross-DC operations you use LOCAL_QUORUM to avoid cross-Atlantic latency on every write.

from cassandra.cluster import Cluster
from cassandra.policies import DCAwareRoundRobinPolicy
from cassandra import ConsistencyLevel
from cassandra.query import SimpleStatement

cluster = Cluster(
    ['10.0.1.1', '10.0.1.2'],         # seed nodes in London DC
    load_balancing_policy=DCAwareRoundRobinPolicy(local_dc='london')
)
session = cluster.connect('payments')

# Write with LOCAL_QUORUM: 2 of 3 London replicas must ACK
# Avoids cross-DC round trip while still ensuring durability
write_stmt = SimpleStatement(
    """INSERT INTO transactions
       (txn_id, account_id, amount, currency, ts)
       VALUES (%s, %s, %s, %s, toTimestamp(now()))""",
    consistency_level=ConsistencyLevel.LOCAL_QUORUM
)
session.execute(write_stmt, ('txn_9812', 'acc_441', 149.99, 'GBP'))

# Read with LOCAL_QUORUM: ensures we see the latest committed write
read_stmt = SimpleStatement(
    "SELECT * FROM transactions WHERE txn_id = %s",
    consistency_level=ConsistencyLevel.LOCAL_QUORUM
)
row = session.execute(read_stmt, ('txn_9812',)).one()

Write acknowledged by 2/3 London replicas  (LOCAL_QUORUM ✓)
Write latency: 4.2ms

Read returned from 2/3 London replicas:
  txn_id:     txn_9812
  account_id: acc_441
  amount:     149.99
  currency:   GBP
Read latency: 3.1ms

ConsistencyLevel.LOCAL_QUORUM

LOCAL_QUORUM requires a quorum of replicas in the local DC only — 2 of 3 London replicas. QUORUM would require a quorum across all DCs (4 of 6 total) which means every write and read must synchronise across the Atlantic, adding 80–100ms of cross-DC latency. LOCAL_QUORUM gives you strong consistency within the DC without that latency penalty — the standard choice for multi-DC deployments.

DCAwareRoundRobinPolicy(local_dc='london')

Tells the driver to prefer London nodes as coordinators. Without this, the driver might route a write from a London application server through a Frankfurt coordinator — adding unnecessary cross-DC latency before the write even starts. The driver should always talk to the nearest node first.

Hands-on — Replication in MongoDB Replica Sets

The scenario: Your MongoDB deployment is a single standalone instance. After a disk failure last month caused 4 hours of downtime, your CTO mandates a replica set with automatic failover. You are initialising a 3-node replica set and verifying the replication state.

# Initialise the replica set from the primary node
rs.initiate({
  _id: "rs0",
  members: [
    { _id: 0, host: "mongo1:27017", priority: 2 },  // preferred primary
    { _id: 1, host: "mongo2:27017", priority: 1 },
    { _id: 2, host: "mongo3:27017", priority: 1 }
  ]
})

# Check replica set status
rs.status()

{
  set: 'rs0',
  members: [
    { _id: 0, name: 'mongo1:27017', stateStr: 'PRIMARY',
      optime: { ts: Timestamp(1736512943, 1) },
      health: 1, uptime: 142 },
    { _id: 1, name: 'mongo2:27017', stateStr: 'SECONDARY',
      optime: { ts: Timestamp(1736512943, 1) },
      health: 1, uptime: 141 },
    { _id: 2, name: 'mongo3:27017', stateStr: 'SECONDARY',
      optime: { ts: Timestamp(1736512943, 1) },
      health: 1, uptime: 141 }
  ],
  ok: 1
}

priority: 2 on mongo1

Priority controls which node is preferred for primary election. mongo1 has priority 2, the others have priority 1 — so after any failover, once mongo1 recovers it will reclaim the primary role. This is useful when one node has better hardware. Set priority to 0 to make a node permanently non-eligible for primary (useful for a hidden analytics replica that should never serve writes).

optime: Timestamp(1736512943, 1)

The operation timestamp — the position in the primary's oplog that this member has replicated up to. When all three members show the same optime, replication is fully caught up. If a secondary's optime lags significantly behind the primary, it means replication is falling behind — usually caused by a slow secondary or a burst of writes.

The scenario continues: You simulate a primary failure to verify automatic failover works correctly. The secondary with the highest optime is elected primary within seconds.

# Simulate primary failure by stepping down (or kill the process)
rs.stepDown()   # forces the primary to step down and triggers election

# On reconnect — check who won the election
rs.status().members.filter(m => m.stateStr === 'PRIMARY')

mongo1:27017 stepped down.
Election triggered — candidates: mongo2, mongo3
mongo2:27017 elected PRIMARY (highest optime, priority 1)
Election completed in: 2.4 seconds

rs.status() after failover:
  mongo1:27017  SECONDARY  (recovering)
  mongo2:27017  PRIMARY    ← new primary
  mongo3:27017  SECONDARY

Write availability restored after: 2.4 seconds

Election completed in 2.4 seconds

MongoDB uses the Raft consensus algorithm for elections. A candidate node requests votes from the other members. The node with the most up-to-date oplog (highest optime) and highest priority wins. The default election timeout is 10 seconds — you can tune settings.electionTimeoutMillis down to 5 seconds for faster failover, at the cost of more false elections if the primary is only briefly unreachable.

Write unavailability during election (2.4s)

During the election window, the replica set has no primary — writes are rejected. Applications must implement retry logic with exponential backoff to handle this window. The MongoDB driver does this automatically for retryable writes (enabled by default since MongoDB 4.0) — it retries the write once a new primary is elected without any application code changes.

Replication Lag — The Consistency Problem

In any asynchronous replication system, there is a window of time between when a write is committed on the primary and when it appears on secondaries. This is called replication lag. During that window, a read from a secondary returns stale data.

Read-Your-Writes Violation

A user updates their profile photo. The write goes to the primary. The user immediately refreshes — their request lands on a secondary that has not yet replicated the update. They see their old photo. This is a read-your-writes violation. The fix: route reads to the primary for operations that must see their own recent writes, or wait for replication to complete before redirecting to secondaries.

Consistency Level Reference — Cassandra

Consistency Level	Nodes required	Consistency	Use when
ONE	1 replica	Eventual	Maximum availability — stale reads acceptable (logging, analytics)
LOCAL_QUORUM	Majority in local DC	Strong (within DC)	Multi-DC production workloads — payments, user state
QUORUM	Majority across all DCs	Strong (global)	Global strong consistency needed — accept cross-DC latency
ALL	Every replica	Strongest	Rarely used — any single node down blocks writes
ANY	1 node (hint counts)	Weakest	Maximum write availability — even hinted handoff counts

Hinted Handoff — Surviving Node Failures Without Data Loss

When a Cassandra node is temporarily down, the coordinator node stores a hint — a record of the write that the down node missed. When the node comes back online, the coordinator delivers the hint and the node catches up. This is why Cassandra can acknowledge a write at ONE consistency even when a replica is down — the hint ensures the data will eventually reach all replicas.

Read Repair — The Complementary Mechanism

When Cassandra reads with quorum and discovers that replicas disagree (different values for the same row), it automatically corrects the outdated replica in the background. This is called read repair. Together with hinted handoff, it ensures that even after node failures and temporary inconsistencies, all replicas eventually converge on the correct value without manual intervention.

Teacher's Note

The single most common replication misconfiguration in production is using ONE consistency level for everything because it is the fastest — and then being surprised when users see stale data or, worse, when a node failure causes a permanent read of old values. For anything that matters — user accounts, payments, inventory — LOCAL_QUORUM is the minimum. ONE is for event logging and analytics where you genuinely do not care if a row is briefly stale.

Practice Questions — You're the Engineer

Scenario:

Your team runs a multi-DC Cassandra cluster for a banking app — nodes in London and Frankfurt, RF=3 per DC. The payment service needs strong consistency within each DC but cannot afford the 80ms cross-Atlantic latency of global quorum on every transaction. Your tech lead says to use the consistency level that requires a majority of replicas in the local DC only. Which consistency level is that?

Scenario:

One of your Cassandra nodes goes offline for scheduled maintenance. During the 20-minute window, writes keep succeeding at ONE consistency. When the node comes back online, it receives all the writes it missed — automatically, without any manual intervention or replay script. A colleague asks what Cassandra mechanism delivered those missed writes to the recovered node. What is it called?

Scenario:

A user on your social platform updates their display name from "bob_k" to "Robert K." The write goes to the MongoDB primary. The user immediately refreshes their profile page — the load balancer routes the read to a secondary that has not yet replicated the update, so they see their old name "bob_k." The user files a bug report: "my name change didn't save." Which consistency anomaly are they experiencing?

Quiz — Replication in Production

Up Next · Lesson 29

Indexing in NoSQL

Why NoSQL indexes are nothing like SQL indexes — and the specific patterns that turn a 10-second query into a 2ms lookup.

← Previous Course Index Next →