NO SQL Lesson 27 – Partitioning & Sharding | Dataplexa
Data Modeling & Design · Lesson 27

Partitioning & Sharding

When your dataset outgrows a single machine, you have two options: buy a bigger machine (vertical scaling, with hard limits and eye-watering costs) or split the data across many machines (horizontal scaling, theoretically infinite). Partitioning and sharding are how you do the second one — and the key you choose to split on determines whether your cluster scales linearly or melts down under load.

Partitioning vs Sharding — The Distinction

Partitioning is the general concept of dividing a dataset into smaller, independent subsets called partitions. A partition is a slice of the data — it can live on the same machine as other partitions or on a different one.

Sharding is partitioning across multiple physical machines (nodes). When people say a database is sharded, they mean the partitions are distributed — each node owns a subset of the data, and together the nodes form one logical database. Every NoSQL database covered in this course uses sharding. What differs is how each one decides which node owns which data.

Why sharding is the foundation of NoSQL scalability

A single PostgreSQL server tops out at a few terabytes before write throughput degrades. A 10-node Cassandra cluster handles 10× the write throughput. A 100-node cluster handles 100×. The relationship is nearly linear — because each node owns an independent slice of the data and handles its own reads and writes without coordinating with the others.

The Three Partitioning Strategies

Every NoSQL database uses one of three strategies to decide which shard owns which row. Each has different strengths, weaknesses, and failure modes.

Hash Partitioning

Apply a hash function to the partition key. The hash value determines the shard. Distributes data randomly but uniformly — no hotspots, no sequential clustering. The dominant strategy in Cassandra and DynamoDB.

shard = hash(user_id) % num_shards
"usr_882" → hash → shard 4
"usr_441" → hash → shard 7

Weakness: range queries are expensive — "all users who signed up in January" spans every shard.

Range Partitioning

Divide the key space into contiguous ranges. Shard 1 owns A–F, shard 2 owns G–M, shard 3 owns N–Z. Range queries are fast — all data for a range is on one shard. Used in HBase and MongoDB.

Shard 1: rows where key < "G"
Shard 2: rows where key < "N"
Shard 3: rows where key ≥ "N"

Weakness: sequential keys create hotspots — all new writes land on the shard that owns the current range end.

Directory Partitioning

A lookup table (directory) maps each key to a specific shard. Maximum flexibility — any key can be assigned to any shard manually. Used when you need fine-grained control over data placement. Less common, higher operational overhead.

directory["tenant_acme"] → shard 3
directory["tenant_beta"] → shard 1
Manual control over placement

Weakness: the directory itself becomes a bottleneck and single point of failure.

Consistent Hashing — How Cassandra Avoids Reshuffling

Simple hash partitioning has a fatal problem: add or remove a node and you must recalculate hash(key) % num_shards for every key, meaning almost all data moves to a different shard. On a 100-node cluster storing 10TB, that is a massive reshuffling operation that brings the cluster to its knees.

Cassandra solves this with consistent hashing. The hash space is arranged as a ring (0 to 2⁶⁴). Each node is assigned one or more positions on the ring called tokens. A row's partition key is hashed, and the row is owned by the first node whose token is greater than or equal to that hash value — the node's clockwise neighbour on the ring.

Consistent Hashing Ring — 4 Nodes

Node A token 0 Node B token 25 Node C token 50 Node D token 75 key→hash 8 → Node B key→hash 38 → Node C key→hash 88 → Node A Token Ring

Adding a new node only displaces the data between its new token and its predecessor. The rest of the ring is untouched — only 1/N of the data moves when a node is added to an N-node cluster.

Hands-on — Observing Shard Distribution in Cassandra

The scenario: You are a platform engineer running a 6-node Cassandra cluster for a logistics company. The ops team complains that two nodes consistently show 3× higher CPU than the others. You suspect a partition key distribution problem. You need to inspect how data is distributed across nodes and identify the hot partition.

# Check token ranges and data ownership per node
nodetool ring

# Check actual data size per node
nodetool status

# Find the partition key causing the hotspot
# (run from cqlsh — shows estimated partition sizes)
SELECT * FROM system.size_estimates
WHERE keyspace_name = 'logistics'
  AND table_name    = 'shipments'
LIMIT 20;
$ nodetool status
Datacenter: dc1
===============
Status=Up/Down | State=Normal/Leaving/Joining/Moving
--  Address        Load        Tokens  Owns    Host ID
UN  10.0.0.1       48 GB       256     16.4%   node-1
UN  10.0.0.2       49 GB       256     16.8%   node-2
UN  10.0.0.3       147 GB      256     16.6%   node-3   ← HOT
UN  10.0.0.4       51 GB       256     16.7%   node-4
UN  10.0.0.5       49 GB       256     16.8%   node-5
UN  10.0.0.6       145 GB      256     16.7%   node-6   ← HOT

Note: Owns reflects the configured replication factor of 3.
Nodes 3 and 6 are carrying 3x the data of their peers.
nodetool status — the first diagnostic tool you reach for

nodetool status shows the load (data size) each node carries. In a well-distributed cluster all nodes should have roughly equal load. Nodes 3 and 6 at 147GB vs ~49GB for the others is a 3× imbalance — a classic sign of a low-cardinality or monotonic partition key creating hot partitions that land disproportionately on certain token ranges.

Tokens: 256 (vnodes)

Each node owns 256 virtual nodes (vnodes) — 256 positions on the consistent hash ring. Virtual nodes are Cassandra's mechanism for achieving even data distribution even when nodes have different hardware specs. The fact that all nodes have equal token ownership (≈16.7%) but unequal data load confirms the problem is in the partition key, not the ring configuration.

The scenario continues: You investigate the shipments table schema and discover the partition key is status — values are 'pending', 'in_transit', 'delivered'. 70% of all shipments are 'delivered', and their token range happens to fall on nodes 3 and 6. You redesign the partition key to distribute load evenly.

-- BEFORE: low-cardinality partition key causes hotspot
-- 70% of rows have status='delivered' → nodes 3 & 6 overwhelmed
CREATE TABLE shipments_bad (
  status      TEXT,           -- BAD: only 3 values → 3 partitions max
  shipped_at  TIMESTAMP,
  shipment_id UUID,
  origin      TEXT,
  destination TEXT,
  PRIMARY KEY (status, shipped_at, shipment_id)
);

-- AFTER: high-cardinality composite partition key
-- shipper_id has 50,000 unique values → 50,000 partitions
-- date bounds partition size to one day of shipments per shipper
CREATE TABLE shipments (
  shipper_id  TEXT,
  ship_date   DATE,
  shipped_at  TIMESTAMP,
  shipment_id UUID,
  status      TEXT,
  origin      TEXT,
  destination TEXT,
  PRIMARY KEY ((shipper_id, ship_date), shipped_at, shipment_id)
) WITH CLUSTERING ORDER BY (shipped_at DESC, shipment_id ASC);
Created table shipments
Took 0.744 seconds

After migration and nodetool status:
UN  10.0.0.1  82 GB  256  16.7%  node-1  ✓
UN  10.0.0.2  81 GB  256  16.6%  node-2  ✓
UN  10.0.0.3  83 GB  256  16.7%  node-3  ✓
UN  10.0.0.4  82 GB  256  16.7%  node-4  ✓
UN  10.0.0.5  81 GB  256  16.6%  node-5  ✓
UN  10.0.0.6  82 GB  256  16.7%  node-6  ✓
Even distribution across all 6 nodes.
PRIMARY KEY ((shipper_id, ship_date), shipped_at, shipment_id)

The composite partition key (shipper_id, ship_date) creates one partition per shipper per day. With 50,000 shippers and 365 days per year, that is 18.25 million partitions distributed across 6 nodes — roughly 3 million partitions per node. Consistent hash spreads them uniformly. No single value dominates any node.

status moved to a regular column

status is now a regular column — you can filter by it within a partition using WHERE status = 'pending' after specifying the partition key, but you cannot use it as the primary distribution mechanism. If you need to query all pending shipments across all shippers, that requires either a secondary index (with its own performance caveats) or a separate table shaped for that query.

Sharding in MongoDB — How It Works

MongoDB sharding is managed through three components: mongos (the query router), config servers (store the shard map), and shard replica sets (the actual data). When a query arrives at mongos, it consults the config servers to find which shard holds the requested data, then routes the query directly there.

The scenario: Your MongoDB collection of 800 million order documents has outgrown a single server — write latency has climbed to 300ms and disk is 94% full. You are enabling sharding on the orders collection using a hashed shard key on customer_id to distribute writes evenly.

# Enable sharding on the database
sh.enableSharding("ecommerce")

# Shard the orders collection on a hashed customer_id
# Hashed sharding → uniform distribution, good for high write throughput
sh.shardCollection(
  "ecommerce.orders",
  { customer_id: "hashed" }   // "hashed" = hash partitioning strategy
)

# Check shard distribution after data migrates
db.orders.getShardDistribution()
Shard shard0000 at shard0000/10.0.1.1:27018
  data: 198.4 GiB  docs: 200,112,841  chunks: 42
Shard shard0001 at shard0001/10.0.1.2:27018
  data: 201.1 GiB  docs: 200,891,204  chunks: 41
Shard shard0002 at shard0002/10.0.1.3:27018
  data: 199.8 GiB  docs: 200,443,118  chunks: 42
Shard shard0003 at shard0003/10.0.1.4:27018
  data: 200.7 GiB  docs: 198,552,837  chunks: 43

Totals: data 800 GiB  docs 800,000,000  chunks 168
Shard shard0000 contains 25.01% data  25.01% docs
Shard shard0001 contains 25.11% data  25.11% docs
Even distribution across 4 shards ✓
{"{"} customer_id: "hashed" {"}"}

The string "hashed" tells MongoDB to apply a hash function to customer_id before determining the shard. This is identical in concept to Cassandra's consistent hashing — it ensures customer IDs are distributed uniformly regardless of their actual values. If you used a ranged shard key instead, newly registered customers (with higher IDs) would all land on the same shard.

chunks: 42

MongoDB breaks each shard's data into chunks — contiguous ranges of the hashed key space. The default chunk size is 128MB. MongoDB's balancer automatically migrates chunks between shards to keep data evenly distributed. When a chunk grows beyond 128MB, MongoDB splits it and the balancer may move one half to a less-loaded shard.

Shard Key Selection — The Rules

Property Why it matters Bad example Good example
High cardinality More unique values = more partitions = better distribution status (3 values) customer_id (millions)
Non-monotonic Sequential keys (timestamps, auto-increment) route all writes to the current range-end shard created_at (timestamp) hashed(order_id)
Matches query pattern Queries that include the shard key hit one shard. Queries without it scatter to all shards. Shard by region, query by user_id Shard by user_id, query by user_id
Bounded partition size A single key value must not generate a partition that grows without limit over time user_id with no TTL (power users) (user_id, month) composite

Scatter-Gather — The Cost of a Bad Shard Key

When a query does not include the shard key, the query router must send it to every shard and merge the results — a scatter-gather operation. On a well-sharded cluster this is unavoidable for some queries, but if your most common queries are scatter-gather, you have chosen the wrong shard key.

The Scatter-Gather Trap

A 10-shard MongoDB cluster where the app's most frequent query does not include the shard key is not 10× faster than a single node — it is 10× slower. Each shard runs the full query and returns results. The mongos router merges 10 result sets. You have multiplied your work, not divided it. Always verify your most frequent queries include the shard key before committing to a sharding strategy.

Teacher's Note

In MongoDB, the shard key is immutable — you cannot change it after the collection is sharded without dropping and re-creating the collection. In Cassandra, the partition key is baked into the table schema. In both systems, choosing the wrong distribution key locks you into a painful migration. Spend disproportionate time on this decision before you write a single byte of production data.

Practice Questions — You're the Engineer

Scenario:

Your Cassandra cluster is growing. You need to add a 7th node to a 6-node cluster. Your tech lead explains that with a simple modulo hash (hash % num_nodes), adding a node would require moving almost all data to new nodes. Cassandra avoids this by arranging nodes on a ring where each node owns a token range — adding a node only displaces the data between the new node's token and its predecessor. What is this partitioning algorithm called?


Scenario:

Your MongoDB cluster has 8 shards. The collection is sharded by customer_id. The analytics team's most common query filters by product_category — a field that is not the shard key. You run explain() and see the query is sent to all 8 shards simultaneously and the results are merged. Query time is 8× slower than expected. What is this query execution pattern called?


Scenario:

An HBase table stores web crawl data partitioned by URL. The data engineer explains that all URLs starting with 'a' through 'f' are on Region Server 1, 'g' through 'm' on Region Server 2, and so on — the key space is divided into contiguous alphabetical ranges. Range scans like "get all pages under example.com" hit exactly one region server. What partitioning strategy is this?


Quiz — Partitioning & Sharding in Production

Scenario:

A logistics company has a 10-node Cassandra cluster with a shipments table partitioned by status. Status has three values: 'pending', 'in_transit', 'delivered'. nodetool status shows 2 nodes carrying 60% of the total data while the other 8 carry 40% combined. Adding more nodes has not helped. What is the root cause?

Scenario:

Your MongoDB orders collection is sharded by created_at (a timestamp) using range sharding. On Black Friday, all new orders written in the same minute land on the same shard because they share the same timestamp range. That shard hits 100% CPU while the others are idle. What sharding strategy should you switch to and why?

Scenario:

You are presenting Cassandra's scaling properties to your CTO. She asks: "If we have 20 nodes and add a 21st, how much data has to move?" You explain that Cassandra uses consistent hashing on a token ring, not a simple modulo hash. What is the correct answer about how much data moves, and why is it so much less than with simple modulo hashing?

Up Next · Lesson 28

Replication Strategies

How NoSQL databases keep copies of your data on multiple nodes — and what happens when those copies disagree.