NoSQL
Cassandra Introduction
In 2008, Facebook had a problem: their inbox search needed to handle 50 billion messages. Each search query had to scan millions of rows across hundreds of servers in milliseconds. Their MySQL cluster was drowning. Two engineers — Avinash Lakshman and Prashant Malik — designed a new database that borrowed ideas from both Amazon's Dynamo paper and Google's Bigtable paper. They called it Cassandra. Facebook open-sourced it in 2008. Apache adopted it in 2010. Today it handles millions of writes per second for Netflix, Instagram, Apple, Twitter, and Uber — and it's about to make complete sense to you.
What Makes Cassandra Fundamentally Different
Cassandra was engineered around one core belief: writes must never block. Every architectural decision flows from this. The result is a database that can ingest a million writes per second across a 10-node cluster — sustained, without degradation.
No single point of failure
Every node is equal. No master, no primary, no leader. Any node can accept any read or write.
Linear scalability
Double the nodes, double the throughput. Cassandra scales horizontally with near-perfect linearity.
Write-optimised
Writes go to RAM first (memtable), confirmed immediately, flushed to disk in the background.
The Ring Architecture — No Masters, No Hierarchy
Cassandra organises its nodes in a ring. Each node owns a range of the token space (0 to 2^64). When you write a row, Cassandra hashes the partition key to a token, finds which node owns that token range, and writes there. Every node knows the ring topology — any node can route any request.
Cassandra Ring — 4 nodes, replication factor 2
With Replication Factor 2, every write goes to the primary node AND the next node clockwise. Node A failure? Node B still has the data. No data loss. No manual failover.
CQL — Cassandra Query Language
CQL looks like SQL — familiar syntax, familiar keywords. But it behaves completely differently under the hood. The most important difference: CQL only allows queries that use the partition key. Anything else requires an index or a full table scan — which Cassandra was not designed for.
Table Design — The Most Critical Cassandra Skill
In Cassandra, your table design is your query. You design tables around the queries you need to run — not around the data's natural relationships. This is the opposite of relational design.
The scenario: You're building a ride-hailing platform. Drivers send GPS coordinates every 5 seconds. You need two query patterns: (1) get all locations for a specific driver in the last hour, (2) get the most recent location for a driver. Here's how to design for both:
-- Connect to Cassandra
cqlsh
-- Create a keyspace (equivalent to a database)
CREATE KEYSPACE ridehailing
WITH replication = {
'class': 'NetworkTopologyStrategy',
'us-east': 3 -- 3 replicas in us-east datacenter
};
USE ridehailing;
NetworkTopologyStrategy
The production-grade replication strategy. Distributes replicas across datacenters and racks. 'us-east': 3 means 3 replicas in the us-east datacenter. For multi-region: 'us-east': 3, 'eu-west': 3 gives 6 total replicas across two regions.
SimpleStrategy
The alternative for single-datacenter or development. WITH replication = {"{"}'class': 'SimpleStrategy', 'replication_factor': 3{"}"}. Never use SimpleStrategy in production multi-DC deployments — it doesn't understand datacenter topology.
-- Design the table around the query: "all locations for driver X in last hour"
CREATE TABLE driver_locations (
driver_id TEXT, -- partition key: all rows for one driver on one node
recorded_at TIMESTAMP, -- clustering column: rows sorted by time within partition
latitude DOUBLE,
longitude DOUBLE,
speed_kmh FLOAT,
PRIMARY KEY (driver_id, recorded_at)
) WITH CLUSTERING ORDER BY (recorded_at DESC);
-- DESC = most recent first — "latest location" query reads the first row only
PRIMARY KEY (driver_id, recorded_at)
The first element in PRIMARY KEY is always the partition key. Everything after is a clustering column. This table design means: all GPS pings from driver_id live on the same node (partition key), sorted by recorded_at (clustering column). Range queries on recorded_at are extremely fast — sequential reads on pre-sorted data.
CLUSTERING ORDER BY (recorded_at DESC)
Data is physically stored on disk in this order. DESC means newest first. Getting the latest location for a driver = reading just the first row of the partition. No sort at query time — it's already sorted. Massively efficient for "get latest" queries.
Writing Data — The Fast Path
The scenario: 80,000 drivers sending GPS pings every 5 seconds — that's 16,000 writes per second. Here's how Cassandra handles them:
from cassandra.cluster import Cluster
from cassandra import ConsistencyLevel
from cassandra.query import SimpleStatement
from datetime import datetime
cluster = Cluster(['node1', 'node2', 'node3'])
session = cluster.connect('ridehailing')
Cluster(['node1','node2','node3']) — the driver connects to all seed nodes. It downloads the ring topology from any one of them and knows exactly which node owns which token range. Future requests are sent directly to the right node — no proxy, no routing layer.
# Prepare the statement once — reuse for every write
# Prepared statements are compiled once and sent by ID on reuse — faster
insert_location = session.prepare("""
INSERT INTO driver_locations (driver_id, recorded_at, latitude, longitude, speed_kmh)
VALUES (?, ?, ?, ?, ?)
USING TTL 86400
""")
# TTL 86400 = auto-delete after 24 hours (location data is temporary)
session.prepare()
Sends the CQL statement to the cluster once for compilation. Cassandra returns a prepared statement ID. On every subsequent execution, only the ID + bound values are sent — not the full CQL text. At 16,000 inserts/sec, this eliminates enormous parsing overhead.
USING TTL 86400
Every row expires and is automatically deleted after 86,400 seconds (24 hours). GPS history older than 24 hours is irrelevant — no cleanup jobs needed. Cassandra's TTL applies per-row, not per-table. You can even set different TTLs on different columns.
# Execute a write with QUORUM consistency
# QUORUM = majority of replicas must confirm (2 out of 3 with RF=3)
bound = insert_location.bind((
'driver_d7821',
datetime.now(),
51.5074, # latitude
-0.1278, # longitude
42.5 # speed in km/h
))
bound.consistency_level = ConsistencyLevel.QUORUM
session.execute(bound)
-- Write path internally: -- 1. Driver receives write request -- 2. Hashes 'driver_d7821' to token → finds primary node -- 3. Writes to primary node's commit log (disk) — for durability -- 4. Writes to primary node's memtable (RAM) — for fast reads -- 5. Forwards to RF-1 replica nodes asynchronously -- 6. QUORUM: waits for 2/3 nodes to confirm → returns OK Write acknowledged: ~1.8ms No locking. No transaction log coordination across nodes.
The write path — why it's so fast:
Commit log write
A sequential append to the commit log file — the fastest possible disk write. Sequential I/O is 100x faster than random I/O. Even if the node crashes before the memtable is flushed, the commit log is replayed on restart to recover the write.
Memtable → SSTable flush
The memtable fills up in RAM, then Cassandra flushes it to an immutable SSTable file on disk in the background. SSTables are never modified — new writes always create new SSTables. No in-place updates. No random writes. This is why Cassandra sustains millions of writes/sec.
Reading Data — Partition Key First, Always
The scenario: Fetch the last hour of GPS data for a driver, and separately fetch just the most recent location:
from datetime import datetime, timedelta
one_hour_ago = datetime.now() - timedelta(hours=1)
# Query 1: last hour of locations for driver_d7821
rows = session.execute("""
SELECT recorded_at, latitude, longitude, speed_kmh
FROM driver_locations
WHERE driver_id = 'driver_d7821'
AND recorded_at >= %s
""", (one_hour_ago,))
for row in rows:
print(f"{row.recorded_at}: ({row.latitude}, {row.longitude}) @ {row.speed_kmh}km/h")
2024-01-15 15:59:55: (51.5074, -0.1278) @ 42.5km/h 2024-01-15 15:59:50: (51.5071, -0.1275) @ 41.2km/h 2024-01-15 15:59:45: (51.5068, -0.1272) @ 40.8km/h ...720 rows (one per 5 seconds for 1 hour) Query time: 6ms Rows scanned: 720 (sequential read of one partition — no cross-node lookups)
Why this query is fast: driver_id = 'driver_d7821' directs Cassandra to exactly one partition (one node). All 720 rows are co-located on that node, pre-sorted by recorded_at DESC. The range filter is a sequential read from the top of the partition. No cross-node coordination, no sorting at query time.
# Query 2: most recent location only — LIMIT 1 on DESC-ordered partition
row = session.execute("""
SELECT recorded_at, latitude, longitude
FROM driver_locations
WHERE driver_id = 'driver_d7821'
LIMIT 1
""").one()
print(f"Last seen: {row.recorded_at} at ({row.latitude}, {row.longitude})")
Last seen: 2024-01-15 15:59:55 at (51.5074, -0.1278) Query time: 0.9ms Rows scanned: 1 (reads first row of partition — already sorted DESC)
LIMIT 1 on a DESC-ordered partition = instant — because data is stored newest-first, the first row IS the latest entry. Cassandra reads one row and stops. 0.9ms for what would be an expensive MAX(recorded_at) aggregate in SQL.
Consistency Levels — The Tunable Dial
Cassandra's consistency level is set per-query — not per-table or per-database. This is unique: you can run the same table with strong consistency for critical reads and eventual consistency for bulk analytics, all in the same application.
| Level | Nodes Required | Latency | Use When |
|---|---|---|---|
| ONE | 1 of N | Fastest | Stale data acceptable. Analytics, logs, metrics. |
| QUORUM | ⌊N/2⌋+1 of N | Medium | Strong consistency without ALL overhead. Most production reads/writes. |
| LOCAL_QUORUM | Majority in local DC | Low (local DC only) | Multi-DC: strong locally, eventual across regions. |
| ALL | All N nodes | Slowest | Financial writes where every replica must confirm. Rarely used. |
The QUORUM guarantee — a worked example with RF=3:
Write QUORUM + Read QUORUM = Strong Consistency
RF=3 → QUORUM = 2 nodes. Write confirmed on 2 nodes. Read returns from 2 nodes and picks the newest. Since write touched 2 and read touches 2, at least 1 node overlap is guaranteed → you always read the latest write.
Write ONE + Read ONE = Eventual Consistency
Write confirmed on 1 node. Read from 1 node (could be a different node that hasn't received the write yet). Brief stale window. For analytics and counters — perfectly acceptable and 3x faster.
The Partition Key Problem — Hot Partitions
The most common Cassandra mistake is a poor partition key. If your partition key has low cardinality — few unique values — all the traffic goes to a small number of nodes. This is called a hot partition and it destroys performance.
❌ Hot partition — BAD design
country TEXT, -- BAD partition key
event_id UUID,
PRIMARY KEY (country, event_id)
);
-- "US" partition receives 300M writes
-- "LI" (Liechtenstein) gets 100
-- One node is melting, others idle
Country has low cardinality — only ~200 values. All US traffic to one partition. That one node becomes a bottleneck while others are idle. This is the #1 Cassandra anti-pattern.
✅ Even distribution — GOOD design
user_id UUID, -- GOOD: millions of unique values
event_id UUID,
PRIMARY KEY (user_id, event_id)
);
-- Millions of unique user_ids
-- Traffic spread evenly across all nodes
-- Linear scalability achieved
High-cardinality partition key = even distribution. Each user's events live on one node (fast reads) but different users are spread across all nodes (no bottleneck).
Teacher's Note
The single biggest mindset shift when learning Cassandra: stop thinking about your data model and start thinking about your query patterns. In SQL you design a normalised schema and let the query planner figure out how to answer any query. In Cassandra, you design tables for specific queries — one table per query pattern is a common approach. If you have 5 different query patterns, you might have 5 tables, each storing the same data in a different shape. Duplication is intentional. Cassandra trades storage space for query performance, and at its scale, that trade-off is absolutely worth it.
Practice Questions — You're the Engineer
Scenario:
Scenario:
Scenario:
product_category as the partition key. They have 20 categories but one category ("electronics") receives 70% of all writes. One Cassandra node is consistently at 95% CPU while the others sit at 10%. Query latency on electronics is degrading. What is this anti-pattern called?
Quiz — Cassandra Under Pressure
Scenario:
Scenario:
Scenario:
Up Next · Lesson 19
HBase Overview
Google's Bigtable paper implemented in open source — how HBase integrates with the Hadoop ecosystem, its column family architecture, and why it's the database of choice for petabyte-scale analytics workloads.