NoSQL
Column-Family Databases
Netflix sends you a personalised recommendation every time you open the app. Behind that recommendation is a system that has processed billions of watch history events — every play, pause, skip, and rating — across 238 million subscribers. That data lives in a column-family database. Processing it row by row like SQL would take hours. Column-family storage reads only the columns it needs, skipping terabytes of irrelevant data. The result: recommendations computed in seconds, not hours. This lesson explains exactly how that works.
Row Storage vs Column Storage — The Physical Difference
This is the most important concept in this lesson. It's not just a software difference — it's a physical difference in how bytes are laid out on disk. That physical layout determines everything: which queries are fast, which are slow, and why column-family databases exist at all.
Imagine 4 users in a database — same data, two different physical layouts on disk:
❌ Row storage (SQL) — data on disk:
[Ali | ali@x.com | 28 | London]
-- All columns for Row 2 together:
[Sara | s@x.com | 34 | Paris]
-- All columns for Row 3 together:
[James | j@x.com | 22 | Tokyo]
-- All columns for Row 4 together:
[Lin | l@x.com | 41 | Berlin]
Query: average age of all users.
Must read: ALL 4 rows × ALL 4 columns = 16 values read to get 4 ages.
✅ Column storage — data on disk:
[Ali | Sara | James | Lin]
-- email column stored together:
[ali@x.com | s@x.com | j@x.com | l@x.com]
-- age column stored together:
[28 | 34 | 22 | 41]
-- city column stored together:
[London | Paris | Tokyo | Berlin]
Query: average age of all users.
Must read: ONLY the age column = 4 values read. Skip 12 values entirely.
At 1 billion rows with 100 columns: column storage reads 1 column (10GB) to compute an average. Row storage reads all 100 columns (1TB) to get the same result. That is the difference between a 2-second query and a 3-minute query.
What "Column-Family" Actually Means
The term "column-family database" is slightly misleading. These databases do not just store data by column — they store data in groups of related columns called column families. Each column family is stored separately on disk. Columns within a family are always read together.
Column Family 1 — "profile"
user_002: name=Sara, email=s@x.com
user_003: name=James
-- user_003 has no email — no NULL stored
Stored together on disk. All profile reads are a single sequential I/O operation.
Column Family 2 — "activity"
user_002: last_login=2024-01-14
user_003: last_login=2024-01-15, clicks=89
Stored separately from profile. Activity reports never need to touch the profile column family.
The Write Path — LSM-Tree Architecture
Column-family databases use an LSM-tree (Log-Structured Merge-tree) write path. This is the architectural reason they achieve millions of writes per second. Here is how it works step by step:
Write arrives — Commit Log (disk)
Every write is first appended to the commit log — a sequential file on disk. Sequential append is the fastest possible disk operation. This provides durability: if the server crashes, the commit log is replayed on restart and no data is lost.
Write goes to Memtable (RAM)
Simultaneously, the write is stored in an in-memory sorted tree (the memtable). The client receives a success response immediately — no waiting for disk. This is why writes feel instant: they are confirmed the moment RAM is updated.
Memtable fills — Flush to SSTable (disk)
When the memtable reaches its size threshold (typically 64MB to 256MB), it is flushed to an immutable SSTable (Sorted String Table) file on disk. SSTables are never modified — new writes always create new SSTables. This eliminates random I/O entirely.
Background compaction merges SSTables
Over time, multiple small SSTables accumulate. A background compaction process merges them into fewer larger SSTables — removing deleted rows and duplicates. Reads become faster as fewer files need to be checked per query.
The Read Path — Memtable + Bloom Filters + SSTables
Reads in column-family databases are more complex than writes. The database must check multiple places before returning a result:
Check Memtable
Most recent writes live here. If found — return immediately. RAM access is around 100ns.
Check Bloom Filter
Probabilistic structure per SSTable. "Definitely not here" skips that file entirely. Eliminates most SSTable reads.
Read SSTables
Check remaining SSTable files newest-first. Row cache can short-circuit this for hot data.
Hands-on — Building a SaaS Analytics Table
The scenario: You are building a real-time analytics dashboard for a SaaS platform. Every user action generates an event. You need to query events by user for the last 30 days. Here is the full table definition and working queries:
-- Create keyspace and event table
CREATE KEYSPACE saas_analytics
WITH replication = {'class':'SimpleStrategy','replication_factor':3};
USE saas_analytics;
CREATE TABLE user_events (
user_id TEXT,
event_time TIMESTAMP,
event_type TEXT,
properties MAP, -- flexible key-value for event metadata
PRIMARY KEY (user_id, event_time)
) WITH CLUSTERING ORDER BY (event_time DESC)
AND default_time_to_live = 2592000; -- auto-delete after 30 days
MAP<TEXT, TEXT> properties
A map column type stores key-value pairs inside a single column. A page_view event stores {"{"}"url": "/pricing", "time_on_page": "42s"{"}"}. A purchase event stores {"{"}"product_id": "p_44", "amount": "49.99"{"}"}. Different event types carry different properties with zero schema changes.
default_time_to_live = 2592000
Every row in this table auto-expires after 30 days (2,592,000 seconds). Analytics events older than 30 days are automatically garbage-collected by Cassandra. No application-level cleanup, no scheduled jobs, no unbounded table growth.
from cassandra.cluster import Cluster
from datetime import datetime, timedelta
session = Cluster(['node1', 'node2']).connect('saas_analytics')
# Insert a page view event
session.execute("""
INSERT INTO user_events (user_id, event_time, event_type, properties)
VALUES (%s, %s, %s, %s)
""", ('user_4421', datetime.now(), 'page_view',
{'url': '/pricing', 'time_on_page': '42s'}))
{'url': '/pricing', 'time_on_page': '42s'} passed as a Python dict, stored as a Cassandra MAP. Each event type carries different key-value pairs inside this one column. The table schema never changes — the MAP absorbs any new event properties naturally.
# Query: all events for user_4421 in the last 7 days
seven_days_ago = datetime.now() - timedelta(days=7)
rows = session.execute("""
SELECT event_time, event_type, properties
FROM user_events
WHERE user_id = 'user_4421'
AND event_time >= %s
""", (seven_days_ago,))
event_time | event_type | properties
---------------------------+-------------+------------------------------------------
2024-01-15 16:04:22.000 | page_view | {'url': '/pricing', 'time_on_page': '42s'}
2024-01-15 15:58:11.000 | page_view | {'url': '/features', 'time_on_page': '18s'}
2024-01-14 09:22:03.000 | purchase | {'product_id': 'p_44', 'amount': '49.99'}
2024-01-13 14:01:55.000 | signup | {'plan': 'pro', 'referrer': 'google'}
(4 rows — sequential read of one partition, pre-sorted DESC)
Query time: 3.2ms
WHERE user_id = 'user_4421' AND event_time >= %s
The partition key (user_id) directs Cassandra to exactly one node. The clustering column range filter (event_time) is a sequential read of pre-sorted data on that node. No cross-node coordination. No sort at query time. The result already arrives in DESC order — no ORDER BY needed.
Column family read advantage
Cassandra reads only the three columns requested (event_time, event_type, properties) from disk. Any other columns in the row are not touched. At millions of events per user, this selective column reading keeps queries consistently fast.
Where Column-Family Databases Shine
Instagram — Activity Feeds
400M daily active users each generating dozens of events per session. Cassandra stores every like, comment, and follow — partitioned by user. Each user's feed is one partition. Fast retrieval, massive write throughput.
Netflix — Watch History
Every play, pause, skip, and rating from 238M subscribers. Watch history partitioned by user. The recommendation engine reads one partition per user — never needs to cross nodes.
Uber — Trip Events
Location pings, fare calculations, driver ratings — all time-series. Cassandra handles 1M+ writes/sec at peak. Trip data partitioned by trip_id — all events for one trip on one node.
Apple — iCloud
Device sync for hundreds of millions of devices. Reportedly 75,000+ Cassandra nodes — one of the largest deployments in the world. Contacts, calendars, notes all stored in column families.
When NOT to Use Column-Family Databases
| Scenario | Why Column-Family Struggles | Better Choice |
|---|---|---|
| Complex ad-hoc queries | No partition key in WHERE = full table scan. Must know queries upfront. | PostgreSQL or MongoDB |
| Multi-entity transactions | No ACID across multiple partition keys. Each partition is atomic independently. | PostgreSQL |
| Rich document queries | CQL is limited. No complex filtering on arbitrary nested fields. | MongoDB |
| Small datasets under 1M rows | Cluster operational overhead is not justified at small scale. | PostgreSQL or SQLite |
Teacher's Note
The biggest revelation when understanding column-family storage is that the name is almost backwards from what you expect. "Column-family" sounds like it is about organising columns — but what it is really about is co-locating related rows via the partition key, and pre-sorting those rows via the clustering column. The column storage format is the implementation detail that makes analytics fast. The partition and clustering key design is the skill that makes everything else work. Get the key design right and a column-family database will outperform anything else at scale. Get it wrong and you will be writing support tickets wondering why your queries take 30 seconds.
Practice Questions — You're the Engineer
Scenario:
Scenario:
total_amount. PostgreSQL reads all 40 fields for all 500M rows — about 20TB. Your Cassandra-based warehouse reads only 500MB and returns the result 100x faster. What storage format makes this possible?
Scenario:
Quiz — Column-Family in Production
Scenario:
Scenario:
Scenario:
Up Next · Lesson 18
Cassandra Introduction
From Facebook's inbox search to Netflix's watch history — how Cassandra's ring architecture, tunable consistency, and partition key design make it the database of choice for million-writes-per-second workloads.