NO SQL Lesson 17 – Column-Family Databases | Dataplexa

NoSQL Database Types · Lesson 17

Column-Family Databases

Netflix sends you a personalised recommendation every time you open the app. Behind that recommendation is a system that has processed billions of watch history events — every play, pause, skip, and rating — across 238 million subscribers. That data lives in a column-family database. Processing it row by row like SQL would take hours. Column-family storage reads only the columns it needs, skipping terabytes of irrelevant data. The result: recommendations computed in seconds, not hours. This lesson explains exactly how that works.

Row Storage vs Column Storage — The Physical Difference

This is the most important concept in this lesson. It's not just a software difference — it's a physical difference in how bytes are laid out on disk. That physical layout determines everything: which queries are fast, which are slow, and why column-family databases exist at all.

Imagine 4 users in a database — same data, two different physical layouts on disk:

❌ Row storage (SQL) — data on disk:

        -- All columns for Row 1 together:

        [Ali | ali@x.com | 28 | London]

        -- All columns for Row 2 together:

        [Sara | s@x.com | 34 | Paris]

        -- All columns for Row 3 together:

        [James | j@x.com | 22 | Tokyo]

        -- All columns for Row 4 together:

        [Lin | l@x.com | 41 | Berlin]

Query: average age of all users.
Must read: ALL 4 rows × ALL 4 columns = 16 values read to get 4 ages.

✅ Column storage — data on disk:

        -- name column stored together:

        [Ali | Sara | James | Lin]

        -- email column stored together:

        [ali@x.com | s@x.com | j@x.com | l@x.com]

        -- age column stored together:

        [28 | 34 | 22 | 41]

        -- city column stored together:

        [London | Paris | Tokyo | Berlin]

Query: average age of all users.
Must read: ONLY the age column = 4 values read. Skip 12 values entirely.

At 1 billion rows with 100 columns: column storage reads 1 column (10GB) to compute an average. Row storage reads all 100 columns (1TB) to get the same result. That is the difference between a 2-second query and a 3-minute query.

What "Column-Family" Actually Means

The term "column-family database" is slightly misleading. These databases do not just store data by column — they store data in groups of related columns called column families. Each column family is stored separately on disk. Columns within a family are always read together.

Column Family 1 — "profile"

      user_001: name=Ali, email=ali@x.com

      user_002: name=Sara, email=s@x.com

      user_003: name=James

      -- user_003 has no email — no NULL stored

Stored together on disk. All profile reads are a single sequential I/O operation.

Column Family 2 — "activity"

      user_001: last_login=2024-01-15, page_views=142

      user_002: last_login=2024-01-14

      user_003: last_login=2024-01-15, clicks=89

Stored separately from profile. Activity reports never need to touch the profile column family.

The Write Path — LSM-Tree Architecture

Column-family databases use an LSM-tree (Log-Structured Merge-tree) write path. This is the architectural reason they achieve millions of writes per second. Here is how it works step by step:

Write arrives — Commit Log (disk)

Every write is first appended to the commit log — a sequential file on disk. Sequential append is the fastest possible disk operation. This provides durability: if the server crashes, the commit log is replayed on restart and no data is lost.

Write goes to Memtable (RAM)

Simultaneously, the write is stored in an in-memory sorted tree (the memtable). The client receives a success response immediately — no waiting for disk. This is why writes feel instant: they are confirmed the moment RAM is updated.

Memtable fills — Flush to SSTable (disk)

When the memtable reaches its size threshold (typically 64MB to 256MB), it is flushed to an immutable SSTable (Sorted String Table) file on disk. SSTables are never modified — new writes always create new SSTables. This eliminates random I/O entirely.

Background compaction merges SSTables

Over time, multiple small SSTables accumulate. A background compaction process merges them into fewer larger SSTables — removing deleted rows and duplicates. Reads become faster as fewer files need to be checked per query.

The Read Path — Memtable + Bloom Filters + SSTables

Reads in column-family databases are more complex than writes. The database must check multiple places before returning a result:

1️⃣

Check Memtable

Most recent writes live here. If found — return immediately. RAM access is around 100ns.

2️⃣

Check Bloom Filter

Probabilistic structure per SSTable. "Definitely not here" skips that file entirely. Eliminates most SSTable reads.

3️⃣

Read SSTables

Check remaining SSTable files newest-first. Row cache can short-circuit this for hot data.

Hands-on — Building a SaaS Analytics Table

The scenario: You are building a real-time analytics dashboard for a SaaS platform. Every user action generates an event. You need to query events by user for the last 30 days. Here is the full table definition and working queries:

-- Create keyspace and event table
CREATE KEYSPACE saas_analytics
  WITH replication = {'class':'SimpleStrategy','replication_factor':3};

USE saas_analytics;

CREATE TABLE user_events (
  user_id    TEXT,
  event_time TIMESTAMP,
  event_type TEXT,
  properties MAP,       -- flexible key-value for event metadata
  PRIMARY KEY (user_id, event_time)
) WITH CLUSTERING ORDER BY (event_time DESC)
  AND default_time_to_live = 2592000; -- auto-delete after 30 days

MAP<TEXT, TEXT> properties

A map column type stores key-value pairs inside a single column. A page_view event stores {"{"}"url": "/pricing", "time_on_page": "42s"{"}"}. A purchase event stores {"{"}"product_id": "p_44", "amount": "49.99"{"}"}. Different event types carry different properties with zero schema changes.

default_time_to_live = 2592000

Every row in this table auto-expires after 30 days (2,592,000 seconds). Analytics events older than 30 days are automatically garbage-collected by Cassandra. No application-level cleanup, no scheduled jobs, no unbounded table growth.

from cassandra.cluster import Cluster
from datetime import datetime, timedelta

session = Cluster(['node1', 'node2']).connect('saas_analytics')

# Insert a page view event
session.execute("""
  INSERT INTO user_events (user_id, event_time, event_type, properties)
  VALUES (%s, %s, %s, %s)
""", ('user_4421', datetime.now(), 'page_view',
      {'url': '/pricing', 'time_on_page': '42s'}))

{'url': '/pricing', 'time_on_page': '42s'} passed as a Python dict, stored as a Cassandra MAP. Each event type carries different key-value pairs inside this one column. The table schema never changes — the MAP absorbs any new event properties naturally.

# Query: all events for user_4421 in the last 7 days
seven_days_ago = datetime.now() - timedelta(days=7)

rows = session.execute("""
  SELECT event_time, event_type, properties
  FROM user_events
  WHERE user_id   = 'user_4421'
    AND event_time >= %s
""", (seven_days_ago,))

event_time                 | event_type  | properties
---------------------------+-------------+------------------------------------------
2024-01-15 16:04:22.000    | page_view   | {'url': '/pricing', 'time_on_page': '42s'}
2024-01-15 15:58:11.000    | page_view   | {'url': '/features', 'time_on_page': '18s'}
2024-01-14 09:22:03.000    | purchase    | {'product_id': 'p_44', 'amount': '49.99'}
2024-01-13 14:01:55.000    | signup      | {'plan': 'pro', 'referrer': 'google'}

(4 rows — sequential read of one partition, pre-sorted DESC)
Query time: 3.2ms

WHERE user_id = 'user_4421' AND event_time >= %s

The partition key (user_id) directs Cassandra to exactly one node. The clustering column range filter (event_time) is a sequential read of pre-sorted data on that node. No cross-node coordination. No sort at query time. The result already arrives in DESC order — no ORDER BY needed.

Column family read advantage

Cassandra reads only the three columns requested (event_time, event_type, properties) from disk. Any other columns in the row are not touched. At millions of events per user, this selective column reading keeps queries consistently fast.

Where Column-Family Databases Shine

📱

Instagram — Activity Feeds

400M daily active users each generating dozens of events per session. Cassandra stores every like, comment, and follow — partitioned by user. Each user's feed is one partition. Fast retrieval, massive write throughput.

🎬

Netflix — Watch History

Every play, pause, skip, and rating from 238M subscribers. Watch history partitioned by user. The recommendation engine reads one partition per user — never needs to cross nodes.

🚗

Uber — Trip Events

Location pings, fare calculations, driver ratings — all time-series. Cassandra handles 1M+ writes/sec at peak. Trip data partitioned by trip_id — all events for one trip on one node.

🍎

Apple — iCloud

Device sync for hundreds of millions of devices. Reportedly 75,000+ Cassandra nodes — one of the largest deployments in the world. Contacts, calendars, notes all stored in column families.

When NOT to Use Column-Family Databases

Scenario	Why Column-Family Struggles	Better Choice
Complex ad-hoc queries	No partition key in WHERE = full table scan. Must know queries upfront.	PostgreSQL or MongoDB
Multi-entity transactions	No ACID across multiple partition keys. Each partition is atomic independently.	PostgreSQL
Rich document queries	CQL is limited. No complex filtering on arbitrary nested fields.	MongoDB
Small datasets under 1M rows	Cluster operational overhead is not justified at small scale.	PostgreSQL or SQLite

Teacher's Note

The biggest revelation when understanding column-family storage is that the name is almost backwards from what you expect. "Column-family" sounds like it is about organising columns — but what it is really about is co-locating related rows via the partition key, and pre-sorting those rows via the clustering column. The column storage format is the implementation detail that makes analytics fast. The partition and clustering key design is the skill that makes everything else work. Get the key design right and a column-family database will outperform anything else at scale. Get it wrong and you will be writing support tickets wondering why your queries take 30 seconds.

Practice Questions — You're the Engineer

Scenario:

A developer asks why Cassandra sustains millions of writes per second while PostgreSQL maxes out at around 50,000. Your tech lead explains: Cassandra first appends to a sequential commit log, then writes to an in-memory structure, then flushes to immutable sorted files on disk — eliminating all random I/O. What is this write architecture called?

Scenario:

Your analytics team needs to compute average order value across 500 million orders. Each order has 40 fields but you only need total_amount. PostgreSQL reads all 40 fields for all 500M rows — about 20TB. Your Cassandra-based warehouse reads only 500MB and returns the result 100x faster. What storage format makes this possible?

Scenario:

A Cassandra read must check multiple SSTable files. To avoid reading every file for every query, Cassandra maintains a per-SSTable probabilistic data structure. It can definitively say "this row is definitely NOT in this file" — allowing the read path to skip that file entirely. What is this structure called?

Quiz — Column-Family in Production

Up Next · Lesson 18

Cassandra Introduction

From Facebook's inbox search to Netflix's watch history — how Cassandra's ring architecture, tunable consistency, and partition key design make it the database of choice for million-writes-per-second workloads.

← Previous Course Index Next →