NO SQL Lesson 23 – NoSQL Data Modeling | Dataplexa

Data Modeling & Design · Lesson 23

NoSQL Data Modeling

In SQL, you model the data first and figure out the queries later. In NoSQL, you do the opposite — you model around the queries, and the data shape follows. This is the single biggest mental shift when moving from relational to NoSQL, and the engineers who get it wrong spend the next six months rewriting their schema.

The Fundamental Difference — Data-First vs Query-First

In a relational database you normalise data into clean, logical tables — customers, orders, products — and trust the query planner to JOIN them efficiently at runtime. The schema reflects the real world. Queries are flexible because JOINs are cheap.

In NoSQL — especially Cassandra and DynamoDB — there are no JOINs. If you model your data the SQL way, you will write queries that require reading multiple tables and assembling results in application code. At 10 million rows that is acceptable. At 10 billion rows it is a disaster. NoSQL data modeling forces you to model around access patterns — the exact queries your application will run — before you write a single table definition.

Same domain, two completely different schemas:

❌ SQL mindset applied to Cassandra

          -- Three separate tables, JOIN at query time

          customers (customer_id, name, email)

          orders (order_id, customer_id, date)

          order_items (order_id, product_id, qty)

          -- To get a customer's orders you need:

          -- SELECT + JOIN + JOIN = full table scans

          -- Cassandra has no JOIN. This query fails.

✅ Query-first NoSQL modeling

          -- One table shaped around the query

          -- "Get all orders for customer X"

          orders_by_customer (

            customer_id TEXT,   -- partition key

            order_date  TIMESTAMP, -- clustering

            order_id    TEXT,

            items       LIST<TEXT>

          )

          -- One partition read. No JOIN needed.

Step 1 — Define Your Access Patterns First

Before you touch a table definition, write down every query your application will run. Be specific — not "get user data" but "get a user's profile by user_id" and "get all users who signed up in January ordered by sign-up date." Each distinct access pattern may need its own table in Cassandra or its own index in MongoDB.

The Access Pattern Checklist

For every query your application needs, document: (1) what you filter by, (2) what you sort by, (3) how many rows are returned, and (4) how often it runs. The most frequent, most latency-sensitive queries define your primary table structure. Everything else is secondary.

Step 2 — Choose the Right Partition Key

In Cassandra and DynamoDB, the partition key is the most critical decision you make. It determines which node stores the data and how you can query it. A good partition key distributes data evenly across nodes and matches your most frequent query pattern. A bad partition key creates hotspots — one node drowning while the rest sit idle.

Scenario	Bad partition key	Why it fails	Good partition key
IoT sensor readings	timestamp	All current writes go to one partition — hotspot	sensor_id
E-commerce orders	country	Low cardinality — all US orders on one node	customer_id
Social media posts	post_id (UUID)	Can't retrieve "all posts by user" — need user_id	user_id
Chat messages	sender_id	Popular users create huge partitions — unbounded growth	conversation_id

Hands-on — Modeling a Ride-Sharing App

The scenario: You are the lead data engineer at a ride-sharing startup. The product team gives you three access patterns the app needs to support: (1) get a driver's full trip history ordered by most recent first, (2) get all trips for a given city on a given date, and (3) get the details of a single trip by trip ID. You are modeling these in Cassandra. Each access pattern needs its own table — there are no JOINs.

-- Table 1: shaped for query "get all trips for a driver, newest first"
CREATE TABLE trips_by_driver (
  driver_id   TEXT,
  trip_date   TIMESTAMP,
  trip_id     UUID,
  city        TEXT,
  distance_km DECIMAL,
  fare_usd    DECIMAL,
  status      TEXT,
  PRIMARY KEY (driver_id, trip_date, trip_id)
) WITH CLUSTERING ORDER BY (trip_date DESC, trip_id ASC);

Created table trips_by_driver
Took 0.842 seconds

PRIMARY KEY (driver_id, trip_date, trip_id)

driver_id is the partition key — all trips for one driver live on the same node, making the "get all trips for driver X" query a single partition read. trip_date and trip_id are clustering columns — they sort data within the partition and allow efficient range queries. trip_id is appended to guarantee uniqueness when two trips share the same timestamp.

CLUSTERING ORDER BY (trip_date DESC, trip_id ASC)

Data is physically stored on disk in this order — most recent trips first. When the app queries the last 10 trips, Cassandra reads the first 10 records in the partition sequentially. No sort at query time, no ORDER BY overhead. If you forget this and sort ASC, reading "newest first" means scanning the whole partition backwards.

The scenario continues: The ops team needs a city-level dashboard showing all trips in London on a specific date — for surge pricing calculations. This is a completely different access pattern, so it needs its own table. You cannot use trips_by_driver for this — querying it by city would require scanning every driver's partition.

-- Table 2: shaped for query "get all trips in a city on a date"
CREATE TABLE trips_by_city_date (
  city        TEXT,
  trip_date   DATE,
  trip_time   TIMESTAMP,
  trip_id     UUID,
  driver_id   TEXT,
  distance_km DECIMAL,
  fare_usd    DECIMAL,
  PRIMARY KEY ((city, trip_date), trip_time, trip_id)
) WITH CLUSTERING ORDER BY (trip_time DESC, trip_id ASC);

Created table trips_by_city_date
Took 0.614 seconds

PRIMARY KEY ((city, trip_date), trip_time, trip_id)

The double parentheses around (city, trip_date) create a composite partition key — both fields together determine which node stores this row. Using city alone would put all London trips ever on one node — a massive, ever-growing partition. Adding trip_date bounds the partition to one day of trips per city, keeping partition sizes manageable and predictable.

Two tables for two access patterns

The same trip data now lives in two tables — trips_by_driver and trips_by_city_date. Every write goes to both tables. This is intentional and expected in NoSQL modeling. You are trading write duplication for read efficiency. One fast read per access pattern is worth two writes per event.

The scenario continues: Now you write the actual data and run both queries to validate the model works as designed.

from cassandra.cluster import Cluster
from datetime import datetime
import uuid

session = Cluster(['localhost']).connect('rideshare')

trip_id   = uuid.uuid4()
driver_id = 'drv_882'
now       = datetime.utcnow()

# Write to BOTH tables on every trip completion
insert_driver = """
  INSERT INTO trips_by_driver
    (driver_id, trip_date, trip_id, city, distance_km, fare_usd, status)
  VALUES (%s, %s, %s, %s, %s, %s, %s)
"""
insert_city = """
  INSERT INTO trips_by_city_date
    (city, trip_date, trip_time, trip_id, driver_id, distance_km, fare_usd)
  VALUES (%s, %s, %s, %s, %s, %s, %s)
"""

session.execute(insert_driver,
  (driver_id, now, trip_id, 'London', 8.4, 14.50, 'completed'))

session.execute(insert_city,
  ('London', now.date(), now, trip_id, driver_id, 8.4, 14.50))

Insert to trips_by_driver: completed in 1.2ms
Insert to trips_by_city_date: completed in 0.9ms

Two inserts per trip completion

Every time a trip completes, the application writes to both tables. This is the NoSQL trade-off in action — you accept slightly higher write complexity so that every read is a single, fast partition lookup. In production, these two inserts are often sent as a Cassandra batch to ensure both succeed or both fail together.

# Query 1: driver's recent trips (uses trips_by_driver)
rows = session.execute("""
  SELECT trip_date, city, distance_km, fare_usd
  FROM trips_by_driver
  WHERE driver_id = 'drv_882'
  LIMIT 10
""")
for r in rows:
    print(f"  {r.trip_date:%Y-%m-%d}  {r.city:<12} {r.distance_km}km  ${r.fare_usd}")

# Query 2: all London trips today (uses trips_by_city_date)
from datetime import date
rows2 = session.execute("""
  SELECT trip_time, driver_id, fare_usd
  FROM trips_by_city_date
  WHERE city = 'London' AND trip_date = %s
  LIMIT 100
""", (date.today(),))

-- Query 1: driver drv_882 recent trips
  2025-01-15  London       8.4km   $14.50
  2025-01-15  London       5.1km   $9.20
  2025-01-14  Manchester   12.3km  $21.80
  2025-01-14  London       3.8km   $7.40
Query time: 2.1ms  (single partition read)

-- Query 2: London trips today
  16:04:22  drv_882   $14.50
  15:58:11  drv_441   $22.00
  15:51:03  drv_109   $8.75
Query time: 3.4ms  (single partition read)

Both queries: single partition read, under 4ms

Neither query touches more than one partition. Query 1 reads the drv_882 partition in trips_by_driver. Query 2 reads the ('London', today) partition in trips_by_city_date. This is the payoff of query-first modeling — at 100 million trips, these queries still take under 5ms because they never do a full table scan.

The Three Core NoSQL Modeling Patterns

Most NoSQL schemas are built from three recurring patterns. Recognise which one fits your access pattern and you have your table structure.

Pattern 1

One Table Per Query

Each distinct access pattern gets its own table, shaped so the query hits one partition. Data is duplicated across tables. Used heavily in Cassandra and DynamoDB. Write amplification is the trade-off — accept it, it is worth it.

trips_by_driver, trips_by_city_date, trips_by_id

Pattern 2

Embed Related Data

In MongoDB, nest related data inside the parent document instead of referencing it from a separate collection. An order document contains its line items array directly. One document read returns everything — no second query, no JOIN equivalent.

{"{"} order_id, customer, items: [{"{"}...{"}"},{"{"} ...{"}"}] {"}"}

Pattern 3

Bucket Pattern

Group time-series events into fixed-size buckets — one document or row per hour, day, or week per entity. Reduces the total number of documents and keeps partition sizes bounded. Common for IoT and analytics in MongoDB and Cassandra.

{"{"} sensor_id, hour: "2025-01-15T14", readings: [{"{"} t, v {"}"},...] {"}"}

What Happens When You Get It Wrong

Unbounded Partition Growth

Partitioning a chat app by user_id means a power user with 10 million messages creates a 10GB partition. Reads slow down. Compaction stalls. The node falls behind. Always think about partition size ceiling.

Low-Cardinality Partition Keys

Partitioning by status (active/inactive) or country concentrates all data on a few nodes. A 100-node cluster effectively becomes a 3-node cluster. You paid for 100 nodes and are using 3.

ALLOW FILTERING

Using ALLOW FILTERING in Cassandra means you are filtering without a partition key — scanning the entire table. Cassandra forces you to write it explicitly because it knows this is catastrophically slow at scale.

Missing Access Patterns

Modeling before talking to every team that touches the data. Three months in, the analytics team asks for a query you never modeled for. Now you need to backfill a new table with historical data — a painful, expensive process.

Teacher's Note

The hardest part of NoSQL modeling is accepting data duplication. SQL training builds a strong instinct that duplication is wrong. In NoSQL it is a tool — you duplicate deliberately so every read is a single partition lookup. The question is not "is this data duplicated?" but "does every access pattern I care about hit exactly one partition?" If the answer is yes, your schema is correct.

Practice Questions — You're the Engineer

Scenario:

A new engineer joins your team and starts designing the Cassandra schema for a healthcare app by drawing an ER diagram — patients, appointments, doctors, and prescriptions as separate tables with foreign key relationships. You stop them and explain that in NoSQL you must define something else before touching the schema. What must be defined first?

Scenario:

You are modeling a Cassandra table for a global ride-sharing app. Partitioning by city alone puts all London trips — past, present, and future — on one node, which will eventually hold terabytes of data. Your tech lead suggests including trip_date in the partition key as well, using the double-parentheses syntax ((city, trip_date)). What is this called?

Scenario:

A developer on your team writes a Cassandra query that filters on status = 'completed' without including the partition key in the WHERE clause. Cassandra rejects the query with an error message suggesting they add a specific keyword to force it to execute. You tell them never to add that keyword in production. What is the keyword Cassandra is suggesting?

Quiz — NoSQL Data Modeling in Production

Up Next · Lesson 24

Query-Driven Design

How to take a list of access patterns and systematically translate them into a production-ready NoSQL schema — with real worked examples.

← Previous Course Index Next →