NO SQL Lesson 7 – Types of NoSQL Databases | Dataplexa
NoSQL Fundamentals · Lesson 7

Types of NoSQL Databases

A hammer and a scalpel are both tools. You wouldn't use one where you need the other. NoSQL is the same — it's not one database, it's four completely different families, each engineered to solve a specific class of problem brilliantly. Pick the wrong family and you'll fight your database every day. Pick the right one and your queries feel effortless. This lesson gives you the map.

The Four Families — A Quick Map

🔑

1. Key-Value Stores

The simplest model. One key maps to one value. Like a giant dictionary. Blindingly fast. No structure enforced.

Redis · DynamoDB · Memcached

📄

2. Document Stores

Stores data as JSON-like documents. Each document is self-describing. Nested objects and arrays welcome. Natural fit for most web apps.

MongoDB · CouchDB · Firestore

📊

3. Column-Family Stores

Organises data into column groups instead of rows. Built for massive write throughput and time-series data. The choice at true planet scale.

Cassandra · HBase · ScyllaDB

🕸️

4. Graph Databases

Models data as nodes connected by edges. Relationships are first-class citizens. When your data IS the connections, this is your tool.

Neo4j · Amazon Neptune · ArangoDB

Family 1 — Key-Value Stores

The key-value store is the purest NoSQL model. Think of it as a massive hash map — you give it a key, it gives you back a value instantly. No schema. No columns. No joins. Just keys and values.

How it stores data — visualised:

Key Value (can be anything)
session:user_4421 "{"{"}"token":"abc123","expires":1705000000"{"}"}"
cache:homepage_v3 "<html>...full rendered page HTML...</html>"
rate_limit:ip_203.0.113.5 "47" (requests this minute)
leaderboard:game_99 [{"{"}"player":"Ali","score":9820},{"{"}"player":"Sam","score":9100}]

The value can be a string, number, JSON blob, list, or binary data. The database doesn't care — it just stores and retrieves by key.

The scenario: You're a backend engineer at a SaaS platform. Your API gets 8,000 requests per second. Every request checks if the user is authenticated by looking up their session token. That's 8,000 database reads per second just for auth. Your PostgreSQL is already at 70% CPU. Your tech lead says: "Move sessions to Redis." Here's exactly what that looks like:

import redis

# Connect to Redis
r = redis.Redis(host='localhost', port=6379, db=0)

# Store a session when user logs in
# EX=1800 means auto-delete after 30 minutes
r.set('session:user_4421',
      '{"token":"abc123","user_id":"u_4421","role":"admin"}',
      ex=1800)
r.set('session:user_4421', '...', ex=1800)

r.set stores a key-value pair. The key is session:user_4421 — the colon is just a naming convention to group related keys. The value is a JSON string. ex=1800 sets a 30-minute TTL — Redis automatically deletes the key when it expires. No cron job. No cleanup script. Zero maintenance.

# Look up session on every API request — O(1) speed
session_data = r.get('session:user_4421')

if session_data:
    session = json.loads(session_data)
    print(f"User: {session['user_id']}, Role: {session['role']}")
else:
    print("Session expired — please log in again")
User: u_4421, Role: admin

-- Redis lookup time: 0.3ms
-- PostgreSQL equivalent: 4-12ms
-- At 8,000 req/sec: saves ~92ms of CPU time per second
-- Redis serves this entirely from RAM — no disk read ever

Why key-value wins here:

O(1) lookup: No matter how many sessions exist — 100 or 100 million — finding one by key takes the same time. Redis hashes the key to a memory address and reads directly. No scan, no index traversal.

RAM-only: Redis stores everything in memory. A memory read is 100x faster than a disk read. That 0.3ms latency is real — it's not marketing. Your PostgreSQL query hits disk for cold data. Redis never does.

Family 2 — Document Stores

Document stores let you store, retrieve, and query JSON-like documents. Each document is self-contained — it carries its own structure. No fixed schema, no required columns, no NULLs for missing fields. This is the most popular NoSQL family for web applications.

How a document store organises data:

Collection: users

Document 1:
"_id": "u_001",
"name": "Priya",
"email": "p@x.com"

Document 2:
"_id": "u_002",
"name": "Carlos",
"email": "c@x.com",
"phone": "+1-555-0101",
"langs": ["en", "es"]

Collection: orders

Document 1:
"_id": "ord_881",
"user_id": "u_001",
"items": [
  {"{"}"sku":"TEE-M","qty":2},
  {"{"}"sku":"HAT-L","qty":1}
],
"total": 74.97

Notice: Priya has no phone or langs field — and that's perfectly fine. Carlos has both. The order document embeds items directly — no separate items table needed.

The scenario: You're building a product catalogue for a multi-category e-commerce app. Each product type has totally different attributes. Here's how you query and update documents in MongoDB:

// Insert a product — rich nested structure, no schema needed
db.products.insertOne({
  sku:      "LAPTOP-PRO-14",
  name:     "ProBook X14",
  category: "electronics",
  price:    1199.00,
  specs: {                        // nested object — no extra table
    ram_gb:   16,
    storage:  "512GB SSD",
    display:  "14-inch Retina"
  },
  tags: ["laptop", "work", "sale"] // array — directly in the document
})
specs: {"{ ram_gb: 16, ... }"}

A nested object stored directly inside the document. In SQL you'd need a product_specs table with a foreign key join. Here it's just a field. You can query it with dot notation: {"{"}"specs.ram_gb": {"{"}"$gte": 16{"}"}"{"}"}

tags: ["laptop", "work", "sale"]

Arrays are a native type. Find all products on sale: db.products.find({"{"}"tags": "sale"{"}"}). MongoDB indexes array elements individually — each tag is searchable without any extra table.

// Query: find all electronics under £500 with 16GB+ RAM
db.products.find({
  category:        "electronics",
  price:           { $lte: 500 },
  "specs.ram_gb":  { $gte: 16 }
})
[
  {
    sku: "ULTRABOOK-Z13",
    name: "ZenBook 13",
    category: "electronics",
    price: 449,
    specs: { ram_gb: 16, storage: "256GB SSD", display: "13-inch OLED" },
    tags: ["laptop", "ultralight"]
  }
]

What just happened:

"specs.ram_gb": {"{"}"$gte": 16{"}"} — dot notation lets you query nested fields as if they were top-level. MongoDB traverses into the specs object and filters on ram_gb. No JOIN to a specs table. No subquery.

$lte, $gte — MongoDB's comparison operators. $lte = less than or equal, $gte = greater than or equal. All operators start with $ — that's MongoDB's query language convention.

Family 3 — Column-Family Stores

Column-family stores look like tables at first glance — but they work completely differently under the hood. Instead of storing data row by row, they store it column by column, in groups called column families. This makes them extraordinarily fast for write-heavy workloads and time-series data.

SQL rows vs Column-Family storage — the key difference:

❌ SQL stores data by ROW

-- On disk, one row together:
[user_1 | London | 28 | active]
[user_2 | Paris | 34 | inactive]
[user_3 | Tokyo | 22 | active]

-- To get all cities:
-- Must read ENTIRE rows first

✅ Cassandra stores data by COLUMN

-- On disk, one column together:
city: [London, Paris, Tokyo]
age: [28, 34, 22]
status: [active, inactive, active]

-- To get all cities:
-- Read ONLY the city column

For analytics queries ("average age of all users") column storage is 10–100x faster — you only read the one column you need, not every field of every row.

The scenario: You're building a real-time analytics pipeline. Every page view on your platform is an event. At peak you receive 120,000 events per second. Here's how you write and read them in Cassandra:

from cassandra.cluster import Cluster

session = Cluster(['node1']).connect()

# Create a table — partition key is crucial in Cassandra
session.execute("""
  CREATE TABLE IF NOT EXISTS analytics.page_views (
    site_id   TEXT,
    ts        TIMESTAMP,
    user_id   TEXT,
    page      TEXT,
    duration  INT,
    PRIMARY KEY (site_id, ts)  -- site_id = partition, ts = sort
  )
""")
PRIMARY KEY (site_id, ts)

This is the most important design decision in Cassandra. site_id is the partition key — it determines which node stores this row. All rows for the same site live on the same node — fast range queries. ts is the clustering column — rows within a partition are sorted by timestamp automatically. Newest events always at the end.

# Insert a page view event
session.execute("""
  INSERT INTO analytics.page_views
    (site_id, ts, user_id, page, duration)
  VALUES (%s, %s, %s, %s, %s)
""", ('site_acme', datetime.now(), 'u_8821', '/pricing', 42))
-- Write acknowledged in 1.1ms
-- No locking, no transaction log coordination
-- Data written to memtable (RAM) first
-- Flushed to SSTable on disk in background

-- Read last 1 hour of events for site_acme:
site_id    | ts                      | user_id | page      | duration
-----------+-------------------------+---------+-----------+---------
site_acme  | 2024-01-15 14:00:01.221 | u_8821  | /pricing  | 42
site_acme  | 2024-01-15 14:00:01.887 | u_9002  | /home     | 8
site_acme  | 2024-01-15 14:00:02.103 | u_4421  | /checkout | 120
... (thousands more rows, all sorted by ts)

Why column-family wins for this workload:

memtable first: Cassandra writes to an in-memory buffer first, confirms immediately, then flushes to disk in the background. This is why it achieves 1 million writes/sec on a 10-node cluster. The write path never waits for disk.

Range queries on clustering column: Because events are sorted by ts within a partition, querying "last 1 hour" is a sequential scan of pre-sorted data — no index needed, no sorting at query time.

Family 4 — Graph Databases

Graph databases store data as nodes (things) and edges (relationships between things). In a graph database, relationships are not computed at query time via JOINs — they are physically stored as connections. Traversing a relationship is as fast as a single pointer lookup.

Why SQL breaks on relationship queries:

-- SQL: "Find friends of friends of friends of User A"
SELECT DISTINCT u3.name FROM users u1
JOIN follows f1 ON f1.follower_id = u1.id
JOIN users u2 ON u2.id = f1.following_id
JOIN follows f2 ON f2.follower_id = u2.id
JOIN users u3 ON u3.id = f2.following_id
JOIN follows f3 ON f3.follower_id = u3.id
WHERE u1.id = 'user_A';
-- 3 levels deep = 3 JOINs. 6 levels = 6 JOINs. Each level multiplies the cost.

In Neo4j the same query is: MATCH (a:User {"{"}"id":"user_A"{"}"})-[:FOLLOWS*3]->(c) RETURN c.name — and it traverses millions of relationships in milliseconds regardless of depth.

The scenario: You're building a fraud detection system. You need to find all accounts connected to a suspicious account within 3 hops — shared devices, shared email domains, shared IP addresses. In SQL this is 6+ JOINs. In Neo4j it's natural:

// Neo4j Cypher — create nodes and relationships
// First: create the suspicious account node
CREATE (a:Account {id: 'acc_441', name: 'Suspicious Corp', risk: 'high'})

// Create a connected account
CREATE (b:Account {id: 'acc_882', name: 'Shell Ltd'})

// Create the relationship between them — SHARED_DEVICE
CREATE (a)-[:SHARED_DEVICE {device_id: 'dev_99x'}]->(b)
CREATE (a:Account {"{"}"id": "acc_441"{"}"} )

(a) is a node variable. :Account is the label — like a type or category. The properties in curly braces are the node's data. Nodes can have multiple labels: (a:Account:HighRisk).

(a)-[:SHARED_DEVICE]->(b)

This creates a directed relationship from node a to node b of type SHARED_DEVICE. The relationship itself can have properties — here device_id. This relationship is stored as a physical pointer — traversing it is O(1), not a JOIN computation.

// Find all accounts within 3 hops of the suspicious account
MATCH (start:Account {id: 'acc_441'})-[*1..3]-(connected:Account)
RETURN connected.id, connected.name, connected.risk
connected.id  | connected.name   | connected.risk
--------------+------------------+---------------
acc_882       | Shell Ltd        | medium
acc_103       | FastPay Inc      | null
acc_771       | Offshore Brokers | high

Query time: 4ms
(Traversed 2.1M relationships to find these 3 accounts)

What just happened:

[*1..3] means traverse between 1 and 3 relationship hops in any direction. This single pattern replaces 6 JOIN operations in SQL. Adding more hops costs almost nothing — Neo4j follows pointers, it doesn't recompute joins.

4ms to traverse 2.1M relationships: This is the graph database's superpower. The same query in SQL on a well-indexed table with 2M rows would take several seconds — because every JOIN multiplies the row count being processed.

All Four Families — Complete Comparison

Family Data Model Superpower Weakness Best For
Key-Value key → value Sub-ms reads, TTL, atomic ops No rich queries — must know the key Cache, sessions, rate limiting, leaderboards
Document JSON documents in collections Flexible schema, nested queries No multi-doc transactions (pre v4) User profiles, catalogues, CMS, APIs
Column-Family Rows + column groups Massive write throughput, time-series Schema design is critical, no ad-hoc queries IoT, analytics, logs, time-series events
Graph Nodes + edges + properties Deep relationship traversal in ms Not suited for non-relationship data Social graphs, fraud detection, recommendations

Teacher's Note

The most common mistake I see in architecture reviews: engineers pick MongoDB for everything because it's the most familiar NoSQL database. MongoDB is excellent — but using it to store time-series sensor data (Cassandra's job) or social graph traversals (Neo4j's job) is like using a Swiss army knife to cut down a tree. It works, technically. But you'll feel it every day. Each family in this lesson has a natural habitat. Know the habitat before you pick the tool.

Practice Questions — You're the Engineer

Scenario:

Your API receives 12,000 requests per second. Every request must verify a user's authentication token. The token lookup needs to complete in under 1ms. Tokens expire after 1 hour automatically. You need zero complex queries — just "does this token exist and who does it belong to?" Which NoSQL family is the right choice?


Scenario:

Your bank's fraud team needs to identify money laundering rings. They need to find all accounts that share a device, email domain, or IP address within 4 hops of a flagged account — in real time as transactions happen. The relationships between accounts are the data. Which NoSQL family handles this natively?


Scenario:

You are building a fleet management platform. 80,000 delivery trucks each send GPS coordinates, speed, and fuel level every 5 seconds — that's 240,000 writes per second, sustained. Queries are always time-range based: "give me all readings for truck_441 between 9am and 11am today." Which NoSQL family was specifically designed for this pattern?


Quiz — Pick the Right Family

Scenario:

You're building a property listing platform. Each listing is different: apartments have floor numbers and lift access, houses have garden sizes and parking, commercial units have planning permissions and zoning codes. New property types are added regularly. Queries search by location, price range, and type-specific attributes. Which NoSQL family fits best?

Scenario:

A smart energy company monitors 500,000 electricity meters. Each meter sends a reading every 10 seconds — that's 50,000 writes per second. Billing queries need all readings for a specific meter between two dates. Older readings are rarely accessed. Which family handles this best?

Scenario:

A music streaming platform wants to build "Users like you also listened to..." recommendations. The algorithm works by finding users with similar listening patterns, then surfacing songs those users liked that the current user hasn't heard. This requires traversing user-song-user relationships across millions of nodes in real time. Which family?

Up Next · Lesson 8

Schema-less Design

What "no schema" actually means in production, why it's both a superpower and a trap, and how to design schema-less data that doesn't become unmaintainable chaos six months later.