NoSQL
NoSQL Use Cases
In 2012, a small team at LinkedIn was drowning. Their MySQL cluster was handling member activity feeds — who connected with whom, who viewed whose profile, who liked what. With 200 million members, every feed load was a multi-table JOIN across billions of rows. Page loads hit 8 seconds. They rewrote the feed system in Espresso, LinkedIn's document store. Load time dropped to 200ms. This lesson is about recognising the exact moment a use case belongs to NoSQL — before you hit that wall, not after.
How to Think About Use Cases
Before looking at specific scenarios, there's a mental model that makes every use case decision clear. Ask these three questions in order:
What does the data look like?
Rows and columns with fixed structure → SQL. JSON-like documents, graphs, key-value pairs, time-series events → NoSQL. The shape of your data almost always points to the right database family.
How will it be read?
Always by a single key → key-value. By content and filters → document. By relationship traversal → graph. By time range on a specific entity → column-family. Design your data model around your most frequent query pattern.
What are the consequences of stale or inconsistent data?
Financial loss, medical harm, inventory mismatch → strong consistency, probably SQL. Slightly stale view count, delayed feed update → eventual consistency, NoSQL handles it perfectly.
Use Case 1 — Real-Time Session and Cache Management
Every web application manages sessions. Every high-traffic app needs caching. These two requirements share the same profile: fast key lookup, optional expiry, no complex queries. This is key-value's natural home.
Why NoSQL wins here
- Sub-millisecond reads from RAM
- Built-in TTL — sessions auto-expire
- Atomic operations (INCR, SETNX)
- No JOIN ever needed — just fetch by key
Why SQL struggles here
- Disk reads on cache miss: 4–12ms
- Expiry requires scheduled cleanup jobs
- High-frequency writes stress the WAL
- No native in-memory storage tier
The scenario: You're building a rate limiter for a public API. Each IP address gets 100 requests per minute. You need to count requests per IP, enforce the limit, and auto-reset the counter every 60 seconds. Redis handles this with two commands:
import redis
r = redis.Redis(host='localhost', port=6379)
def check_rate_limit(ip_address):
key = f"rate:{ip_address}" # key namespaced by IP
count = r.incr(key) # atomically increment counter
if count == 1: # first request — set 60s window
r.expire(key, 60)
return count <= 100 # True = allowed, False = blocked
r.incr(key)
Atomically increments the counter by 1 and returns the new value. If the key doesn't exist yet, Redis creates it with value 0 first, then increments to 1. The atomic nature means two simultaneous requests can never both read 99 and both write 100 — there's no race condition.
r.expire(key, 60)
Sets a 60-second TTL on this key. After 60 seconds Redis automatically deletes it. The next request for this IP starts fresh at 0. No cron job, no cleanup script, no scheduled task. Redis handles it natively.
# Simulate 3 requests from same IP
for i in range(3):
allowed = check_rate_limit("203.0.113.5")
print(f"Request {i+1}: {'✓ allowed' if allowed else '✗ blocked'}")
Request 1: ✓ allowed (count=1, key expires in 60s) Request 2: ✓ allowed (count=2) Request 3: ✓ allowed (count=3) -- After 101 requests from same IP in 60 seconds: Request 101: ✗ blocked (count=101, limit exceeded) -- After 60s window resets: Request 1: ✓ allowed (key deleted, fresh start)
Why this can't be replicated cleanly in SQL: You'd need a request_counts table, a scheduled job to DELETE expired rows, and a transaction to prevent race conditions on the count. That's 3 pieces of infrastructure. Redis does it in 4 lines with no maintenance overhead.
Use Case 2 — Content Management and User Profiles
User profiles, product listings, blog posts, job listings — all of these share the same characteristic: every record has a different set of attributes, the schema evolves constantly, and the data is read and written as a whole unit. Document stores are the natural fit.
The scenario: You're building a job board. A software engineering job has tech stack, remote options, and salary range. A design role has portfolio requirements and tool preferences. A sales role has territory and quota details. Each job type has completely different attributes. Here's how MongoDB handles all of them in one collection without a single NULL:
// Insert two completely different job types — same collection
db.jobs.insertMany([
{
title: "Senior Backend Engineer",
company: "FinTech Corp",
type: "engineering",
salary: { min: 90000, max: 130000, currency: "GBP" },
stack: ["Python", "Kafka", "PostgreSQL"],
remote: "hybrid",
visa_sponsored: true
},
{
title: "Head of Growth",
company: "FinTech Corp",
type: "sales",
salary: { base: 75000, ote: 120000, currency: "GBP" },
territory: ["UK", "Ireland"],
quota_annual: 2400000,
crm_tools: ["Salesforce", "HubSpot"]
}
])
salary: {"{ min: 90000, max: 130000 }"} vs salary: {"{ base: 75000, ote: 120000 }"}
The salary field has a completely different structure for each job type. In SQL, you'd need separate columns for every possible salary component — salary_min, salary_max, salary_base, salary_ote — most of which would be NULL for any given job. MongoDB stores exactly what each document needs.
stack: ["Python", "Kafka", "PostgreSQL"]
Arrays stored natively. In SQL this would be a separate job_skills table. Here it's one field. And it's queryable: db.jobs.find({"{"}"stack": "Python"{"}"}) finds all Python jobs instantly.
// Query: find all hybrid engineering roles paying over £100k
db.jobs.find({
type: "engineering",
remote: "hybrid",
"salary.min": { $gte: 100000 }
})
[
{
title: "Senior Backend Engineer",
company: "FinTech Corp",
salary: { min: 90000, max: 130000, currency: "GBP" },
stack: [ "Python", "Kafka", "PostgreSQL" ],
remote: "hybrid"
}
]
-- Query time: 2ms
-- Dot notation queries into nested objects natively
"salary.min": {"{"}"$gte": 100000{"}"} — dot notation lets you filter on nested fields as if they were top-level. MongoDB traverses into the salary object and compares min. No subquery. No JOIN to a salary table. One natural query.
Use Case 3 — Time-Series and Event Streaming
IoT sensor data, application logs, user activity events, financial tick data — all of these are time-series: a continuous stream of timestamped events. They share three traits: enormous write volume, always read by time range, and old data rarely updates. Column-family databases were built for exactly this pattern.
The scenario: You're building a monitoring platform for a cloud provider. 200,000 servers each emit a CPU metric every 10 seconds — that's 20,000 writes per second, sustained. Queries are always: "give me all metrics for server X between time A and time B." Here's the Cassandra schema and query:
# Cassandra table — designed around the query pattern
session.execute("""
CREATE TABLE IF NOT EXISTS metrics.server_cpu (
server_id TEXT, -- partition key: all data for one server together
recorded_at TIMESTAMP, -- clustering column: sorted by time within partition
cpu_pct FLOAT, -- the metric value
region TEXT, -- metadata
PRIMARY KEY (server_id, recorded_at)
) WITH CLUSTERING ORDER BY (recorded_at DESC)
""")
PRIMARY KEY (server_id, recorded_at)
server_id is the partition key — all metrics for one server land on the same node. This is intentional: the most common query is "all metrics for server X" — and now it's a single-node read, no cross-cluster coordination needed.
CLUSTERING ORDER BY (recorded_at DESC)
Data is physically stored in descending timestamp order on disk. The most recent metrics are at the top of the partition. "Latest 100 readings" is a sequential read from the start of the file — no sorting at query time, no index scan needed.
# Query: last 1 hour of CPU metrics for a specific server
rows = session.execute("""
SELECT recorded_at, cpu_pct FROM metrics.server_cpu
WHERE server_id = 'srv-eu-042'
AND recorded_at >= %s
AND recorded_at <= %s
""", (one_hour_ago, now))
recorded_at | cpu_pct --------------------------+--------- 2024-01-15 15:59:50.000 | 78.4 2024-01-15 15:59:40.000 | 76.1 2024-01-15 15:59:30.000 | 74.8 ... (359 more rows — one per 10 seconds) Query time: 4ms Rows scanned: 360 (sequential, pre-sorted — no index needed) Writes/sec sustained: 20,000+ with no degradation
4ms for 360 rows: Because the data is pre-sorted by time and co-located by server, this is a sequential read of contiguous disk blocks. The most efficient possible query pattern.
SQL equivalent at this scale: PostgreSQL would need a composite index on (server_id, recorded_at), and at 20,000 inserts/sec the index maintenance overhead would start degrading write performance within hours. Cassandra's LSM-tree storage handles this write pattern natively.
Use Case 4 — Social Graphs and Recommendation Engines
When your data is fundamentally about relationships — who knows whom, what connects to what, which items are similar — graph databases stop being a nice-to-have and become the only sensible choice.
The scenario: You're building a "People You May Know" feature for a professional network. The algorithm: find all people who are connected to your connections but not yet connected to you — second-degree connections. In SQL this is a self-join nightmare. In Neo4j it's one pattern match:
// Neo4j Cypher — "People You May Know"
MATCH (me:Person {id: 'u_441'})-[:CONNECTED_TO]->(friend)
-[:CONNECTED_TO]->(suggestion)
WHERE NOT (me)-[:CONNECTED_TO]-(suggestion) // not already connected
AND suggestion.id <> 'u_441' // not yourself
RETURN suggestion.name, suggestion.title,
COUNT(friend) AS mutual_connections
ORDER BY mutual_connections DESC
LIMIT 10
(me)-[:CONNECTED_TO]->(friend)-[:CONNECTED_TO]->(suggestion)
This is a path pattern — traverse from me to my friends, then from those friends to their friends. Each arrow is a pointer dereference, not a JOIN computation. At 10 million users, this still completes in under 10ms because graph traversal cost scales with the depth of traversal, not the total number of nodes.
COUNT(friend) AS mutual_connections
For each suggested person, count how many mutual connections you share. Results are ranked by this — people with more mutuals appear first. LinkedIn, Facebook, and Twitter all use exactly this pattern on their graph databases.
suggestion.name | suggestion.title | mutual_connections --------------------+---------------------------+------------------- Sara Osei | CTO at CloudBase | 12 James Park | Senior Engineer at AWS | 9 Lin Wei | Product Lead at Stripe | 7 ... (7 more suggestions) Query time: 8ms (traversed 340,000 relationships) SQL equivalent with JOINs: 4,200ms on same dataset
8ms vs 4,200ms: That's a 525x performance difference on the same data. The SQL version requires a self-join on the connections table, filtered by exclusion, grouped, and sorted. Each step multiplies the row count being processed. Neo4j follows pointers — it doesn't compute.
Use Case 5 — Real-Time Leaderboards and Counters
The scenario: You're running a live coding competition. 50,000 participants. Scores update constantly. Users check the leaderboard every few seconds. The leaderboard must show the top 100 in real time. Redis Sorted Sets make this trivially simple:
import redis
r = redis.Redis(host='localhost', port=6379)
# Update a player's score — ZADD upserts score in sorted set
r.zadd('leaderboard:contest_42', {'player:u_8821': 9420})
r.zadd('leaderboard:contest_42', {'player:u_4401': 8875})
r.zadd('leaderboard:contest_42', {'player:u_1192': 9910})
ZADD — adds a member to a Sorted Set with a score. Redis keeps the set sorted by score automatically using a skip list data structure. Updating a score (upsert) is O(log N) — logarithmic, not linear. At 50,000 players, an update takes nanoseconds.
# Get top 3 with scores — ZREVRANGE returns highest scores first
top3 = r.zrevrange('leaderboard:contest_42', 0, 2, withscores=True)
for rank, (player, score) in enumerate(top3, 1):
print(f"#{rank} {player.decode()} — {int(score)} pts")
#1 player:u_1192 — 9910 pts #2 player:u_8821 — 9420 pts #3 player:u_4401 — 8875 pts -- ZREVRANGE: O(log N + M) where M = number of results -- 50,000 player leaderboard, top 100 query: 0.4ms -- SQL ORDER BY score DESC LIMIT 100: requires full index scan = 12ms+
ZREVRANGE 0 2 — returns indexes 0 to 2 (top 3) in reverse order (highest first). The sorted set is already ordered — there's no sorting happening at query time. The answer is pre-computed by ZADD.
SQL alternative: You'd do SELECT player_id, score FROM scores ORDER BY score DESC LIMIT 100. Even with an index, this reads and sorts all 50,000 rows on every leaderboard refresh. Redis reads 100 pre-sorted values.
The Cases Where NoSQL Is the Wrong Choice
Knowing when NOT to use NoSQL is just as important. These scenarios reliably cause pain when engineers choose NoSQL against the grain of their data:
| Scenario | Why NoSQL Struggles | Right Tool |
|---|---|---|
| Bank transfers and payroll | Multi-entity ACID transactions across accounts. Eventual consistency = money disappears. | PostgreSQL |
| Complex reporting and BI | Multi-collection aggregations, window functions, GROUP BY across millions of documents — no native JOIN. | PostgreSQL / Redshift |
| ERP and inventory management | Deep referential integrity — orders reference products, products reference warehouses. Foreign key constraints essential. | MySQL / PostgreSQL |
| Healthcare records (HIPAA) | Regulatory compliance requires strict schema, full audit trails, and ACID guarantees on every write. | PostgreSQL / Oracle |
The Complete Use Case Map
🔑 Key-Value (Redis)
📄 Document (MongoDB)
📊 Column-Family (Cassandra)
🕸️ Graph (Neo4j)
Teacher's Note
Every use case in this lesson follows the same pattern: the data had a natural shape, and the right database was the one that stored and retrieved that shape most efficiently. The engineers who struggle with database choices are the ones who start with a database and then try to fit their data into it. Start with your data — its shape, its volume, its query pattern — and the database chooses itself. That's the skill this entire section of the course has been building toward.
Practice Questions — You're the Engineer
Scenario:
Scenario:
Scenario:
Quiz — Right Tool for the Job
Scenario:
Scenario:
ORDER BY score DESC LIMIT 100 query takes 180ms at this scale. What is the right fix?
Scenario:
Up Next · Lesson 11
Key-Value Databases
A deep dive into how key-value stores work under the hood — hash tables, in-memory architecture, persistence strategies, and the Redis data structures that power half the internet's infrastructure.