NO SQL Lesson 10 – NoSQL Use Cases | Dataplexa

NoSQL Fundamentals · Lesson 10

NoSQL Use Cases

In 2012, a small team at LinkedIn was drowning. Their MySQL cluster was handling member activity feeds — who connected with whom, who viewed whose profile, who liked what. With 200 million members, every feed load was a multi-table JOIN across billions of rows. Page loads hit 8 seconds. They rewrote the feed system in Espresso, LinkedIn's document store. Load time dropped to 200ms. This lesson is about recognising the exact moment a use case belongs to NoSQL — before you hit that wall, not after.

How to Think About Use Cases

Before looking at specific scenarios, there's a mental model that makes every use case decision clear. Ask these three questions in order:

What does the data look like?

Rows and columns with fixed structure → SQL. JSON-like documents, graphs, key-value pairs, time-series events → NoSQL. The shape of your data almost always points to the right database family.

How will it be read?

Always by a single key → key-value. By content and filters → document. By relationship traversal → graph. By time range on a specific entity → column-family. Design your data model around your most frequent query pattern.

What are the consequences of stale or inconsistent data?

Financial loss, medical harm, inventory mismatch → strong consistency, probably SQL. Slightly stale view count, delayed feed update → eventual consistency, NoSQL handles it perfectly.

Use Case 1 — Real-Time Session and Cache Management

Every web application manages sessions. Every high-traffic app needs caching. These two requirements share the same profile: fast key lookup, optional expiry, no complex queries. This is key-value's natural home.

Why NoSQL wins here

Sub-millisecond reads from RAM
Built-in TTL — sessions auto-expire
Atomic operations (INCR, SETNX)
No JOIN ever needed — just fetch by key

Why SQL struggles here

Disk reads on cache miss: 4–12ms
Expiry requires scheduled cleanup jobs
High-frequency writes stress the WAL
No native in-memory storage tier

The scenario: You're building a rate limiter for a public API. Each IP address gets 100 requests per minute. You need to count requests per IP, enforce the limit, and auto-reset the counter every 60 seconds. Redis handles this with two commands:

import redis
r = redis.Redis(host='localhost', port=6379)

def check_rate_limit(ip_address):
    key = f"rate:{ip_address}"       # key namespaced by IP

    count = r.incr(key)              # atomically increment counter

    if count == 1:                   # first request — set 60s window
        r.expire(key, 60)

    return count <= 100              # True = allowed, False = blocked

r.incr(key)

Atomically increments the counter by 1 and returns the new value. If the key doesn't exist yet, Redis creates it with value 0 first, then increments to 1. The atomic nature means two simultaneous requests can never both read 99 and both write 100 — there's no race condition.

r.expire(key, 60)

Sets a 60-second TTL on this key. After 60 seconds Redis automatically deletes it. The next request for this IP starts fresh at 0. No cron job, no cleanup script, no scheduled task. Redis handles it natively.

# Simulate 3 requests from same IP
for i in range(3):
    allowed = check_rate_limit("203.0.113.5")
    print(f"Request {i+1}: {'✓ allowed' if allowed else '✗ blocked'}")

Request 1: ✓ allowed   (count=1,  key expires in 60s)
Request 2: ✓ allowed   (count=2)
Request 3: ✓ allowed   (count=3)

-- After 101 requests from same IP in 60 seconds:
Request 101: ✗ blocked  (count=101, limit exceeded)

-- After 60s window resets:
Request 1:  ✓ allowed   (key deleted, fresh start)

Why this can't be replicated cleanly in SQL: You'd need a request_counts table, a scheduled job to DELETE expired rows, and a transaction to prevent race conditions on the count. That's 3 pieces of infrastructure. Redis does it in 4 lines with no maintenance overhead.

Use Case 2 — Content Management and User Profiles

User profiles, product listings, blog posts, job listings — all of these share the same characteristic: every record has a different set of attributes, the schema evolves constantly, and the data is read and written as a whole unit. Document stores are the natural fit.

The scenario: You're building a job board. A software engineering job has tech stack, remote options, and salary range. A design role has portfolio requirements and tool preferences. A sales role has territory and quota details. Each job type has completely different attributes. Here's how MongoDB handles all of them in one collection without a single NULL:

// Insert two completely different job types — same collection
db.jobs.insertMany([
  {
    title:    "Senior Backend Engineer",
    company:  "FinTech Corp",
    type:     "engineering",
    salary:   { min: 90000, max: 130000, currency: "GBP" },
    stack:    ["Python", "Kafka", "PostgreSQL"],
    remote:   "hybrid",
    visa_sponsored: true
  },
  {
    title:    "Head of Growth",
    company:  "FinTech Corp",
    type:     "sales",
    salary:   { base: 75000, ote: 120000, currency: "GBP" },
    territory:    ["UK", "Ireland"],
    quota_annual: 2400000,
    crm_tools:    ["Salesforce", "HubSpot"]
  }
])

salary: {"{ min: 90000, max: 130000 }"} vs salary: {"{ base: 75000, ote: 120000 }"}

The salary field has a completely different structure for each job type. In SQL, you'd need separate columns for every possible salary component — salary_min, salary_max, salary_base, salary_ote — most of which would be NULL for any given job. MongoDB stores exactly what each document needs.

stack: ["Python", "Kafka", "PostgreSQL"]

Arrays stored natively. In SQL this would be a separate job_skills table. Here it's one field. And it's queryable: db.jobs.find({"{"}"stack": "Python"{"}"}) finds all Python jobs instantly.

// Query: find all hybrid engineering roles paying over £100k
db.jobs.find({
  type:            "engineering",
  remote:          "hybrid",
  "salary.min":    { $gte: 100000 }
})

[
  {
    title: "Senior Backend Engineer",
    company: "FinTech Corp",
    salary: { min: 90000, max: 130000, currency: "GBP" },
    stack: [ "Python", "Kafka", "PostgreSQL" ],
    remote: "hybrid"
  }
]
-- Query time: 2ms
-- Dot notation queries into nested objects natively

"salary.min": {"{"}"$gte": 100000{"}"} — dot notation lets you filter on nested fields as if they were top-level. MongoDB traverses into the salary object and compares min. No subquery. No JOIN to a salary table. One natural query.

Use Case 3 — Time-Series and Event Streaming

IoT sensor data, application logs, user activity events, financial tick data — all of these are time-series: a continuous stream of timestamped events. They share three traits: enormous write volume, always read by time range, and old data rarely updates. Column-family databases were built for exactly this pattern.

The scenario: You're building a monitoring platform for a cloud provider. 200,000 servers each emit a CPU metric every 10 seconds — that's 20,000 writes per second, sustained. Queries are always: "give me all metrics for server X between time A and time B." Here's the Cassandra schema and query:

# Cassandra table — designed around the query pattern
session.execute("""
  CREATE TABLE IF NOT EXISTS metrics.server_cpu (
    server_id  TEXT,        -- partition key: all data for one server together
    recorded_at TIMESTAMP,  -- clustering column: sorted by time within partition
    cpu_pct    FLOAT,       -- the metric value
    region     TEXT,        -- metadata
    PRIMARY KEY (server_id, recorded_at)
  ) WITH CLUSTERING ORDER BY (recorded_at DESC)
""")

PRIMARY KEY (server_id, recorded_at)

server_id is the partition key — all metrics for one server land on the same node. This is intentional: the most common query is "all metrics for server X" — and now it's a single-node read, no cross-cluster coordination needed.

CLUSTERING ORDER BY (recorded_at DESC)

Data is physically stored in descending timestamp order on disk. The most recent metrics are at the top of the partition. "Latest 100 readings" is a sequential read from the start of the file — no sorting at query time, no index scan needed.

# Query: last 1 hour of CPU metrics for a specific server
rows = session.execute("""
  SELECT recorded_at, cpu_pct FROM metrics.server_cpu
  WHERE server_id = 'srv-eu-042'
    AND recorded_at >= %s
    AND recorded_at <= %s
""", (one_hour_ago, now))

recorded_at               | cpu_pct
--------------------------+---------
2024-01-15 15:59:50.000   | 78.4
2024-01-15 15:59:40.000   | 76.1
2024-01-15 15:59:30.000   | 74.8
... (359 more rows — one per 10 seconds)

Query time: 4ms
Rows scanned: 360 (sequential, pre-sorted — no index needed)
Writes/sec sustained: 20,000+ with no degradation

4ms for 360 rows: Because the data is pre-sorted by time and co-located by server, this is a sequential read of contiguous disk blocks. The most efficient possible query pattern.

SQL equivalent at this scale: PostgreSQL would need a composite index on (server_id, recorded_at), and at 20,000 inserts/sec the index maintenance overhead would start degrading write performance within hours. Cassandra's LSM-tree storage handles this write pattern natively.

Use Case 4 — Social Graphs and Recommendation Engines

When your data is fundamentally about relationships — who knows whom, what connects to what, which items are similar — graph databases stop being a nice-to-have and become the only sensible choice.

The scenario: You're building a "People You May Know" feature for a professional network. The algorithm: find all people who are connected to your connections but not yet connected to you — second-degree connections. In SQL this is a self-join nightmare. In Neo4j it's one pattern match:

// Neo4j Cypher — "People You May Know"
MATCH (me:Person {id: 'u_441'})-[:CONNECTED_TO]->(friend)
                               -[:CONNECTED_TO]->(suggestion)
WHERE NOT (me)-[:CONNECTED_TO]-(suggestion)  // not already connected
  AND suggestion.id <> 'u_441'               // not yourself
RETURN suggestion.name, suggestion.title,
       COUNT(friend) AS mutual_connections
ORDER BY mutual_connections DESC
LIMIT 10

(me)-[:CONNECTED_TO]->(friend)-[:CONNECTED_TO]->(suggestion)

This is a path pattern — traverse from me to my friends, then from those friends to their friends. Each arrow is a pointer dereference, not a JOIN computation. At 10 million users, this still completes in under 10ms because graph traversal cost scales with the depth of traversal, not the total number of nodes.

COUNT(friend) AS mutual_connections

For each suggested person, count how many mutual connections you share. Results are ranked by this — people with more mutuals appear first. LinkedIn, Facebook, and Twitter all use exactly this pattern on their graph databases.

suggestion.name     | suggestion.title          | mutual_connections
--------------------+---------------------------+-------------------
Sara Osei           | CTO at CloudBase          | 12
James Park          | Senior Engineer at AWS    | 9
Lin Wei             | Product Lead at Stripe    | 7
... (7 more suggestions)

Query time: 8ms (traversed 340,000 relationships)
SQL equivalent with JOINs: 4,200ms on same dataset

8ms vs 4,200ms: That's a 525x performance difference on the same data. The SQL version requires a self-join on the connections table, filtered by exclusion, grouped, and sorted. Each step multiplies the row count being processed. Neo4j follows pointers — it doesn't compute.

Use Case 5 — Real-Time Leaderboards and Counters

The scenario: You're running a live coding competition. 50,000 participants. Scores update constantly. Users check the leaderboard every few seconds. The leaderboard must show the top 100 in real time. Redis Sorted Sets make this trivially simple:

import redis
r = redis.Redis(host='localhost', port=6379)

# Update a player's score — ZADD upserts score in sorted set
r.zadd('leaderboard:contest_42', {'player:u_8821': 9420})
r.zadd('leaderboard:contest_42', {'player:u_4401': 8875})
r.zadd('leaderboard:contest_42', {'player:u_1192': 9910})

ZADD — adds a member to a Sorted Set with a score. Redis keeps the set sorted by score automatically using a skip list data structure. Updating a score (upsert) is O(log N) — logarithmic, not linear. At 50,000 players, an update takes nanoseconds.

# Get top 3 with scores — ZREVRANGE returns highest scores first
top3 = r.zrevrange('leaderboard:contest_42', 0, 2, withscores=True)
for rank, (player, score) in enumerate(top3, 1):
    print(f"#{rank} {player.decode()} — {int(score)} pts")

#1 player:u_1192 — 9910 pts
#2 player:u_8821 — 9420 pts
#3 player:u_4401 — 8875 pts

-- ZREVRANGE: O(log N + M) where M = number of results
-- 50,000 player leaderboard, top 100 query: 0.4ms
-- SQL ORDER BY score DESC LIMIT 100: requires full index scan = 12ms+

ZREVRANGE 0 2 — returns indexes 0 to 2 (top 3) in reverse order (highest first). The sorted set is already ordered — there's no sorting happening at query time. The answer is pre-computed by ZADD.

SQL alternative: You'd do SELECT player_id, score FROM scores ORDER BY score DESC LIMIT 100. Even with an index, this reads and sorts all 50,000 rows on every leaderboard refresh. Redis reads 100 pre-sorted values.

The Cases Where NoSQL Is the Wrong Choice

Knowing when NOT to use NoSQL is just as important. These scenarios reliably cause pain when engineers choose NoSQL against the grain of their data:

Scenario	Why NoSQL Struggles	Right Tool
Bank transfers and payroll	Multi-entity ACID transactions across accounts. Eventual consistency = money disappears.	PostgreSQL
Complex reporting and BI	Multi-collection aggregations, window functions, GROUP BY across millions of documents — no native JOIN.	PostgreSQL / Redshift
ERP and inventory management	Deep referential integrity — orders reference products, products reference warehouses. Foreign key constraints essential.	MySQL / PostgreSQL
Healthcare records (HIPAA)	Regulatory compliance requires strict schema, full audit trails, and ACID guarantees on every write.	PostgreSQL / Oracle

The Complete Use Case Map

🔑 Key-Value (Redis)

✅ Session management & auth tokens

✅ API rate limiting

✅ Page and query caching

✅ Real-time leaderboards

✅ Shopping cart state

✅ Pub/sub messaging

📄 Document (MongoDB)

✅ User profiles & settings

✅ Product catalogues

✅ CMS and blog content

✅ Mobile app backends

✅ Real-time personalisation

✅ Multi-tenant SaaS data

📊 Column-Family (Cassandra)

✅ IoT sensor streams

✅ Application and server logs

✅ Financial tick data

✅ User activity event streams

✅ Messaging history (WhatsApp-style)

✅ Real-time analytics pipelines

🕸️ Graph (Neo4j)

✅ Social networks & connections

✅ Fraud detection rings

✅ Recommendation engines

✅ Knowledge graphs

✅ Supply chain & logistics networks

✅ Access control & permission trees

Teacher's Note

Every use case in this lesson follows the same pattern: the data had a natural shape, and the right database was the one that stored and retrieved that shape most efficiently. The engineers who struggle with database choices are the ones who start with a database and then try to fit their data into it. Start with your data — its shape, its volume, its query pattern — and the database chooses itself. That's the skill this entire section of the course has been building toward.

Practice Questions — You're the Engineer

Scenario:

Your SaaS platform has a feature flag system. For each user, you store a set of feature flags that determine what UI they see. The lookup is always "get all flags for user X" — a single key fetch. Flags change rarely but are read on every page load. The lookup must complete in under 0.5ms. There are no complex queries — just set and get by user ID. Which NoSQL family?

Scenario:

A fintech company captures every stock price tick from 3,000 instruments, each updating 10 times per second — 30,000 writes per second. Queries are always time-range based: "give me all ticks for AAPL between 9:30am and 10:00am." Data older than 90 days is archived. Which NoSQL family was purpose-built for this exact pattern?

Scenario:

An insurance company wants to detect staged accident fraud. Fraudulent claims often involve rings of connected people — same solicitor, same garage, same witnesses across multiple claims. The fraud team needs to find all people connected to a suspicious claimant within 4 relationship hops in real time as new claims are filed. Which NoSQL family?

Quiz — Right Tool for the Job

Up Next · Lesson 11

Key-Value Databases

A deep dive into how key-value stores work under the hood — hash tables, in-memory architecture, persistence strategies, and the Redis data structures that power half the internet's infrastructure.

← Previous Course Index Next →