NO SQL Lesson 22 – Choosing the Right NoSQL | Dataplexa
NoSQL Database Types · Lesson 22

Choosing the Right NoSQL Database

The most expensive database mistake is not picking the wrong engine — it is picking it for the wrong reasons. Teams choose MongoDB because it is popular, Redis because it is fast, Cassandra because Netflix uses it. Six months later they are fighting the data model, rewriting queries, or migrating to something else entirely. This lesson gives you a systematic framework so you choose the right database the first time.

The Decision Framework — Four Questions First

Before you look at any database, answer these four questions about your workload. The answers will eliminate most of the options immediately.

Question 1

What does a read look like?

A single key lookup? A range scan? A multi-hop traversal? A full-text search? Aggregation across millions of rows? The read pattern is the single most important constraint.

Question 2

What is the write volume and latency requirement?

100 writes/sec or 1 million? Does every write need sub-millisecond acknowledgement, or can you tolerate 10ms? This determines whether you need an LSM-tree engine or a B-tree.

Question 3

How flexible is your data structure?

Does every record look the same, or do different entities have wildly different shapes? Do you need schema enforcement or schema freedom? Is the structure evolving rapidly?

Question 4

How critical is consistency?

Can users briefly see stale data? Or does your domain — banking, healthcare, inventory — require that every read sees the latest committed write? This determines your CAP theorem position.

The Full Comparison — All Four NoSQL Families

Type Best read pattern Write strength Consistency Avoid when Examples
Key-Value Single key lookup by exact ID Sub-millisecond, extremely high throughput Eventual (DynamoDB) or strong (Redis) You need to query by anything other than the key Redis, DynamoDB, Memcached
Document Rich queries on nested fields, flexible filters Good — indexed writes, flexible schema Tunable (MongoDB) or eventual (CouchDB) You need multi-document transactions or time-series at scale MongoDB, CouchDB, Firestore
Column-Family Range scans on a partition key + clustering column Exceptional — LSM-tree, millions of writes/sec Tunable (Cassandra) or strong (HBase) You need ad-hoc queries or multi-partition transactions Cassandra, HBase, ScyllaDB
Graph Multi-hop traversals, relationship pattern matching Moderate — strong for connected writes Strong (Neo4j) You need heavy aggregations or high write throughput Neo4j, Amazon Neptune, TigerGraph

Decision Tree — From Workload to Database

Follow the path that matches your workload

Is your primary access pattern a single key lookup?
├── YES: Is the data ephemeral / needs expiry or pub-sub?
│ ├── YES → Redis (in-memory, TTL, pub-sub, data structures)
│ └── NO → DynamoDB (durable, serverless, global tables)
└── NO: continue ↓
Does your data have rich, varying structure with nested fields?
├── YES: Do you need offline sync / multi-master replication?
│ ├── YES → CouchDB (MVCC, offline-first, HTTP API)
│ └── NO → MongoDB (rich queries, aggregation pipeline, indexes)
└── NO: continue ↓
Is the workload time-series or event-stream with very high write volume?
├── YES: Are you already running Hadoop / HDFS?
│ ├── YES → HBase (HDFS-native, batch + random access)
│ └── NO → Cassandra (peer-to-peer, tunable consistency, no HDFS needed)
└── NO: continue ↓
Is the primary challenge multi-hop relationship traversal?
├── YES → Neo4j / Graph DB (fraud, recommendations, knowledge graphs)
└── NO → PostgreSQL — you probably do not need NoSQL at all

Real Architecture Decisions — Five Case Studies

The scenario: You have just joined a fast-growing startup as the lead engineer. Five different product requirements land on your desk the same week. Each one needs a database decision justified to your CTO. Work through each one.

Case 1

Live Football Score Updates

5 million concurrent users. Each needs the current score for one match — accessed by match ID. Scores change every few minutes. Stale data older than 5 seconds is acceptable. The backend already pushes updates via WebSocket.

Decision: Redis — single key lookup by match ID, sub-millisecond reads, TTL keeps stale scores from accumulating, pub-sub pushes updates to WebSocket handlers the moment the score changes.
Case 2

E-commerce Product Catalogue

200,000 products, each with a completely different attribute structure — electronics have voltage and warranty, clothing has size and material, food has allergens and calories. Marketing wants to filter by any combination of attributes. New product types are added monthly.

Decision: MongoDB — flexible document schema absorbs any product shape without migrations, rich query operators filter on any field, indexes cover the specific attribute combinations marketing uses most often.
Case 3

IoT Temperature Monitoring

80,000 sensors, each sending a reading every 5 seconds — 16,000 writes per second sustained. Queries are always "get all readings for sensor X between time A and time B." Data older than 90 days is irrelevant and should be automatically deleted.

Decision: Cassandra — partition key on sensor_id, clustering column on timestamp, TTL handles 90-day expiry automatically. LSM-tree write path sustains 16,000 writes/sec without breaking a sweat. Range scans per sensor are a single partition read.
Case 4

LinkedIn-Style "People You May Know"

10 million users. Each user has connections to other users, and those connections have connections. The feature needs second- and third-degree connections in under 50ms. The prototype in Postgres using recursive CTEs takes 8 seconds and degrades further as the network grows.

Decision: Neo4j — variable-length traversal follows direct relationship pointers, depth-2 connections are native. Traversal time stays constant as the network grows. Postgres recursive CTEs degrade quadratically; Neo4j does not.
Case 5

Hospital Patient Records

Complex nested data (diagnoses, medications, allergies, test results). Relationships between patients, doctors, and departments matter. Every write must be ACID. Regulators require that data is never lost or inconsistent. 50,000 records total — no massive scale.

Decision: PostgreSQL + JSONB — full ACID, JSONB columns handle nested data flexibly, foreign keys enforce referential integrity, 50,000 records is not a NoSQL-scale problem. Do not reach for NoSQL when Postgres already solves the problem cleanly.

The Polyglot Pattern — When You Need More Than One

Most production systems at scale do not use a single database. They use the right database for each concern. This is called polyglot persistence — different data stores for different parts of the same application.

Example: E-commerce Platform — Polyglot Architecture

Orders & Payments

PostgreSQL

ACID transactions, financial integrity, referential constraints

Product Catalogue

MongoDB

Flexible schema per category, rich attribute queries, fast iteration

Sessions & Cart

Redis

Sub-ms reads, TTL for abandoned carts, pub-sub for real-time updates

Recommendations

Neo4j

Purchase graph traversal, collaborative filtering, real-time recommendations

Clickstream Events

Cassandra

Millions of events/sec, partitioned by user, TTL for 90-day retention

The Warning Signs — You Chose the Wrong Database

You picked Cassandra and now…

Every feature request requires a new table because your queries keep changing. You are doing ALLOW FILTERING on every query. Developers are asking "how do I JOIN these two tables?" You needed a document database.

You picked MongoDB and now…

Writes are piling up at 500,000/sec and the cluster is gasping. The collection has 20 billion documents. Every read scans a huge index. You needed Cassandra for the write throughput and partition-based access.

You picked Redis and now…

The dataset grew to 400GB and you are paying a fortune for RAM. You have complex filtering requirements beyond key lookup. You needed DynamoDB or MongoDB — Redis was the right tool for a much smaller hot dataset.

You picked Neo4j and now…

You are trying to aggregate revenue across 2 billion transactions using graph traversal. Query times are minutes. The data is fundamentally tabular — relationships are not the interesting part. You needed a columnar data warehouse.

Teacher's Note

The right answer is almost always "it depends" — but that is not an excuse to avoid deciding. Use the four questions at the top of this lesson every single time, and when in doubt, choose the simplest option that solves the problem. PostgreSQL solves more problems than most engineers give it credit for. NoSQL is not an upgrade from SQL — it is a specialisation. Reach for it only when SQL's constraints are genuinely causing you pain, not before.

Practice Questions — You're the Engineer

Scenario:

A fitness app logs a GPS coordinate every 3 seconds for each active user. At peak, 200,000 users are active simultaneously — that is 66,000 writes per second. The only query the app ever makes is "get all coordinates for user X between time A and time B." Coordinates older than 30 days are never needed. You have no existing Hadoop infrastructure. Which database family and specific system should you choose?


Scenario:

Your CTO reviews the architecture diagram for your new platform and notices it uses PostgreSQL for orders, MongoDB for the product catalogue, Redis for session data, and Cassandra for event streams. She asks what architectural pattern this represents — using multiple specialised databases rather than a single general-purpose one.


Scenario:

A startup is building an invoicing tool. It has customers, invoices, line items, and payments — all tightly related with referential integrity requirements. The dataset is expected to reach 500,000 records. A junior developer suggests MongoDB because "NoSQL is more modern." What database should you actually choose, and why is the junior developer's reasoning flawed?


Quiz — Database Selection in Production

Scenario:

A retail company is rebuilding its product catalogue. Electronics have voltage, warranty period, and connector type. Clothing has size, material, and wash instructions. Furniture has dimensions and weight capacity. Marketing filters products by any combination of these attributes. New product categories are added quarterly. Write volume is modest — 500 updates per day. Which database should you choose?

Scenario:

Your authentication service handles 50,000 login sessions simultaneously. Every API request includes a session token. The service must look up that token in under 1ms to validate the user. Sessions expire after 30 minutes of inactivity. The session object contains only a user ID, permissions list, and last-seen timestamp — no complex structure. Which database should store the session data?

Scenario:

A data analyst joins your company and immediately complains that every query they write against your Cassandra-backed analytics store requires ALLOW FILTERING and takes 40 seconds. They need to filter events by region, event_type, user_plan, and date — in any combination, with no fixed access pattern. What is the root cause of the problem?

Up Next · Lesson 23

NoSQL Data Modeling

The principles that separate a schema that scales from one that falls apart at 10 million records — and how to apply them from day one.