NO SQL Lesson 2 – History of NoSQL | Dataplexa

NoSQL Fundamentals · Lesson 2

History of NoSQL

In 2004, a small team at Google published an internal paper describing how they stored data across thousands of cheap machines. They weren't trying to start a movement — they were just trying to keep Google Search alive. That paper quietly broke a 30-year-old assumption about how databases had to work.

First — Why Did SQL Rule for 30 Years?

SQL was invented at IBM in 1974. For three decades, it was genuinely the right tool for almost everything. Businesses ran on structured data — invoices, employee records, transactions. The data fit neatly into tables. Servers were big and expensive but fast enough. Life was good.

Then the internet happened. Suddenly data didn't look like neat tables anymore. It looked like user profiles with 40 different fields, photo uploads, friend graphs, real-time messages, clickstream logs. And the traffic wasn't hundreds of users — it was millions, then billions. SQL wasn't broken. The world just outgrew it in specific ways.

The key shift in one sentence:

The web moved from a few users reading structured data to billions of users writing messy, unpredictable data constantly — and SQL databases weren't designed for that second world.

The NoSQL Timeline — How We Got Here

Here's the actual history, year by year. Each dot on this timeline is a moment that changed how the industry stored data:

1970 — SQL Is Born at IBM

Edgar Codd publishes his relational model paper. SQL becomes the universal language for databases. Every serious business runs on it for the next 30 years.

2004 — Google's Bigtable (Internal)

Google engineers build Bigtable internally to handle the web crawl data behind Google Search. It stores data across thousands of machines in a completely new way — rows keyed by a single string, no joins, no fixed schema. It works brilliantly.

2007 — Amazon's Dynamo Paper

Amazon publishes how they built Dynamo — a key-value store designed around one specific problem: the shopping cart must NEVER go down, even if servers fail. They introduce "eventual consistency" — the idea that it's okay for data to be slightly out of sync for a moment, as long as it catches up. Revolutionary thinking.

2008–2009 — The Open Source Explosion

Inspired by those papers, engineers start building open source alternatives. Cassandra (Facebook, 2008), MongoDB (2009), CouchDB (2005), HBase (2008) — all launched within a two-year window. Suddenly everyone could use these ideas, not just Google and Amazon.

June 2009 — The Name "NoSQL" Is Coined

Johan Oskarsson, a developer in London, wants to organise a meetup about these new non-relational databases. He needs a short hashtag for Twitter. He asks on IRC, someone suggests #nosql, and it sticks. The entire movement gets its name from a tweet. Not a paper. Not a conference. A tweet.

2010–Today — Mainstream & Cloud-Native

AWS launches DynamoDB (2012), MongoDB goes public (2017), Redis becomes the most-loved database on the Stack Overflow Developer Survey for years running. NoSQL is no longer a niche — it's a standard part of every production architecture.

Story 1 — Google's Problem in 2004

To understand Bigtable, you have to understand what Google was doing. They were crawling the entire web — billions of web pages, storing the content of each page, tracking when each page was last crawled, and updating it constantly.

In a SQL database, you'd have a table with a column for the URL, a column for the content, a column for the last-crawled date. Clean, simple. But at Google's scale — billions of rows, terabytes of data — that SQL table needed to fit on one very powerful server. And that server had limits.

Google's solution was radical: spread the data across thousands of cheap servers, use a simple row key to find any piece of data instantly, and throw away joins entirely.

The scenario: Imagine you're storing web crawl data just like Google did. Each row is one web page. The key is the URL. Here's what that data structure looks like — tiny, focused, no confusion:

# Bigtable-style data — think of it as a giant map
# Key             Column              Value
# -------         -------             -------
com.google.www    content:html        "<html>Google Search...</html>"
com.google.www    metadata:crawled    "2004-03-15 09:22:11"
com.bbc.news      content:html        "<html>Breaking News...</html>"
com.bbc.news      metadata:crawled    "2004-03-15 11:05:44"

Row Key            Column               Value
--------------     ----------------     ---------------------------
com.google.www     content:html         "<html>Google Search...</html>"
com.google.www     metadata:crawled     "2004-03-15 09:22:11"
com.bbc.news       content:html         "<html>Breaking News...</html>"
com.bbc.news       metadata:crawled     "2004-03-15 11:05:44"

What just happened? — Line by line

com.google.www

This is the row key — the unique identifier for this piece of data. Notice it's the URL written backwards (com.google.www not www.google.com). That's intentional — it groups all Google pages together when sorted alphabetically. Smart design trick.

content:html

This is the column name. The part before the colon (content) is called a column family — a group. The part after (html) is the specific column. You can add new columns to any row without touching any other row. No ALTER TABLE. No migrations.

metadata:crawled

A second column family called metadata, storing when this page was last crawled. Notice: com.google.www and com.bbc.news both have this column. But if a new page had never been crawled, it simply wouldn't have this column at all — no NULL needed.

The big idea: This table can be split across 1,000 servers. Server 1 stores rows starting with com.a to com.m. Server 2 stores com.n to com.z. Any query goes directly to the right server. Google can add more servers any time. No single server ever becomes a bottleneck.

Story 2 — Amazon's Shopping Cart Problem in 2007

Amazon had a very different problem from Google. Google needed to store and read massive amounts of data fast. Amazon needed something much simpler but much more critical: the shopping cart must never fail.

Think about what happens when you add an item to your Amazon cart. That one click touches a database. If the database is down — even for 200 milliseconds — you see an error. Amazon calculated that every 100ms of latency cost them 1% in sales. At their scale, a few minutes of downtime was millions of dollars lost.

Their SQL database could not guarantee that. Any server could fail. Any network partition could separate the database from the app. So they asked: what if we design a database where availability is the #1 priority, and we accept that data might be slightly stale for a second?

Amazon's "Eventual Consistency" — the key insight

Imagine two people updating the same shopping cart from different devices at the same moment. Instead of locking the database and making one wait (which can cause failures), Dynamo lets both writes go through and resolves the conflict a moment later. The cart might show slightly different items for half a second — but it never crashes. That trade-off was worth billions.

The scenario: You're building the cart service at an e-commerce company. Your lead says "I don't care if the cart takes 50ms to sync between devices — I care that it NEVER throws an error." Here's what that DynamoDB-style key-value storage looks like:

# DynamoDB (Python) — storing a shopping cart
# Each cart is ONE item in the table, keyed by user ID
import boto3

dynamodb = boto3.resource('dynamodb', region_name='us-east-1')
table = dynamodb.Table('shopping_carts')

What these 3 lines do:

boto3.resource('dynamodb') — connects to DynamoDB using AWS SDK. boto3 is Amazon's Python library for all AWS services.

dynamodb.Table('shopping_carts') — points to a specific table. Think of it like selecting which database to use. No SQL connection string, no username/password config shown here — AWS handles authentication separately.

# Save the cart — one put_item call stores the whole cart
table.put_item(Item={
    'user_id': 'user_4492',         # the primary key — like a row key
    'items': ['shoe-xl', 'hat-m'],  # the cart contents — just a list
    'last_updated': '2024-01-15'    # when it was last changed
})

Line by line — no skipping:

table.put_item(Item={...})

Creates or replaces one item (one row) in the table. If a cart for user_4492 already exists, it gets completely overwritten. Simple and deliberate.

'user_id': 'user_4492'

This is the partition key — DynamoDB's version of a primary key. It's how DynamoDB decides which server stores this item. All data for user_4492 lives on the same server. Fast lookups, every time.

'items': ['shoe-xl', 'hat-m']

A list stored directly inside the item. In SQL you'd need a separate cart_items table and a JOIN to get this. Here it's just one field. One read to get the whole cart.

# Read the cart back — one line, instant
response = table.get_item(Key={'user_id': 'user_4492'})
cart = response['Item']
print(cart)

{
  'user_id': 'user_4492',
  'items': ['shoe-xl', 'hat-m'],
  'last_updated': '2024-01-15'
}

What just happened?

get_item(Key={'user_id': 'user_4492'}) — DynamoDB takes the key, hashes it, finds exactly which server holds this data, and returns it. No scanning. No searching. Direct lookup every single time. This is why it's under 10ms even at Amazon's scale.

response['Item'] — The result comes back as a plain Python dictionary. No ORM, no model class, no row mapping. Just a dict you can use directly. The whole cart — items and all — in one object. That simplicity is the point.

The Database Explosion — 2005 to 2012

Once Google and Amazon published their papers, engineers everywhere started building. Here's what launched and why each one was born:

Year	Database	Type	Born From This Problem
2005	CouchDB	Document	Offline-first apps that sync when reconnected. REST-based API over HTTP.
2006	Google Bigtable	Column-Family	Storing the entire web crawl. Petabytes on thousands of machines.
2007	Amazon Dynamo	Key-Value	Shopping cart availability. Must never fail even when servers go down.
2008	Cassandra (Facebook)	Column-Family	Facebook Inbox search. 50 billion messages. Needed massive write throughput.
2008	HBase	Column-Family	Open source Bigtable clone. Used by Twitter, LinkedIn for analytics workloads.
2009	MongoDB	Document	Developer speed. Store JSON directly without designing a schema first.
2009	Redis	Key-Value	In-memory speed. Caching, leaderboards, sessions. Sub-millisecond reads.
2010	Neo4j	Graph	Social graphs, fraud rings. Relationships as first-class data, not JOIN hacks.
2012	AWS DynamoDB	Key-Value	Fully managed NoSQL for everyone. Amazon productises their internal Dynamo.

How These Databases Relate to Each Other

It helps to see the family tree. Each database was inspired by what came before it:

Google Bigtable Paper (2006)

Apache HBase
Open source Bigtable

Apache Cassandra
Facebook, 2008

Amazon Dynamo Paper
2007

AWS DynamoDB
Public product, 2012

2009 — NoSQL as a movement (the hashtag)

MongoDB
Documents

Redis
Key-Value

Neo4j
Graph

CouchDB
Documents

Why This History Matters to You Right Now

This isn't just trivia. Knowing where each database came from tells you exactly what it's good at:

If you need massive write throughput

→ Use Cassandra. It was literally built for Facebook's inbox problem.

If you need zero-downtime key lookups

→ Use DynamoDB. It was built for Amazon's "never fail" shopping cart requirement.

If you need fast prototyping with flexible data

→ Use MongoDB. It was built for developer speed, not web-scale crawling.

If you need sub-millisecond cache reads

→ Use Redis. It stores everything in RAM, just like it was designed to do from day one.

Teacher's Note

Every database has a birth story — a specific problem it was designed to solve. When you know that story, tool selection becomes obvious instead of overwhelming. The engineers who struggle are the ones who pick databases based on hype. The ones who choose well are the ones who ask: "What problem was this database actually built for?"

Practice Questions

1. Amazon's Dynamo paper introduced a concept where data might be slightly out of sync for a short moment but catches up. What is this called? (two words)

2. In what year was the term "NoSQL" coined as a hashtag by Johan Oskarsson?

3. Which NoSQL database was built by Facebook in 2008 to handle inbox search across 50 billion messages?

Quiz

Up Next · Lesson 3

Problems with RDBMS

The exact walls SQL hits at scale — schema rigidity, join performance, vertical limits — and why they matter in your production systems today.

← Previous Course Index Next →