NoSQL
History of NoSQL
In 2004, a small team at Google published an internal paper describing how they stored data across thousands of cheap machines. They weren't trying to start a movement — they were just trying to keep Google Search alive. That paper quietly broke a 30-year-old assumption about how databases had to work.
First — Why Did SQL Rule for 30 Years?
SQL was invented at IBM in 1974. For three decades, it was genuinely the right tool for almost everything. Businesses ran on structured data — invoices, employee records, transactions. The data fit neatly into tables. Servers were big and expensive but fast enough. Life was good.
Then the internet happened. Suddenly data didn't look like neat tables anymore. It looked like user profiles with 40 different fields, photo uploads, friend graphs, real-time messages, clickstream logs. And the traffic wasn't hundreds of users — it was millions, then billions. SQL wasn't broken. The world just outgrew it in specific ways.
The key shift in one sentence:
The web moved from a few users reading structured data to billions of users writing messy, unpredictable data constantly — and SQL databases weren't designed for that second world.
The NoSQL Timeline — How We Got Here
Here's the actual history, year by year. Each dot on this timeline is a moment that changed how the industry stored data:
1970 — SQL Is Born at IBM
Edgar Codd publishes his relational model paper. SQL becomes the universal language for databases. Every serious business runs on it for the next 30 years.
2004 — Google's Bigtable (Internal)
Google engineers build Bigtable internally to handle the web crawl data behind Google Search. It stores data across thousands of machines in a completely new way — rows keyed by a single string, no joins, no fixed schema. It works brilliantly.
2007 — Amazon's Dynamo Paper
Amazon publishes how they built Dynamo — a key-value store designed around one specific problem: the shopping cart must NEVER go down, even if servers fail. They introduce "eventual consistency" — the idea that it's okay for data to be slightly out of sync for a moment, as long as it catches up. Revolutionary thinking.
2008–2009 — The Open Source Explosion
Inspired by those papers, engineers start building open source alternatives. Cassandra (Facebook, 2008), MongoDB (2009), CouchDB (2005), HBase (2008) — all launched within a two-year window. Suddenly everyone could use these ideas, not just Google and Amazon.
June 2009 — The Name "NoSQL" Is Coined
Johan Oskarsson, a developer in London, wants to organise a meetup about these new non-relational databases. He needs a short hashtag for Twitter. He asks on IRC, someone suggests #nosql, and it sticks. The entire movement gets its name from a tweet. Not a paper. Not a conference. A tweet.
2010–Today — Mainstream & Cloud-Native
AWS launches DynamoDB (2012), MongoDB goes public (2017), Redis becomes the most-loved database on the Stack Overflow Developer Survey for years running. NoSQL is no longer a niche — it's a standard part of every production architecture.
Story 1 — Google's Problem in 2004
To understand Bigtable, you have to understand what Google was doing. They were crawling the entire web — billions of web pages, storing the content of each page, tracking when each page was last crawled, and updating it constantly.
In a SQL database, you'd have a table with a column for the URL, a column for the content, a column for the last-crawled date. Clean, simple. But at Google's scale — billions of rows, terabytes of data — that SQL table needed to fit on one very powerful server. And that server had limits.
Google's solution was radical: spread the data across thousands of cheap servers, use a simple row key to find any piece of data instantly, and throw away joins entirely.
The scenario: Imagine you're storing web crawl data just like Google did. Each row is one web page. The key is the URL. Here's what that data structure looks like — tiny, focused, no confusion:
# Bigtable-style data — think of it as a giant map
# Key Column Value
# ------- ------- -------
com.google.www content:html "<html>Google Search...</html>"
com.google.www metadata:crawled "2004-03-15 09:22:11"
com.bbc.news content:html "<html>Breaking News...</html>"
com.bbc.news metadata:crawled "2004-03-15 11:05:44"
Row Key Column Value -------------- ---------------- --------------------------- com.google.www content:html "<html>Google Search...</html>" com.google.www metadata:crawled "2004-03-15 09:22:11" com.bbc.news content:html "<html>Breaking News...</html>" com.bbc.news metadata:crawled "2004-03-15 11:05:44"
What just happened? — Line by line
com.google.www
This is the row key — the unique identifier for this piece of data. Notice it's the URL written backwards (com.google.www not www.google.com). That's intentional — it groups all Google pages together when sorted alphabetically. Smart design trick.
content:html
This is the column name. The part before the colon (content) is called a column family — a group. The part after (html) is the specific column. You can add new columns to any row without touching any other row. No ALTER TABLE. No migrations.
metadata:crawled
A second column family called metadata, storing when this page was last crawled. Notice: com.google.www and com.bbc.news both have this column. But if a new page had never been crawled, it simply wouldn't have this column at all — no NULL needed.
The big idea: This table can be split across 1,000 servers. Server 1 stores rows starting with com.a to com.m. Server 2 stores com.n to com.z. Any query goes directly to the right server. Google can add more servers any time. No single server ever becomes a bottleneck.
Story 2 — Amazon's Shopping Cart Problem in 2007
Amazon had a very different problem from Google. Google needed to store and read massive amounts of data fast. Amazon needed something much simpler but much more critical: the shopping cart must never fail.
Think about what happens when you add an item to your Amazon cart. That one click touches a database. If the database is down — even for 200 milliseconds — you see an error. Amazon calculated that every 100ms of latency cost them 1% in sales. At their scale, a few minutes of downtime was millions of dollars lost.
Their SQL database could not guarantee that. Any server could fail. Any network partition could separate the database from the app. So they asked: what if we design a database where availability is the #1 priority, and we accept that data might be slightly stale for a second?
Amazon's "Eventual Consistency" — the key insight
Imagine two people updating the same shopping cart from different devices at the same moment. Instead of locking the database and making one wait (which can cause failures), Dynamo lets both writes go through and resolves the conflict a moment later. The cart might show slightly different items for half a second — but it never crashes. That trade-off was worth billions.
The scenario: You're building the cart service at an e-commerce company. Your lead says "I don't care if the cart takes 50ms to sync between devices — I care that it NEVER throws an error." Here's what that DynamoDB-style key-value storage looks like:
# DynamoDB (Python) — storing a shopping cart
# Each cart is ONE item in the table, keyed by user ID
import boto3
dynamodb = boto3.resource('dynamodb', region_name='us-east-1')
table = dynamodb.Table('shopping_carts')
What these 3 lines do:
boto3.resource('dynamodb') — connects to DynamoDB using AWS SDK. boto3 is Amazon's Python library for all AWS services.
dynamodb.Table('shopping_carts') — points to a specific table. Think of it like selecting which database to use. No SQL connection string, no username/password config shown here — AWS handles authentication separately.
# Save the cart — one put_item call stores the whole cart
table.put_item(Item={
'user_id': 'user_4492', # the primary key — like a row key
'items': ['shoe-xl', 'hat-m'], # the cart contents — just a list
'last_updated': '2024-01-15' # when it was last changed
})
Line by line — no skipping:
table.put_item(Item={...})
Creates or replaces one item (one row) in the table. If a cart for user_4492 already exists, it gets completely overwritten. Simple and deliberate.
'user_id': 'user_4492'
This is the partition key — DynamoDB's version of a primary key. It's how DynamoDB decides which server stores this item. All data for user_4492 lives on the same server. Fast lookups, every time.
'items': ['shoe-xl', 'hat-m']
A list stored directly inside the item. In SQL you'd need a separate cart_items table and a JOIN to get this. Here it's just one field. One read to get the whole cart.
# Read the cart back — one line, instant
response = table.get_item(Key={'user_id': 'user_4492'})
cart = response['Item']
print(cart)
{
'user_id': 'user_4492',
'items': ['shoe-xl', 'hat-m'],
'last_updated': '2024-01-15'
}
What just happened?
get_item(Key={'user_id': 'user_4492'}) — DynamoDB takes the key, hashes it, finds exactly which server holds this data, and returns it. No scanning. No searching. Direct lookup every single time. This is why it's under 10ms even at Amazon's scale.
response['Item'] — The result comes back as a plain Python dictionary. No ORM, no model class, no row mapping. Just a dict you can use directly. The whole cart — items and all — in one object. That simplicity is the point.
The Database Explosion — 2005 to 2012
Once Google and Amazon published their papers, engineers everywhere started building. Here's what launched and why each one was born:
| Year | Database | Type | Born From This Problem |
|---|---|---|---|
| 2005 | CouchDB | Document | Offline-first apps that sync when reconnected. REST-based API over HTTP. |
| 2006 | Google Bigtable | Column-Family | Storing the entire web crawl. Petabytes on thousands of machines. |
| 2007 | Amazon Dynamo | Key-Value | Shopping cart availability. Must never fail even when servers go down. |
| 2008 | Cassandra (Facebook) | Column-Family | Facebook Inbox search. 50 billion messages. Needed massive write throughput. |
| 2008 | HBase | Column-Family | Open source Bigtable clone. Used by Twitter, LinkedIn for analytics workloads. |
| 2009 | MongoDB | Document | Developer speed. Store JSON directly without designing a schema first. |
| 2009 | Redis | Key-Value | In-memory speed. Caching, leaderboards, sessions. Sub-millisecond reads. |
| 2010 | Neo4j | Graph | Social graphs, fraud rings. Relationships as first-class data, not JOIN hacks. |
| 2012 | AWS DynamoDB | Key-Value | Fully managed NoSQL for everyone. Amazon productises their internal Dynamo. |
How These Databases Relate to Each Other
It helps to see the family tree. Each database was inspired by what came before it:
Open source Bigtable
Facebook, 2008
2007
Public product, 2012
Documents
Key-Value
Graph
Documents
Why This History Matters to You Right Now
This isn't just trivia. Knowing where each database came from tells you exactly what it's good at:
If you need massive write throughput
→ Use Cassandra. It was literally built for Facebook's inbox problem.
If you need zero-downtime key lookups
→ Use DynamoDB. It was built for Amazon's "never fail" shopping cart requirement.
If you need fast prototyping with flexible data
→ Use MongoDB. It was built for developer speed, not web-scale crawling.
If you need sub-millisecond cache reads
→ Use Redis. It stores everything in RAM, just like it was designed to do from day one.
Teacher's Note
Every database has a birth story — a specific problem it was designed to solve. When you know that story, tool selection becomes obvious instead of overwhelming. The engineers who struggle are the ones who pick databases based on hype. The ones who choose well are the ones who ask: "What problem was this database actually built for?"
Practice Questions
1. Amazon's Dynamo paper introduced a concept where data might be slightly out of sync for a short moment but catches up. What is this called? (two words)
2. In what year was the term "NoSQL" coined as a hashtag by Johan Oskarsson?
3. Which NoSQL database was built by Facebook in 2008 to handle inbox search across 50 billion messages?
Quiz
1. What specific problem was Google's Bigtable designed to solve?
2. What was Amazon's core requirement when designing Dynamo?
3. How did the term "NoSQL" actually get its name?
Up Next · Lesson 3
Problems with RDBMS
The exact walls SQL hits at scale — schema rigidity, join performance, vertical limits — and why they matter in your production systems today.