Data Science Lesson 38 – NoSQL | Dataplexa

Data Storage · Lesson 38

NoSQL

Master document databases, key-value stores, and graph systems that power modern e-commerce platforms like Flipkart and Amazon.

Coming from SQL? You'll find NoSQL surprisingly liberating. No rigid schemas. No complex joins. Just flexible data storage that scales horizontally across thousands of servers.

The name "NoSQL" is honestly misleading — it doesn't mean "no SQL at all." It means "Not Only SQL". Most NoSQL databases support some SQL-like querying. The real difference? They trade ACID compliance for massive scalability.

Why NoSQL Exists

Picture Swiggy during dinner rush. 50,000 orders per minute. Customer profiles, restaurant menus, delivery locations, real-time tracking. Traditional SQL databases hit a wall around 10,000 concurrent users. That's the 90% case where SQL works fine — but the 10% trips everyone up.

Scale Horizontally

Handle Unstructured Data

Real-time Performance

Developer Flexibility

Four NoSQL Types

Document

MongoDB, CouchDB. JSON-like documents. Perfect for catalogs, user profiles.

Key-Value

Redis, DynamoDB. Simple pairs. Caching, session storage, real-time data.

Column

Cassandra, HBase. Wide columns. Analytics, time-series data.

Graph

Neo4j, Amazon Neptune. Relationships. Social networks, recommendations.

MongoDB Essentials

MongoDB dominates the document database space. Think of it as SQL tables but each row can have completely different columns. No predefined schema. Store nested objects, arrays, any JSON structure.

The scenario: You're the lead analyst at BigBasket. Product catalog has thousands of variations — electronics have specifications, groceries have nutritional info, books have authors. One flexible collection handles everything.

# Install and import pymongo for MongoDB connection
import pymongo
from pymongo import MongoClient
import pandas as pd

# Connect to MongoDB (local instance)
client = MongoClient('mongodb://localhost:27017/')
# Create or access database
db = client['bigbasket_catalog']

Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'bigbasket_catalog')

What just happened?

We connected to MongoDB running locally on port 27017. The database bigbasket_catalog gets created automatically when we first insert data. Try this: Check if MongoDB is running with brew services start mongodb-community on Mac.

# Create collection (like a SQL table)
products = db['products']

# Insert a complex product document
smartphone = {
    "product_id": "PROD001",
    "name": "iPhone 15 Pro",
    "category": "Electronics",
    "price": 129900,
    "specifications": {
        "storage": "256GB",
        "ram": "8GB",
        "camera": "48MP Triple Camera"
    }
}

{'product_id': 'PROD001', 'name': 'iPhone 15 Pro', 'category': 'Electronics', 'price': 129900, 'specifications': {'storage': '256GB', 'ram': '8GB', 'camera': '48MP Triple Camera'}}

What just happened?

We created a products collection and defined a document with nested objects. Notice the specifications field contains multiple sub-fields — impossible in traditional SQL without separate tables. Try this: Add more nested levels like specifications.camera.features.

# Insert the document
result = products.insert_one(smartphone)
print(f"Inserted document ID: {result.inserted_id}")

# Insert a completely different product structure
grocery_item = {
    "product_id": "PROD002",
    "name": "Organic Basmati Rice",
    "category": "Food",
    "price": 299,
    "nutrition": {
        "calories_per_100g": 130,
        "protein": "2.7g",
        "carbs": "28g"
    },
    "certifications": ["Organic", "Non-GMO"]
}

Inserted document ID: 507f1f77bcf86cd799439011
{'product_id': 'PROD002', 'name': 'Organic Basmati Rice', 'category': 'Food', 'price': 299, 'nutrition': {'calories_per_100g': 130, 'protein': '2.7g', 'carbs': '28g'}, 'certifications': ['Organic', 'Non-GMO']}

What just happened?

MongoDB auto-generated a unique _id field. The grocery item has completely different fields — nutrition instead of specifications, plus an array certifications. Same collection, totally different structure. Try this: Insert a book with author, ISBN, and page count.

Querying Documents

# Find all products
all_products = products.find()
for product in all_products:
    print(f"Product: {product['name']}")
    
# Find specific category
electronics = products.find({"category": "Electronics"})
print(f"\nElectronics found: {electronics.count()}")

# Query nested fields using dot notation
high_storage = products.find({"specifications.storage": "256GB"})
for item in high_storage:
    print(f"High storage device: {item['name']}")

Product: iPhone 15 Pro
Product: Organic Basmati Rice

Electronics found: 1
High storage device: iPhone 15 Pro

What just happened?

The dot notation specifications.storage queries nested objects. find() returns a cursor, not the actual data — you iterate through it. The grocery item was skipped in the storage query because it doesn't have a specifications field. Try this: Query array elements with certifications: "Organic".

MongoDB dominates with 58% market share, followed by Redis for caching and real-time applications

Document databases lead because they match how developers think. JSON objects everywhere — APIs, frontend state, configuration files. Why transform data between different formats when you can store it natively? Key-value stores like Redis shine for specific use cases — session storage, caching, real-time leaderboards. Simple but blazingly fast. You wouldn't build a complex application on Redis alone, but it's perfect as a supporting actor.

Redis for Speed

Redis keeps everything in memory. That means sub-millisecond response times but limited by RAM capacity. Perfect for caching frequently accessed data, session management, and real-time analytics.

The scenario: Zomato's recommendation engine needs to track user preferences in real-time. Every click, every search, every order updates the preference score. Traditional databases can't handle 100,000 updates per second.

# Install and import redis
import redis

# Connect to Redis (default localhost:6379)
r = redis.Redis(host='localhost', port=6379, db=0)

# Test connection
r.ping()
print("Connected to Redis!")

Connected to Redis!
True

# Store user preference scores
r.set("user:12345:cuisine:italian", 8.5)
r.set("user:12345:cuisine:chinese", 7.2)
r.set("user:12345:cuisine:indian", 9.1)

# Retrieve preference
italian_score = r.get("user:12345:cuisine:italian")
print(f"Italian cuisine score: {float(italian_score)}")

# Increment score atomically (thread-safe)
r.incrbyfloat("user:12345:cuisine:italian", 0.3)
new_score = r.get("user:12345:cuisine:italian")
print(f"Updated Italian score: {float(new_score)}")

Italian cuisine score: 8.5
Updated Italian score: 8.8

What just happened?

Redis stores everything as strings — we convert to float for math. The key structure user:12345:cuisine:italian creates a namespace. incrbyfloat is atomic — no race conditions even with millions of concurrent users. Try this: Use expire to auto-delete keys after 24 hours.

📊 Data Insight

Redis can handle 500,000+ operations per second on standard hardware. MongoDB peaks around 10,000 inserts/second. The 50x speed difference makes Redis essential for real-time features like live chat, gaming leaderboards, and recommendation engines.

SQL vs NoSQL Trade-offs

Aspect	SQL	NoSQL
Schema	Rigid, predefined	Flexible, evolving
Scaling	Vertical (bigger servers)	Horizontal (more servers)
Consistency	ACID guaranteed	Eventual consistency
Query Language	Standardized SQL	Database-specific
Best For	Complex relationships	Rapid development, scale

The CAP theorem explains the fundamental trade-off. You can only guarantee two of three: Consistency, Availability, Partition tolerance. SQL chooses consistency. Most NoSQL systems choose availability and partition tolerance.

Common Mistake

Thinking NoSQL means "no relationships." Many NoSQL databases support references and joins — they're just not enforced at the database level. The exact fix: Design your data model to minimize relationships, but don't eliminate them entirely.

SQL excels at consistency and complex queries, while NoSQL dominates performance and scalability

The radar chart reveals why both technologies coexist. SQL databases shine for financial systems, inventory management, anything requiring perfect consistency. Banking transactions must never go missing or duplicate. NoSQL databases excel at user-facing features — social media feeds, product catalogs, real-time messaging. Instagram can survive showing you an old photo, but can't survive being slow. The performance and scalability advantages outweigh occasional inconsistencies.

Choosing the Right Database

Choose SQL When

Financial transactions
Complex reporting
Established data structure
Team knows SQL well

Choose NoSQL When

Rapid prototyping
Massive scale required
Varying data structures
Real-time performance

NoSQL delivers 5x faster response times and handles 10x more concurrent users than traditional SQL

The performance gap is dramatic. NoSQL response times of 8ms versus SQL's 45ms might seem small, but multiply by millions of requests. Those milliseconds translate to user engagement and revenue. Development speed tells the real story. NoSQL lets you iterate faster — add new fields, change data structures, deploy without migrations. SQL requires careful planning, schema changes, downtime. Both approaches work, but for different organizational rhythms.

Quiz

Up Next

Data Modeling

Learn how to design efficient database schemas and relationships that scale from startup to enterprise, building on the SQL and NoSQL foundations you've mastered.

← Previous Course Index Next →