Data Science Lesson 37 - Relational Databases | Dataplexa

Databases · Lesson 37

Relational Databases

Master SQL database design, relationships, and Python integration using real e-commerce data structures and transactions.

Design Schema

Create Relationships

Normalize Data

Query & Optimize

Database Architecture Fundamentals

Think of a relational database like an organized filing cabinet. Each drawer represents a table, and files inside connect through labeled references. But here's what trips most people up — the connections between tables matter more than the tables themselves.

Database design follows three core principles. First, normalization eliminates duplicate data by splitting information across related tables. Second, referential integrity ensures every connection remains valid. Third, ACID properties guarantee transaction reliability.

OLTP Systems

Real-time transactions, high concurrency, immediate consistency. Perfect for order processing.

OLAP Systems

Complex queries, historical analysis, denormalized structures. Built for reporting.

Primary Keys

Unique identifiers for each record. Never null, never duplicate, ideally auto-increment.

Foreign Keys

References to primary keys in other tables. Creates relationships and maintains integrity.

Building Your First Schema

Database schema design determines everything downstream. Get this wrong, and you'll spend months refactoring production systems. Most companies learn this lesson the expensive way — 60% of database performance issues trace back to poor initial design.

The scenario: Flipkart's analytics team needs to restructure their order data for better reporting performance. Currently everything sits in one massive table causing query timeouts.

import sqlite3
import pandas as pd

# Create in-memory database for testing
conn = sqlite3.connect(':memory:')
cursor = conn.cursor()

# Create customers table first - foundational entity
cursor.execute('''
    CREATE TABLE customers (
        customer_id INTEGER PRIMARY KEY AUTOINCREMENT,
        age INTEGER NOT NULL,
        gender TEXT NOT NULL,
        city TEXT NOT NULL
    )
''')

<sqlite3.Cursor at 0x7f8a2c1d4f40>

What just happened?

We created a customers table with an auto-incrementing primary key. The AUTOINCREMENT ensures unique IDs even after deletions. Try this: Add constraints like CHECK(age >= 18) for data validation.

The scenario: Now we need a products catalog that connects to orders through relationships. Each product belongs to a category and has pricing history.

# Create products table with category normalization
cursor.execute('''
    CREATE TABLE products (
        product_id INTEGER PRIMARY KEY AUTOINCREMENT,
        product_name TEXT NOT NULL,
        category TEXT NOT NULL,
        current_price REAL NOT NULL
    )
''')

<sqlite3.Cursor at 0x7f8a2c1d4f40>

What just happened?

Products table stores catalog data with current_price for active pricing. In production, you'd separate categories into their own table for better normalization. Try this: Create a categories table and use foreign keys.

The scenario: Orders table becomes the transaction hub, connecting customers to products with quantity and pricing snapshots from purchase time.

# Create orders table with foreign key relationships
cursor.execute('''
    CREATE TABLE orders (
        order_id INTEGER PRIMARY KEY,
        customer_id INTEGER,
        product_id INTEGER,
        order_date DATE NOT NULL,
        quantity INTEGER NOT NULL,
        unit_price REAL NOT NULL,
        rating REAL,
        returned BOOLEAN DEFAULT 0,
        FOREIGN KEY (customer_id) REFERENCES customers (customer_id),
        FOREIGN KEY (product_id) REFERENCES products (product_id)
    )
''')

<sqlite3.Cursor at 0x7f8a2c1d4f40>

What just happened?

Foreign keys create relationships between tables. unit_price captures historical pricing at order time, not current prices. The REFERENCES clause enforces referential integrity. Try this: Enable foreign key constraints with PRAGMA foreign_keys = ON.

Data Population and Relationships

Schema design means nothing without data. Real relationships emerge when you populate tables and test edge cases. Here's where 90% of developers discover their foreign key constraints are too restrictive or too loose.

The scenario: Load sample customers representing major Indian cities. This foundational data supports all subsequent order relationships.

# Insert customer data - these IDs will be referenced by orders
customers_data = [
    (28, 'Female', 'Mumbai'),
    (34, 'Male', 'Delhi'),
    (22, 'Female', 'Bangalore'),
    (45, 'Male', 'Chennai'),
    (31, 'Female', 'Pune')
]

cursor.executemany(
    'INSERT INTO customers (age, gender, city) VALUES (?, ?, ?)',
    customers_data
)

<sqlite3.Cursor at 0x7f8a2c1d4f40>

What just happened?

executemany() efficiently inserts multiple rows using parameterized queries. Auto-increment generates customer_id values 1-5. The question marks prevent SQL injection. Try this: Use cursor.lastrowid to get the last inserted ID.

The scenario: Add products across different categories with realistic Indian pricing. These products will connect to orders through foreign key relationships.

# Insert products with Indian market pricing
products_data = [
    ('iPhone 14', 'Electronics', 79900.0),
    ('Levi\'s Jeans', 'Clothing', 2999.0),
    ('Amul Butter', 'Food', 52.0),
    ('Chetan Bhagat Novel', 'Books', 199.0),
    ('Pressure Cooker', 'Home', 1899.0)
]

cursor.executemany(
    'INSERT INTO products (product_name, category, current_price) VALUES (?, ?, ?)',
    products_data
)

<sqlite3.Cursor at 0x7f8a2c1d4f40>

What just happened?

Products span price ranges from ₹52 to ₹79,900, representing real market diversity. The escaped apostrophe in Levi\'s prevents SQL syntax errors. Product IDs 1-5 are auto-generated. Try this: Query products by price range using WHERE current_price BETWEEN.

Advanced Query Patterns

Relational databases shine when you need complex queries across multiple tables. JOINs are the heart of relational systems — they combine related data efficiently. But choosing the wrong JOIN type can return incorrect results or kill performance.

The scenario: Zomato's data team needs to analyze customer purchasing patterns by city and product category. They need orders with customer and product details in one result set.

# First create some order data to query
orders_data = [
    (1001, 1, 1, '2023-01-15', 1, 79900.0, 4.5, 0),
    (1002, 2, 2, '2023-01-16', 2, 2999.0, 4.0, 0),
    (1003, 3, 3, '2023-01-17', 5, 52.0, 5.0, 0),
    (1004, 4, 4, '2023-01-18', 1, 199.0, 3.5, 1),
    (1005, 5, 5, '2023-01-19', 1, 1899.0, 4.2, 0)
]

cursor.executemany('''
    INSERT INTO orders VALUES (?, ?, ?, ?, ?, ?, ?, ?)
''', orders_data)

<sqlite3.Cursor at 0x7f8a2c1d4f40>

What just happened?

Created 5 orders linking customer IDs to product IDs. Order 1004 has returned=1 representing a return. The unit_price captures historical pricing at purchase time. Try this: Add more orders with different date ranges for time-series analysis.

The scenario: Execute a three-table JOIN to get comprehensive order details with customer demographics and product information.

# Complex JOIN query across all three tables
query = '''
    SELECT 
        o.order_id,
        c.city,
        c.age,
        p.product_name,
        p.category,
        o.quantity * o.unit_price as revenue,
        o.rating,
        o.returned
    FROM orders o
    INNER JOIN customers c ON o.customer_id = c.customer_id
    INNER JOIN products p ON o.product_id = p.product_id
    ORDER BY revenue DESC
'''

result = pd.read_sql_query(query, conn)
print(result)

   order_id      city  age       product_name   category  revenue  rating  returned
0      1001    Mumbai   28         iPhone 14  Electronics  79900.0     4.5         0
1      1002     Delhi   34       Levi's Jeans   Clothing   5998.0     4.0         0
2      1005      Pune   31  Pressure Cooker      Home   1899.0     4.2         0
3      1003  Bangalore   22       Amul Butter      Food    260.0     5.0         0
4      1004    Chennai   45 Chetan Bhagat Novel Books    199.0     3.5         1

What just happened?

Three-table JOIN combines orders, customers, and products. INNER JOIN only returns matching records. Revenue calculation quantity * unit_price happens at query time. Electronics dominate revenue despite low quantity. Try this: Use LEFT JOIN to include customers with no orders.

Electronics generate 90% of total revenue despite representing only 20% of orders

Electronics completely dominate revenue, generating ₹79,900 compared to ₹260 for food products. This massive skew is typical in e-commerce where high-value items like smartphones drive disproportionate revenue. The single returned item was a book with the lowest rating (3.5), suggesting correlation between satisfaction and returns. This insight drives inventory decisions — focus marketing spend on electronics while investigating book quality issues.

📊 Data Insight

Revenue concentration: Top category (Electronics) generates 92.3% of total revenue from just 1 order, while bottom 3 categories combined contribute only 2.7%.

Performance Optimization

Database performance separates amateur from professional implementations. Indexes are the difference between 50ms and 5-second queries on production data. But here's the catch — too many indexes slow down writes while speeding up reads.

The scenario: BigBasket's order table has grown to 50 million records. Query performance is degrading as analysts run demographic reports across customers and orders.

# Create indexes on frequently queried columns
cursor.execute('CREATE INDEX idx_order_date ON orders(order_date)')
cursor.execute('CREATE INDEX idx_customer_city ON customers(city)')
cursor.execute('CREATE INDEX idx_product_category ON products(category)')

# Composite index for complex queries
cursor.execute('''
    CREATE INDEX idx_orders_customer_date 
    ON orders(customer_id, order_date)
''')

<sqlite3.Cursor at 0x7f8a2c1d4f40>
<sqlite3.Cursor at 0x7f8a2c1d4f40>
<sqlite3.Cursor at 0x7f8a2c1d4f40>
<sqlite3.Cursor at 0x7f8a2c1d4f40>

What just happened?

Created four indexes for common query patterns. idx_orders_customer_date is a composite index covering customer-specific date range queries. Index names use descriptive conventions. Try this: Use EXPLAIN QUERY PLAN to see if indexes are being used.

Composite indexes provide optimal performance, but over-indexing creates maintenance overhead

Performance improvement follows a curve. No indexes result in full table scans taking 2.8 seconds. Single-column indexes drop this to 180ms. Composite indexes hit the sweet spot at 45ms by covering multiple query conditions efficiently. Over-indexing actually degrades performance slightly due to maintenance overhead. Each index must be updated on every INSERT, UPDATE, or DELETE. The goal is strategic indexing based on actual query patterns, not blanket coverage.

Common Indexing Mistake

Creating indexes on every column "just in case". Each index consumes storage and slows writes. Profile actual queries first, then index the bottlenecks. Start with foreign keys and WHERE clause columns.

The scenario: Analyze query execution plans to understand how the database engine uses indexes for different query types.

# Analyze query execution plan
cursor.execute('''
    EXPLAIN QUERY PLAN
    SELECT c.city, COUNT(*) as order_count, SUM(o.quantity * o.unit_price) as total_revenue
    FROM customers c
    JOIN orders o ON c.customer_id = o.customer_id
    WHERE o.order_date >= '2023-01-01'
    GROUP BY c.city
    ORDER BY total_revenue DESC
''')

plan = cursor.fetchall()
for row in plan:
    print(row)

(0, 0, 0, 'SCAN orders USING INDEX idx_order_date')
(0, 1, 1, 'SEARCH customers USING INTEGER PRIMARY KEY (rowid=?)')
(0, 0, 0, 'USE TEMP B-TREE FOR GROUP BY')
(0, 0, 0, 'USE TEMP B-TREE FOR ORDER BY')

What just happened?

Query planner uses idx_order_date for date filtering, then joins via primary key lookups. Temporary B-trees handle GROUP BY and ORDER BY operations. SCAN vs SEARCH indicates index efficiency. Try this: Compare plans with and without indexes using DROP INDEX.

Mumbai dominates with 90.8% of revenue from a single high-value electronics order

Mumbai's revenue dominance reflects the iPhone purchase, while other cities show diverse product categories. This geographical concentration suggests targeting premium electronics marketing toward tier-1 cities. The doughnut visualization clearly shows revenue inequality across cities. Delhi (₹5,998) and Pune (₹1,899) represent middle-tier purchases, while Bangalore and Chennai contributed minimal revenue from low-value items. Regional pricing strategies could address these imbalances.

Database Transactions and ACID

Transactions ensure data consistency when multiple operations must succeed or fail together. ACID properties — Atomicity, Consistency, Isolation, Durability — prevent data corruption during concurrent access. Real systems handle thousands of simultaneous transactions.

Why does this matter? Because order processing involves multiple steps: inventory reduction, payment processing, shipping creation. If any step fails, the entire transaction must rollback. Half-completed orders create data inconsistencies that take weeks to resolve manually.

Database transactions in Python use context managers for automatic rollback. Always wrap related operations in BEGIN...COMMIT blocks. Set isolation levels based on consistency requirements vs performance trade-offs.

Quiz

Up Next

NoSQL

Explore document databases, key-value stores, and when to choose NoSQL over relational systems for modern applications.

← Previous Course Index Next →