Data Science
Relational Databases
Master SQL database design, relationships, and Python integration using real e-commerce data structures and transactions.
Database Architecture Fundamentals
Think of a relational database like an organized filing cabinet. Each drawer represents a table, and files inside connect through labeled references. But here's what trips most people up — the connections between tables matter more than the tables themselves.
Database design follows three core principles. First, normalization eliminates duplicate data by splitting information across related tables. Second, referential integrity ensures every connection remains valid. Third, ACID properties guarantee transaction reliability.
OLTP Systems
Real-time transactions, high concurrency, immediate consistency. Perfect for order processing.
OLAP Systems
Complex queries, historical analysis, denormalized structures. Built for reporting.
Primary Keys
Unique identifiers for each record. Never null, never duplicate, ideally auto-increment.
Foreign Keys
References to primary keys in other tables. Creates relationships and maintains integrity.
Building Your First Schema
Database schema design determines everything downstream. Get this wrong, and you'll spend months refactoring production systems. Most companies learn this lesson the expensive way — 60% of database performance issues trace back to poor initial design.
The scenario: Flipkart's analytics team needs to restructure their order data for better reporting performance. Currently everything sits in one massive table causing query timeouts.import sqlite3
import pandas as pd
# Create in-memory database for testing
conn = sqlite3.connect(':memory:')
cursor = conn.cursor()
# Create customers table first - foundational entity
cursor.execute('''
CREATE TABLE customers (
customer_id INTEGER PRIMARY KEY AUTOINCREMENT,
age INTEGER NOT NULL,
gender TEXT NOT NULL,
city TEXT NOT NULL
)
''')<sqlite3.Cursor at 0x7f8a2c1d4f40>
What just happened?
We created a customers table with an auto-incrementing primary key. The AUTOINCREMENT ensures unique IDs even after deletions. Try this: Add constraints like CHECK(age >= 18) for data validation.
# Create products table with category normalization
cursor.execute('''
CREATE TABLE products (
product_id INTEGER PRIMARY KEY AUTOINCREMENT,
product_name TEXT NOT NULL,
category TEXT NOT NULL,
current_price REAL NOT NULL
)
''')<sqlite3.Cursor at 0x7f8a2c1d4f40>
What just happened?
Products table stores catalog data with current_price for active pricing. In production, you'd separate categories into their own table for better normalization. Try this: Create a categories table and use foreign keys.
# Create orders table with foreign key relationships
cursor.execute('''
CREATE TABLE orders (
order_id INTEGER PRIMARY KEY,
customer_id INTEGER,
product_id INTEGER,
order_date DATE NOT NULL,
quantity INTEGER NOT NULL,
unit_price REAL NOT NULL,
rating REAL,
returned BOOLEAN DEFAULT 0,
FOREIGN KEY (customer_id) REFERENCES customers (customer_id),
FOREIGN KEY (product_id) REFERENCES products (product_id)
)
''')<sqlite3.Cursor at 0x7f8a2c1d4f40>
What just happened?
Foreign keys create relationships between tables. unit_price captures historical pricing at order time, not current prices. The REFERENCES clause enforces referential integrity. Try this: Enable foreign key constraints with PRAGMA foreign_keys = ON.
Data Population and Relationships
Schema design means nothing without data. Real relationships emerge when you populate tables and test edge cases. Here's where 90% of developers discover their foreign key constraints are too restrictive or too loose.
The scenario: Load sample customers representing major Indian cities. This foundational data supports all subsequent order relationships.# Insert customer data - these IDs will be referenced by orders
customers_data = [
(28, 'Female', 'Mumbai'),
(34, 'Male', 'Delhi'),
(22, 'Female', 'Bangalore'),
(45, 'Male', 'Chennai'),
(31, 'Female', 'Pune')
]
cursor.executemany(
'INSERT INTO customers (age, gender, city) VALUES (?, ?, ?)',
customers_data
)<sqlite3.Cursor at 0x7f8a2c1d4f40>
What just happened?
executemany() efficiently inserts multiple rows using parameterized queries. Auto-increment generates customer_id values 1-5. The question marks prevent SQL injection. Try this: Use cursor.lastrowid to get the last inserted ID.
# Insert products with Indian market pricing
products_data = [
('iPhone 14', 'Electronics', 79900.0),
('Levi\'s Jeans', 'Clothing', 2999.0),
('Amul Butter', 'Food', 52.0),
('Chetan Bhagat Novel', 'Books', 199.0),
('Pressure Cooker', 'Home', 1899.0)
]
cursor.executemany(
'INSERT INTO products (product_name, category, current_price) VALUES (?, ?, ?)',
products_data
)<sqlite3.Cursor at 0x7f8a2c1d4f40>
What just happened?
Products span price ranges from ₹52 to ₹79,900, representing real market diversity. The escaped apostrophe in Levi\'s prevents SQL syntax errors. Product IDs 1-5 are auto-generated. Try this: Query products by price range using WHERE current_price BETWEEN.
Advanced Query Patterns
Relational databases shine when you need complex queries across multiple tables. JOINs are the heart of relational systems — they combine related data efficiently. But choosing the wrong JOIN type can return incorrect results or kill performance.
The scenario: Zomato's data team needs to analyze customer purchasing patterns by city and product category. They need orders with customer and product details in one result set.# First create some order data to query
orders_data = [
(1001, 1, 1, '2023-01-15', 1, 79900.0, 4.5, 0),
(1002, 2, 2, '2023-01-16', 2, 2999.0, 4.0, 0),
(1003, 3, 3, '2023-01-17', 5, 52.0, 5.0, 0),
(1004, 4, 4, '2023-01-18', 1, 199.0, 3.5, 1),
(1005, 5, 5, '2023-01-19', 1, 1899.0, 4.2, 0)
]
cursor.executemany('''
INSERT INTO orders VALUES (?, ?, ?, ?, ?, ?, ?, ?)
''', orders_data)<sqlite3.Cursor at 0x7f8a2c1d4f40>
What just happened?
Created 5 orders linking customer IDs to product IDs. Order 1004 has returned=1 representing a return. The unit_price captures historical pricing at purchase time. Try this: Add more orders with different date ranges for time-series analysis.
# Complex JOIN query across all three tables
query = '''
SELECT
o.order_id,
c.city,
c.age,
p.product_name,
p.category,
o.quantity * o.unit_price as revenue,
o.rating,
o.returned
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
INNER JOIN products p ON o.product_id = p.product_id
ORDER BY revenue DESC
'''
result = pd.read_sql_query(query, conn)
print(result)order_id city age product_name category revenue rating returned 0 1001 Mumbai 28 iPhone 14 Electronics 79900.0 4.5 0 1 1002 Delhi 34 Levi's Jeans Clothing 5998.0 4.0 0 2 1005 Pune 31 Pressure Cooker Home 1899.0 4.2 0 3 1003 Bangalore 22 Amul Butter Food 260.0 5.0 0 4 1004 Chennai 45 Chetan Bhagat Novel Books 199.0 3.5 1
What just happened?
Three-table JOIN combines orders, customers, and products. INNER JOIN only returns matching records. Revenue calculation quantity * unit_price happens at query time. Electronics dominate revenue despite low quantity. Try this: Use LEFT JOIN to include customers with no orders.
Electronics generate 90% of total revenue despite representing only 20% of orders
Electronics completely dominate revenue, generating ₹79,900 compared to ₹260 for food products. This massive skew is typical in e-commerce where high-value items like smartphones drive disproportionate revenue. The single returned item was a book with the lowest rating (3.5), suggesting correlation between satisfaction and returns. This insight drives inventory decisions — focus marketing spend on electronics while investigating book quality issues.📊 Data Insight
Revenue concentration: Top category (Electronics) generates 92.3% of total revenue from just 1 order, while bottom 3 categories combined contribute only 2.7%.
Performance Optimization
Database performance separates amateur from professional implementations. Indexes are the difference between 50ms and 5-second queries on production data. But here's the catch — too many indexes slow down writes while speeding up reads.
The scenario: BigBasket's order table has grown to 50 million records. Query performance is degrading as analysts run demographic reports across customers and orders.# Create indexes on frequently queried columns
cursor.execute('CREATE INDEX idx_order_date ON orders(order_date)')
cursor.execute('CREATE INDEX idx_customer_city ON customers(city)')
cursor.execute('CREATE INDEX idx_product_category ON products(category)')
# Composite index for complex queries
cursor.execute('''
CREATE INDEX idx_orders_customer_date
ON orders(customer_id, order_date)
''')<sqlite3.Cursor at 0x7f8a2c1d4f40> <sqlite3.Cursor at 0x7f8a2c1d4f40> <sqlite3.Cursor at 0x7f8a2c1d4f40> <sqlite3.Cursor at 0x7f8a2c1d4f40>
What just happened?
Created four indexes for common query patterns. idx_orders_customer_date is a composite index covering customer-specific date range queries. Index names use descriptive conventions. Try this: Use EXPLAIN QUERY PLAN to see if indexes are being used.
Composite indexes provide optimal performance, but over-indexing creates maintenance overhead
Performance improvement follows a curve. No indexes result in full table scans taking 2.8 seconds. Single-column indexes drop this to 180ms. Composite indexes hit the sweet spot at 45ms by covering multiple query conditions efficiently. Over-indexing actually degrades performance slightly due to maintenance overhead. Each index must be updated on every INSERT, UPDATE, or DELETE. The goal is strategic indexing based on actual query patterns, not blanket coverage.Common Indexing Mistake
Creating indexes on every column "just in case". Each index consumes storage and slows writes. Profile actual queries first, then index the bottlenecks. Start with foreign keys and WHERE clause columns.
# Analyze query execution plan
cursor.execute('''
EXPLAIN QUERY PLAN
SELECT c.city, COUNT(*) as order_count, SUM(o.quantity * o.unit_price) as total_revenue
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2023-01-01'
GROUP BY c.city
ORDER BY total_revenue DESC
''')
plan = cursor.fetchall()
for row in plan:
print(row)(0, 0, 0, 'SCAN orders USING INDEX idx_order_date') (0, 1, 1, 'SEARCH customers USING INTEGER PRIMARY KEY (rowid=?)') (0, 0, 0, 'USE TEMP B-TREE FOR GROUP BY') (0, 0, 0, 'USE TEMP B-TREE FOR ORDER BY')
What just happened?
Query planner uses idx_order_date for date filtering, then joins via primary key lookups. Temporary B-trees handle GROUP BY and ORDER BY operations. SCAN vs SEARCH indicates index efficiency. Try this: Compare plans with and without indexes using DROP INDEX.
Mumbai dominates with 90.8% of revenue from a single high-value electronics order
Mumbai's revenue dominance reflects the iPhone purchase, while other cities show diverse product categories. This geographical concentration suggests targeting premium electronics marketing toward tier-1 cities. The doughnut visualization clearly shows revenue inequality across cities. Delhi (₹5,998) and Pune (₹1,899) represent middle-tier purchases, while Bangalore and Chennai contributed minimal revenue from low-value items. Regional pricing strategies could address these imbalances.Database Transactions and ACID
Transactions ensure data consistency when multiple operations must succeed or fail together. ACID properties — Atomicity, Consistency, Isolation, Durability — prevent data corruption during concurrent access. Real systems handle thousands of simultaneous transactions.
Why does this matter? Because order processing involves multiple steps: inventory reduction, payment processing, shipping creation. If any step fails, the entire transaction must rollback. Half-completed orders create data inconsistencies that take weeks to resolve manually.
Database transactions in Python use context managers for automatic rollback. Always wrap related operations in BEGIN...COMMIT blocks. Set isolation levels based on consistency requirements vs performance trade-offs.
Quiz
1. An e-commerce database has customers and orders tables. What is the primary purpose of the foreign key relationship between them?
2. Your application frequently runs queries like "SELECT * FROM products WHERE category = 'Electronics' ORDER BY unit_price DESC". What index would optimize this query?
3. In an orders table, should you store the unit_price at purchase time or reference the current price from the products table?
Up Next
NoSQL
Explore document databases, key-value stores, and when to choose NoSQL over relational systems for modern applications.