DataBase Design Lesson 32 – Designing for Scalability | Dataplexa
Physical & Enterprise · Lesson 32

Designing for Scalability

Learn horizontal and vertical scaling strategies, partitioning techniques, and replication patterns to handle millions of users like Amazon and Netflix

The Scale Problem

Imagine BookVault starts with 100 customers. Your single server handles everything perfectly. Then you get featured on social media. Suddenly, 10,000 customers try to buy books simultaneously. Your database crashes.

This isn't theoretical. Instagram had 13 employees when Facebook bought them for $1 billion. They served 100 million users with a tiny team. How? Smart database architecture decisions.

Scalability means your database performs well as data volume, user count, or transaction frequency increases. But there's a catch - you need to plan for scale before you need it. Retrofitting a poorly designed database is like rebuilding a plane while flying.

1
Single Server (1-1K users)
2
Vertical Scaling (1K-10K users)
3
Read Replicas (10K-100K users)
4
Sharding & Partitioning (100K+ users)

Vertical vs Horizontal Scaling

Vertical scaling means adding more power to your existing server. More RAM, faster CPU, better storage. Think of it like upgrading from a Honda Civic to a Ferrari.

Horizontal scaling means adding more servers. Instead of one Ferrari, you get ten Honda Civics working together. Netflix uses thousands of servers, not one supercomputer.

Vertical Scaling

Pros:

  • Simple to implement
  • No code changes needed
  • ACID transactions preserved

Cons:

  • Hardware limits (ceiling effect)
  • Expensive at high performance
  • Single point of failure

Horizontal Scaling

Pros:

  • Nearly unlimited scaling
  • Fault tolerant
  • Cost-effective at scale

Cons:

  • Complex architecture
  • Data consistency challenges
  • Requires application changes

Honestly, most companies start vertical then go horizontal. Amazon started with a single database. Now they have thousands. The key is knowing when to make the jump - usually when your monthly server bill hits four digits.

Database Replication

Replication creates copies of your database on multiple servers. Master-slave replication is the most common pattern. One server handles writes (master), others handle reads (slaves).

Think about BookVault during Black Friday. Thousands of people browse books (reads) but only some actually buy (writes). You need many read servers, fewer write servers.

MASTER
Handles Writes
orders, customers
READ REPLICA 1
books, authors
READ REPLICA 2
categories, search
READ REPLICA 3
analytics, reports
-- Configure read replica routing in BookVault
-- Write operations go to master
INSERT INTO orders (customer_id, order_date, total_amount)
VALUES (1001, CURRENT_DATE, 45.99);

-- Read operations can use any replica
SELECT b.title, a.first_name, a.last_name
FROM books b
JOIN authors a ON b.author_id = a.author_id
WHERE b.category_id = 5;
INSERT executed on MASTER server
Query OK, 1 row affected

SELECT executed on READ REPLICA 2
+------------------+------------+-----------+
| title            | first_name | last_name |
+------------------+------------+-----------+
| The Great Gatsby | F. Scott   | Fitzgerald|
| To Kill a Bird   | Harper     | Lee       |
+------------------+------------+-----------+
2 rows in set (0.01 sec)

What just happened?

The INSERT went to the master database because it modifies data. The SELECT used a read replica, reducing load on the master. Try this: Route all customer browsing to replicas, keep checkout on master.

Replication Lag

Here's the catch with replication: slaves aren't instantly updated. There's a delay called replication lag. A customer might place an order but not see it immediately when they refresh the page.

Instagram handles this by showing users their own posts immediately (from master) but loading their feed from replicas. Smart compromise between consistency and performance.

Common Mistake

Reading from a replica immediately after writing to master. The customer places an order then gets "Order not found" because the replica hasn't updated yet. Fix: Read from master for 30 seconds after any write operation.

Database Sharding

Sharding splits your data across multiple databases. Instead of one giant customers table with 50 million rows, you have five databases with 10 million rows each.

Uber shards by geographic region. New York rides go to one database, Los Angeles rides to another. Makes sense - people rarely need cross-city data in real-time.

Sharding Strategies

RANGE-BASED

Split by data ranges

A-F customers → Shard 1
G-M customers → Shard 2

HASH-BASED

Split by hash function

hash(customer_id) % 4
= shard number

DIRECTORY-BASED

Lookup table decides

customer_shard_map
stores routing info

GEOGRAPHIC

Split by location

US customers → US shard
EU customers → EU shard

For BookVault, hash-based sharding works well. Customer ID 1001 might go to shard 1, customer ID 1002 to shard 2. Evenly distributed, predictable routing.

-- BookVault sharding logic (hash-based)
-- Application code determines which shard to query

-- Customer 1001 (hash(1001) % 4 = 1) → Shard 1
-- Connect to shard_1 database
SELECT * FROM customers WHERE customer_id = 1001;

-- Customer 1002 (hash(1002) % 4 = 2) → Shard 2  
-- Connect to shard_2 database
SELECT * FROM customers WHERE customer_id = 1002;
SHARD_1 Database:
+-------------+------------+-----------+-------------------+
| customer_id | first_name | last_name | email             |
+-------------+------------+-----------+-------------------+
| 1001        | Sarah      | Johnson   | sarah@example.com |
+-------------+------------+-----------+-------------------+

SHARD_2 Database:
+-------------+------------+-----------+-------------------+
| customer_id | first_name | last_name | email             |
+-------------+------------+-----------+-------------------+
| 1002        | Mike       | Chen      | mike@example.com  |
+-------------+------------+-----------+-------------------+

What just happened?

Each customer's data lives on exactly one shard, determined by hashing their ID. The application routes queries to the correct database. Try this: Use a consistent hashing algorithm so adding shards doesn't require moving all data.

Cross-Shard Queries

Sharding's biggest problem: queries across multiple shards. Want to find all customers who bought books in the last 30 days? That data might be on all four shards. Suddenly one query becomes four queries plus data merging.

Sharding Complexity

Joins across shards are expensive or impossible. Analytics queries become nightmares. Many companies use separate analytical databases (data warehouses) fed by all operational shards. Don't shard unless you absolutely need to.

Partitioning

Partitioning splits large tables within a single database. Unlike sharding (multiple databases), partitioning keeps everything in one database but organizes data more efficiently.

BookVault's orders table might have 10 million rows. Searching through them is slow. But if you partition by date, queries for recent orders only scan the relevant partition.

-- Partition BookVault orders table by date ranges
-- PostgreSQL syntax (MySQL similar)

CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    customer_id INTEGER NOT NULL,
    order_date DATE NOT NULL,
    status VARCHAR(20) NOT NULL,
    total_amount DECIMAL(10,2)
) PARTITION BY RANGE (order_date);

-- Create partitions for each quarter
CREATE TABLE orders_2024_q1 PARTITION OF orders 
FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');

CREATE TABLE orders_2024_q2 PARTITION OF orders 
FOR VALUES FROM ('2024-04-01') TO ('2024-07-01');
Table "orders" created with range partitioning
Partition "orders_2024_q1" created (Jan-Mar 2024)
Partition "orders_2024_q2" created (Apr-Jun 2024)

Query for recent orders will only scan orders_2024_q2:
Query performance improved by 75% (scans 1/4 of data)

Data Insight

Partitioning can improve query performance by 5-10x when done correctly. Airbnb reduced booking query times from 30 seconds to 3 seconds by partitioning their reservations table by check-in date.

Real-World Architecture

Here's how big companies actually scale their databases. Spoiler: it's messier than textbooks suggest.

Company Primary Strategy Database Scale
Instagram Functional sharding PostgreSQL 100M+ users, 13 employees
Uber Geographic sharding MySQL + Schemaless 1B+ trips stored
Shopify Shop-based sharding MySQL 1M+ merchants
Discord Message partitioning Cassandra 1T+ messages

Instagram's approach was brilliant: they sharded by function, not data. User photos went to one cluster, user relationships to another, feeds to a third. Each team could optimize their specific use case.

BookVault Scaling Plan

Let's design BookVault's scaling roadmap:

-- Phase 1: Read replicas for book browsing
-- Master handles: orders, customers (writes)
-- Replicas handle: books, authors, categories (reads)

-- Phase 2: Partition large tables  
-- Partition orders by date (quarterly)
-- Partition order_items by order_date

-- Phase 3: Shard by customer region
-- US customers: shard_us_1, shard_us_2  
-- EU customers: shard_eu_1, shard_eu_2

-- Keep books table replicated on all shards
-- (customers need to browse same inventory)
BookVault Scaling Architecture:

Phase 1: 1-10K users
├── Master DB (writes)
└── 2x Read Replicas (browsing)

Phase 2: 10K-100K users  
├── Master DB (partitioned)
├── 4x Read Replicas
└── Analytics DB (reports)

Phase 3: 100K+ users
├── US Shard Cluster (2 shards)
├── EU Shard Cluster (2 shards)
├── Global Books Replica
└── Analytics Warehouse

Pro tip: Scale in stages. Don't jump straight to sharding. Most "scaling problems" are actually poorly optimized queries or missing indexes. Profile first, scale second.

Quiz

1. BookVault wants to shard their customer data across 4 database servers to handle 1 million customers. Which approach would distribute customers most evenly?


2. BookVault uses master-slave replication. A customer places an order (write to master) then immediately refreshes the page to see their order confirmation (read query). What's the best way to handle this?


3. BookVault's orders table has grown to 5 million rows. Most queries filter by order_date to find recent orders, but performance is getting slow. What's the best solution before considering sharding?


Up Next

Designing for Security

Learn how to protect BookVault's data with encryption, access controls, and security best practices that prevent costly breaches.