DataBase Design Lesson 23 – Denormalization | Dataplexa
Normalization · Lesson 23

Denormalization

Learn when and how to strategically break normalization rules to optimize database performance for real-world applications.

What is Denormalization?

You spent the last few lessons learning normalization. Now we're going to break those rules on purpose.

Denormalization is the process of intentionally introducing redundancy back into normalized tables. You take perfectly normalized data and make it "messy" again. But there's a reason.

Normalized databases are academically perfect. They eliminate redundancy and keep data consistent. But they can be slow in the real world. Every query needs multiple joins. Amazon doesn't care about perfect normalization when customers are waiting for search results.

1
Perfect Normalization
2
Performance Problems
3
Identify Bottlenecks
4
Strategic Denormalization

Think of it like city planning. A perfectly organized grid is easy to navigate. But sometimes you build a highway straight through to speed up traffic. The grid gets messy, but people get where they need to go faster.

When to Denormalize

Don't denormalize because you're lazy. Do it because you have proof that your normalized database is too slow. Measure first, optimize second.

Read-Heavy Systems

E-commerce product catalogs, news sites, social media feeds. Users read 100x more than they write.

Complex Joins

Queries joining 5+ tables that run constantly. Dashboard reports, analytics, search results.

Calculated Fields

Values computed from other data. Order totals, customer ratings, inventory counts calculated repeatedly.

Historical Data

Data that never changes. Completed orders, archived records, year-end reports that get queried often.

BookVault faces this exact problem. Every time someone searches for books, the database joins books + authors + categories. With 10,000 searches per minute, those joins become expensive.

Common Denormalization Techniques

Technique 1: Add Calculated Columns

Store computed values directly in tables instead of calculating them every time.

Before — Always Calculate

order_id customer_id status
1001501completed
1002502pending

Must join order_items to get total

After — Store Total

order_id customer_id total_amount
100150189.97
100250245.99

Direct access to total

-- Add calculated column to BookVault orders
-- Now we store the total instead of calculating it
ALTER TABLE orders 
ADD COLUMN total_amount DECIMAL(10,2);

-- Update existing orders with calculated totals
UPDATE orders 
SET total_amount = (
    SELECT SUM(quantity * unit_price) 
    FROM order_items 
    WHERE order_items.order_id = orders.order_id
);
Query OK, 0 rows affected (0.08 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 1247 rows affected (0.23 sec)
Rows matched: 1247  Changed: 1247  Warnings: 0

What just happened?

We added a total_amount column and populated it with calculated values. Now order totals load instantly instead of requiring joins to order_items. Try this: Compare query performance before and after on a table with thousands of orders.

Technique 2: Flatten Relationships

Store foreign table data directly in the main table. Perfect for data that rarely changes.

-- Denormalized books table includes author name
-- Eliminates need to join authors table for search results
CREATE TABLE books_denorm AS
SELECT 
    b.book_id,
    b.title,
    b.author_id,
    -- Store author name directly in books table
    CONCAT(a.first_name, ' ', a.last_name) as author_name,
    b.category_id,
    c.name as category_name,
    b.price,
    b.stock_qty
FROM books b
JOIN authors a ON b.author_id = a.author_id  
JOIN categories c ON b.category_id = c.category_id;
Query OK, 8547 rows affected (0.45 sec)
Records: 8547  Duplicates: 0  Warnings: 0

book_id | title                    | author_name      | category_name
--------|--------------------------|------------------|---------------
1001    | The Great Gatsby        | F. Scott Fitzgerald | Fiction  
1002    | To Kill a Mockingbird   | Harper Lee       | Fiction
1003    | 1984                    | George Orwell    | Fiction

What just happened?

We flattened the books, authors, and categories relationships into one table. Book searches now access everything from a single table instead of joining three tables. The trade-off: if an author changes their name, we must update multiple book records. Try this: Measure the performance difference on a search query.

Technique 3: Create Summary Tables

Pre-calculate complex aggregations that dashboards and reports need constantly.

-- Create monthly sales summary table
-- Dashboard queries this instead of raw order data
CREATE TABLE monthly_sales_summary (
    year_month DATE,
    total_orders INT,
    total_revenue DECIMAL(12,2),
    avg_order_value DECIMAL(8,2),
    unique_customers INT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Populate with aggregated data
INSERT INTO monthly_sales_summary 
SELECT 
    DATE_FORMAT(order_date, '%Y-%m-01') as year_month,
    COUNT(*) as total_orders,
    SUM(total_amount) as total_revenue,
    AVG(total_amount) as avg_order_value,
    COUNT(DISTINCT customer_id) as unique_customers,
    NOW()
FROM orders 
WHERE status = 'completed'
GROUP BY DATE_FORMAT(order_date, '%Y-%m-01');
Query OK, 0 rows affected (0.12 sec)

Query OK, 24 rows affected (0.34 sec)
Records: 24  Duplicates: 0  Warnings: 0

year_month | total_orders | total_revenue | avg_order_value
-----------|-------------|---------------|----------------
2024-01-01 | 1247        | 87,459.23     | 70.12
2024-02-01 | 1456        | 98,234.67     | 67.49
2024-03-01 | 1789        | 125,678.90    | 70.25

Data Insight

Dashboard queries that took 4.2 seconds scanning 50,000+ order records now complete in 0.03 seconds reading 24 summary rows. That's a 99.3% performance improvement for executive reports.

The Tradeoffs

Denormalization isn't magic. You gain read performance but pay other costs. Every database architect faces these decisions.

Benefit Cost Real Example
Faster Reads Slower Writes Update author name = update 50 book records
Simple Queries Data Inconsistency Risk Order total might not match item totals
Less Storage I/O More Storage Space Store author name in every book record
Predictable Performance Complex Maintenance Triggers, batch jobs to keep data synchronized

Critical Rule: Measure First

Never denormalize without proving you have a performance problem. Profile your queries, identify the actual bottlenecks, then denormalize strategically. Random denormalization creates more problems than it solves.

Maintaining Denormalized Data

Once you denormalize, you must keep redundant data synchronized. Here are the common strategies that companies like Netflix and Spotify use.

Database Triggers

Automatically update denormalized data when source data changes.

-- Trigger to update book denormalization when author changes
-- Keeps author_name synchronized across all book records
CREATE TRIGGER update_author_in_books
    AFTER UPDATE ON authors
    FOR EACH ROW
BEGIN
    -- Update all books by this author with new name
    UPDATE books_denorm 
    SET author_name = CONCAT(NEW.first_name, ' ', NEW.last_name)
    WHERE author_id = NEW.author_id;
END;
Query OK, 0 rows affected (0.05 sec)

Batch Processing

Update denormalized tables in scheduled batches. Works well for data that doesn't need real-time accuracy.

-- Daily batch job to refresh monthly sales summary
-- Runs at 2 AM when system usage is low
DELETE FROM monthly_sales_summary 
WHERE year_month >= DATE_FORMAT(CURDATE() - INTERVAL 3 MONTH, '%Y-%m-01');

INSERT INTO monthly_sales_summary 
SELECT 
    DATE_FORMAT(order_date, '%Y-%m-01') as year_month,
    COUNT(*) as total_orders,
    SUM(total_amount) as total_revenue,
    AVG(total_amount) as avg_order_value,
    COUNT(DISTINCT customer_id) as unique_customers,
    NOW()
FROM orders 
WHERE status = 'completed'
  AND order_date >= DATE_FORMAT(CURDATE() - INTERVAL 3 MONTH, '%Y-%m-01')
GROUP BY DATE_FORMAT(order_date, '%Y-%m-01');
Query OK, 3 rows affected (0.02 sec)

Query OK, 3 rows affected (0.18 sec)
Records: 3  Duplicates: 0  Warnings: 0

What just happened?

We refreshed the last 3 months of summary data to catch any late-arriving orders or corrections. The DELETE removes outdated summaries, then INSERT rebuilds them from current order data. Try this: Schedule this as a cron job for automated maintenance.

Real-World Patterns

Every major tech company uses denormalization strategically. Here's how they approach it.

Amazon Product Search

Stores product data, reviews, ratings, and availability in flattened search indices. Millions of searches can't wait for joins across normalized tables.

Pattern: Search-optimized denormalization

Netflix Recommendations

Pre-computes recommendation scores and stores them in denormalized tables. Can't calculate recommendations in real-time for 200M+ users.

Pattern: Pre-calculated ML results

BookVault could denormalize customer order history into a single table for instant "Recently Viewed" and "Recommended for You" features. The customer experience demands speed over perfect normalization.

The key insight? Denormalization is an optimization technique, not a design philosophy. Start normalized, measure performance, then denormalize specific bottlenecks. Companies that denormalize everything from the start end up with unmaintainable data chaos.

Quiz

1. BookVault's search feature queries books, authors, and categories together 50,000 times per day. The joins are causing 3-second delays. What is the primary reason to consider denormalizing this data?


2. You've denormalized author names into the books table. When an author changes their name, you need to update potentially hundreds of book records. What's the most reliable way to maintain this denormalized data?


3. Which scenario would benefit most from creating a denormalized summary table?


Up Next

Surrogate vs Natural Keys

Choose between artificial IDs and real-world identifiers to ensure your database scales reliably over time.