DataBase Design Lesson 29 – Indexing Strategy | Dataplexa

Physical & Enterprise · Lesson 29

Indexing Strategy

Design high-performance indexes for BookVault's database using real query patterns and performance analysis to create an optimal indexing strategy.

SLOW QUERY

        books → full table scan

        1,000,000 rows checked

        Response: 2.4 seconds

→

INDEX APPLIED

OPTIMIZED QUERY

        books → index seek

        50 rows examined

        Response: 0.003 seconds

Indexes transform database performance from unbearably slow to lightning fast. But wrong indexes can make things worse. They consume storage space, slow down writes, and confuse the query optimizer. Smart indexing strategy balances query performance against maintenance costs. Think of indexes like a library's card catalog system. Without it, finding one book means checking every shelf. With the right catalog, you walk directly to the exact location.

Understanding Index Types

Database indexes come in several flavors. Each solves specific performance problems. Clustered indexes physically reorder table data. Non-clustered indexes create separate lookup structures. Composite indexes cover multiple columns together.

Clustered Index

Physically sorts table data. Only one per table. Usually on primary key. Fast for range queries.

Non-Clustered

Separate lookup structure. Multiple allowed per table. Points to actual data rows. Best for specific searches.

Composite Index

Covers multiple columns in specific order. Perfect for complex WHERE clauses and sorting.

Partial Index

Indexes subset of rows meeting conditions. Smaller, faster, targets specific query patterns.

BookVault's books table illustrates these differences perfectly. The primary key book_id gets a clustered index automatically. But searches by title or author_id need separate non-clustered indexes.

Query Pattern Analysis

Before creating indexes, analyze your actual queries. Not what you think users will search for. What they actually search for. The difference kills performance budgets. BookVault's most common queries reveal clear indexing needs:

-- BookVault's top 5 query patterns from application logs
-- Query 1: Book search by title (40% of all queries)
SELECT book_id, title, price 
FROM books 
WHERE title LIKE 'Harry Potter%';

-- Query 2: Books by specific author (25% of queries)  
SELECT b.title, b.price, a.first_name, a.last_name
FROM books b
JOIN authors a ON b.author_id = a.author_id
WHERE a.last_name = 'Rowling';

-- Query 3: Category browsing with pagination (20% of queries)
SELECT book_id, title, price
FROM books
WHERE category_id = 5
ORDER BY title
LIMIT 20 OFFSET 100;

Query 1: Full table scan - 2.1 seconds
Query 2: Hash join with table scans - 1.8 seconds  
Query 3: Full table scan + sort - 3.2 seconds

Total response time: 7.1 seconds
User experience: Unacceptable

What just happened?

Without proper indexes, every query performs full table scans. The database examines every single row to find matches. Try this: Run EXPLAIN on your slowest queries to see their execution plans.

Pattern analysis reveals three critical index opportunities. Title searches need a prefix-optimized index. Author joins require indexes on foreign key columns. Category browsing benefits from a composite index covering both filtering and sorting.

Query Frequency vs Performance Impact

Not all slow queries deserve indexes. A query running once per day can stay slow. A query running 10,000 times per hour must be fast. Calculate the performance impact score: frequency × current response time × user frustration level.

Query Type	Daily Frequency	Avg Response	Impact Score	Priority
Book title search	45,000	2.1s	94,500	CRITICAL
Author page load	28,000	1.8s	50,400	HIGH
Category browsing	22,000	3.2s	70,400	HIGH
Sales reports	50	8.5s	425	LOW

The numbers don't lie. Book title searches create 200x more performance impact than sales reports, despite reports being individually slower.

Designing Composite Indexes

Single-column indexes solve simple problems. Real applications need composite indexes covering multiple columns. But column order matters enormously. Get it wrong, and your index becomes useless for most queries. The golden rule: most selective column first, then by query frequency. "Selective" means the column that eliminates the most rows fastest.

-- BookVault composite index design
-- Query: Find fantasy books under $20, sorted by publication year

SELECT title, price, published_year
FROM books  
WHERE category_id = 3        -- Fantasy category (selectivity: 8%)
  AND price < 20.00          -- Price filter (selectivity: 45%)
  AND stock_qty > 0          -- In stock (selectivity: 78%)
ORDER BY published_year DESC;

Without index: 1,000,000 rows scanned
Expected result: ~2,800 matching books

Selectivity analysis:
- category_id = 3: eliminates 92% of rows (best)
- price < 20.00: eliminates 55% of remaining rows  
- stock_qty > 0: eliminates 22% of remaining rows

What just happened?

Selectivity determines index effectiveness. The category_id filter eliminates 92% of rows immediately, making it the perfect leading column. Try this: Calculate selectivity with COUNT DISTINCT on your filter columns.

Based on this analysis, the optimal composite index puts the most selective column first:

-- Optimal composite index for BookVault fantasy search
CREATE INDEX idx_books_category_price_stock_year ON books (
    category_id,      -- Most selective (8% of rows match)
    price,           -- Second most selective (helps range queries) 
    stock_qty,       -- Least selective but commonly filtered
    published_year   -- Included for ORDER BY optimization
);

-- This index supports multiple query patterns:
-- Pattern 1: WHERE category_id = ? 
-- Pattern 2: WHERE category_id = ? AND price < ?
-- Pattern 3: WHERE category_id = ? AND price < ? AND stock_qty > ?
-- Pattern 4: All above + ORDER BY published_year

Index created successfully.
Estimated size: 45 MB
Expected performance improvement:
- Fantasy book queries: 2.1s → 0.08s (26x faster)
- Price range searches: 1.8s → 0.12s (15x faster)  
- Stock availability: 1.2s → 0.05s (24x faster)

Column Order Mistake

Never put low-selectivity columns first. An index starting with stock_qty would be nearly useless because 78% of books are in stock. The database still examines 780,000 rows instead of 80,000.

Index Maintenance Strategy

Indexes aren't "create and forget" structures. They require ongoing maintenance. Every INSERT, UPDATE, and DELETE modifies indexes too. Write-heavy tables with too many indexes become slower than tables with no indexes. BookVault's orders table demonstrates this challenge perfectly. Orders are created frequently but rarely updated. The order_items table follows the same pattern.

Data Insight

BookVault processes 50,000 orders daily but only 200 order modifications. This 250:1 read-to-write ratio justifies aggressive indexing on order lookup columns.

Monitor index usage regularly. Unused indexes waste storage and slow down writes for zero benefit:

-- Monitor index usage in BookVault database
-- PostgreSQL version (adjust for your database system)

SELECT 
    schemaname,
    tablename,
    indexname,
    idx_scan AS times_used,
    idx_tup_read AS rows_read,
    idx_tup_fetch AS rows_fetched
FROM pg_stat_user_indexes 
WHERE schemaname = 'bookvault'
ORDER BY idx_scan DESC;

indexname                    | times_used | rows_read
idx_books_title              | 450,234    | 2,100,445
idx_books_author_id          | 289,567    | 1,200,334  
idx_orders_customer_id       | 234,123    | 890,234
idx_books_experimental_date  | 0          | 0
idx_categories_old_system    | 12         | 48

What just happened?

Usage statistics reveal index effectiveness. The experimental date index was never used and should be dropped. The old system category index is barely used and wastes 15 MB. Try this: Schedule monthly index usage reviews.

Index Rebuilding and Statistics

Database indexes fragment over time, just like hard drives. Query performance gradually degrades as indexes become less efficient. Most databases provide tools to rebuild and optimize indexes. Index statistics become stale when data distribution changes significantly. A category that used to contain 5% of books now contains 40%. The query optimizer makes poor decisions based on outdated statistics.

-- BookVault index maintenance routine
-- Update statistics for better query plans

-- PostgreSQL example
ANALYZE books;
ANALYZE orders; 
ANALYZE order_items;

-- Reindex fragmented indexes (run during maintenance window)
REINDEX INDEX idx_books_title;
REINDEX INDEX idx_orders_customer_id;

-- Check index bloat and effectiveness
SELECT 
    tablename,
    indexname,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes 
WHERE schemaname = 'bookvault';

Statistics updated for 3 tables.
Reindexing completed.

Index sizes after optimization:
idx_books_title         | 89 MB (was 156 MB - 43% reduction)
idx_orders_customer_id  | 34 MB (was 52 MB - 35% reduction)
Total space reclaimed: 85 MB

Covering Indexes and Include Columns

The ultimate index optimization: covering indexes. These indexes contain all columns needed by a query. The database never touches the actual table. Pure index lookups are 10-50x faster than index + table lookups. BookVault's book search demonstrates covering index power:

-- BookVault covering index for complete book search
-- Query needs: title, price, stock_qty, author name

-- Without covering index: 
-- 1. Use title index to find book_id values
-- 2. Look up each book_id in books table for price, stock_qty  
-- 3. Look up author_id in authors table for name
-- Result: Multiple table accesses per book

-- With covering index:
CREATE INDEX idx_books_search_covering ON books (
    title,                    -- Search key
    category_id              -- Filter key  
) INCLUDE (
    price,                   -- Include columns (non-key data)
    stock_qty,
    author_id,
    published_year
);

-- Now this query uses ONLY the index:
SELECT title, price, stock_qty, published_year
FROM books 
WHERE title LIKE 'Harry%' 
  AND category_id = 1;

Covering index created: 145 MB
Query execution plan:
-> Index Only Scan on idx_books_search_covering
   Filter: (title ~~ 'Harry%' AND category_id = 1)
   Rows: 47
   Time: 0.003 seconds (was 0.24 seconds)

Performance improvement: 80x faster

What just happened?

"Index Only Scan" means the query never accessed the books table. All required data came from the index. This eliminates random disk I/O and cache misses. Try this: Identify your most frequent SELECT columns for covering index opportunities.

Covering indexes have trade-offs. They consume significantly more storage space. They slow down writes more than regular indexes. Use them sparingly on your most critical, high-frequency queries.

Pro Tip: Create covering indexes for queries that run more than 1,000 times per hour and access predictable column sets. Monitor storage growth and write performance impact carefully.

Real-world indexing strategy combines all these techniques. Start with single-column indexes on frequently filtered columns. Add composite indexes for complex query patterns. Create covering indexes for your top 3-5 most critical queries. Monitor, measure, and adjust based on actual usage patterns. Remember: the best index is the one that solves real performance problems your users actually experience. Not theoretical optimizations that look good on paper.

Quiz

Up Next

Partitioning Concepts

Learn how to split massive tables across multiple physical storage units for better performance and easier maintenance.

← Previous Course Index Next →

Database Design

Indexing Strategy

Understanding Index Types

Clustered Index

Non-Clustered

Composite Index

Partial Index

Query Pattern Analysis

Query Frequency vs Performance Impact

Designing Composite Indexes

Index Maintenance Strategy

Index Rebuilding and Statistics

Covering Indexes and Include Columns

Quiz