MS SQL Lesson 28 – Indexes | Dataplexa

Indexes

An index is a sorted copy of selected column data with pointers back to the full rows in the table. Without indexes, SQL Server answers every query by reading every row in the table from first to last — a full table scan. For a table with ten rows that is instant. For a table with ten million rows that is seconds to minutes of I/O. An index lets SQL Server jump directly to the relevant rows the same way a book index lets you jump to the right page instead of reading every page until you find the word. Understanding which columns to index, how SQL Server uses indexes, and what the maintenance cost of each index is — these are the decisions that separate a database that performs well at scale from one that degrades as data grows. This lesson covers clustered and nonclustered indexes, how to create and manage them, how to read whether they are being used, and how to keep them healthy — all using the DataplexaStore schema.

Clustered Indexes

A clustered index determines the physical order in which rows are stored on disk. The leaf level of a clustered index is the actual table data — the rows themselves, sorted by the index key. Because the data can only be physically sorted one way, a table can have at most one clustered index. SQL Server creates a clustered index automatically when you define a PRIMARY KEY constraint, which is why primary key lookups are the fastest possible single-row retrievals — the row is right there in the index leaf. A table without a clustered index is called a heap and stores rows in insertion order with no guaranteed structure, making range scans and ordered retrievals significantly slower.

-- Inspect existing clustered indexes across DataplexaStore tables
USE DataplexaStore;

SELECT
    t.name                          AS table_name,
    i.name                          AS index_name,
    i.type_desc                     AS index_type,
    i.is_primary_key,
    i.is_unique,
    STRING_AGG(c.name, ', ')
        WITHIN GROUP (ORDER BY ic.key_ordinal)
                                    AS index_columns
FROM       sys.tables        t
JOIN       sys.indexes       i  ON t.object_id  = i.object_id
JOIN       sys.index_columns ic ON i.object_id  = ic.object_id
                                AND i.index_id  = ic.index_id
JOIN       sys.columns       c  ON ic.object_id = c.object_id
                                AND ic.column_id = c.column_id
WHERE      i.type_desc = 'CLUSTERED'
  AND      t.name IN ('Users','Products','Orders','OrderItems','Reviews')
GROUP BY   t.name, i.name, i.type_desc, i.is_primary_key, i.is_unique
ORDER BY   t.name;
table_name index_name index_type is_primary_key is_unique index_columns
─────────── ─────────────── ────────── ────────────── ───────── ─────────────
OrderItems PK_OrderItems CLUSTERED 1 1 OrderItemID
Orders PK_Orders CLUSTERED 1 1 OrderID
Products PK_Products CLUSTERED 1 1 ProductID
Reviews PK_Reviews CLUSTERED 1 1 ReviewID
Users PK_Users CLUSTERED 1 1 UserID
  • Every DataplexaStore table has a clustered index on its primary key — SQL Server created these automatically when the PRIMARY KEY constraints were defined; this is the correct default for most tables
  • The clustered index key (UserID, ProductID, etc.) determines physical row order — a query that retrieves rows by primary key performs a clustered index seek, navigating directly to the right page without scanning anything
  • A heap (table with no clustered index) would show no rows in this query for that table — for most OLTP workloads, adding a clustered index on the primary key is always the right first step

Nonclustered Indexes

A nonclustered index is a separate structure stored alongside the table, containing a sorted copy of the indexed column values and pointers (row locators) back to the full rows in the clustered index or heap. A table can have up to 999 nonclustered indexes, though in practice more than a handful is usually a sign of over-indexing. Nonclustered indexes are created on columns that appear frequently in WHERE clauses, JOIN conditions, and ORDER BY clauses — the columns SQL Server uses to filter and navigate the data. When SQL Server finds a useful nonclustered index it performs an index seek: it navigates the index tree directly to the matching key values and uses the row locators to retrieve the remaining columns from the clustered index. The additional trip from the nonclustered index to the clustered index to retrieve remaining columns is called a key lookup, and it is worth understanding because it has a cost that becomes relevant on large result sets.

-- Create nonclustered indexes on commonly filtered columns

-- Index on Orders.UserID — queries that filter or join by UserID will use this
CREATE NONCLUSTERED INDEX IX_Orders_UserID
ON Orders (UserID);

-- Index on Orders.Status — queries filtering by order status
CREATE NONCLUSTERED INDEX IX_Orders_Status
ON Orders (Status);

-- Composite index — Category + Price together for queries that filter by category and sort by price
CREATE NONCLUSTERED INDEX IX_Products_Category_Price
ON Products (Category ASC, Price DESC);

-- Index with INCLUDE — avoids key lookup for common query pattern
-- Covers: WHERE Category = ? ORDER BY Price and also returns ProductName
CREATE NONCLUSTERED INDEX IX_Products_Category_Covering
ON Products (Category ASC, Price DESC)
INCLUDE (ProductName, StockQty);    -- included columns are in leaf only, not the key

-- Verify all indexes on Products
SELECT
    i.name                          AS index_name,
    i.type_desc,
    i.is_unique,
    STRING_AGG(
        c.name + CASE WHEN ic.is_included_column = 1 THEN ' (INC)' ELSE '' END,
        ', '
    ) WITHIN GROUP (ORDER BY ic.is_included_column, ic.key_ordinal)
                                    AS columns
FROM   sys.indexes       i
JOIN   sys.index_columns ic ON i.object_id = ic.object_id AND i.index_id = ic.index_id
JOIN   sys.columns       c  ON ic.object_id = c.object_id AND ic.column_id = c.column_id
WHERE  i.object_id = OBJECT_ID('Products')
GROUP BY i.name, i.type_desc, i.is_unique
ORDER BY i.type_desc DESC, i.name;
index_name type_desc is_unique columns
─────────────────────────────── ──────────── ───────── ──────────────────────────────────────
PK_Products CLUSTERED 1 ProductID
IX_Products_Category_Price NONCLUSTERED 0 Category, Price
IX_Products_Category_Covering NONCLUSTERED 0 Category, Price, ProductName (INC), StockQty (INC)
  • The covering index (IX_Products_Category_Covering) includes ProductName and StockQty as non-key columns in the index leaf — a query that filters by Category, sorts by Price, and returns ProductName and StockQty can be satisfied entirely from this index without a key lookup back to the clustered index
  • INCLUDE columns appear only at the leaf level of the index, not in the B-tree navigation structure — this keeps the index key narrow and efficient for navigation while still providing the extra columns at retrieval time
  • Two indexes on Category + Price is redundant here — in practice you would choose one; the covering version is shown separately to demonstrate the INCLUDE syntax

Index Seeks vs Index Scans

SQL Server's query optimiser chooses between two ways to use a nonclustered index: a seek or a scan. An index seek navigates the B-tree structure directly to the first matching key and reads only the qualifying rows — this is fast and scales well regardless of table size. An index scan reads the index from start to finish, which is faster than a table scan only because the index is narrower, but still reads every row. SQL Server chooses a scan when it cannot use the index key to narrow the starting position — typically because a leading wildcard in LIKE, a function applied to the column, or a type mismatch prevents the optimiser from performing a range seek. The execution plan in SSMS shows which operation was chosen and its estimated cost.

-- Queries that produce seeks vs scans — check execution plan in SSMS

-- INDEX SEEK — equality or range filter on the index leading column
-- SQL Server navigates directly to Category = 'Electronics' in the index
SELECT ProductName, Category, Price
FROM   Products
WHERE  Category = 'Electronics'
ORDER BY Price DESC;

-- INDEX SEEK — range scan, still a seek because the leading column has a known start
SELECT ProductName, Price
FROM   Products
WHERE  Category = 'Electronics'
  AND  Price BETWEEN 50 AND 200;

-- INDEX SCAN — function applied to the column prevents seek
-- SQL Server cannot use the index key when the column value is transformed
SELECT ProductName, Category
FROM   Products
WHERE  UPPER(Category) = 'ELECTRONICS';   -- wrapping in UPPER forces a scan

-- INDEX SCAN — leading wildcard prevents seek
SELECT ProductName, Category
FROM   Products
WHERE  Category LIKE '%tronics';           -- unknown prefix — must scan entire index

-- Correct form — trailing wildcard allows seek
SELECT ProductName, Category
FROM   Products
WHERE  Category LIKE 'Electro%';          -- known prefix — index seek possible
-- Seek result (Electronics, 50-200):
ProductName Category Price
─────────────────── ─────────── ─────
Mechanical Keyboard Electronics 80.99
Webcam HD Electronics 71.99
USB-C Hub Electronics 44.99

-- UPPER scan — same rows, found via scan not seek:
ProductName Category Price
──────────────────── ─────────── ─────
Monitor 27 inch Electronics 359.99
Mechanical Keyboard Electronics 80.99
Webcam HD Electronics 71.99
USB-C Hub Electronics 44.99
Wireless Mouse Electronics 29.69

-- Trailing wildcard seek:
ProductName Category
──────────────────── ───────────
Monitor 27 inch Electronics
Mechanical Keyboard Electronics
Webcam HD Electronics
USB-C Hub Electronics
Wireless Mouse Electronics
  • UPPER(Category) forces a scan because the index stores the raw column value — 'Electronics' not 'ELECTRONICS' — and applying UPPER to each row before comparing means SQL Server cannot navigate to a known position in the sorted index
  • The fix for UPPER is to store data consistently and use a case-insensitive collation, so WHERE Category = 'electronics' would find the row without a function wrapper
  • On a 10-row Products table the performance difference is imperceptible — on a Products table with 5 million rows the difference between a seek returning 5 rows and a scan reading 5 million is the difference between a millisecond query and a 30-second one

Index Fragmentation and Maintenance

As rows are inserted, updated, and deleted, index pages become fragmented — either internally (pages with large amounts of unused space) or externally (pages stored out of logical order on disk). Fragmentation degrades read performance because SQL Server must read more pages and do more I/O to retrieve the same data. Two operations address fragmentation: REORGANIZE reorders the leaf-level pages of an index online without blocking reads or writes, but does not reclaim pages with internal fragmentation. REBUILD drops and recreates the index entirely, producing a perfectly compact and ordered structure, but takes a full lock on the index during the operation (offline) unless the ENTERPRISE edition's online rebuild option is used. The standard guidance is to reorganise when fragmentation is between 10% and 30%, and rebuild when it exceeds 30%.

-- Check fragmentation on all DataplexaStore indexes
SELECT
    OBJECT_NAME(ips.object_id)      AS table_name,
    i.name                          AS index_name,
    i.type_desc,
    ips.index_level,
    ips.avg_fragmentation_in_percent,
    ips.page_count,
    ips.record_count
FROM   sys.dm_db_index_physical_stats(
           DB_ID(),         -- current database
           NULL,            -- all tables
           NULL,            -- all indexes
           NULL,            -- all partitions
           'LIMITED'        -- fast mode — sufficient for fragmentation check
       ) ips
JOIN   sys.indexes i
    ON ips.object_id = i.object_id
   AND ips.index_id  = i.index_id
WHERE  ips.index_level = 0          -- leaf level only
  AND  i.name IS NOT NULL           -- exclude heaps (index_id = 0)
  AND  OBJECT_NAME(ips.object_id) IN ('Users','Products','Orders','OrderItems','Reviews')
ORDER BY ips.avg_fragmentation_in_percent DESC;

-- Rebuild a specific index — recreates the structure completely
ALTER INDEX IX_Orders_UserID ON Orders REBUILD;

-- Reorganise a specific index — online operation, lower impact
ALTER INDEX IX_Products_Category_Price ON Products REORGANIZE;

-- Rebuild all indexes on a table
ALTER INDEX ALL ON Products REBUILD;
table_name index_name type_desc index_level avg_fragmentation page_count record_count
─────────── ─────────────────────────── ──────────── ─────────── ───────────────── ────────── ────────────
OrderItems PK_OrderItems CLUSTERED 0 0.00 1 8
Orders IX_Orders_Status NONCLUSTERED 0 0.00 1 4
Orders IX_Orders_UserID NONCLUSTERED 0 0.00 1 4
Orders PK_Orders CLUSTERED 0 0.00 1 4
Products IX_Products_Category_Price NONCLUSTERED 0 0.00 1 10
Products PK_Products CLUSTERED 0 0.00 1 10
Reviews PK_Reviews CLUSTERED 0 0.00 1 4
Users PK_Users CLUSTERED 0 0.00 1 9
  • 0% fragmentation across all indexes is expected on a small development dataset — fragmentation accumulates over thousands of inserts, updates, and deletes on large tables in production
  • REBUILD is always the correct choice after a bulk INSERT or DELETE that touches a large proportion of the table's rows — these operations can cause fragmentation to spike from 0% to 70%+ in a single operation
  • In production, index maintenance is typically run on a scheduled job during off-peak hours — either a fixed weekly rebuild schedule or an adaptive maintenance plan that checks fragmentation first and chooses REORGANIZE or REBUILD based on the threshold

When Not to Index

Every index has a maintenance cost. On every INSERT a new entry must be added to every nonclustered index on the table. On every UPDATE that changes an indexed column the index entry must be updated. On every DELETE the index entry must be removed. For a table with ten indexes, one INSERT triggers eleven write operations — one to the table and one to each index. Over-indexing a heavily written table can make writes slower than the reads it was intended to speed up. Columns with very few distinct values — a Status column with three possible values or a boolean flag — are poor index candidates because the optimiser often finds that a table scan is faster than fetching a large percentage of the table through an index. The right number of indexes is the fewest that cover the important query patterns without burdening write operations.

-- Identify potentially unused indexes — indexes that cost writes but are never used for reads
-- sys.dm_db_index_usage_stats resets when SQL Server restarts

SELECT
    OBJECT_NAME(i.object_id)        AS table_name,
    i.name                          AS index_name,
    i.type_desc,
    ISNULL(us.user_seeks,  0)       AS seeks,
    ISNULL(us.user_scans,  0)       AS scans,
    ISNULL(us.user_lookups,0)       AS lookups,
    ISNULL(us.user_seeks, 0)
        + ISNULL(us.user_scans, 0)
        + ISNULL(us.user_lookups, 0)
                                    AS total_reads,
    ISNULL(us.user_updates, 0)      AS write_cost    -- index updates from DML
FROM       sys.indexes i
LEFT JOIN  sys.dm_db_index_usage_stats us
    ON     i.object_id  = us.object_id
   AND     i.index_id   = us.index_id
   AND     us.database_id = DB_ID()
WHERE      OBJECT_NAME(i.object_id) IN ('Users','Products','Orders','OrderItems','Reviews')
  AND      i.type_desc = 'NONCLUSTERED'
  AND      i.name IS NOT NULL
ORDER BY   total_reads ASC, write_cost DESC;
table_name index_name type_desc seeks scans lookups total_reads write_cost
────────── ─────────────────────────────── ──────────── ───── ───── ─────── ─────────── ──────────
Products IX_Products_Category_Price NONCLUSTERED 0 0 0 0 0
Products IX_Products_Category_Covering NONCLUSTERED 0 0 0 0 0
Orders IX_Orders_Status NONCLUSTERED 0 0 0 0 0
Orders IX_Orders_UserID NONCLUSTERED 0 0 0 0 0
  • All four nonclustered indexes show zero reads — they have never been used for a seek, scan, or lookup since the server last restarted; on a development dataset this is expected but in production these would be candidates for review and potential removal
  • write_cost = 0 here because no DML has run since the indexes were created in this session; in production a high write_cost with zero total_reads is the classic signal of an index that is costing more than it contributes
  • sys.dm_db_index_usage_stats resets on every SQL Server restart — usage data should be collected over a representative period (at minimum one full business cycle) before making index removal decisions

Summary Table

Concept Detail Key Point
Clustered index Defines physical row order One per table — created by PRIMARY KEY
Nonclustered index Separate sorted structure with row pointers Up to 999 per table — index WHERE/JOIN/ORDER columns
INCLUDE columns Extra columns in index leaf only Covering index eliminates key lookups
Index seek Navigate B-tree to exact rows Fast — scales to any table size
Index scan Read entire index left to right Functions on columns, leading wildcards force scans
REBUILD Drop and recreate index Use when fragmentation > 30%
REORGANIZE Reorder leaf pages online Use when fragmentation 10–30%

Practice Questions

Practice 1. What is the difference between a clustered and a nonclustered index?



Practice 2. What is a covering index and why does it improve performance?



Practice 3. Why does WHERE UPPER(Category) = 'ELECTRONICS' prevent an index seek on the Category column?



Practice 4. When should you use REBUILD instead of REORGANIZE for an index?



Practice 5. A table has 12 nonclustered indexes. An INSERT runs on this table. How many write operations does SQL Server perform?



Quiz

Quiz 1. How many clustered indexes can a table have?






Quiz 2. What is a key lookup and when does it occur?






Quiz 3. sys.dm_db_index_usage_stats shows an index with user_seeks = 0 and user_updates = 50,000. What does this indicate?






Quiz 4. What fragmentation threshold triggers a REBUILD rather than a REORGANIZE?






Quiz 5. Why are columns with very few distinct values — such as a Status column with three possible values — poor index candidates?






Next up - Transactions & ACID - Understand how SQL Server groups statements into atomic units of work and guarantees data consistency even when things go wrong.