MS SQL Lesson 31 – Query Optimization | Dataplexa

Query Optimisation

A query that returns the right answer in 30 seconds is not a correct query in production — it is a problem waiting for a deadline. SQL Server's query optimiser does significant work automatically, but it can only work with what you give it: accurate statistics, usable indexes, and queries written in ways that allow the optimiser to reason about them. The difference between a query that runs in 2 milliseconds and one that runs in 20 seconds is almost never the business logic — it is the shape of the query, the presence or absence of the right index, a function wrapping a column in a WHERE clause, or a data type mismatch that silently disables index use. This lesson covers the practical techniques that reliably improve query performance: understanding what the optimiser needs, identifying the patterns that prevent it from doing its job, reading the signals that tell you where the problem is, and applying the fixes — all using the DataplexaStore schema.

How the Optimiser Chooses a Plan

Before executing any query, SQL Server's query optimiser generates one or more candidate execution plans and estimates the cost of each based on statistics — metadata about the distribution of values in each column and index. It picks the plan with the lowest estimated cost and caches it for reuse. The optimiser is cost-based, not rule-based: it does not have a fixed algorithm for every query shape; it explores the search space of possible join orders, index choices, and operator selections and tries to find a good enough plan within a time budget. The phrase "good enough" is important — the optimiser does not guarantee the optimal plan, only a plan that is estimated to be reasonably efficient based on the statistics it has. When statistics are stale, when data has unusual distributions, or when the query shape prevents the optimiser from reasoning correctly, the resulting plan can be dramatically suboptimal.

-- Inspect statistics on the Products table
USE DataplexaStore;

-- List all statistics objects on Products
SELECT
    s.name                          AS stats_name,
    s.auto_created,
    s.user_created,
    sp.last_updated,
    sp.rows,
    sp.rows_sampled,
    sp.modification_counter
FROM       sys.stats                s
CROSS APPLY sys.dm_db_stats_properties(s.object_id, s.stats_id) sp
WHERE      s.object_id = OBJECT_ID('Products')
ORDER BY   sp.last_updated DESC;

-- Update statistics manually — forces the optimiser to re-examine data distribution
UPDATE STATISTICS Products;             -- updates all stats on the table
UPDATE STATISTICS Products PK_Products; -- updates one specific stats object

-- Check when statistics were last updated across all DataplexaStore tables
SELECT
    OBJECT_NAME(s.object_id)        AS table_name,
    s.name                          AS stats_name,
    sp.last_updated,
    sp.rows,
    sp.modification_counter         AS modifications_since_last_update
FROM       sys.stats                s
CROSS APPLY sys.dm_db_stats_properties(s.object_id, s.stats_id) sp
WHERE      OBJECT_NAME(s.object_id) IN ('Users','Products','Orders','OrderItems','Reviews')
ORDER BY   sp.modification_counter DESC, OBJECT_NAME(s.object_id);

-- Statistics on Products:
stats_name auto_created user_created last_updated rows rows_sampled modification_counter
─────────────────────────────── ──────────── ──────────── ─────────────────────── ──── ──────────── ────────────────────
PK_Products 0 0 2024-03-28 15:00:00.000 10 10 0
IX_Products_Category_Price 0 1 2024-03-28 15:10:00.000 10 10 0
IX_Products_Category_Covering 0 1 2024-03-28 15:10:00.000 10 10 0
_WA_Sys_00000003_Products 1 0 2024-03-28 15:00:00.000 10 10 2

-- Cross-table stats freshness:
table_name stats_name last_updated rows modifications
─────────── ─────────────────────── ─────────────────────── ──── ─────────────
Products _WA_Sys_00000003_Price 2024-03-28 15:00:00.000 10 2
Orders IX_Orders_Status 2024-03-28 15:10:00.000 4 0
Orders PK_Orders 2024-03-28 15:10:00.000 4 0

modification_counter = 2 on the auto-created Price stats means two rows were modified since the last update — SQL Server auto-updates stats when roughly 20% of rows change, so on a 10-row table this threshold is hit very quickly; on a million-row table 20% means 200,000 modifications before auto-update triggers
Auto-created statistics (_WA_Sys_...) are created automatically by SQL Server when a column is used in a WHERE clause for the first time and no existing stats cover it — they are a sign the optimiser needed information it did not have
In production, stale statistics on large tables are one of the most common causes of sudden query regressions — a query that ran in 50ms for months suddenly takes 30 seconds after a bulk load because the statistics no longer reflect the actual data distribution

SARGability — Writing WHERE Clauses the Optimiser Can Use

A SARGable predicate (Search ARGument able) is a WHERE condition that SQL Server can use to perform an index seek — navigating directly to the matching rows without scanning the entire index. A predicate is SARGable when the indexed column appears alone on one side of the comparison and the value or expression it is compared to is on the other side. Wrapping the column in a function, performing arithmetic on it, or comparing it to a value of a different data type all make the predicate non-SARGable, forcing a scan. Non-SARGable predicates are one of the most common and most fixable sources of unnecessary full-index scans in production queries.

-- SARGable vs non-SARGable predicates — each pair produces the same result
-- but only the SARGable form allows an index seek

-- NON-SARGable: function wrapping the column
SELECT ProductName, Price
FROM   Products
WHERE  ROUND(Price, 0) = 30;        -- ROUND applied to column — forces scan

-- SARGable equivalent: move the logic to the value side
SELECT ProductName, Price
FROM   Products
WHERE  Price >= 29.50               -- column is bare — index seek possible
  AND  Price <  30.50;

-- NON-SARGable: implicit type conversion — column is INT, literal is VARCHAR
-- This converts every UserID to VARCHAR before comparing — forces scan
SELECT UserID, FullName
FROM   Users
WHERE  CAST(UserID AS VARCHAR) = '2';

-- SARGable equivalent: use the correct type
SELECT UserID, FullName
FROM   Users
WHERE  UserID = 2;                  -- INT compared to INT — seek on clustered index

-- NON-SARGable: arithmetic on the column
SELECT OrderID, TotalAmount
FROM   Orders
WHERE  TotalAmount / 2 > 50;        -- divides every row then compares — scan

-- SARGable equivalent: rearrange the expression to isolate the column
SELECT OrderID, TotalAmount
FROM   Orders
WHERE  TotalAmount > 100;           -- column is bare — index seek possible

-- NON-SARGable: leading wildcard
SELECT ProductName FROM Products WHERE ProductName LIKE '%Mouse';  -- scan

-- SARGable equivalent: trailing wildcard
SELECT ProductName FROM Products WHERE ProductName LIKE 'Wire%';   -- seek

-- ROUND (non-SARGable) — returns same row as range version:
ProductName Price
────────────── ─────
Wireless Mouse 29.99

-- Range equivalent (SARGable):
ProductName Price
────────────── ─────
Wireless Mouse 29.99

-- INT = INT (SARGable):
UserID FullName
────── ────────────
2 Bob Williams

-- TotalAmount > 100 (SARGable):
OrderID TotalAmount
─────── ───────────
2 512.92
3 159.91
7 104.37

Every non-SARGable form forces a full scan of the index or table — on a 10-row Products table this is imperceptible; on a Products table with 5 million rows each non-SARGable scan reads every row and every page, making the query orders of magnitude slower than the seek equivalent
The ROUND workaround (using a range instead) is slightly different in semantics — it returns any price that rounds to 30, not exactly 30; the query intent should confirm whether this is the right interpretation before replacing the non-SARGable form
Type mismatch is particularly dangerous because it is invisible — WHERE UserID = '2' looks correct, works correctly, and returns the right result, but forces SQL Server to convert every UserID in the table before comparing, disabling the clustered index seek silently

Query Rewriting Techniques

Beyond SARGability, several structural query patterns consistently produce worse plans than their equivalents. OR conditions on different columns are often better served by UNION ALL because OR prevents the optimiser from using separate indexes on each column independently. SELECT DISTINCT used to eliminate duplicates caused by incorrect joins is a symptom of a structural problem and often hides a missing GROUP BY or an incorrect relationship — fixing the root cause is better than masking it with DISTINCT. Scalar subqueries in SELECT that run once per row (correlated) can often be replaced by a single JOIN or window function that the optimiser can process in a single pass rather than row by row.

-- Query rewriting — structural improvements for better plan quality

-- OR on different indexed columns — may not use either index efficiently
SELECT UserID, FullName, Country
FROM   Users
WHERE  Country = 'UK'
   OR  MembershipTier = 'VIP';

-- UNION ALL equivalent — each branch can independently use its own index seek
SELECT UserID, FullName, Country, MembershipTier
FROM   Users
WHERE  Country = 'UK'
UNION ALL
SELECT UserID, FullName, Country, MembershipTier
FROM   Users
WHERE  MembershipTier = 'VIP'
  AND  Country <> 'UK';          -- exclude the overlap to avoid duplicates

-- Correlated scalar subquery — re-executed once per product row
SELECT
    p.ProductName,
    p.Price,
    (SELECT COUNT(*) FROM OrderItems oi WHERE oi.ProductID = p.ProductID) AS times_ordered
FROM   Products p
ORDER BY times_ordered DESC;

-- JOIN + GROUP BY equivalent — single pass, no per-row subquery execution
SELECT
    p.ProductName,
    p.Price,
    COUNT(oi.OrderItemID)           AS times_ordered
FROM       Products    p
LEFT JOIN  OrderItems  oi ON p.ProductID = oi.ProductID
GROUP BY   p.ProductName, p.Price
ORDER BY   times_ordered DESC;

-- OR result:
UserID FullName Country MembershipTier
────── ──────────── ─────── ──────────────
2 Bob Williams UK Premium
7 Grace Patel India VIP
3 Clara Davis Canada VIP

-- UNION ALL result (same rows, no duplicates):
UserID FullName Country MembershipTier
────── ──────────── ─────── ──────────────
2 Bob Williams UK Premium
7 Grace Patel India VIP
3 Clara Davis Canada VIP

-- JOIN + GROUP BY (same result as correlated subquery, single pass):
ProductName Price times_ordered
──────────────────── ────── ─────────────
Monitor 27 inch 359.99 2
Notebook A5 12.99 2
Mechanical Keyboard 81.80 2
Desk Lamp 36.74 2
USB-C Hub 45.44 2
Wireless Mouse 29.99 1
Webcam HD 71.99 0
Standing Desk 367.49 0
Ergonomic Chair 262.49 0
Ballpoint Pen Set 8.54 0

The OR and UNION ALL forms return identical results — on small tables the optimiser may transform the OR into a UNION internally anyway, but on large tables with separate indexes on Country and MembershipTier, UNION ALL gives the optimiser an explicit path to use both indexes independently
The JOIN + GROUP BY form replaces the correlated subquery — the optimiser processes all products and all order items in a single join pass rather than running a separate COUNT query for each of the 10 product rows; at 10 products the difference is negligible, at 100,000 products the correlated form would run 100,000 subqueries
Prices have drifted from earlier lesson values due to the accumulated price changes applied in L17, L19, and L29 — the DataplexaStore data reflects all cumulative modifications across the lesson series

Identifying Slow Queries — sys.dm_exec_query_stats

SQL Server caches every query plan that executes and tracks cumulative performance statistics for each one — total elapsed time, CPU time, logical reads, and execution count. This data is available in sys.dm_exec_query_stats and is one of the first places to look when investigating a performance problem. Sorting by total logical reads identifies the queries consuming the most I/O across all executions. Sorting by total elapsed time finds the queries responsible for the most wall-clock time. Dividing by execution count reveals the per-execution cost, distinguishing a query that runs once and takes 10 seconds from one that runs 10,000 times and takes 1ms each — both are performance concerns but they require different solutions.

-- Top queries by total logical reads — most I/O-intensive queries in the plan cache
SELECT TOP 10
    qs.execution_count,
    qs.total_logical_reads,
    qs.total_logical_reads / qs.execution_count
                                    AS avg_logical_reads,
    qs.total_elapsed_time / 1000    AS total_elapsed_ms,
    qs.total_elapsed_time / qs.execution_count / 1000
                                    AS avg_elapsed_ms,
    qs.total_worker_time / 1000     AS total_cpu_ms,
    SUBSTRING(qt.text, 1, 150)      AS query_text
FROM       sys.dm_exec_query_stats  qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) qt
WHERE      qt.dbid = DB_ID()
ORDER BY   qs.total_logical_reads DESC;

-- Queries with the worst average elapsed time — slowest per execution
SELECT TOP 10
    qs.execution_count,
    qs.total_elapsed_time / qs.execution_count / 1000
                                    AS avg_elapsed_ms,
    qs.total_logical_reads / qs.execution_count
                                    AS avg_reads,
    SUBSTRING(qt.text, 1, 150)      AS query_text
FROM       sys.dm_exec_query_stats  qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) qt
WHERE      qt.dbid = DB_ID()
  AND      qs.execution_count > 0
ORDER BY   avg_elapsed_ms DESC;

-- Top queries by logical reads (development dataset — low counts):
execution_count total_logical_reads avg_logical_reads total_elapsed_ms avg_elapsed_ms query_text
─────────────── ─────────────────── ───────────────── ──────────────── ────────────── ───────────────────────────────────────
4 48 12 112 28 SELECT p.ProductName, p.Category, p...
3 36 12 84 28 SELECT u.UserID, u.FullName, u.Email...
2 18 9 52 26 SELECT o.OrderID, o.Status, o.TotalA...
1 12 12 31 31 UPDATE Products SET Price = ROUND(Pri...
1 8 8 22 22 SELECT ProductName, Category, Price F...

avg_logical_reads is the key metric — logical reads measure pages read from the buffer pool, directly reflecting how much of the table was scanned; a query with 12 logical reads on a 10-row table is reading nearly every page, which is expected here but would be alarming on a million-row table
sys.dm_exec_query_stats resets when the plan is evicted from cache (due to memory pressure, statistics updates, or server restart) — in production the data represents recent history, not all-time history
The plan cache on a development dataset shows queries from the current session only; in production with hundreds of concurrent sessions this view reveals the true workload profile — which queries run most often and which cost the most per execution

The Missing Index DMV

SQL Server tracks situations where it estimated a query would have run faster if a particular index existed, and stores these recommendations in sys.dm_db_missing_index_details. When the optimiser builds an execution plan and chooses a scan where it could have used a seek — if only an index existed on the right columns — it records the columns that would have helped and an estimate of the improvement. These recommendations are a starting point for index analysis, not a prescription. The optimiser recommends indexes based on individual queries in isolation and does not account for the write overhead of adding indexes or the possibility that a single composite index could serve multiple recommendations simultaneously.

-- Missing index recommendations from the optimiser
SELECT
    mid.statement                   AS table_name,
    migs.avg_total_user_cost        AS avg_query_cost_reduction,
    migs.avg_user_impact            AS avg_pct_improvement,
    migs.user_seeks,
    migs.user_scans,
    mid.equality_columns,           -- columns used in = predicates
    mid.inequality_columns,         -- columns used in <, >, BETWEEN predicates
    mid.included_columns            -- columns to INCLUDE for covering
FROM       sys.dm_db_missing_index_groups   mig
JOIN       sys.dm_db_missing_index_group_stats migs
    ON     mig.index_group_handle = migs.group_handle
JOIN       sys.dm_db_missing_index_details  mid
    ON     mig.index_handle = mid.index_handle
WHERE      mid.database_id = DB_ID()
ORDER BY   migs.avg_total_user_cost * migs.avg_user_impact DESC;

-- Translate a recommendation into a CREATE INDEX statement
-- Example: if equality_columns = 'Status', included_columns = 'TotalAmount, OrderDate'
CREATE NONCLUSTERED INDEX IX_Orders_Status_Covering
ON Orders (Status)
INCLUDE (TotalAmount, OrderDate, UserID);  -- covering for common order status queries

-- Missing index recommendations (development dataset may show few or none):
table_name avg_query_cost_reduction avg_pct_improvement user_seeks equality_columns inequality_columns included_columns
──────────────────────────── ──────────────────────── ─────────────────── ────────── ──────────────── ────────────────── ─────────────────
[DataplexaStore].[dbo].[Users] 0.008 78.23 2 Country NULL FullName, Email

-- Translated index created:
Command(s) completed successfully.

avg_pct_improvement = 78.23 means the optimiser estimates a 78% reduction in query cost if an index on Users.Country were created — a high-impact recommendation worth investigating
The recommendation separates equality_columns (used in =) from inequality_columns (used in <, >, BETWEEN) because composite index key order matters: equality columns should come first in the index key, then inequality columns, then INCLUDE non-key columns
The missing index DMVs reset on SQL Server restart — in production they should be queried after a representative workload period, not immediately after startup when the optimiser has had little opportunity to gather recommendations

Summary Table

Technique	Problem It Solves	Key Rule
Update statistics	Stale row count estimates	Run after bulk loads or large modifications
SARGable predicates	Functions on columns disable index seeks	Column bare on one side — value or expression on the other
Match literal types to column types	Implicit conversion forces scan	INT column = INT literal, not VARCHAR
UNION ALL instead of OR	OR on different columns blocks individual index use	Each branch seeks independently
JOIN instead of correlated subquery	Per-row subquery re-execution	Single pass replaces N subquery executions
sys.dm_exec_query_stats	Identify high-cost queries	Sort by total_logical_reads or avg_elapsed_ms
sys.dm_db_missing_index_details	Find indexes the optimiser wanted	Equality cols first, inequality next, INCLUDE the rest

Practice Questions

Practice 1. What makes a WHERE clause predicate non-SARGable, and why does it hurt performance?

Practice 2. Statistics on a large Orders table have not been updated after a bulk load of 2 million new rows. What problem might occur?

Practice 3. A query has WHERE YEAR(OrderDate) = 2024. Why is this non-SARGable and how would you rewrite it?

Practice 4. What does avg_logical_reads in sys.dm_exec_query_stats measure and why is it useful?

Practice 5. The missing index DMV recommends an index with equality_columns = 'Category' and included_columns = 'ProductName, Price'. Write the CREATE INDEX statement.

Quiz

Quiz 1. WHERE CAST(UserID AS VARCHAR) = '2' prevents an index seek. What is the correct rewrite?

WHERE UserID = 2
WHERE CAST(UserID AS INT) = 2
WHERE UserID LIKE '2'
WHERE CONVERT(VARCHAR, UserID) = '2'

Quiz 2. In a missing index recommendation, equality_columns should come before inequality_columns in the index key. Why?

Equality columns narrow to a single value, maximising precision before the range begins
Inequality columns are larger
SQL Server requires alphabetical order
Equality columns change less

Quiz 3. sys.dm_exec_query_stats resets in which circumstance?

When the plan is evicted from cache or SQL Server restarts
Every midnight automatically
After every 1,000 executions
On each new connection

Quiz 4. Why is replacing a correlated scalar subquery with a LEFT JOIN + GROUP BY often faster?

JOIN processes both tables in one pass instead of N subquery executions
JOINs always use indexes
Subqueries cannot be parallelised
GROUP BY reduces locking

Quiz 5. SQL Server auto-updates statistics when approximately what percentage of rows in a table are modified?

Approximately 20%
1%
50%
5%

Next up - Execution Plans - Read the graphical and text execution plans SQL Server generates, understand the operators they contain, and use them to pinpoint exactly where a query is spending its time.

← Previous Course Index Next →

MS SQL Server