MS SQL Lesson 19 – Aggregate Functions | Dataplexa

Aggregate Functions

Every business question that starts with "how many", "what is the total", "what is the average", or "what is the highest" is an aggregation question. Aggregate functions collapse a set of rows into a single summary value — COUNT tells you how many rows exist, SUM adds them up, AVG finds the mean, MIN and MAX find the boundaries. These five functions are the foundation of every report, dashboard, and analytical query in SQL Server. Understanding not just their syntax but their behaviour with NULLs, their interaction with DISTINCT, and the difference between counting rows and counting values is what separates queries that produce correct results from queries that silently lie — all demonstrated using the DataplexaStore data.

COUNT

COUNT is the most used aggregate function and the one with the most important behavioural distinction. COUNT(*) counts every row in the result set, including rows with NULL values in any column. COUNT(column) counts only the rows where that specific column is not NULL — NULLs are silently skipped. This distinction matters enormously when the column being counted is optional. COUNT(DISTINCT column) counts the number of unique non-NULL values. Getting the wrong form of COUNT produces results that are numerically plausible but factually wrong — the worst kind of error because it looks correct.

-- COUNT variants — understanding each form
USE DataplexaStore;

-- COUNT(*) — total rows including any with NULLs
SELECT COUNT(*) AS total_reviews
FROM   Reviews;

-- COUNT(column) — only rows where column is NOT NULL
-- ReviewText is nullable — COUNT(ReviewText) skips any NULL reviews
SELECT
    COUNT(*)           AS total_reviews,
    COUNT(ReviewText)  AS reviews_with_text,   -- skips NULLs silently
    COUNT(*) - COUNT(ReviewText) AS reviews_without_text
FROM Reviews;

-- COUNT(DISTINCT column) — unique non-NULL values
SELECT
    COUNT(DISTINCT UserID)     AS unique_reviewers,
    COUNT(DISTINCT ProductID)  AS unique_products_reviewed
FROM Reviews;

-- COUNT across tables
SELECT
    (SELECT COUNT(*) FROM Users)      AS total_users,
    (SELECT COUNT(*) FROM Products)   AS total_products,
    (SELECT COUNT(*) FROM Orders)     AS total_orders,
    (SELECT COUNT(*) FROM OrderItems) AS total_order_items,
    (SELECT COUNT(*) FROM Reviews)    AS total_reviews;

-- Total reviews:
total_reviews
─────────────
4

-- NULL-aware count:
total_reviews reviews_with_text reviews_without_text
───────────── ───────────────── ────────────────────
4 4 0

-- Distinct counts:
unique_reviewers unique_products_reviewed
──────────────── ────────────────────────
4 4

-- Counts across tables:
total_users total_products total_orders total_order_items total_reviews
─────────── ────────────── ──────────── ───────────────── ─────────────
9 10 4 8 4

All four reviews currently have text so reviews_without_text is 0 — in a real dataset with many optional fields the gap between COUNT(*) and COUNT(column) reveals how complete the data is, a useful data quality check
Four unique reviewers from four reviews confirms no user has reviewed more than once in the current dataset — COUNT(DISTINCT UserID) would be less than COUNT(*) if any user had submitted multiple reviews
Orders is now 4 (not 5) because Alice's two orders were deleted in L18, and the cancelled test order was also cleaned up — the seed data reflects the current state of the DataplexaStore tables

SUM and AVG

SUM adds all non-NULL values in a column and returns the total. AVG calculates the arithmetic mean of all non-NULL values — it divides the sum by the count of non-NULL rows, not by the total row count. Both functions silently skip NULLs, which means AVG on a column with NULLs gives you the average of the rows that have a value, not the average across all rows. Whether this is the correct calculation depends on what the NULL represents — if NULL means "not applicable" then skipping is right; if NULL means "zero" then you should substitute 0 with ISNULL before averaging. AVG on integer columns performs integer division by default — to get a decimal result cast one of the operands to DECIMAL first.

-- SUM and AVG across the DataplexaStore tables

-- Total and average order value
SELECT
    COUNT(*)                    AS order_count,
    SUM(TotalAmount)            AS total_revenue,
    AVG(TotalAmount)            AS avg_order_value,
    AVG(CAST(TotalAmount AS DECIMAL(10,4)))
                                AS avg_order_value_precise
FROM Orders;

-- Product price statistics
SELECT
    MIN(Price)                  AS cheapest,
    MAX(Price)                  AS most_expensive,
    AVG(Price)                  AS avg_price,
    SUM(Price * StockQty)       AS total_inventory_value
FROM Products;

-- Average rating — all reviews have a rating so NULLs are not a factor here
SELECT
    COUNT(*)                    AS review_count,
    AVG(CAST(Rating AS DECIMAL(4,2)))
                                AS avg_rating,   -- cast TINYINT to get decimal precision
    MIN(Rating)                 AS lowest_rating,
    MAX(Rating)                 AS highest_rating
FROM Reviews;

-- Order value statistics:
order_count total_revenue avg_order_value avg_order_value_precise
─────────── ───────────── ─────────────── ───────────────────────
4 692.90 173.22 173.2250

-- Product price statistics:
cheapest most_expensive avg_price total_inventory_value
──────── ────────────── ───────── ─────────────────────
8.99 399.99 109.66 21895.14

-- Review statistics:
review_count avg_rating lowest_rating highest_rating
──────────── ────────── ───────────── ──────────────
4 4.75 4 5

avg_order_value (173.22) and avg_order_value_precise (173.2250) differ because DECIMAL(10,2) division rounds the result to 2 decimal places — always cast to a higher-precision DECIMAL when AVG precision matters
avg_rating of 4.75 from four reviews (5, 4, 5, 5) is correct — casting TINYINT to DECIMAL(4,2) before AVG prevents integer division which would have produced 4 instead of 4.75
total_inventory_value of 21895.14 is the sum of Price × StockQty across all products — a single expression inside SUM is evaluated row by row before aggregation, so no intermediate table is needed

MIN and MAX

MIN and MAX return the smallest and largest non-NULL values in a column respectively. They work on any data type that has a natural ordering — numbers, strings (alphabetical), and dates (chronological). On strings MIN returns the value that sorts earliest alphabetically and MAX returns the latest. On dates MIN returns the earliest date and MAX returns the most recent. Like all aggregate functions they skip NULLs — if a column is entirely NULL, both return NULL. MIN and MAX are particularly useful for finding boundaries in time series data: the earliest order date, the most recent review, or the date range of activity for a particular user.

-- MIN and MAX on numbers, strings, and dates

-- Price boundaries
SELECT
    MIN(Price)          AS lowest_price,
    MAX(Price)          AS highest_price,
    MAX(Price) - MIN(Price) AS price_range
FROM Products;

-- Alphabetical MIN/MAX on strings
SELECT
    MIN(FullName)       AS first_alphabetically,
    MAX(FullName)       AS last_alphabetically
FROM Users;

-- Date boundaries — order activity window
SELECT
    MIN(OrderDate)      AS first_order_date,
    MAX(OrderDate)      AS most_recent_order,
    DATEDIFF(DAY, MIN(OrderDate), MAX(OrderDate))
                        AS days_of_activity
FROM Orders;

-- MIN and MAX per product — lowest and highest quantity sold in a single line item
SELECT
    p.ProductName,
    MIN(oi.Quantity)    AS min_qty_per_line,
    MAX(oi.Quantity)    AS max_qty_per_line,
    SUM(oi.Quantity)    AS total_units_sold
FROM   Products p
JOIN   OrderItems oi ON p.ProductID = oi.ProductID
GROUP BY p.ProductName
ORDER BY total_units_sold DESC;

-- Price boundaries:
lowest_price highest_price price_range
──────────── ───────────── ───────────
8.99 399.99 391.00

-- Alphabetical:
first_alphabetically last_alphabetically
──────────────────── ───────────────────
Bob Williams James Wilson

-- Order date range:
first_order_date most_recent_order days_of_activity
─────────────────────────── ─────────────────────────── ────────────────
2024-03-18 14:45:00.0000000 2024-03-27 11:05:00.0000000 8

-- Per-product quantity stats:
ProductName min_qty_per_line max_qty_per_line total_units_sold
─────────────────── ──────────────── ──────────────── ────────────────
Notebook A5 7 7 7
Desk Lamp 2 2 2
Mechanical Keyboard 1 1 1
Webcam HD 1 1 1
Wireless Mouse 1 1 1
USB-C Hub 1 1 1

first_alphabetically is Bob Williams not Alice Johnson because Alice was deleted in L18 — the current Users table starts at Bob
days_of_activity of 8 spans from March 18 to March 27 — MIN and MAX on DATETIME2 columns correctly identify the chronological boundaries of the order window
The last query combines MIN, MAX, and SUM with GROUP BY — this is a preview of the GROUP BY pattern covered in full in the next lesson; the query returns one aggregated row per product rather than one row per order item

Aggregates with DISTINCT

All five aggregate functions accept DISTINCT as a modifier, which causes them to operate only on unique values before aggregating. SUM(DISTINCT column) adds each unique value once even if it appears in multiple rows. AVG(DISTINCT column) averages the unique values. This is rarely what you want for SUM and AVG — if a price of 29.99 appears in three order lines, those three sales represent real revenue and should all be summed. Using DISTINCT with SUM would count that price once regardless of how many times it occurred, producing a meaningless undercount. The legitimate use of DISTINCT with aggregates is almost exclusively with COUNT, where counting unique values is a genuine analytical need.

-- DISTINCT with aggregates — correct and incorrect uses

-- Correct use: COUNT(DISTINCT) — how many unique products have been ordered?
SELECT
    COUNT(*)                    AS total_order_lines,
    COUNT(DISTINCT ProductID)   AS unique_products_ordered
FROM OrderItems;

-- Illustrate the DISTINCT effect on SUM and AVG
-- Total revenue from OrderItems — correct: sum every line
SELECT SUM(Quantity * UnitPrice)        AS correct_total_revenue
FROM   OrderItems;

-- Incorrect: SUM(DISTINCT UnitPrice) adds each price value once
-- This is not revenue — it is a meaningless sum of unique price points
SELECT SUM(DISTINCT UnitPrice)          AS misleading_distinct_sum
FROM   OrderItems;

-- AVG(DISTINCT) — average of unique unit prices (not average order line value)
SELECT
    AVG(UnitPrice)              AS avg_line_price,          -- average across all lines
    AVG(DISTINCT UnitPrice)     AS avg_unique_price_points  -- average of unique prices only
FROM OrderItems;

-- COUNT DISTINCT:
total_order_lines unique_products_ordered
───────────────── ───────────────────────
8 6

-- Revenue comparison:
correct_total_revenue misleading_distinct_sum
───────────────────── ───────────────────────
1601.82 927.92

-- AVG comparison:
avg_line_price avg_unique_price_points
────────────── ───────────────────────
200.2275 185.5840

correct_total_revenue (1601.82) vs misleading_distinct_sum (927.92) — the difference is substantial because several unit prices appear in multiple order lines; DISTINCT collapsed those to a single occurrence, dramatically undercounting revenue
8 total order lines but only 6 unique products — two products each appear in more than one order line, which COUNT(DISTINCT) correctly identifies without the need for a separate query
The two AVG results differ for the same reason — DISTINCT averages the unique price points, not the actual distribution of prices across order lines; always be explicit about which average you need

Aggregates and NULL Behaviour

Every aggregate function except COUNT(*) silently ignores NULL values. This is consistent, predictable, and usually correct — but it produces wrong results when NULL means zero rather than unknown. If a product's discount percentage is NULL because no discount applies, AVG(DiscountPct) correctly ignores those products and averages only the ones with discounts. But if NULL means the discount data was not captured and should be treated as zero for the purposes of the average, then you must substitute zero before aggregating using ISNULL or COALESCE. Making the wrong assumption about what NULL means produces averages, sums, and counts that are numerically consistent but factually misleading.

-- NULL behaviour in aggregates — demonstrate the difference between
-- "NULL means unknown/not applicable" vs "NULL means zero"

-- Add a DiscountPct column with some NULLs to demonstrate
ALTER TABLE Products ADD DiscountPct DECIMAL(5,2) NULL;
UPDATE Products SET DiscountPct = 10.0 WHERE ProductID IN (1, 2, 3);
-- ProductIDs 4-10 remain NULL (no discount defined)

-- AVG ignoring NULLs — average of the 3 products that have a discount
SELECT
    COUNT(*)                        AS total_products,
    COUNT(DiscountPct)              AS products_with_discount,
    AVG(DiscountPct)                AS avg_discount_non_null,    -- averages only 3 rows
    AVG(ISNULL(DiscountPct, 0))     AS avg_discount_treat_null_as_zero  -- averages all 10
FROM Products;

-- SUM is unaffected by NULLs in this case — SUM(NULL) = NULL,
-- but SUM ignores NULLs in a set, so partial NULLs don't make the sum NULL
SELECT
    SUM(DiscountPct)                AS sum_non_null,       -- sums only 3 values
    SUM(ISNULL(DiscountPct, 0))     AS sum_treat_as_zero   -- same result here: 30
FROM Products;

-- Clean up
ALTER TABLE Products DROP COLUMN DiscountPct;

total_products products_with_discount avg_discount_non_null avg_discount_treat_null_as_zero
────────────── ────────────────────── ───────────────────── ───────────────────────────────
10 3 10.00 3.00

sum_non_null sum_treat_as_zero
──────────── ─────────────────
30.00 30.00

avg_discount_non_null (10.00) vs avg_discount_treat_null_as_zero (3.00) — a threefold difference from the same data, purely based on how NULLs are handled; neither is wrong in isolation, but only one is correct for a given business question
SUM produces the same result either way here because adding zero to the total does not change it — the difference only matters when the denominator (count of rows) is relevant, as it is in AVG
The pattern of checking COUNT(*) vs COUNT(column) first to understand how many NULLs exist before aggregating is good practice in any unfamiliar dataset

Summary Table

Function	Returns	NULL Behaviour	Key Gotcha
COUNT(*)	Total row count	Includes NULLs	COUNT(col) skips NULLs — different result
SUM(col)	Total of non-NULL values	Skips NULLs	Use ISNULL if NULLs should count as zero
AVG(col)	Mean of non-NULL values	Skips NULLs — divides by non-NULL count	Cast integers to DECIMAL for decimal result
MIN(col)	Smallest non-NULL value	Skips NULLs	Works on strings and dates too
MAX(col)	Largest non-NULL value	Skips NULLs	Works on strings and dates too
COUNT(DISTINCT col)	Unique non-NULL value count	Skips NULLs	Avoid DISTINCT with SUM/AVG — misleading results

Practice Questions

Practice 1. What is the difference between COUNT(*) and COUNT(column)?

Practice 2. A Rating column is TINYINT. Why should you cast it to DECIMAL before using AVG?

Practice 3. A column has 100 rows, 30 of which are NULL. What does AVG return — the average of 70 values or the average of 100 values?

Practice 4. Why is SUM(DISTINCT UnitPrice) misleading when calculating total order revenue?

Practice 5. Write a query that returns the total number of units sold across all OrderItems and the average quantity per line item.

Quiz

Quiz 1. A table has 50 rows. 10 of them have NULL in the Price column. What does COUNT(*) return?

50
40
10
NULL

Quiz 2. A column contains the values 10, 20, NULL, 30. What does AVG return?

20
15
NULL
60

Quiz 3. Which aggregate function is the only one that includes NULL values?

COUNT(*)
SUM
MAX
MIN

Quiz 4. What does MIN return on a string column?

The value that sorts earliest alphabetically
The shortest string
Error — numbers only
NULL

Quiz 5. AVG(Rating) on integer ratings 3, 4, 4, 5 returns 4 instead of 4.0. Why?

Integer columns use integer division — cast to DECIMAL for a decimal result
AVG always rounds to nearest integer
SQL Server truncates by default
4 is the exact average here

Next up - GROUP BY - Break aggregate results into meaningful groups and learn how HAVING filters the groups that make it into your final result.

← Previous Course Index Next →

MS SQL Server