MS SQL Server
Aggregate Functions
Every business question that starts with "how many", "what is the total", "what is the average", or "what is the highest" is an aggregation question. Aggregate functions collapse a set of rows into a single summary value — COUNT tells you how many rows exist, SUM adds them up, AVG finds the mean, MIN and MAX find the boundaries. These five functions are the foundation of every report, dashboard, and analytical query in SQL Server. Understanding not just their syntax but their behaviour with NULLs, their interaction with DISTINCT, and the difference between counting rows and counting values is what separates queries that produce correct results from queries that silently lie — all demonstrated using the DataplexaStore data.
COUNT
COUNT is the most used aggregate function and the one with the most important behavioural distinction. COUNT(*) counts every row in the result set, including rows with NULL values in any column. COUNT(column) counts only the rows where that specific column is not NULL — NULLs are silently skipped. This distinction matters enormously when the column being counted is optional. COUNT(DISTINCT column) counts the number of unique non-NULL values. Getting the wrong form of COUNT produces results that are numerically plausible but factually wrong — the worst kind of error because it looks correct.
-- COUNT variants — understanding each form
USE DataplexaStore;
-- COUNT(*) — total rows including any with NULLs
SELECT COUNT(*) AS total_reviews
FROM Reviews;
-- COUNT(column) — only rows where column is NOT NULL
-- ReviewText is nullable — COUNT(ReviewText) skips any NULL reviews
SELECT
COUNT(*) AS total_reviews,
COUNT(ReviewText) AS reviews_with_text, -- skips NULLs silently
COUNT(*) - COUNT(ReviewText) AS reviews_without_text
FROM Reviews;
-- COUNT(DISTINCT column) — unique non-NULL values
SELECT
COUNT(DISTINCT UserID) AS unique_reviewers,
COUNT(DISTINCT ProductID) AS unique_products_reviewed
FROM Reviews;
-- COUNT across tables
SELECT
(SELECT COUNT(*) FROM Users) AS total_users,
(SELECT COUNT(*) FROM Products) AS total_products,
(SELECT COUNT(*) FROM Orders) AS total_orders,
(SELECT COUNT(*) FROM OrderItems) AS total_order_items,
(SELECT COUNT(*) FROM Reviews) AS total_reviews;total_reviews
─────────────
4
-- NULL-aware count:
total_reviews reviews_with_text reviews_without_text
───────────── ───────────────── ────────────────────
4 4 0
-- Distinct counts:
unique_reviewers unique_products_reviewed
──────────────── ────────────────────────
4 4
-- Counts across tables:
total_users total_products total_orders total_order_items total_reviews
─────────── ────────────── ──────────── ───────────────── ─────────────
9 10 4 8 4
- All four reviews currently have text so reviews_without_text is 0 — in a real dataset with many optional fields the gap between COUNT(*) and COUNT(column) reveals how complete the data is, a useful data quality check
- Four unique reviewers from four reviews confirms no user has reviewed more than once in the current dataset — COUNT(DISTINCT UserID) would be less than COUNT(*) if any user had submitted multiple reviews
- Orders is now 4 (not 5) because Alice's two orders were deleted in L18, and the cancelled test order was also cleaned up — the seed data reflects the current state of the DataplexaStore tables
SUM and AVG
SUM adds all non-NULL values in a column and returns the total. AVG calculates the arithmetic mean of all non-NULL values — it divides the sum by the count of non-NULL rows, not by the total row count. Both functions silently skip NULLs, which means AVG on a column with NULLs gives you the average of the rows that have a value, not the average across all rows. Whether this is the correct calculation depends on what the NULL represents — if NULL means "not applicable" then skipping is right; if NULL means "zero" then you should substitute 0 with ISNULL before averaging. AVG on integer columns performs integer division by default — to get a decimal result cast one of the operands to DECIMAL first.
-- SUM and AVG across the DataplexaStore tables
-- Total and average order value
SELECT
COUNT(*) AS order_count,
SUM(TotalAmount) AS total_revenue,
AVG(TotalAmount) AS avg_order_value,
AVG(CAST(TotalAmount AS DECIMAL(10,4)))
AS avg_order_value_precise
FROM Orders;
-- Product price statistics
SELECT
MIN(Price) AS cheapest,
MAX(Price) AS most_expensive,
AVG(Price) AS avg_price,
SUM(Price * StockQty) AS total_inventory_value
FROM Products;
-- Average rating — all reviews have a rating so NULLs are not a factor here
SELECT
COUNT(*) AS review_count,
AVG(CAST(Rating AS DECIMAL(4,2)))
AS avg_rating, -- cast TINYINT to get decimal precision
MIN(Rating) AS lowest_rating,
MAX(Rating) AS highest_rating
FROM Reviews;order_count total_revenue avg_order_value avg_order_value_precise
─────────── ───────────── ─────────────── ───────────────────────
4 692.90 173.22 173.2250
-- Product price statistics:
cheapest most_expensive avg_price total_inventory_value
──────── ────────────── ───────── ─────────────────────
8.99 399.99 109.66 21895.14
-- Review statistics:
review_count avg_rating lowest_rating highest_rating
──────────── ────────── ───────────── ──────────────
4 4.75 4 5
- avg_order_value (173.22) and avg_order_value_precise (173.2250) differ because DECIMAL(10,2) division rounds the result to 2 decimal places — always cast to a higher-precision DECIMAL when AVG precision matters
- avg_rating of 4.75 from four reviews (5, 4, 5, 5) is correct — casting TINYINT to DECIMAL(4,2) before AVG prevents integer division which would have produced 4 instead of 4.75
- total_inventory_value of 21895.14 is the sum of Price × StockQty across all products — a single expression inside SUM is evaluated row by row before aggregation, so no intermediate table is needed
MIN and MAX
MIN and MAX return the smallest and largest non-NULL values in a column respectively. They work on any data type that has a natural ordering — numbers, strings (alphabetical), and dates (chronological). On strings MIN returns the value that sorts earliest alphabetically and MAX returns the latest. On dates MIN returns the earliest date and MAX returns the most recent. Like all aggregate functions they skip NULLs — if a column is entirely NULL, both return NULL. MIN and MAX are particularly useful for finding boundaries in time series data: the earliest order date, the most recent review, or the date range of activity for a particular user.
-- MIN and MAX on numbers, strings, and dates
-- Price boundaries
SELECT
MIN(Price) AS lowest_price,
MAX(Price) AS highest_price,
MAX(Price) - MIN(Price) AS price_range
FROM Products;
-- Alphabetical MIN/MAX on strings
SELECT
MIN(FullName) AS first_alphabetically,
MAX(FullName) AS last_alphabetically
FROM Users;
-- Date boundaries — order activity window
SELECT
MIN(OrderDate) AS first_order_date,
MAX(OrderDate) AS most_recent_order,
DATEDIFF(DAY, MIN(OrderDate), MAX(OrderDate))
AS days_of_activity
FROM Orders;
-- MIN and MAX per product — lowest and highest quantity sold in a single line item
SELECT
p.ProductName,
MIN(oi.Quantity) AS min_qty_per_line,
MAX(oi.Quantity) AS max_qty_per_line,
SUM(oi.Quantity) AS total_units_sold
FROM Products p
JOIN OrderItems oi ON p.ProductID = oi.ProductID
GROUP BY p.ProductName
ORDER BY total_units_sold DESC;lowest_price highest_price price_range
──────────── ───────────── ───────────
8.99 399.99 391.00
-- Alphabetical:
first_alphabetically last_alphabetically
──────────────────── ───────────────────
Bob Williams James Wilson
-- Order date range:
first_order_date most_recent_order days_of_activity
─────────────────────────── ─────────────────────────── ────────────────
2024-03-18 14:45:00.0000000 2024-03-27 11:05:00.0000000 8
-- Per-product quantity stats:
ProductName min_qty_per_line max_qty_per_line total_units_sold
─────────────────── ──────────────── ──────────────── ────────────────
Notebook A5 7 7 7
Desk Lamp 2 2 2
Mechanical Keyboard 1 1 1
Webcam HD 1 1 1
Wireless Mouse 1 1 1
USB-C Hub 1 1 1
- first_alphabetically is Bob Williams not Alice Johnson because Alice was deleted in L18 — the current Users table starts at Bob
- days_of_activity of 8 spans from March 18 to March 27 — MIN and MAX on DATETIME2 columns correctly identify the chronological boundaries of the order window
- The last query combines MIN, MAX, and SUM with GROUP BY — this is a preview of the GROUP BY pattern covered in full in the next lesson; the query returns one aggregated row per product rather than one row per order item
Aggregates with DISTINCT
All five aggregate functions accept DISTINCT as a modifier, which causes them to operate only on unique values before aggregating. SUM(DISTINCT column) adds each unique value once even if it appears in multiple rows. AVG(DISTINCT column) averages the unique values. This is rarely what you want for SUM and AVG — if a price of 29.99 appears in three order lines, those three sales represent real revenue and should all be summed. Using DISTINCT with SUM would count that price once regardless of how many times it occurred, producing a meaningless undercount. The legitimate use of DISTINCT with aggregates is almost exclusively with COUNT, where counting unique values is a genuine analytical need.
-- DISTINCT with aggregates — correct and incorrect uses
-- Correct use: COUNT(DISTINCT) — how many unique products have been ordered?
SELECT
COUNT(*) AS total_order_lines,
COUNT(DISTINCT ProductID) AS unique_products_ordered
FROM OrderItems;
-- Illustrate the DISTINCT effect on SUM and AVG
-- Total revenue from OrderItems — correct: sum every line
SELECT SUM(Quantity * UnitPrice) AS correct_total_revenue
FROM OrderItems;
-- Incorrect: SUM(DISTINCT UnitPrice) adds each price value once
-- This is not revenue — it is a meaningless sum of unique price points
SELECT SUM(DISTINCT UnitPrice) AS misleading_distinct_sum
FROM OrderItems;
-- AVG(DISTINCT) — average of unique unit prices (not average order line value)
SELECT
AVG(UnitPrice) AS avg_line_price, -- average across all lines
AVG(DISTINCT UnitPrice) AS avg_unique_price_points -- average of unique prices only
FROM OrderItems;total_order_lines unique_products_ordered
───────────────── ───────────────────────
8 6
-- Revenue comparison:
correct_total_revenue misleading_distinct_sum
───────────────────── ───────────────────────
1601.82 927.92
-- AVG comparison:
avg_line_price avg_unique_price_points
────────────── ───────────────────────
200.2275 185.5840
- correct_total_revenue (1601.82) vs misleading_distinct_sum (927.92) — the difference is substantial because several unit prices appear in multiple order lines; DISTINCT collapsed those to a single occurrence, dramatically undercounting revenue
- 8 total order lines but only 6 unique products — two products each appear in more than one order line, which COUNT(DISTINCT) correctly identifies without the need for a separate query
- The two AVG results differ for the same reason — DISTINCT averages the unique price points, not the actual distribution of prices across order lines; always be explicit about which average you need
Aggregates and NULL Behaviour
Every aggregate function except COUNT(*) silently ignores NULL values. This is consistent, predictable, and usually correct — but it produces wrong results when NULL means zero rather than unknown. If a product's discount percentage is NULL because no discount applies, AVG(DiscountPct) correctly ignores those products and averages only the ones with discounts. But if NULL means the discount data was not captured and should be treated as zero for the purposes of the average, then you must substitute zero before aggregating using ISNULL or COALESCE. Making the wrong assumption about what NULL means produces averages, sums, and counts that are numerically consistent but factually misleading.
-- NULL behaviour in aggregates — demonstrate the difference between
-- "NULL means unknown/not applicable" vs "NULL means zero"
-- Add a DiscountPct column with some NULLs to demonstrate
ALTER TABLE Products ADD DiscountPct DECIMAL(5,2) NULL;
UPDATE Products SET DiscountPct = 10.0 WHERE ProductID IN (1, 2, 3);
-- ProductIDs 4-10 remain NULL (no discount defined)
-- AVG ignoring NULLs — average of the 3 products that have a discount
SELECT
COUNT(*) AS total_products,
COUNT(DiscountPct) AS products_with_discount,
AVG(DiscountPct) AS avg_discount_non_null, -- averages only 3 rows
AVG(ISNULL(DiscountPct, 0)) AS avg_discount_treat_null_as_zero -- averages all 10
FROM Products;
-- SUM is unaffected by NULLs in this case — SUM(NULL) = NULL,
-- but SUM ignores NULLs in a set, so partial NULLs don't make the sum NULL
SELECT
SUM(DiscountPct) AS sum_non_null, -- sums only 3 values
SUM(ISNULL(DiscountPct, 0)) AS sum_treat_as_zero -- same result here: 30
FROM Products;
-- Clean up
ALTER TABLE Products DROP COLUMN DiscountPct;────────────── ────────────────────── ───────────────────── ───────────────────────────────
10 3 10.00 3.00
sum_non_null sum_treat_as_zero
──────────── ─────────────────
30.00 30.00
- avg_discount_non_null (10.00) vs avg_discount_treat_null_as_zero (3.00) — a threefold difference from the same data, purely based on how NULLs are handled; neither is wrong in isolation, but only one is correct for a given business question
- SUM produces the same result either way here because adding zero to the total does not change it — the difference only matters when the denominator (count of rows) is relevant, as it is in AVG
- The pattern of checking COUNT(*) vs COUNT(column) first to understand how many NULLs exist before aggregating is good practice in any unfamiliar dataset
Summary Table
| Function | Returns | NULL Behaviour | Key Gotcha |
|---|---|---|---|
| COUNT(*) | Total row count | Includes NULLs | COUNT(col) skips NULLs — different result |
| SUM(col) | Total of non-NULL values | Skips NULLs | Use ISNULL if NULLs should count as zero |
| AVG(col) | Mean of non-NULL values | Skips NULLs — divides by non-NULL count | Cast integers to DECIMAL for decimal result |
| MIN(col) | Smallest non-NULL value | Skips NULLs | Works on strings and dates too |
| MAX(col) | Largest non-NULL value | Skips NULLs | Works on strings and dates too |
| COUNT(DISTINCT col) | Unique non-NULL value count | Skips NULLs | Avoid DISTINCT with SUM/AVG — misleading results |
Practice Questions
Practice 1. What is the difference between COUNT(*) and COUNT(column)?
Practice 2. A Rating column is TINYINT. Why should you cast it to DECIMAL before using AVG?
Practice 3. A column has 100 rows, 30 of which are NULL. What does AVG return — the average of 70 values or the average of 100 values?
Practice 4. Why is SUM(DISTINCT UnitPrice) misleading when calculating total order revenue?
Practice 5. Write a query that returns the total number of units sold across all OrderItems and the average quantity per line item.
Quiz
Quiz 1. A table has 50 rows. 10 of them have NULL in the Price column. What does COUNT(*) return?
Quiz 2. A column contains the values 10, 20, NULL, 30. What does AVG return?
Quiz 3. Which aggregate function is the only one that includes NULL values?
Quiz 4. What does MIN return on a string column?
Quiz 5. AVG(Rating) on integer ratings 3, 4, 4, 5 returns 4 instead of 4.0. Why?
Next up - GROUP BY - Break aggregate results into meaningful groups and learn how HAVING filters the groups that make it into your final result.