Data Science Lesson 35 – Window Functions | Dataplexa

SQL · Lesson 35

Window Functions

Master SQL's most powerful analytical feature to calculate running totals, rankings, and moving averages across your ecommerce dataset without losing row-level detail.

What Makes Window Functions Special

Regular aggregations like SUM() collapse your data into groups. You lose individual rows. Window functions solve this brilliantly — they calculate aggregations while keeping every single row intact.

Think of it like looking through a sliding window across your data. For each row, the function peeks at nearby rows to calculate values like "running total up to this point" or "rank within this category". Your original data stays untouched.

Regular GROUP BY

Collapses 1000 orders into 5 category totals

Window Functions

Keeps all 1000 rows + adds analytical columns

The OVER Clause Architecture

Every window function needs an OVER() clause. This tells SQL exactly which rows to include in each calculation. The syntax breaks down into three parts that work like filters and sorters.

PARTITION BY

Groups data

ORDER BY

Sorts within groups

ROWS/RANGE

Defines window size

The scenario: Flipkart's analytics team needs to track each order's rank within its category and calculate running revenue totals. Regular queries would require multiple subqueries and joins.

# Basic window function structure
SELECT order_id, revenue, product_category,
       # ROW_NUMBER creates sequential numbering within each partition
       ROW_NUMBER() OVER (PARTITION BY product_category ORDER BY revenue DESC) as category_rank
FROM dataplexa_ecommerce 
# PARTITION BY splits data into separate windows by category
# ORDER BY sorts rows within each category window
LIMIT 8;

order_id | revenue  | product_category | category_rank
---------|----------|------------------|---------------
1047     | 87540.0  | Electronics      | 1
2891     | 76230.0  | Electronics      | 2
3456     | 72180.0  | Electronics      | 3
8923     | 43650.0  | Clothing         | 1
7234     | 39870.0  | Clothing         | 2
5678     | 28940.0  | Food             | 1
9012     | 25630.0  | Food             | 2
4321     | 19780.0  | Books            | 1

What just happened?

The PARTITION BY product_category created separate ranking systems — Electronics rank 1 is independent from Clothing rank 1. ORDER BY revenue DESC sorted each category by highest revenue first. Try this: Remove PARTITION BY to see global ranking across all categories.

Ranking Functions Comparison

SQL offers four ranking functions that handle ties differently. ROW_NUMBER always assigns unique ranks, even for identical values. RANK and DENSE_RANK both handle ties but skip numbers differently.

Function	Ties Handling	Example (values: 100,90,90,80)
ROW_NUMBER	Assigns arbitrary order	1, 2, 3, 4
RANK	Skips next rank	1, 2, 2, 4
DENSE_RANK	No gaps in sequence	1, 2, 2, 3
NTILE	Groups into buckets	1, 1, 2, 2 (for 2 buckets)

The scenario: Swiggy wants to identify top-performing restaurants within each city, but several have identical ratings. Different ranking methods will produce different business insights.

# Compare different ranking methods for identical ratings
SELECT order_id, city, rating,
       # ROW_NUMBER gives unique ranks even for ties
       ROW_NUMBER() OVER (PARTITION BY city ORDER BY rating DESC) as row_rank,
       # RANK skips numbers after ties
       RANK() OVER (PARTITION BY city ORDER BY rating DESC) as standard_rank
FROM dataplexa_ecommerce 
WHERE city = 'Mumbai' AND rating >= 4.5;

order_id | city   | rating | row_rank | standard_rank
---------|--------|--------|----------|---------------
3401     | Mumbai | 5.0    | 1        | 1
7829     | Mumbai | 5.0    | 2        | 1
9156     | Mumbai | 5.0    | 3        | 1
2847     | Mumbai | 4.8    | 4        | 4
6193     | Mumbai | 4.8    | 5        | 4
4025     | Mumbai | 4.7    | 6        | 6
8374     | Mumbai | 4.6    | 7        | 7

What just happened?

Three orders tied with rating 5.0. ROW_NUMBER arbitrarily assigned ranks 1,2,3 while RANK gave all three the same rank 1, then skipped to rank 4. Notice how standard_rank jumps from 1 directly to 4. Try this: Add DENSE_RANK to see continuous numbering 1,1,1,2,2,3.

Running Totals and Cumulative Analysis

Running totals show progressive accumulation — perfect for tracking daily revenue growth or customer acquisition. Frame specifications control exactly which rows get included in each calculation.

The default frame is ROWS UNBOUNDED PRECEDING — from the start of partition to current row. But you can define custom windows like "last 7 days" or "3 rows before and after".

# Calculate running total of daily revenue
SELECT date, 
       SUM(revenue) as daily_revenue,
       # Running total from start of data to current row
       SUM(SUM(revenue)) OVER (ORDER BY date) as running_total
FROM dataplexa_ecommerce 
# First aggregate by date to get daily totals
GROUP BY date 
ORDER BY date
LIMIT 7;

date       | daily_revenue | running_total
-----------|---------------|---------------
2023-01-01 | 245670.50    | 245670.50
2023-01-02 | 189340.25    | 435010.75
2023-01-03 | 312580.75    | 747591.50
2023-01-04 | 198760.40    | 946351.90
2023-01-05 | 267490.85    | 1213842.75
2023-01-06 | 234510.60    | 1448353.35
2023-01-07 | 298650.30    | 1747003.65

What just happened?

Each running_total includes all previous days plus current day. January 3rd shows ₹747,591.50 = sum of first three days (245,670 + 189,340 + 312,580). ORDER BY date ensures chronological accumulation. Try this: Add PARTITION BY product_category for separate running totals per category.

Running totals reveal consistent growth trajectory despite daily fluctuations

The blue area shows how cumulative revenue smooths out daily spikes and dips. Day 3 had the highest single-day revenue, but the running total reveals steady business growth. This visualization helps executives track progress toward quarterly targets.

Running totals also power percentage calculations. When you know each row's cumulative total and the grand total, you can calculate what percentage of total business each time period represents. This supports decision-making around resource allocation and seasonal planning.

Moving Averages and Custom Windows

Moving averages smooth out short-term fluctuations to reveal underlying trends. The ROWS BETWEEN clause defines exactly how many preceding and following rows to include in each calculation.

The scenario: Zomato's demand planning team needs 7-day moving averages to predict kitchen capacity needs. Daily orders fluctuate wildly, but weekly patterns are stable.

# 7-day moving average with custom window frame
SELECT date, 
       COUNT(*) as daily_orders,
       # Average of current row + 6 preceding rows = 7-day window
       AVG(COUNT(*)) OVER (
           ORDER BY date 
           ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
       ) as moving_avg_7day
FROM dataplexa_ecommerce 
GROUP BY date;

date       | daily_orders | moving_avg_7day
-----------|--------------|----------------
2023-01-01 | 42          | 42.00
2023-01-02 | 38          | 40.00
2023-01-03 | 67          | 49.00
2023-01-04 | 31          | 44.50
2023-01-05 | 55          | 46.60
2023-01-06 | 43          | 46.00
2023-01-07 | 62          | 48.29
2023-01-08 | 39          | 47.86

What just happened?

January 7th had 62 orders, but the moving_avg_7day is 48.29 — the average of Jan 1-7. ROWS BETWEEN 6 PRECEDING AND CURRENT ROW creates a sliding 7-day window. Early dates use shorter windows (Jan 1 uses only 1 day, Jan 2 uses 2 days). Try this: Change to 3 PRECEDING AND 3 FOLLOWING for centered 7-day average.

Window Frame Gotcha

Writing ROWS 7 PRECEDING actually creates an 8-row window (7 previous + current row). The fix: use ROWS BETWEEN 6 PRECEDING AND CURRENT ROW for exactly 7 rows.

Lead and Lag Functions

LAG and LEAD functions access values from other rows without self-joins. LAG looks backward, LEAD looks forward. Perfect for calculating period-over-period changes or comparing sequential events.

The scenario: OYO's revenue team needs month-over-month growth rates. Traditional approaches require complex self-joins and subqueries that are hard to maintain.

# Calculate month-over-month revenue growth
SELECT 
    DATE_TRUNC('month', CAST(date AS DATE)) as month,
    SUM(revenue) as monthly_revenue,
    # LAG gets the previous month's revenue
    LAG(SUM(revenue)) OVER (ORDER BY DATE_TRUNC('month', CAST(date AS DATE))) as prev_month_revenue
FROM dataplexa_ecommerce
GROUP BY DATE_TRUNC('month', CAST(date AS DATE));

month      | monthly_revenue | prev_month_revenue
-----------|-----------------|-------------------
2023-01-01 | 8947650.75     | null
2023-02-01 | 9234180.40     | 8947650.75
2023-03-01 | 9876540.20     | 9234180.40
2023-04-01 | 8654320.10     | 9876540.20
2023-05-01 | 10123450.80    | 8654320.10

# Add growth rate calculation
SELECT month, monthly_revenue, prev_month_revenue,
       # Calculate percentage change from previous month
       ROUND(
           (monthly_revenue - prev_month_revenue) / prev_month_revenue * 100, 2
       ) as growth_rate_pct
FROM previous_query_results;

month      | monthly_revenue | prev_month_revenue | growth_rate_pct
-----------|-----------------|--------------------|-----------------
2023-01-01 | 8947650.75     | null              | null
2023-02-01 | 9234180.40     | 8947650.75       | 3.20
2023-03-01 | 9876540.20     | 9234180.40       | 6.96
2023-04-01 | 8654320.10     | 9876540.20       | -12.38
2023-05-01 | 10123450.80    | 8654320.10       | 16.98

What just happened?

LAG(SUM(revenue)) retrieved the previous month's total revenue for each row. March shows 6.96% growth because (₹98.77L - ₹92.34L) / ₹92.34L = 6.96%. April's -12.38% reveals a significant decline. The first row shows null because there's no previous month. Try this: Use LAG(column, 2) to compare with two months ago.

April's decline signals need for immediate marketing intervention and Q2 strategy review

📊 Data Insight

May's 16.98% rebound suggests April was seasonal dip, not structural problem. LAG functions revealed this pattern in one query versus multiple complex joins.

Advanced Window Function Techniques

Combining multiple window functions reveals sophisticated insights. FIRST_VALUE and LAST_VALUE functions grab boundary values within each partition, while PERCENT_RANK shows relative position as percentage.

The scenario: Myntra's product team wants to identify each item's performance relative to category leaders and bottom performers. They need the highest and lowest revenue in each category alongside each product's percentile ranking.

# Advanced window analysis: percentiles and boundary values
SELECT product_name, product_category, revenue,
       # Get the highest revenue product in each category
       FIRST_VALUE(product_name) OVER (
           PARTITION BY product_category 
           ORDER BY revenue DESC
       ) as category_leader,
       # Percentile ranking within category (0.0 to 1.0)
       PERCENT_RANK() OVER (
           PARTITION BY product_category 
           ORDER BY revenue
       ) as percentile_rank
FROM dataplexa_ecommerce
WHERE product_category IN ('Electronics', 'Clothing');

product_name    | product_category | revenue   | category_leader | percentile_rank
----------------|------------------|-----------|-----------------|----------------
iPhone 14 Pro   | Electronics      | 87540.0   | iPhone 14 Pro   | 1.00
MacBook Air M2  | Electronics      | 76230.0   | iPhone 14 Pro   | 0.89
Samsung TV      | Electronics      | 45670.0   | iPhone 14 Pro   | 0.67
Sony Headphones | Electronics      | 23450.0   | iPhone 14 Pro   | 0.33
Charging Cable  | Electronics      | 890.0     | iPhone 14 Pro   | 0.00
Nike Sneakers   | Clothing         | 43650.0   | Nike Sneakers   | 1.00
Levi's Jeans    | Clothing         | 28940.0   | Nike Sneakers   | 0.75
Cotton T-Shirt  | Clothing         | 12340.0   | Nike Sneakers   | 0.50

What just happened?

FIRST_VALUE showed "iPhone 14 Pro" for all Electronics products because it's the revenue leader when sorted DESC. PERCENT_RANK of 0.89 means MacBook Air outperformed 89% of Electronics products. Percentile 0.0 = lowest performer, 1.0 = highest performer. Try this: Add LAST_VALUE to show the worst performer in each category.

Scatter plot reveals revenue distribution patterns - Electronics has wider spread with clear outliers

The scatter plot shows Electronics has much higher revenue variance than Clothing. Electronics products cluster at extremes (very high and very low performers) while Clothing shows more consistent mid-range performance. This suggests different pricing strategies may be needed for each category.

Business teams can use percentile rankings to identify underperformers for promotional campaigns or top performers for inventory prioritization. Products in the 0-25th percentile might need marketing boosts, while 75th-100th percentile products justify increased stock levels.

Pro Tip: Combine NTILE(4) with PERCENT_RANK to create quartile buckets. Products in NTILE 4 + PERCENT_RANK > 0.9 are your true champions worthy of premium shelf placement.

Quiz

Up Next

Subqueries

Build queries within queries to solve complex analytical problems that window functions can't handle alone.

← Previous Course Index Next →