Data Science Lesson 36 – Subqueries | Dataplexa

SQL · Lesson 36

Subqueries

Master nested queries to solve complex business problems by combining multiple data lookups in single SQL statements.

Inner Query Executes First

Result Feeds Outer Query

Final Answer Retrieved

What Are Subqueries?

A subquery is simply a query inside another query. Think of it like asking a question that depends on the answer to another question first. If you want to find "customers who spent more than average," you need to calculate the average first, then find customers above that number.

The inner query runs completely before the outer query even starts. This makes subqueries incredibly powerful for complex business logic that would be impossible with simple WHERE clauses. But honestly, they can slow things down if not written carefully.

Scalar Subqueries

Return single value for comparisons

Row Subqueries

Return multiple rows for IN/EXISTS

Correlated

Reference outer query columns

Nested

Independent of outer query

Scalar Subqueries

Scalar subqueries return exactly one value. Perfect for comparisons against averages, maximums, or calculated thresholds. The catch? If your inner query accidentally returns multiple rows, the whole thing crashes.

# Find customers who spent more than average order value
SELECT customer_age, revenue
FROM dataplexa_ecommerce 
WHERE revenue > (
    SELECT AVG(revenue) 
    FROM dataplexa_ecommerce
)
LIMIT 5;

customer_age | revenue
-------------|--------
        34   | 145000
        28   | 167500  
        45   | 189000
        31   | 156700
        52   | 178300

What just happened?

The inner query SELECT AVG(revenue) calculated the average first (around ₹87,000), then the outer query found all customers above that threshold. Try this: Change AVG to MAX to find customers near the highest spenders.

The scenario: Flipkart's pricing team needs to identify premium customers for their loyalty program. They want anyone spending above average to get special discounts during the festive season.

# Calculate what percentage above average each customer spent
SELECT customer_age,
       revenue,
       ROUND(((revenue - (SELECT AVG(revenue) FROM dataplexa_ecommerce)) 
              / (SELECT AVG(revenue) FROM dataplexa_ecommerce)) * 100, 1) as pct_above_avg
FROM dataplexa_ecommerce 
WHERE revenue > (SELECT AVG(revenue) FROM dataplexa_ecommerce)
ORDER BY pct_above_avg DESC
LIMIT 5;

customer_age | revenue | pct_above_avg
-------------|---------|-------------
        38   | 189500  |      117.8
        42   | 178900  |      105.6
        29   | 167300  |       92.1
        55   | 156800  |       80.2
        33   | 145600  |       67.3

What just happened?

We used the same subquery SELECT AVG(revenue) three times in one statement. The top customer spent 117% more than average! Try this: Add product_category to see which categories drive premium spending.

Most premium customers spend 100%+ above average, creating clear loyalty program tiers

The bar chart reveals a power law distribution where the highest spenders dramatically outnumber modest above-average customers. This suggests Flipkart should create multiple loyalty tiers rather than a single premium category. From a business perspective, the 156 customers spending double the average represent massive revenue potential. These ultra-premium customers likely respond to exclusive early access and personalized recommendations rather than simple discount offers.

Row Subqueries with IN and EXISTS

When you need to match against multiple values, IN and EXISTS become your best friends. The key difference? IN compares values directly while EXISTS just checks if any rows match the condition.

The scenario: Swiggy wants to find customers who ordered from the same cities where their top-rated restaurants are located. They need to cross-reference customer cities with high-performing restaurant locations.

# Find customers in cities with high-rated products (rating > 4.5)
SELECT DISTINCT customer_age, city, product_category
FROM dataplexa_ecommerce e1
WHERE city IN (
    SELECT city 
    FROM dataplexa_ecommerce 
    WHERE rating > 4.5
)
LIMIT 8;

customer_age | city      | product_category
-------------|-----------|----------------
        28   | Mumbai    | Electronics
        34   | Delhi     | Clothing  
        45   | Bangalore | Food
        29   | Mumbai    | Books
        52   | Chennai   | Home
        38   | Delhi     | Electronics
        41   | Pune      | Clothing
        33   | Mumbai    | Food

What just happened?

The subquery found all cities with ratings above 4.5, then IN matched customers from those same cities. Notice Mumbai and Delhi dominate the results. Try this: Change the rating threshold to 4.0 and see how the city mix changes.

# Using EXISTS to find customers who bought returned products
SELECT customer_age, product_name, revenue
FROM dataplexa_ecommerce e1
WHERE EXISTS (
    SELECT 1 
    FROM dataplexa_ecommerce e2 
    WHERE e2.product_name = e1.product_name 
    AND e2.returned = TRUE
)
LIMIT 6;

customer_age | product_name      | revenue
-------------|-------------------|--------
        34   | iPhone 14 Pro     | 125000
        28   | Samsung Galaxy S23| 89000  
        45   | MacBook Air M2    | 145000
        31   | OnePlus 11       | 67000
        52   | iPad Pro         | 98000
        38   | Dell XPS 13      | 112000

What just happened?

EXISTS found products that have been returned by someone, then showed all customers who bought those same products. The SELECT 1 is just a placeholder - EXISTS only cares if rows match. Try this: Add NOT EXISTS to find products that were never returned.

Mumbai and Delhi capture 58% of customers from high-performing product cities

Mumbai's dominance in high-rated cities makes sense - larger market size typically correlates with better product selection and customer service infrastructure. Delhi follows closely, suggesting these metro markets should get priority for new premium product launches. The smaller representation from Bangalore, Chennai, and Pune indicates untapped potential. Swiggy could focus expansion efforts on these cities, knowing that high ratings are achievable based on existing customer satisfaction patterns.

Correlated Subqueries

Here's where things get interesting. Correlated subqueries reference columns from the outer query, creating a dependency loop. The inner query runs once for every row in the outer query - this can be slow but enables powerful row-by-row comparisons.

The scenario: Zomato's analytics team wants to find customers who spent more than the average for their specific age group. Simple averages won't work - they need age-specific benchmarks for targeted marketing campaigns.

# Find customers who spent above average for their age group
SELECT customer_age, revenue, product_category
FROM dataplexa_ecommerce e1
WHERE revenue > (
    SELECT AVG(revenue) 
    FROM dataplexa_ecommerce e2 
    WHERE e2.customer_age = e1.customer_age
)
ORDER BY customer_age
LIMIT 8;

customer_age | revenue | product_category
-------------|---------|----------------
        22   | 89500   | Electronics
        24   | 67800   | Clothing
        25   | 145000  | Electronics  
        27   | 78900   | Food
        28   | 167500  | Home
        29   | 134500  | Electronics
        31   | 89700   | Books
        32   | 156000  | Clothing

What just happened?

For each customer, the subquery calculated the average spending for that specific age (e.g., all 25-year-olds). Notice how a 25-year-old spending ₹145k makes the list while higher amounts might not for older ages. Try this: Add the calculated average to see the exact thresholds per age.

# Show the age group average alongside each customer
SELECT customer_age, 
       revenue,
       (SELECT ROUND(AVG(revenue), 0) 
        FROM dataplexa_ecommerce e2 
        WHERE e2.customer_age = e1.customer_age) as age_group_avg
FROM dataplexa_ecommerce e1
WHERE revenue > (SELECT AVG(revenue) FROM dataplexa_ecommerce e2 WHERE e2.customer_age = e1.customer_age)
LIMIT 6;

customer_age | revenue | age_group_avg
-------------|---------|-------------
        22   | 89500   | 67000
        24   | 67800   | 54000
        25   | 145000  | 89000  
        27   | 78900   | 65000
        28   | 167500  | 98000
        29   | 134500  | 87000

What just happened?

Now you can see the exact age-group averages each customer beat. The 22-year-old spending ₹89k beat their age average of ₹67k by 33%. Try this: Calculate the percentage above age-group average using the formula from earlier.

📊 Data Insight

Spending patterns vary dramatically by age: 22-year-olds average ₹67k while 28-year-olds average ₹98k. Age-segmented campaigns could boost conversion rates by 40-60% compared to generic targeting.

Performance Considerations

Subqueries can absolutely crush performance if you're not careful. The 10% of cases that trip everyone up usually involve correlated subqueries on large tables without proper indexing. A subquery that runs once is fine - one that runs for every row in a million-row table will ruin your day.

Common Performance Mistake

Using correlated subqueries without indexes - Always ensure the columns referenced in WHERE clauses have indexes. A correlated subquery on customer_age needs an index on customer_age, or it will scan the entire table for every outer row.

✓ When to Use Subqueries

Complex business logic, one-time calculations, EXISTS checks

Consider JOINs Instead

Large datasets, repeated calculations, performance-critical queries

JOINs maintain performance advantage as data grows, especially beyond 100K rows

The performance gap becomes brutal at scale. Subqueries hit a performance wall around 100k rows where JOINs still perform reasonably. This logarithmic chart actually understates the difference - in real applications, poorly written subqueries can take 10x longer than optimized JOINs. Why does this matter? Because most data analysts start with subqueries (they're easier to reason about) but production systems demand JOIN performance. Learning when to refactor a working subquery into a JOIN becomes a crucial skill as you handle larger datasets.

Advanced Subquery Patterns

The most powerful subqueries solve business problems that would take multiple steps otherwise. Think of finding the second-highest value, customers with above-median purchases, or products that outsell their category average.

The scenario: HDFC Bank's credit card team wants to identify customers whose spending is in the top 20% for their city. They need city-specific percentiles for risk assessment and credit limit decisions.

# Find customers in top 20% spending for their city
SELECT city, customer_age, revenue
FROM dataplexa_ecommerce e1
WHERE revenue >= (
    SELECT PERCENTILE_CONT(0.8) WITHIN GROUP (ORDER BY revenue)
    FROM dataplexa_ecommerce e2 
    WHERE e2.city = e1.city
)
ORDER BY city, revenue DESC
LIMIT 10;

city      | customer_age | revenue
----------|--------------|--------
Bangalore |           34 | 189500
Bangalore |           28 | 167800
Bangalore |           45 | 156000
Chennai   |           52 | 178300
Chennai   |           38 | 145600
Delhi     |           41 | 195000
Delhi     |           33 | 176500
Delhi     |           29 | 159000
Mumbai    |           47 | 198500
Mumbai    |           35 | 187200

What just happened?

PERCENTILE_CONT(0.8) found the 80th percentile spending for each city, then we selected customers above that threshold. Notice Delhi and Mumbai have the highest top-tier spenders. Try this: Change to 0.95 to find the true elite spenders per city.

# Find products that sell above their category average
SELECT product_category, 
       COUNT(*) as high_performers,
       ROUND(AVG(revenue), 0) as avg_revenue
FROM dataplexa_ecommerce e1
WHERE revenue > (
    SELECT AVG(revenue) 
    FROM dataplexa_ecommerce e2 
    WHERE e2.product_category = e1.product_category
)
GROUP BY product_category
ORDER BY avg_revenue DESC;

product_category | high_performers | avg_revenue
-----------------|----------------|------------
Electronics      |            234 | 156780
Home             |            167 | 134500  
Clothing         |            198 | 118900
Books            |            145 | 89600
Food             |            123 | 76800

What just happened?

We identified above-average performers within each category, then grouped to see category-level patterns. Electronics has both the most high-performers (234) and highest average revenue (₹1.57L). Try this: Add product_name to see specific high-performing items per category.

📊 Data Insight

Electronics dominates with 234 above-average products generating ₹1.57L average revenue - nearly 2x higher than Food category. This suggests premium electronics drive disproportionate platform value.

Pro tip: Combine subqueries with window functions for the ultimate analytical power. Use subqueries to filter your dataset, then apply ROW_NUMBER() or RANK() for sophisticated business rankings that would be impossible with simple GROUP BY queries.

Quiz

Up Next

Relational Databases

Master table relationships, foreign keys, and database design principles that make complex subqueries possible in the first place.

← Previous Course Index Next →