Data Science Lesson 36 – Subqueries | Dataplexa
SQL · Lesson 36

Subqueries

Master nested queries to solve complex business problems by combining multiple data lookups in single SQL statements.

1
Inner Query Executes First
2
Result Feeds Outer Query
3
Final Answer Retrieved

What Are Subqueries?

A subquery is simply a query inside another query. Think of it like asking a question that depends on the answer to another question first. If you want to find "customers who spent more than average," you need to calculate the average first, then find customers above that number.

The inner query runs completely before the outer query even starts. This makes subqueries incredibly powerful for complex business logic that would be impossible with simple WHERE clauses. But honestly, they can slow things down if not written carefully.

Scalar Subqueries

Return single value for comparisons

Row Subqueries

Return multiple rows for IN/EXISTS

Correlated

Reference outer query columns

Nested

Independent of outer query

Scalar Subqueries

Scalar subqueries return exactly one value. Perfect for comparisons against averages, maximums, or calculated thresholds. The catch? If your inner query accidentally returns multiple rows, the whole thing crashes.

# Find customers who spent more than average order value
SELECT customer_age, revenue
FROM dataplexa_ecommerce 
WHERE revenue > (
    SELECT AVG(revenue) 
    FROM dataplexa_ecommerce
)
LIMIT 5;

What just happened?

The inner query SELECT AVG(revenue) calculated the average first (around ₹87,000), then the outer query found all customers above that threshold. Try this: Change AVG to MAX to find customers near the highest spenders.

The scenario: Flipkart's pricing team needs to identify premium customers for their loyalty program. They want anyone spending above average to get special discounts during the festive season.
# Calculate what percentage above average each customer spent
SELECT customer_age,
       revenue,
       ROUND(((revenue - (SELECT AVG(revenue) FROM dataplexa_ecommerce)) 
              / (SELECT AVG(revenue) FROM dataplexa_ecommerce)) * 100, 1) as pct_above_avg
FROM dataplexa_ecommerce 
WHERE revenue > (SELECT AVG(revenue) FROM dataplexa_ecommerce)
ORDER BY pct_above_avg DESC
LIMIT 5;

What just happened?

We used the same subquery SELECT AVG(revenue) three times in one statement. The top customer spent 117% more than average! Try this: Add product_category to see which categories drive premium spending.

Most premium customers spend 100%+ above average, creating clear loyalty program tiers

The bar chart reveals a power law distribution where the highest spenders dramatically outnumber modest above-average customers. This suggests Flipkart should create multiple loyalty tiers rather than a single premium category. From a business perspective, the 156 customers spending double the average represent massive revenue potential. These ultra-premium customers likely respond to exclusive early access and personalized recommendations rather than simple discount offers.

Row Subqueries with IN and EXISTS

When you need to match against multiple values, IN and EXISTS become your best friends. The key difference? IN compares values directly while EXISTS just checks if any rows match the condition.

The scenario: Swiggy wants to find customers who ordered from the same cities where their top-rated restaurants are located. They need to cross-reference customer cities with high-performing restaurant locations.
# Find customers in cities with high-rated products (rating > 4.5)
SELECT DISTINCT customer_age, city, product_category
FROM dataplexa_ecommerce e1
WHERE city IN (
    SELECT city 
    FROM dataplexa_ecommerce 
    WHERE rating > 4.5
)
LIMIT 8;

What just happened?

The subquery found all cities with ratings above 4.5, then IN matched customers from those same cities. Notice Mumbai and Delhi dominate the results. Try this: Change the rating threshold to 4.0 and see how the city mix changes.

# Using EXISTS to find customers who bought returned products
SELECT customer_age, product_name, revenue
FROM dataplexa_ecommerce e1
WHERE EXISTS (
    SELECT 1 
    FROM dataplexa_ecommerce e2 
    WHERE e2.product_name = e1.product_name 
    AND e2.returned = TRUE
)
LIMIT 6;

What just happened?

EXISTS found products that have been returned by someone, then showed all customers who bought those same products. The SELECT 1 is just a placeholder - EXISTS only cares if rows match. Try this: Add NOT EXISTS to find products that were never returned.

Mumbai and Delhi capture 58% of customers from high-performing product cities

Mumbai's dominance in high-rated cities makes sense - larger market size typically correlates with better product selection and customer service infrastructure. Delhi follows closely, suggesting these metro markets should get priority for new premium product launches. The smaller representation from Bangalore, Chennai, and Pune indicates untapped potential. Swiggy could focus expansion efforts on these cities, knowing that high ratings are achievable based on existing customer satisfaction patterns.

Correlated Subqueries

Here's where things get interesting. Correlated subqueries reference columns from the outer query, creating a dependency loop. The inner query runs once for every row in the outer query - this can be slow but enables powerful row-by-row comparisons.

The scenario: Zomato's analytics team wants to find customers who spent more than the average for their specific age group. Simple averages won't work - they need age-specific benchmarks for targeted marketing campaigns.
# Find customers who spent above average for their age group
SELECT customer_age, revenue, product_category
FROM dataplexa_ecommerce e1
WHERE revenue > (
    SELECT AVG(revenue) 
    FROM dataplexa_ecommerce e2 
    WHERE e2.customer_age = e1.customer_age
)
ORDER BY customer_age
LIMIT 8;

What just happened?

For each customer, the subquery calculated the average spending for that specific age (e.g., all 25-year-olds). Notice how a 25-year-old spending ₹145k makes the list while higher amounts might not for older ages. Try this: Add the calculated average to see the exact thresholds per age.

# Show the age group average alongside each customer
SELECT customer_age, 
       revenue,
       (SELECT ROUND(AVG(revenue), 0) 
        FROM dataplexa_ecommerce e2 
        WHERE e2.customer_age = e1.customer_age) as age_group_avg
FROM dataplexa_ecommerce e1
WHERE revenue > (SELECT AVG(revenue) FROM dataplexa_ecommerce e2 WHERE e2.customer_age = e1.customer_age)
LIMIT 6;

What just happened?

Now you can see the exact age-group averages each customer beat. The 22-year-old spending ₹89k beat their age average of ₹67k by 33%. Try this: Calculate the percentage above age-group average using the formula from earlier.

📊 Data Insight

Spending patterns vary dramatically by age: 22-year-olds average ₹67k while 28-year-olds average ₹98k. Age-segmented campaigns could boost conversion rates by 40-60% compared to generic targeting.

Performance Considerations

Subqueries can absolutely crush performance if you're not careful. The 10% of cases that trip everyone up usually involve correlated subqueries on large tables without proper indexing. A subquery that runs once is fine - one that runs for every row in a million-row table will ruin your day.

Common Performance Mistake

Using correlated subqueries without indexes - Always ensure the columns referenced in WHERE clauses have indexes. A correlated subquery on customer_age needs an index on customer_age, or it will scan the entire table for every outer row.

✓ When to Use Subqueries

Complex business logic, one-time calculations, EXISTS checks

Consider JOINs Instead

Large datasets, repeated calculations, performance-critical queries

JOINs maintain performance advantage as data grows, especially beyond 100K rows

The performance gap becomes brutal at scale. Subqueries hit a performance wall around 100k rows where JOINs still perform reasonably. This logarithmic chart actually understates the difference - in real applications, poorly written subqueries can take 10x longer than optimized JOINs. Why does this matter? Because most data analysts start with subqueries (they're easier to reason about) but production systems demand JOIN performance. Learning when to refactor a working subquery into a JOIN becomes a crucial skill as you handle larger datasets.

Advanced Subquery Patterns

The most powerful subqueries solve business problems that would take multiple steps otherwise. Think of finding the second-highest value, customers with above-median purchases, or products that outsell their category average.

The scenario: HDFC Bank's credit card team wants to identify customers whose spending is in the top 20% for their city. They need city-specific percentiles for risk assessment and credit limit decisions.
# Find customers in top 20% spending for their city
SELECT city, customer_age, revenue
FROM dataplexa_ecommerce e1
WHERE revenue >= (
    SELECT PERCENTILE_CONT(0.8) WITHIN GROUP (ORDER BY revenue)
    FROM dataplexa_ecommerce e2 
    WHERE e2.city = e1.city
)
ORDER BY city, revenue DESC
LIMIT 10;

What just happened?

PERCENTILE_CONT(0.8) found the 80th percentile spending for each city, then we selected customers above that threshold. Notice Delhi and Mumbai have the highest top-tier spenders. Try this: Change to 0.95 to find the true elite spenders per city.

# Find products that sell above their category average
SELECT product_category, 
       COUNT(*) as high_performers,
       ROUND(AVG(revenue), 0) as avg_revenue
FROM dataplexa_ecommerce e1
WHERE revenue > (
    SELECT AVG(revenue) 
    FROM dataplexa_ecommerce e2 
    WHERE e2.product_category = e1.product_category
)
GROUP BY product_category
ORDER BY avg_revenue DESC;

What just happened?

We identified above-average performers within each category, then grouped to see category-level patterns. Electronics has both the most high-performers (234) and highest average revenue (₹1.57L). Try this: Add product_name to see specific high-performing items per category.

📊 Data Insight

Electronics dominates with 234 above-average products generating ₹1.57L average revenue - nearly 2x higher than Food category. This suggests premium electronics drive disproportionate platform value.

Pro tip: Combine subqueries with window functions for the ultimate analytical power. Use subqueries to filter your dataset, then apply ROW_NUMBER() or RANK() for sophisticated business rankings that would be impossible with simple GROUP BY queries.

Quiz

1. In this query: SELECT customer_age FROM dataplexa_ecommerce WHERE revenue > (SELECT AVG(revenue) FROM dataplexa_ecommerce) - what happens first?


2. What's the key difference between EXISTS and IN subqueries when checking for matching records?


3. This correlated subquery is running slowly on a 1M row table: WHERE revenue > (SELECT AVG(revenue) FROM dataplexa_ecommerce e2 WHERE e2.customer_age = e1.customer_age). What's the most likely performance fix?


Up Next

Relational Databases

Master table relationships, foreign keys, and database design principles that make complex subqueries possible in the first place.