Data Science
Subqueries
Master nested queries to solve complex business problems by combining multiple data lookups in single SQL statements.
What Are Subqueries?
A subquery is simply a query inside another query. Think of it like asking a question that depends on the answer to another question first. If you want to find "customers who spent more than average," you need to calculate the average first, then find customers above that number.
The inner query runs completely before the outer query even starts. This makes subqueries incredibly powerful for complex business logic that would be impossible with simple WHERE clauses. But honestly, they can slow things down if not written carefully.
Scalar Subqueries
Return single value for comparisons
Row Subqueries
Return multiple rows for IN/EXISTS
Correlated
Reference outer query columns
Nested
Independent of outer query
Scalar Subqueries
Scalar subqueries return exactly one value. Perfect for comparisons against averages, maximums, or calculated thresholds. The catch? If your inner query accidentally returns multiple rows, the whole thing crashes.
# Find customers who spent more than average order value
SELECT customer_age, revenue
FROM dataplexa_ecommerce
WHERE revenue > (
SELECT AVG(revenue)
FROM dataplexa_ecommerce
)
LIMIT 5;customer_age | revenue
-------------|--------
34 | 145000
28 | 167500
45 | 189000
31 | 156700
52 | 178300What just happened?
The inner query SELECT AVG(revenue) calculated the average first (around ₹87,000), then the outer query found all customers above that threshold. Try this: Change AVG to MAX to find customers near the highest spenders.
# Calculate what percentage above average each customer spent
SELECT customer_age,
revenue,
ROUND(((revenue - (SELECT AVG(revenue) FROM dataplexa_ecommerce))
/ (SELECT AVG(revenue) FROM dataplexa_ecommerce)) * 100, 1) as pct_above_avg
FROM dataplexa_ecommerce
WHERE revenue > (SELECT AVG(revenue) FROM dataplexa_ecommerce)
ORDER BY pct_above_avg DESC
LIMIT 5;customer_age | revenue | pct_above_avg
-------------|---------|-------------
38 | 189500 | 117.8
42 | 178900 | 105.6
29 | 167300 | 92.1
55 | 156800 | 80.2
33 | 145600 | 67.3What just happened?
We used the same subquery SELECT AVG(revenue) three times in one statement. The top customer spent 117% more than average! Try this: Add product_category to see which categories drive premium spending.
Most premium customers spend 100%+ above average, creating clear loyalty program tiers
The bar chart reveals a power law distribution where the highest spenders dramatically outnumber modest above-average customers. This suggests Flipkart should create multiple loyalty tiers rather than a single premium category. From a business perspective, the 156 customers spending double the average represent massive revenue potential. These ultra-premium customers likely respond to exclusive early access and personalized recommendations rather than simple discount offers.Row Subqueries with IN and EXISTS
When you need to match against multiple values, IN and EXISTS become your best friends. The key difference? IN compares values directly while EXISTS just checks if any rows match the condition.
# Find customers in cities with high-rated products (rating > 4.5)
SELECT DISTINCT customer_age, city, product_category
FROM dataplexa_ecommerce e1
WHERE city IN (
SELECT city
FROM dataplexa_ecommerce
WHERE rating > 4.5
)
LIMIT 8;customer_age | city | product_category
-------------|-----------|----------------
28 | Mumbai | Electronics
34 | Delhi | Clothing
45 | Bangalore | Food
29 | Mumbai | Books
52 | Chennai | Home
38 | Delhi | Electronics
41 | Pune | Clothing
33 | Mumbai | FoodWhat just happened?
The subquery found all cities with ratings above 4.5, then IN matched customers from those same cities. Notice Mumbai and Delhi dominate the results. Try this: Change the rating threshold to 4.0 and see how the city mix changes.
# Using EXISTS to find customers who bought returned products
SELECT customer_age, product_name, revenue
FROM dataplexa_ecommerce e1
WHERE EXISTS (
SELECT 1
FROM dataplexa_ecommerce e2
WHERE e2.product_name = e1.product_name
AND e2.returned = TRUE
)
LIMIT 6;customer_age | product_name | revenue
-------------|-------------------|--------
34 | iPhone 14 Pro | 125000
28 | Samsung Galaxy S23| 89000
45 | MacBook Air M2 | 145000
31 | OnePlus 11 | 67000
52 | iPad Pro | 98000
38 | Dell XPS 13 | 112000What just happened?
EXISTS found products that have been returned by someone, then showed all customers who bought those same products. The SELECT 1 is just a placeholder - EXISTS only cares if rows match. Try this: Add NOT EXISTS to find products that were never returned.
Mumbai and Delhi capture 58% of customers from high-performing product cities
Mumbai's dominance in high-rated cities makes sense - larger market size typically correlates with better product selection and customer service infrastructure. Delhi follows closely, suggesting these metro markets should get priority for new premium product launches. The smaller representation from Bangalore, Chennai, and Pune indicates untapped potential. Swiggy could focus expansion efforts on these cities, knowing that high ratings are achievable based on existing customer satisfaction patterns.Correlated Subqueries
Here's where things get interesting. Correlated subqueries reference columns from the outer query, creating a dependency loop. The inner query runs once for every row in the outer query - this can be slow but enables powerful row-by-row comparisons.
The scenario: Zomato's analytics team wants to find customers who spent more than the average for their specific age group. Simple averages won't work - they need age-specific benchmarks for targeted marketing campaigns.# Find customers who spent above average for their age group
SELECT customer_age, revenue, product_category
FROM dataplexa_ecommerce e1
WHERE revenue > (
SELECT AVG(revenue)
FROM dataplexa_ecommerce e2
WHERE e2.customer_age = e1.customer_age
)
ORDER BY customer_age
LIMIT 8;customer_age | revenue | product_category
-------------|---------|----------------
22 | 89500 | Electronics
24 | 67800 | Clothing
25 | 145000 | Electronics
27 | 78900 | Food
28 | 167500 | Home
29 | 134500 | Electronics
31 | 89700 | Books
32 | 156000 | ClothingWhat just happened?
For each customer, the subquery calculated the average spending for that specific age (e.g., all 25-year-olds). Notice how a 25-year-old spending ₹145k makes the list while higher amounts might not for older ages. Try this: Add the calculated average to see the exact thresholds per age.
# Show the age group average alongside each customer
SELECT customer_age,
revenue,
(SELECT ROUND(AVG(revenue), 0)
FROM dataplexa_ecommerce e2
WHERE e2.customer_age = e1.customer_age) as age_group_avg
FROM dataplexa_ecommerce e1
WHERE revenue > (SELECT AVG(revenue) FROM dataplexa_ecommerce e2 WHERE e2.customer_age = e1.customer_age)
LIMIT 6;customer_age | revenue | age_group_avg
-------------|---------|-------------
22 | 89500 | 67000
24 | 67800 | 54000
25 | 145000 | 89000
27 | 78900 | 65000
28 | 167500 | 98000
29 | 134500 | 87000What just happened?
Now you can see the exact age-group averages each customer beat. The 22-year-old spending ₹89k beat their age average of ₹67k by 33%. Try this: Calculate the percentage above age-group average using the formula from earlier.
📊 Data Insight
Spending patterns vary dramatically by age: 22-year-olds average ₹67k while 28-year-olds average ₹98k. Age-segmented campaigns could boost conversion rates by 40-60% compared to generic targeting.
Performance Considerations
Subqueries can absolutely crush performance if you're not careful. The 10% of cases that trip everyone up usually involve correlated subqueries on large tables without proper indexing. A subquery that runs once is fine - one that runs for every row in a million-row table will ruin your day.
Common Performance Mistake
Using correlated subqueries without indexes - Always ensure the columns referenced in WHERE clauses have indexes. A correlated subquery on customer_age needs an index on customer_age, or it will scan the entire table for every outer row.
✓ When to Use Subqueries
Complex business logic, one-time calculations, EXISTS checks
Consider JOINs Instead
Large datasets, repeated calculations, performance-critical queries
JOINs maintain performance advantage as data grows, especially beyond 100K rows
The performance gap becomes brutal at scale. Subqueries hit a performance wall around 100k rows where JOINs still perform reasonably. This logarithmic chart actually understates the difference - in real applications, poorly written subqueries can take 10x longer than optimized JOINs. Why does this matter? Because most data analysts start with subqueries (they're easier to reason about) but production systems demand JOIN performance. Learning when to refactor a working subquery into a JOIN becomes a crucial skill as you handle larger datasets.Advanced Subquery Patterns
The most powerful subqueries solve business problems that would take multiple steps otherwise. Think of finding the second-highest value, customers with above-median purchases, or products that outsell their category average.
The scenario: HDFC Bank's credit card team wants to identify customers whose spending is in the top 20% for their city. They need city-specific percentiles for risk assessment and credit limit decisions.# Find customers in top 20% spending for their city
SELECT city, customer_age, revenue
FROM dataplexa_ecommerce e1
WHERE revenue >= (
SELECT PERCENTILE_CONT(0.8) WITHIN GROUP (ORDER BY revenue)
FROM dataplexa_ecommerce e2
WHERE e2.city = e1.city
)
ORDER BY city, revenue DESC
LIMIT 10;city | customer_age | revenue ----------|--------------|-------- Bangalore | 34 | 189500 Bangalore | 28 | 167800 Bangalore | 45 | 156000 Chennai | 52 | 178300 Chennai | 38 | 145600 Delhi | 41 | 195000 Delhi | 33 | 176500 Delhi | 29 | 159000 Mumbai | 47 | 198500 Mumbai | 35 | 187200
What just happened?
PERCENTILE_CONT(0.8) found the 80th percentile spending for each city, then we selected customers above that threshold. Notice Delhi and Mumbai have the highest top-tier spenders. Try this: Change to 0.95 to find the true elite spenders per city.
# Find products that sell above their category average
SELECT product_category,
COUNT(*) as high_performers,
ROUND(AVG(revenue), 0) as avg_revenue
FROM dataplexa_ecommerce e1
WHERE revenue > (
SELECT AVG(revenue)
FROM dataplexa_ecommerce e2
WHERE e2.product_category = e1.product_category
)
GROUP BY product_category
ORDER BY avg_revenue DESC;product_category | high_performers | avg_revenue -----------------|----------------|------------ Electronics | 234 | 156780 Home | 167 | 134500 Clothing | 198 | 118900 Books | 145 | 89600 Food | 123 | 76800
What just happened?
We identified above-average performers within each category, then grouped to see category-level patterns. Electronics has both the most high-performers (234) and highest average revenue (₹1.57L). Try this: Add product_name to see specific high-performing items per category.
📊 Data Insight
Electronics dominates with 234 above-average products generating ₹1.57L average revenue - nearly 2x higher than Food category. This suggests premium electronics drive disproportionate platform value.
Pro tip: Combine subqueries with window functions for the ultimate analytical power. Use subqueries to filter your dataset, then apply ROW_NUMBER() or RANK() for sophisticated business rankings that would be impossible with simple GROUP BY queries.
Quiz
1. In this query: SELECT customer_age FROM dataplexa_ecommerce WHERE revenue > (SELECT AVG(revenue) FROM dataplexa_ecommerce) - what happens first?
2. What's the key difference between EXISTS and IN subqueries when checking for matching records?
3. This correlated subquery is running slowly on a 1M row table: WHERE revenue > (SELECT AVG(revenue) FROM dataplexa_ecommerce e2 WHERE e2.customer_age = e1.customer_age). What's the most likely performance fix?
Up Next
Relational Databases
Master table relationships, foreign keys, and database design principles that make complex subqueries possible in the first place.