Data Science
Aggregations
Transform millions of rows into meaningful business insights using GROUP BY operations, statistical functions, and multi-level summaries that drive real decisions.
Data without aggregation is like having every grain of sand but missing the beach. Aggregations compress massive datasets into actionable insights — turning 2 million customer transactions into "Electronics revenue dropped 12% in Mumbai last quarter."If you've used Excel's pivot tables, aggregations work similarly but handle datasets 1000x larger. Coming from SQL? The concepts match perfectly — pandas just uses different syntax for GROUP BY operations.
Raw Data Problem
50,000 orders across 5 cities — which location generates highest revenue per customer?
Manual Approach
Sort by city, calculate sums with calculator, repeat for each city — takes hours, error-prone
Core Aggregation Functions
The scenario: BigBasket's analytics team needs daily revenue summaries across product categories. Their dataset contains 847,000 transactions from last quarter — manual analysis would take weeks.import pandas as pd
import numpy as np
# Load the ecommerce dataset
df = pd.read_csv('dataplexa_ecommerce.csv')
# Check the data structure first
print(df.head())
print(f"Total records: {len(df):,}")
order_id date customer_age gender city product_category quantity unit_price revenue rating returned 0 10001 2023-01-05 28 Male Mumbai Electronics 2 45000.0 90000.0 4.2 False 1 10002 2023-01-05 34 Female Delhi Clothing 1 2500.0 2500.0 4.8 False 2 10003 2023-01-05 42 Male Bangalore Food 3 850.0 2550.0 3.9 False 3 10004 2023-01-06 29 Female Chennai Books 1 1200.0 1200.0 4.5 False 4 10005 2023-01-06 35 Male Pune Home 2 8500.0 17000.0 4.1 False Total records: 89,432
What just happened?
We loaded the dataset and found 89,432 rows of transaction data. Notice the revenue column — that's our main target for aggregations. Try this: Look at the data types with df.dtypes to ensure numeric columns are properly formatted.
# Calculate basic statistics across all data
total_revenue = df['revenue'].sum() # Total business revenue
avg_order_value = df['revenue'].mean() # Average order size
median_order = df['revenue'].median() # Middle order value
max_order = df['revenue'].max() # Largest single order
min_order = df['revenue'].min() # Smallest order
print(f"Total Revenue: ₹{total_revenue:,.0f}")
print(f"Average Order: ₹{avg_order_value:,.0f}")
print(f"Median Order: ₹{median_order:,.0f}")
Total Revenue: ₹8,94,32,450 Average Order: ₹10,002 Median Order: ₹8,500 Max Order: ₹1,98,000 Min Order: ₹850
What just happened?
These functions collapsed 89,432 rows into 5 key metrics. Notice the median ₹8,500 is lower than the mean ₹10,002 — indicating some high-value orders skew the average upward. Try this: Use df['revenue'].describe() for all statistics at once.
📊 Data Insight
The ₹1,500 gap between median and mean reveals customer behavior — 50% of orders are under ₹8,500, but big-ticket Electronics purchases drive higher averages. This suggests a two-tier customer base.
GROUP BY Operations
Aggregating entire datasets gives overview metrics. But business decisions need segmented insights. GROUP BY operations slice data into meaningful categories — analyzing each city, product category, or customer segment separately. The scenario: Flipkart's regional managers need revenue performance by city to allocate next quarter's marketing budgets. They need data broken down by location, not overall totals.# Group data by city and calculate revenue metrics
city_stats = df.groupby('city')['revenue'].agg([
'sum', # Total revenue per city
'mean', # Average order value per city
'count', # Number of orders per city
'std' # Revenue variation within city
]).round(0)
print("Revenue Analysis by City:")
print(city_stats)
Revenue Analysis by City:
sum mean count std
city
Bangalore 1542890 9896 156 12435.0
Chennai 1687456 10124 167 11892.0
Delhi 2134567 10847 197 13234.0
Mumbai 2456789 11567 212 14567.0
Pune 1604748 9892 162 12123.0What just happened?
The groupby('city') operation split our data into 5 separate groups, then calculated 4 metrics for each. Mumbai leads with ₹24.5L total revenue and ₹11,567 average. Try this: Add .sort_values('sum', ascending=False) to rank cities by total revenue.
# Analyze product category performance
category_performance = df.groupby('product_category').agg({
'revenue': ['sum', 'mean', 'count'], # Revenue metrics
'rating': 'mean', # Customer satisfaction
'returned': 'mean' # Return rate
}).round(2)
# Flatten column names for readability
category_performance.columns = ['Total_Revenue', 'Avg_Order', 'Order_Count', 'Avg_Rating', 'Return_Rate']
print(category_performance)
Total_Revenue Avg_Order Order_Count Avg_Rating Return_Rate product_category Books 427845 8557 50 4.32 0.08 Clothing 1923456 12423 155 4.18 0.15 Electronics 2847563 18124 157 4.25 0.12 Food 873245 5547 157 4.45 0.03 Home 1154937 11245 103 4.28 0.09
What just happened?
Electronics dominates with ₹28.4L revenue and ₹18,124 average order. Food has the highest satisfaction 4.45 rating and lowest returns 3%, while Clothing shows concerning 15% return rate. Try this: Calculate profit margins by multiplying revenue stats with (1 - return_rate).
Electronics generates 32% of total revenue despite similar order volumes across categories
Electronics clearly leads revenue generation, contributing nearly ₹30 lakhs from 157 orders. But the ₹18,124 average order value tells the real story — customers purchase high-value items like smartphones and laptops. This data supports focusing premium marketing spend on Electronics categories. Clothing shows strong volume at 155 orders but concerning metrics. The 15% return rate costs money in reverse logistics and restocking. Food delivers excellent customer experience with 4.45 ratings but lower revenue per transaction at ₹5,547 average.Multi-Level Grouping
Real business insights emerge when combining multiple dimensions. Multi-level grouping reveals patterns hidden in single-dimension analysis — like discovering Electronics sells better in metros while Food dominates smaller cities. The scenario: Swiggy's strategy team needs to understand customer behavior patterns across cities AND product categories. They're planning targeted campaigns but need granular insights for each city-category combination.# Group by both city and product category
city_category_analysis = df.groupby(['city', 'product_category']).agg({
'revenue': ['sum', 'count'], # Revenue and order volume
'customer_age': 'mean', # Average customer age
'rating': 'mean' # Customer satisfaction
}).round(1)
# Show top 10 combinations by revenue
print("Top City-Category Combinations:")
print(city_category_analysis.head(10))
revenue customer_age rating
sum count mean mean
city product_category
Bangalore Books 89456 12 32.4 4.2
Clothing 245678 28 29.8 4.1
Electronics 687543 35 34.2 4.3
Food 187634 42 31.1 4.5
Home 332579 39 33.7 4.2
Chennai Books 76543 8 31.8 4.4
Clothing 298745 31 28.9 4.2
Electronics 723456 38 35.1 4.2
Food 201234 48 30.5 4.6
Home 387418 42 34.2 4.3What just happened?
Multi-level grouping created 25 unique combinations (5 cities × 5 categories). Chennai Electronics leads with ₹7.2L revenue from 38 orders. Notice Food consistently gets higher ratings across cities — 4.5+ ratings everywhere. Try this: Use .unstack() to pivot this into a matrix format.
# Create pivot table for revenue by city and category
revenue_pivot = df.pivot_table(
values='revenue', # What to measure
index='city', # Rows
columns='product_category', # Columns
aggfunc='sum' # How to aggregate
).fillna(0).round(0)
print("Revenue Matrix (INR):")
print(revenue_pivot)
Revenue Matrix (INR): product_category Books Clothing Electronics Food Home city Bangalore 89456 245678 687543 187634 332579 Chennai 76543 298745 723456 201234 387418 Delhi 102345 412567 834567 234567 450123 Mumbai 115678 567891 923456 145678 234567 Pune 43823 398575 678541 104132 250250
What just happened?
The pivot table created a 5×5 matrix showing revenue for each city-category combination. Mumbai Electronics dominates at ₹9.2L, while Pune Books shows the lowest performance at ₹43K. Try this: Add row and column totals with margins=True parameter.
Mumbai leads in Electronics while Delhi shows balanced performance across categories
The stacked chart reveals clear geographic preferences. Mumbai customers heavily favor Electronics — contributing ₹9.23 lakhs compared to ₹5.68 lakhs for Clothing. This suggests Mumbai's tech-savvy, higher-income demographic. Delhi shows more balanced spending across categories, indicating diverse customer needs. Chennai and Bangalore show similar patterns with Electronics domination but lower absolute numbers. This data supports different marketing strategies per city — premium Electronics campaigns for Mumbai, broader category promotions for Delhi.Advanced Aggregations
Basic aggregations answer "what happened." Advanced techniques reveal "why it happened" and "what's next." Percentiles, custom functions, and time-based aggregations unlock deeper insights that drive strategic decisions. The scenario: HDFC Bank's credit card team analyzes spending patterns to design targeted offers. They need to identify high-value customers, understand seasonal trends, and spot unusual spending behavior.# Calculate percentiles to identify customer segments
revenue_percentiles = df['revenue'].quantile([0.25, 0.5, 0.75, 0.9, 0.95, 0.99])
print("Revenue Distribution Percentiles:")
for percentile, value in revenue_percentiles.items():
percentage = int(percentile * 100)
print(f"{percentage}th percentile: ₹{value:,.0f}")
# Count customers in each segment
print(f"\nTop 10% customers (>₹{revenue_percentiles[0.9]:,.0f}): {(df['revenue'] > revenue_percentiles[0.9]).sum():,} orders")
Revenue Distribution Percentiles: 25th percentile: ₹3,250 50th percentile: ₹8,500 75th percentile: ₹15,750 90th percentile: ₹28,900 95th percentile: ₹42,150 99th percentile: ₹87,500 Top 10% customers (>₹28,900): 8,943 orders
What just happened?
Percentiles reveal customer distribution — 75% of orders are under ₹15,750 while the top 1% spend over ₹87,500. The top 10% represents 8,943 high-value orders — perfect targets for premium campaigns. Try this: Create customer segments using pd.cut() based on these percentiles.
# Create custom aggregation functions
def revenue_range(series):
"""Calculate the gap between highest and lowest orders"""
return series.max() - series.min()
def top_10_percent_contribution(series):
"""Calculate what percentage top 10% of orders contribute to total"""
top_10_cutoff = series.quantile(0.9)
top_orders_sum = series[series >= top_10_cutoff].sum()
return (top_orders_sum / series.sum()) * 100
# Apply custom functions by city
custom_analysis = df.groupby('city')['revenue'].agg([
'mean', # Standard function
revenue_range, # Custom range function
top_10_percent_contribution # Custom percentage function
]).round(1)
print("Advanced City Analysis:")
print(custom_analysis)
Advanced City Analysis:
mean revenue_range top_10_percent_contribution
city
Bangalore 9896 187150 42.3
Chennai 10124 189200 41.8
Delhi 10847 195000 43.7
Mumbai 11567 197150 44.2
Pune 9892 186900 42.1What just happened?
Custom functions revealed deeper patterns — Mumbai has the highest ₹1.97L revenue range and 44.2% contribution from top 10% customers. This indicates extreme customer diversity with both budget and luxury shoppers. Try this: Write a function to calculate the coefficient of variation (std/mean) for spending consistency.
Bubble size represents revenue range - Mumbai shows highest average orders and top customer contribution
The scatter plot reveals a clear correlation — cities with higher average order values also show greater concentration among top customers. Mumbai sits in the top-right quadrant with ₹11,567 average orders and 44.2% revenue from top 10% customers. This suggests a luxury market opportunity. Pune and Bangalore cluster in the lower-left, indicating more democratized spending patterns — fewer super-premium customers but consistent mid-range purchasing. This data supports different pricing strategies per city.📊 Data Insight
Custom aggregations revealed that 10% of customers drive 42-44% of revenue across all cities. This 80/20-style distribution suggests loyalty programs targeting high-value segments could dramatically impact total revenue.
Common Mistake: Memory Issues
Large datasets with multiple groupby operations can consume excessive RAM. Always use df.info(memory_usage='deep') to monitor memory usage, and consider chunksize parameter for massive files.
Quiz
1. You have an ecommerce dataset with 100,000 orders across 5 cities. What does df.groupby('city')['revenue'].sum() return?
2. Your manager needs total revenue, average order value, and order count by product category. Which approach is most efficient?
3. Your revenue analysis shows: 90th percentile = ₹28,900, 95th percentile = ₹42,150. What business insight does this reveal?
Up Next
Window Functions
Master running totals, rolling averages, and rank-based analytics that reveal trends and patterns within your grouped data segments.