EDA Course
Group-By Summaries
The most common question in business data analysis isn't "what's the average?" — it's "what's the average per group?" Which city sells most? Which product returns the most profit? Which age group churns fastest? Group-by summaries answer all of these in a single line of code.
What Does "Group By" Actually Mean?
Imagine you have a spreadsheet of 1,000 sales orders. You want to know: which salesperson brought in the most revenue? You'd manually find all rows for salesperson A, add up their sales. Then do the same for B, C, D... That's exhausting.
Group-by automates exactly this. It: splits the data into groups (one per salesperson), applies a calculation to each group (sum of sales), then combines the results into one clean table. Data scientists call this the split → apply → combine pattern.
✂️
Split
Divide rows into groups by a category
🔢
Apply
Run a calculation on each group
📋
Combine
Bring results together in one table
The Dataset We'll Use
The scenario: You're an analyst at a coffee shop chain with three branches — North, South, and East. Your regional manager wants a breakdown of sales performance across branches and drink categories before the quarterly review. You have 15 orders and need to slice the data multiple ways to answer her questions.
First, let's load the data:
import pandas as pd
import numpy as np
# Coffee shop orders — 15 transactions across 3 branches
df = pd.DataFrame({
'order_id': range(1, 16),
'branch': ['North','South','East','North','East','South','North',
'East','South','North','East','South','North','East','South'],
'drink': ['Latte','Espresso','Cappuccino','Flat White','Latte',
'Cappuccino','Espresso','Flat White','Latte','Cappuccino',
'Espresso','Flat White','Latte','Cappuccino','Espresso'],
'price': [4.20, 2.80, 3.80, 3.50, 4.20, 3.80, 2.80, 3.50,
4.20, 3.80, 2.80, 3.50, 4.20, 3.80, 2.80],
'units_sold': [3, 5, 2, 4, 6, 3, 7, 2, 4, 3, 5, 6, 2, 4, 8],
'rating': [4, 5, 4, 3, 5, 4, 5, 3, 4, 4, 5, 4, 3, 4, 5]
})
# Add a revenue column — price × units sold
df['revenue'] = df['price'] * df['units_sold']
print(df.to_string(index=False))
order_id branch drink price units_sold rating revenue
1 North Latte 4.20 3 4 12.60
2 South Espresso 2.80 5 5 14.00
3 East Cappuccino 3.80 2 4 7.60
4 North Flat White 3.50 4 3 14.00
5 East Latte 4.20 6 5 25.20
6 South Cappuccino 3.80 3 4 11.40
7 North Espresso 2.80 7 5 19.60
8 East Flat White 3.50 2 3 7.00
9 South Latte 4.20 4 4 16.80
10 North Cappuccino 3.80 3 4 11.40
11 East Espresso 2.80 5 5 14.00
12 South Flat White 3.50 6 4 21.00
13 North Latte 4.20 2 3 8.40
14 East Cappuccino 3.80 4 4 15.20
15 South Espresso 2.80 8 5 22.40
What just happened?
pandas holds all 15 orders in a table. We created the revenue column by multiplying two existing columns — pandas does this row by row automatically without any loop. Now we have everything we need to start grouping.
The Basic Group-By: One Column, One Calculation
The simplest group-by picks one category column to group by and one number column to summarise. The manager's first question: which branch made the most revenue?
# .groupby('branch') splits the table into three groups — one per branch
# ['revenue'] selects just the revenue column inside each group
# .sum() adds up all revenues within each group
revenue_by_branch = df.groupby('branch')['revenue'].sum()
# .sort_values() puts the highest revenue branch at the top
print(revenue_by_branch.sort_values(ascending=False).round(2))
branch South 85.60 East 69.00 North 66.00 Name: revenue, dtype: float64
What just happened?
pandas' .groupby() split the 15 rows into 3 separate groups. .sum() added up all the revenue inside each group. Three lines of data became one clean answer: South branch leads with £85.60, just ahead of East (£69) and North (£66). That's the answer to the manager's question — in two lines of code.
Multiple Calculations at Once With .agg()
One calculation per group is useful. But usually you want several at once — total revenue and number of orders and average rating. The .agg() method lets you request as many calculations as you want in a single step. Think of it as ordering multiple items on one ticket rather than going back to the counter three times.
# .agg() lets you calculate multiple things at once for each group
# Each named argument becomes a column: total_rev='sum' means "sum of revenue, label it total_rev"
branch_summary = df.groupby('branch').agg(
total_revenue = ('revenue', 'sum'), # add up all revenue per branch
total_units = ('units_sold', 'sum'), # add up all units sold per branch
num_orders = ('order_id', 'count'), # count how many orders per branch
avg_rating = ('rating', 'mean'), # average customer rating per branch
best_order = ('revenue', 'max') # the single highest revenue order
).round(2)
print(branch_summary)
total_revenue total_units num_orders avg_rating best_order branch East 69.00 17 4 4.25 25.20 North 66.00 15 4 4.00 19.60 South 85.60 22 5 4.40 22.40
What just happened?
pandas' .agg() is the power move of group-by analysis. The named aggregation syntax — total_revenue = ('revenue', 'sum') — says "take the revenue column, apply sum, call the result total_revenue." You can list as many of these as you want.
South branch leads on revenue (£85.60) and units (22) and has the highest average rating (4.40). North has the most orders per table (4 orders, £66 total — their baskets are smaller). East branch hit the single biggest order (£25.20 for 6 lattes). One table, five insights, one second to read.
Grouping By Two Columns at Once
The manager's next question: "Which drink sells best in each branch?" This needs grouping by two columns simultaneously — branch AND drink. You just pass a list instead of a single column name.
# Pass a LIST of two column names to group by both at once
# This creates one row for every unique branch + drink combination
branch_drink = df.groupby(['branch', 'drink'])['units_sold'].sum()
print(branch_drink.sort_values(ascending=False))
branch drink South Espresso 13 North Espresso 7 East Latte 6 South Flat White 6 East Cappuccino 6 North Flat White 4 South Latte 4 South Cappuccino 3 North Latte 5 North Cappuccino 3 East Espresso 5 East Flat White 2 Name: units_sold, dtype: int64
What just happened?
pandas created a group for every unique combination of branch and drink — 12 combinations from just 15 rows. South Espresso is the clear winner at 13 units. East Flat White is the weakest at just 2. This two-level grouping shows patterns that single-column grouping would completely miss — for example, Espresso is strong at South and North but mid-range at East.
Pivot Tables — The Same Data, More Readable
The two-column groupby above is correct, but the output is a long list. When one category goes across the top and another goes down the side, a pivot table is far easier to read — it looks like a spreadsheet grid where you can scan across rows and down columns at a glance.
# pd.pivot_table() reshapes the same grouped data into a 2D grid
# index = what goes down the rows (branches)
# columns = what goes across the top (drinks)
# values = what number fills each cell
# aggfunc = how to summarise when multiple rows share the same branch+drink
pivot = pd.pivot_table(
df,
index = 'branch',
columns = 'drink',
values = 'units_sold',
aggfunc = 'sum' # add up all units for each branch+drink combination
)
print(pivot)
drink Cappuccino Espresso Flat White Latte branch East 6 5 2 6 North 3 7 4 5 South 3 13 6 4
What just happened?
pandas' pd.pivot_table() takes exactly the same data as the two-column groupby but arranges it as a grid. Now you can scan across a row to see a branch's drink mix, or down a column to compare how one drink performs across branches. South's Espresso dominance (13 units) jumps out immediately. East sells Cappuccino and Latte equally (6 each). This is the same data as before — just far more readable.
Adding Row and Column Totals
The manager will ask: "What's the overall total?" Adding margins (totals) to a pivot table turns it from a data grid into a proper summary report. One extra argument does it.
# margins=True adds a "Total" row at the bottom and a "Total" column on the right
# margins_name='Total' labels those totals (default label is 'All')
pivot_with_totals = pd.pivot_table(
df,
index = 'branch',
columns = 'drink',
values = 'revenue',
aggfunc = 'sum',
margins = True, # ← adds row totals and column totals
margins_name = 'Total'
).round(2)
print(pivot_with_totals)
drink Cappuccino Espresso Flat White Latte Total branch East 22.80 14.00 7.00 25.20 69.00 North 11.40 19.60 14.00 21.00 66.00 South 11.40 22.40 21.00 16.80 71.60 Total 45.60 56.00 42.00 63.00 206.60
What just happened?
pandas' margins=True automatically sums every row and column and adds them as a "Total" row and column. Now the manager can read: total revenue across all branches and drinks is £206.60. Latte is the top drink overall (£63). Flat White is the weakest (£42). This is literally the output you'd paste into a PowerPoint slide.
Ranked Group Summaries
Sometimes you don't just want totals — you want a ranking. Which drink is #1, #2, #3 for each branch? pandas can add rank columns to any grouped output in one line.
# First, get total revenue per drink
drink_totals = df.groupby('drink')['revenue'].sum().reset_index()
# ^^^
# .reset_index() converts the group labels back into a normal column
# (by default groupby makes the group label the index — reset_index makes it a regular column)
# Add a rank column — 1 = highest revenue drink
drink_totals['rank'] = drink_totals['revenue'].rank(ascending=False).astype(int)
# Sort by rank so #1 appears at the top
drink_totals = drink_totals.sort_values('rank')
print(drink_totals)
drink revenue rank 2 Latte 63.00 1 1 Espresso 56.00 2 0 Cappuccino 45.60 3 3 Flat White 42.00 4
What just happened?
pandas' .rank(ascending=False) assigns a ranking number to each row — 1 for the highest value, 4 for the lowest. We use ascending=False because we want rank 1 to mean "best" not "smallest." .reset_index() is a housekeeping step — after a groupby, the group label becomes the table's index (the far-left label). reset_index() moves it back into a regular column so it's easier to work with.
Latte is the top earner at £63, Flat White is last at £42. Despite Espresso being the best-seller by units (13 at South alone), Latte wins on revenue because it's priced higher (£4.20 vs £2.80). That's the kind of nuance group-by analysis surfaces instantly.
Visual Summary — Branch Revenue Breakdown
Here's a visual of the key findings from this analysis — total revenue per branch and the drink revenue breakdown.
Revenue by Branch
Revenue by Drink
Teacher's Note
groupby is the single most-used pandas method in real-world data work. Most of the questions a business asks — "which region?", "which product?", "which customer segment?" — are groupby questions. Learning to reach for it instinctively is one of the biggest productivity jumps you'll make as a data analyst.
A tip that saves a lot of confusion: after a groupby, pandas makes the group label the index of the result (the far-left label). If you want to use that label as a regular column — for sorting, filtering, or merging — always add .reset_index() at the end. It's a one-second habit that prevents a lot of head-scratching.
Practice Questions
1. Which pandas method lets you calculate multiple different statistics (sum, mean, count) for each group all in one step?
2. After a groupby, the group labels become the table's index. Which method moves them back into a regular column?
3. Which argument in pd.pivot_table() adds a row and column of totals to the output?
Quiz
1. You want to find the total sales per region. Which line of code does this correctly?
2. You have sales data grouped by both region and product. Which tool gives you a grid where regions are rows and products are columns?
3. In our coffee shop data, Espresso has the highest unit sales but Latte has the highest revenue. What explains this?
Up Next · Lesson 25
Multicollinearity
What happens when two features are too similar — and why it silently breaks linear models. Learn to detect and fix it before it damages your predictions.