EDA Lesson 24 – Group-By Summaries | Dataplexa
Intermediate Level · Lesson 24

Group-By Summaries

The most common question in business data analysis isn't "what's the average?" — it's "what's the average per group?" Which city sells most? Which product returns the most profit? Which age group churns fastest? Group-by summaries answer all of these in a single line of code.

What Does "Group By" Actually Mean?

Imagine you have a spreadsheet of 1,000 sales orders. You want to know: which salesperson brought in the most revenue? You'd manually find all rows for salesperson A, add up their sales. Then do the same for B, C, D... That's exhausting.

Group-by automates exactly this. It: splits the data into groups (one per salesperson), applies a calculation to each group (sum of sales), then combines the results into one clean table. Data scientists call this the split → apply → combine pattern.

✂️

Split

Divide rows into groups by a category

🔢

Apply

Run a calculation on each group

📋

Combine

Bring results together in one table

The Dataset We'll Use

The scenario: You're an analyst at a coffee shop chain with three branches — North, South, and East. Your regional manager wants a breakdown of sales performance across branches and drink categories before the quarterly review. You have 15 orders and need to slice the data multiple ways to answer her questions.

First, let's load the data:

import pandas as pd
import numpy as np

# Coffee shop orders — 15 transactions across 3 branches
df = pd.DataFrame({
    'order_id':   range(1, 16),
    'branch':     ['North','South','East','North','East','South','North',
                   'East','South','North','East','South','North','East','South'],
    'drink':      ['Latte','Espresso','Cappuccino','Flat White','Latte',
                   'Cappuccino','Espresso','Flat White','Latte','Cappuccino',
                   'Espresso','Flat White','Latte','Cappuccino','Espresso'],
    'price':      [4.20, 2.80, 3.80, 3.50, 4.20, 3.80, 2.80, 3.50,
                   4.20, 3.80, 2.80, 3.50, 4.20, 3.80, 2.80],
    'units_sold': [3, 5, 2, 4, 6, 3, 7, 2, 4, 3, 5, 6, 2, 4, 8],
    'rating':     [4, 5, 4, 3, 5, 4, 5, 3, 4, 4, 5, 4, 3, 4, 5]
})

# Add a revenue column — price × units sold
df['revenue'] = df['price'] * df['units_sold']

print(df.to_string(index=False))

What just happened?

pandas holds all 15 orders in a table. We created the revenue column by multiplying two existing columns — pandas does this row by row automatically without any loop. Now we have everything we need to start grouping.

The Basic Group-By: One Column, One Calculation

The simplest group-by picks one category column to group by and one number column to summarise. The manager's first question: which branch made the most revenue?

# .groupby('branch') splits the table into three groups — one per branch
# ['revenue'] selects just the revenue column inside each group
# .sum() adds up all revenues within each group
revenue_by_branch = df.groupby('branch')['revenue'].sum()

# .sort_values() puts the highest revenue branch at the top
print(revenue_by_branch.sort_values(ascending=False).round(2))

What just happened?

pandas' .groupby() split the 15 rows into 3 separate groups. .sum() added up all the revenue inside each group. Three lines of data became one clean answer: South branch leads with £85.60, just ahead of East (£69) and North (£66). That's the answer to the manager's question — in two lines of code.

Multiple Calculations at Once With .agg()

One calculation per group is useful. But usually you want several at once — total revenue and number of orders and average rating. The .agg() method lets you request as many calculations as you want in a single step. Think of it as ordering multiple items on one ticket rather than going back to the counter three times.

# .agg() lets you calculate multiple things at once for each group
# Each named argument becomes a column: total_rev='sum' means "sum of revenue, label it total_rev"
branch_summary = df.groupby('branch').agg(
    total_revenue  = ('revenue',    'sum'),    # add up all revenue per branch
    total_units    = ('units_sold', 'sum'),    # add up all units sold per branch
    num_orders     = ('order_id',   'count'),  # count how many orders per branch
    avg_rating     = ('rating',     'mean'),   # average customer rating per branch
    best_order     = ('revenue',    'max')     # the single highest revenue order
).round(2)

print(branch_summary)

What just happened?

pandas' .agg() is the power move of group-by analysis. The named aggregation syntax — total_revenue = ('revenue', 'sum') — says "take the revenue column, apply sum, call the result total_revenue." You can list as many of these as you want.

South branch leads on revenue (£85.60) and units (22) and has the highest average rating (4.40). North has the most orders per table (4 orders, £66 total — their baskets are smaller). East branch hit the single biggest order (£25.20 for 6 lattes). One table, five insights, one second to read.

Grouping By Two Columns at Once

The manager's next question: "Which drink sells best in each branch?" This needs grouping by two columns simultaneously — branch AND drink. You just pass a list instead of a single column name.

# Pass a LIST of two column names to group by both at once
# This creates one row for every unique branch + drink combination
branch_drink = df.groupby(['branch', 'drink'])['units_sold'].sum()

print(branch_drink.sort_values(ascending=False))

What just happened?

pandas created a group for every unique combination of branch and drink — 12 combinations from just 15 rows. South Espresso is the clear winner at 13 units. East Flat White is the weakest at just 2. This two-level grouping shows patterns that single-column grouping would completely miss — for example, Espresso is strong at South and North but mid-range at East.

Pivot Tables — The Same Data, More Readable

The two-column groupby above is correct, but the output is a long list. When one category goes across the top and another goes down the side, a pivot table is far easier to read — it looks like a spreadsheet grid where you can scan across rows and down columns at a glance.

# pd.pivot_table() reshapes the same grouped data into a 2D grid
# index  = what goes down the rows (branches)
# columns = what goes across the top (drinks)
# values  = what number fills each cell
# aggfunc = how to summarise when multiple rows share the same branch+drink
pivot = pd.pivot_table(
    df,
    index   = 'branch',
    columns = 'drink',
    values  = 'units_sold',
    aggfunc = 'sum'         # add up all units for each branch+drink combination
)

print(pivot)

What just happened?

pandas' pd.pivot_table() takes exactly the same data as the two-column groupby but arranges it as a grid. Now you can scan across a row to see a branch's drink mix, or down a column to compare how one drink performs across branches. South's Espresso dominance (13 units) jumps out immediately. East sells Cappuccino and Latte equally (6 each). This is the same data as before — just far more readable.

Adding Row and Column Totals

The manager will ask: "What's the overall total?" Adding margins (totals) to a pivot table turns it from a data grid into a proper summary report. One extra argument does it.

# margins=True adds a "Total" row at the bottom and a "Total" column on the right
# margins_name='Total' labels those totals (default label is 'All')
pivot_with_totals = pd.pivot_table(
    df,
    index    = 'branch',
    columns  = 'drink',
    values   = 'revenue',
    aggfunc  = 'sum',
    margins  = True,          # ← adds row totals and column totals
    margins_name = 'Total'
).round(2)

print(pivot_with_totals)

What just happened?

pandas' margins=True automatically sums every row and column and adds them as a "Total" row and column. Now the manager can read: total revenue across all branches and drinks is £206.60. Latte is the top drink overall (£63). Flat White is the weakest (£42). This is literally the output you'd paste into a PowerPoint slide.

Ranked Group Summaries

Sometimes you don't just want totals — you want a ranking. Which drink is #1, #2, #3 for each branch? pandas can add rank columns to any grouped output in one line.

# First, get total revenue per drink
drink_totals = df.groupby('drink')['revenue'].sum().reset_index()
#                                                   ^^^
# .reset_index() converts the group labels back into a normal column
# (by default groupby makes the group label the index — reset_index makes it a regular column)

# Add a rank column — 1 = highest revenue drink
drink_totals['rank'] = drink_totals['revenue'].rank(ascending=False).astype(int)

# Sort by rank so #1 appears at the top
drink_totals = drink_totals.sort_values('rank')

print(drink_totals)

What just happened?

pandas' .rank(ascending=False) assigns a ranking number to each row — 1 for the highest value, 4 for the lowest. We use ascending=False because we want rank 1 to mean "best" not "smallest." .reset_index() is a housekeeping step — after a groupby, the group label becomes the table's index (the far-left label). reset_index() moves it back into a regular column so it's easier to work with.

Latte is the top earner at £63, Flat White is last at £42. Despite Espresso being the best-seller by units (13 at South alone), Latte wins on revenue because it's priced higher (£4.20 vs £2.80). That's the kind of nuance group-by analysis surfaces instantly.

Visual Summary — Branch Revenue Breakdown

Here's a visual of the key findings from this analysis — total revenue per branch and the drink revenue breakdown.

Revenue by Branch

South
£85.60
East
£69.00
North
£66.00

Revenue by Drink

Latte
£63.00 #1
Espresso
£56.00 #2
Cappuccino
£45.60 #3
Flat White
£42.00 #4

Teacher's Note

groupby is the single most-used pandas method in real-world data work. Most of the questions a business asks — "which region?", "which product?", "which customer segment?" — are groupby questions. Learning to reach for it instinctively is one of the biggest productivity jumps you'll make as a data analyst.

A tip that saves a lot of confusion: after a groupby, pandas makes the group label the index of the result (the far-left label). If you want to use that label as a regular column — for sorting, filtering, or merging — always add .reset_index() at the end. It's a one-second habit that prevents a lot of head-scratching.

Practice Questions

1. Which pandas method lets you calculate multiple different statistics (sum, mean, count) for each group all in one step?



2. After a groupby, the group labels become the table's index. Which method moves them back into a regular column?



3. Which argument in pd.pivot_table() adds a row and column of totals to the output?



Quiz

1. You want to find the total sales per region. Which line of code does this correctly?


2. You have sales data grouped by both region and product. Which tool gives you a grid where regions are rows and products are columns?


3. In our coffee shop data, Espresso has the highest unit sales but Latte has the highest revenue. What explains this?


Up Next · Lesson 25

Multicollinearity

What happens when two features are too similar — and why it silently breaks linear models. Learn to detect and fix it before it damages your predictions.