EDA Lesson 24 – Group-By Summaries | Dataplexa

Intermediate Level · Lesson 24

Group-By Summaries

The most common question in business data analysis isn't "what's the average?" — it's "what's the average per group?" Which city sells most? Which product returns the most profit? Which age group churns fastest? Group-by summaries answer all of these in a single line of code.

What Does "Group By" Actually Mean?

Imagine you have a spreadsheet of 1,000 sales orders. You want to know: which salesperson brought in the most revenue? You'd manually find all rows for salesperson A, add up their sales. Then do the same for B, C, D... That's exhausting.

Group-by automates exactly this. It: splits the data into groups (one per salesperson), applies a calculation to each group (sum of sales), then combines the results into one clean table. Data scientists call this the split → apply → combine pattern.

✂️

Split

Divide rows into groups by a category

🔢

Apply

Run a calculation on each group

📋

Combine

Bring results together in one table

The Dataset We'll Use

The scenario: You're an analyst at a coffee shop chain with three branches — North, South, and East. Your regional manager wants a breakdown of sales performance across branches and drink categories before the quarterly review. You have 15 orders and need to slice the data multiple ways to answer her questions.

First, let's load the data:

import pandas as pd
import numpy as np

# Coffee shop orders — 15 transactions across 3 branches
df = pd.DataFrame({
    'order_id':   range(1, 16),
    'branch':     ['North','South','East','North','East','South','North',
                   'East','South','North','East','South','North','East','South'],
    'drink':      ['Latte','Espresso','Cappuccino','Flat White','Latte',
                   'Cappuccino','Espresso','Flat White','Latte','Cappuccino',
                   'Espresso','Flat White','Latte','Cappuccino','Espresso'],
    'price':      [4.20, 2.80, 3.80, 3.50, 4.20, 3.80, 2.80, 3.50,
                   4.20, 3.80, 2.80, 3.50, 4.20, 3.80, 2.80],
    'units_sold': [3, 5, 2, 4, 6, 3, 7, 2, 4, 3, 5, 6, 2, 4, 8],
    'rating':     [4, 5, 4, 3, 5, 4, 5, 3, 4, 4, 5, 4, 3, 4, 5]
})

# Add a revenue column — price × units sold
df['revenue'] = df['price'] * df['units_sold']

print(df.to_string(index=False))

 order_id branch       drink  price  units_sold  rating  revenue
        1  North       Latte   4.20           3       4    12.60
        2  South   Espresso   2.80           5       5    14.00
        3   East  Cappuccino   3.80           2       4     7.60
        4  North  Flat White   3.50           4       3    14.00
        5   East       Latte   4.20           6       5    25.20
        6  South  Cappuccino   3.80           3       4    11.40
        7  North   Espresso   2.80           7       5    19.60
        8   East  Flat White   3.50           2       3     7.00
        9  South       Latte   4.20           4       4    16.80
       10  North  Cappuccino   3.80           3       4    11.40
       11   East   Espresso   2.80           5       5    14.00
       12  South  Flat White   3.50           6       4    21.00
       13  North       Latte   4.20           2       3     8.40
       14   East  Cappuccino   3.80           4       4    15.20
       15  South   Espresso   2.80           8       5    22.40

What just happened?

pandas holds all 15 orders in a table. We created the revenue column by multiplying two existing columns — pandas does this row by row automatically without any loop. Now we have everything we need to start grouping.

The Basic Group-By: One Column, One Calculation

The simplest group-by picks one category column to group by and one number column to summarise. The manager's first question: which branch made the most revenue?

# .groupby('branch') splits the table into three groups — one per branch
# ['revenue'] selects just the revenue column inside each group
# .sum() adds up all revenues within each group
revenue_by_branch = df.groupby('branch')['revenue'].sum()

# .sort_values() puts the highest revenue branch at the top
print(revenue_by_branch.sort_values(ascending=False).round(2))

branch
South    85.60
East     69.00
North    66.00
Name: revenue, dtype: float64

What just happened?

pandas' .groupby() split the 15 rows into 3 separate groups. .sum() added up all the revenue inside each group. Three lines of data became one clean answer: South branch leads with £85.60, just ahead of East (£69) and North (£66). That's the answer to the manager's question — in two lines of code.

Multiple Calculations at Once With .agg()

One calculation per group is useful. But usually you want several at once — total revenue and number of orders and average rating. The .agg() method lets you request as many calculations as you want in a single step. Think of it as ordering multiple items on one ticket rather than going back to the counter three times.

# .agg() lets you calculate multiple things at once for each group
# Each named argument becomes a column: total_rev='sum' means "sum of revenue, label it total_rev"
branch_summary = df.groupby('branch').agg(
    total_revenue  = ('revenue',    'sum'),    # add up all revenue per branch
    total_units    = ('units_sold', 'sum'),    # add up all units sold per branch
    num_orders     = ('order_id',   'count'),  # count how many orders per branch
    avg_rating     = ('rating',     'mean'),   # average customer rating per branch
    best_order     = ('revenue',    'max')     # the single highest revenue order
).round(2)

print(branch_summary)

        total_revenue  total_units  num_orders  avg_rating  best_order
branch
East            69.00           17           4        4.25       25.20
North           66.00           15           4        4.00       19.60
South           85.60           22           5        4.40       22.40

What just happened?

pandas' .agg() is the power move of group-by analysis. The named aggregation syntax — total_revenue = ('revenue', 'sum') — says "take the revenue column, apply sum, call the result total_revenue." You can list as many of these as you want.

South branch leads on revenue (£85.60) and units (22) and has the highest average rating (4.40). North has the most orders per table (4 orders, £66 total — their baskets are smaller). East branch hit the single biggest order (£25.20 for 6 lattes). One table, five insights, one second to read.

Grouping By Two Columns at Once

The manager's next question: "Which drink sells best in each branch?" This needs grouping by two columns simultaneously — branch AND drink. You just pass a list instead of a single column name.

# Pass a LIST of two column names to group by both at once
# This creates one row for every unique branch + drink combination
branch_drink = df.groupby(['branch', 'drink'])['units_sold'].sum()

print(branch_drink.sort_values(ascending=False))

branch  drink
South   Espresso      13
North   Espresso       7
East    Latte          6
South   Flat White     6
East    Cappuccino     6
North   Flat White     4
South   Latte          4
South   Cappuccino     3
North   Latte          5
North   Cappuccino     3
East    Espresso       5
East    Flat White     2
Name: units_sold, dtype: int64

What just happened?

pandas created a group for every unique combination of branch and drink — 12 combinations from just 15 rows. South Espresso is the clear winner at 13 units. East Flat White is the weakest at just 2. This two-level grouping shows patterns that single-column grouping would completely miss — for example, Espresso is strong at South and North but mid-range at East.

Pivot Tables — The Same Data, More Readable

The two-column groupby above is correct, but the output is a long list. When one category goes across the top and another goes down the side, a pivot table is far easier to read — it looks like a spreadsheet grid where you can scan across rows and down columns at a glance.

# pd.pivot_table() reshapes the same grouped data into a 2D grid
# index  = what goes down the rows (branches)
# columns = what goes across the top (drinks)
# values  = what number fills each cell
# aggfunc = how to summarise when multiple rows share the same branch+drink
pivot = pd.pivot_table(
    df,
    index   = 'branch',
    columns = 'drink',
    values  = 'units_sold',
    aggfunc = 'sum'         # add up all units for each branch+drink combination
)

print(pivot)

drink   Cappuccino  Espresso  Flat White  Latte
branch
East             6         5           2      6
North            3         7           4      5
South            3        13           6      4

What just happened?

pandas' pd.pivot_table() takes exactly the same data as the two-column groupby but arranges it as a grid. Now you can scan across a row to see a branch's drink mix, or down a column to compare how one drink performs across branches. South's Espresso dominance (13 units) jumps out immediately. East sells Cappuccino and Latte equally (6 each). This is the same data as before — just far more readable.

Adding Row and Column Totals

The manager will ask: "What's the overall total?" Adding margins (totals) to a pivot table turns it from a data grid into a proper summary report. One extra argument does it.

# margins=True adds a "Total" row at the bottom and a "Total" column on the right
# margins_name='Total' labels those totals (default label is 'All')
pivot_with_totals = pd.pivot_table(
    df,
    index    = 'branch',
    columns  = 'drink',
    values   = 'revenue',
    aggfunc  = 'sum',
    margins  = True,          # ← adds row totals and column totals
    margins_name = 'Total'
).round(2)

print(pivot_with_totals)

drink   Cappuccino  Espresso  Flat White  Latte   Total
branch
East         22.80     14.00        7.00  25.20   69.00
North        11.40     19.60       14.00  21.00   66.00
South        11.40     22.40       21.00  16.80   71.60
Total        45.60     56.00       42.00  63.00  206.60

What just happened?

pandas' margins=True automatically sums every row and column and adds them as a "Total" row and column. Now the manager can read: total revenue across all branches and drinks is £206.60. Latte is the top drink overall (£63). Flat White is the weakest (£42). This is literally the output you'd paste into a PowerPoint slide.

Ranked Group Summaries

Sometimes you don't just want totals — you want a ranking. Which drink is #1, #2, #3 for each branch? pandas can add rank columns to any grouped output in one line.

# First, get total revenue per drink
drink_totals = df.groupby('drink')['revenue'].sum().reset_index()
#                                                   ^^^
# .reset_index() converts the group labels back into a normal column
# (by default groupby makes the group label the index — reset_index makes it a regular column)

# Add a rank column — 1 = highest revenue drink
drink_totals['rank'] = drink_totals['revenue'].rank(ascending=False).astype(int)

# Sort by rank so #1 appears at the top
drink_totals = drink_totals.sort_values('rank')

print(drink_totals)

        drink  revenue  rank
2       Latte    63.00     1
1    Espresso    56.00     2
0  Cappuccino    45.60     3
3  Flat White    42.00     4

What just happened?

pandas' .rank(ascending=False) assigns a ranking number to each row — 1 for the highest value, 4 for the lowest. We use ascending=False because we want rank 1 to mean "best" not "smallest." .reset_index() is a housekeeping step — after a groupby, the group label becomes the table's index (the far-left label). reset_index() moves it back into a regular column so it's easier to work with.

Latte is the top earner at £63, Flat White is last at £42. Despite Espresso being the best-seller by units (13 at South alone), Latte wins on revenue because it's priced higher (£4.20 vs £2.80). That's the kind of nuance group-by analysis surfaces instantly.

Visual Summary — Branch Revenue Breakdown

Here's a visual of the key findings from this analysis — total revenue per branch and the drink revenue breakdown.

Revenue by Branch

South

£85.60

East

£69.00

North

£66.00

Revenue by Drink

Latte

£63.00 #1

Espresso

£56.00 #2

Cappuccino

£45.60 #3

Flat White

£42.00 #4

Teacher's Note

groupby is the single most-used pandas method in real-world data work. Most of the questions a business asks — "which region?", "which product?", "which customer segment?" — are groupby questions. Learning to reach for it instinctively is one of the biggest productivity jumps you'll make as a data analyst.

A tip that saves a lot of confusion: after a groupby, pandas makes the group label the index of the result (the far-left label). If you want to use that label as a regular column — for sorting, filtering, or merging — always add .reset_index() at the end. It's a one-second habit that prevents a lot of head-scratching.

Practice Questions

1. Which pandas method lets you calculate multiple different statistics (sum, mean, count) for each group all in one step?

2. After a groupby, the group labels become the table's index. Which method moves them back into a regular column?

3. Which argument in pd.pivot_table() adds a row and column of totals to the output?

Quiz

Up Next · Lesson 25

Multicollinearity

What happens when two features are too similar — and why it silently breaks linear models. Learn to detect and fix it before it damages your predictions.

← Previous Course Index Next →

EDA Course

Group-By Summaries

What Does "Group By" Actually Mean?

The Dataset We'll Use

The Basic Group-By: One Column, One Calculation

Multiple Calculations at Once With .agg()

Grouping By Two Columns at Once

Pivot Tables — The Same Data, More Readable

Adding Row and Column Totals

Ranked Group Summaries

Visual Summary — Branch Revenue Breakdown

Practice Questions

Quiz