EDA Course
Categorical Visuals
When your data has categories — product types, regions, customer segments — the key analytical question is always: how do these groups compare? This lesson is about getting to that answer fast, clearly, and in a way that drives decisions.
The Core Questions for Categorical Data
Every time you analyse a categorical column, you're trying to answer one of four questions:
How many of each?
Which categories appear most? Are any rare? → Frequency count
How do groups compare on a number?
Does revenue differ by region? → Grouped mean/sum
How do two categories relate?
Are returns more common for certain products? → Cross-tab
What's the composition?
What % of each region's sales come from each product? → Proportions
The Dataset We'll Use
The scenario: You're a data analyst at a retail company with three UK regions — North, South, and Midlands. Your manager has asked you to prepare a categorical breakdown of sales performance before the quarterly board meeting. You need to answer her four questions: how much did each region sell, which product category performs best, are there return rate differences across regions, and what's the revenue mix by category?
import pandas as pd
import numpy as np
# Retail sales data — 18 transactions across 3 regions and 3 product categories
df = pd.DataFrame({
'region': ['North','South','Midlands','North','South','Midlands',
'North','South','Midlands','North','South','Midlands',
'North','South','Midlands','North','South','Midlands'],
'category': ['Electronics','Clothing','Furniture','Clothing','Electronics','Clothing',
'Furniture','Clothing','Electronics','Electronics','Furniture','Electronics',
'Clothing','Furniture','Clothing','Electronics','Clothing','Furniture'],
'revenue': [420, 85, 310, 95, 380, 75, 280, 110, 460,
390, 260, 430, 120, 245, 90, 410, 100, 275],
'returned': [0, 0, 1, 0, 0, 1, 0, 1, 0,
0, 1, 0, 1, 0, 1, 0, 0, 0]
})
print(f"Rows: {len(df)} | Regions: {df['region'].nunique()} | Categories: {df['category'].nunique()}")
print(df.head(6))
Rows: 18 | Regions: 3 | Categories: 3
region category revenue returned
0 North Electronics 420 0
1 South Clothing 85 0
2 Midlands Furniture 310 1
3 North Clothing 95 0
4 South Electronics 380 0
5 Midlands Clothing 75 1
Question 1 — How Much Did Each Region Sell?
The most fundamental categorical analysis: group by one category and sum a number. This tells you where the business lives. Let's go beyond just the total and also show the percentage share and number of transactions — three numbers together paint a much clearer picture than one.
# Group by region and calculate total revenue, transaction count, and average order value
region_summary = df.groupby('region').agg(
total_revenue = ('revenue', 'sum'),
transactions = ('revenue', 'count'),
avg_order = ('revenue', 'mean')
).round(1)
# Add a revenue share column — what % of total revenue does each region contribute?
total = region_summary['total_revenue'].sum()
region_summary['revenue_share'] = (region_summary['total_revenue'] / total * 100).round(1)
# Sort by revenue descending so the top region is first
region_summary = region_summary.sort_values('total_revenue', ascending=False)
print("=== REVENUE BY REGION ===\n")
print(region_summary)
print(f"\nTotal revenue: £{total:,.0f}")
=== REVENUE BY REGION ===
total_revenue transactions avg_order revenue_share
region
Midlands 1640 6 273.3 34.6
North 1435 6 239.2 30.3
South 1660 6 276.7 35.1
Total revenue: £4,735
What just happened?
pandas' .agg() computes three metrics per region in one step. The revenue share column is a simple division — each region's total divided by the grand total, multiplied by 100.
South and Midlands are very close in total revenue (£1,660 vs £1,640), but South has a slightly higher average order value (£276.7 vs £273.3). North lags on both metrics despite the same transaction count — its average order is £239, nearly £38 lower than the others. That gap is the board question: why are North's orders smaller?
Question 2 — Which Category Performs Best?
Now we break it down by product category — and crucially, we look at it within each region. A category that performs well overall might underperform in a specific region, which is where the real insight lives.
# Overall performance by category
cat_overall = df.groupby('category')['revenue'].agg(['sum','mean','count']).round(1)
cat_overall.columns = ['total_rev', 'avg_rev', 'transactions']
cat_overall = cat_overall.sort_values('total_rev', ascending=False)
print("=== OVERALL: REVENUE BY CATEGORY ===\n")
print(cat_overall)
print()
# Category performance WITHIN each region — this is where the story gets interesting
# pd.pivot_table gives us a region × category revenue grid in one line
cat_by_region = pd.pivot_table(
df,
values = 'revenue',
index = 'region',
columns = 'category',
aggfunc = 'sum',
fill_value = 0 # if a region has no sales in a category, show 0 not NaN
)
print("=== REVENUE BY REGION × CATEGORY ===\n")
print(cat_by_region)
=== OVERALL: REVENUE BY CATEGORY ===
total_rev avg_rev transactions
category
Electronics 2010.0 335.0 6
Furniture 1370.0 228.3 6
Clothing 1355.0 112.5 12
=== REVENUE BY REGION × CATEGORY ===
category Clothing Electronics Furniture
region
Midlands 240 1220 180 ← Electronics dominates Midlands
North 215 820 400 ← Furniture stronger in North
South 900 890 260 ← Clothing is South's big earner!
What just happened?
pandas' pd.pivot_table() reshapes the grouped data into a grid — regions as rows, categories as columns — in one line. The overall table tells one story (Electronics wins overall), but the pivot tells a completely different one: the winner depends entirely on which region you're in.
This is the insight that only appears when you cross two categories: Electronics absolutely dominates Midlands (£1,220 out of £1,640 total), Furniture is North's second-biggest earner, and Clothing drives South's revenue (£900 — nearly triple what North or Midlands sell in that category). These are regional product strategy differences that a single overall total would completely hide.
Question 3 — Are Return Rates Different Across Regions?
Returns cost money. If one region has a significantly higher return rate, the business needs to know — and the answer might point to product quality issues, customer expectations, or fulfilment problems. Here we analyse the rate (%) not just the raw count.
# Return rate by region: what % of orders were returned?
return_by_region = df.groupby('region').agg(
total_orders = ('returned', 'count'),
total_returned = ('returned', 'sum') # 'returned' is 0 or 1, so sum = count of returns
)
return_by_region['return_rate_pct'] = (
return_by_region['total_returned'] / return_by_region['total_orders'] * 100
).round(1)
print("=== RETURN RATE BY REGION ===\n")
print(return_by_region)
print()
# Cross-tab: return rate by CATEGORY — which product type gets returned most?
return_by_cat = df.groupby('category').agg(
orders = ('returned', 'count'),
returned = ('returned', 'sum')
)
return_by_cat['return_rate_pct'] = (return_by_cat['returned'] / return_by_cat['orders'] * 100).round(1)
print("=== RETURN RATE BY CATEGORY ===\n")
print(return_by_cat)
=== RETURN RATE BY REGION ===
total_orders total_returned return_rate_pct
region
Midlands 6 2 33.3
North 6 1 16.7
South 6 3 50.0
=== RETURN RATE BY CATEGORY ===
orders returned return_rate_pct
category
Clothing 12 4 33.3
Electronics 6 0 0.0
Furniture 6 2 33.3
What just happened?
pandas' .agg() computes both the total orders and total returns per group in one step. Because returned is coded as 0 or 1, summing it gives us the count of returns automatically — no extra filter needed.
Two very different findings here. South has a 50% return rate — one in two orders sent back. That's alarming. Electronics has a 0% return rate — nobody returned a single electronic item. Clothing returns 33% and Furniture 33%. So South's high return rate is almost certainly being driven by Clothing or Furniture sales there, not Electronics. This is exactly the kind of finding that turns into a board-level action item.
Question 4 — What Is the Revenue Mix?
The final question is about composition. Not just how much each region earns, but what percentage of each region's revenue comes from each category. This reveals whether regions are balanced or dependent on one product type.
# Build the pivot table of revenue by region × category (same as Question 2)
mix = pd.pivot_table(df, values='revenue', index='region',
columns='category', aggfunc='sum', fill_value=0)
# Convert each row to percentages — divide each cell by that row's total
# .div() divides each row by the corresponding row total
# axis=0 means divide row by row
mix_pct = mix.div(mix.sum(axis=1), axis=0).mul(100).round(1)
print("=== REVENUE MIX BY REGION (% of each region's total) ===\n")
print(mix_pct)
print()
# Plain-English summary — what's the dominant category in each region?
print("Key finding:")
for region in mix_pct.index:
top_cat = mix_pct.loc[region].idxmax() # .idxmax() returns the column name with the highest value
top_pct = mix_pct.loc[region].max()
print(f" {region}: {top_cat} dominates at {top_pct:.0f}% of revenue")
=== REVENUE MIX BY REGION (% of each region's total) === category Clothing Electronics Furniture region Midlands 14.6 74.4 11.0 North 15.0 57.1 27.9 South 54.2 33.7 12.1 Key finding: Midlands: Electronics dominates at 74% of revenue North: Electronics dominates at 57% of revenue South: Clothing dominates at 54% of revenue
What just happened?
pandas' .div() divides the entire DataFrame by a Series row-by-row when you use axis=0. mix.sum(axis=1) gives the total per row (total revenue per region), and dividing each cell by its row total converts everything to a proportion. Multiplying by 100 converts to percentages.
.idxmax() returns the column name with the highest value in each row — a clean way to find the dominant category without writing a loop.
The Board Deck — What the Data Actually Says
Four analyses, four findings. Here's what you walk into the board meeting with:
South return rate is 50% — needs urgent investigation
One in two South orders is returned. Electronics returns zero — so the problem is Clothing or Furniture fulfilment in the South region specifically.
Midlands is dangerously concentrated in Electronics (74%)
If Electronics demand drops or a competitor undercuts prices, Midlands has almost nothing else to fall back on. Revenue diversification needed.
North has the lowest return rate (17%) — learn from what they're doing
North's lower average order value might reflect a more careful customer base or better product matching. Worth studying as a best-practice model for other regions.
Electronics never gets returned — consider expanding its range
0% return rate with the highest average order value (£335) makes Electronics the most profitable and least risky category in the business.
Teacher's Note
The hardest part of categorical analysis isn't the code — it's knowing what to compare. The mistake most beginners make is stopping at the first level: "South has the most revenue." That's one number. The real insight comes from the second level: "South's revenue is dominated by Clothing, which also has the highest return rate." Two lines of code apart. Completely different business implication.
Always ask: does the overall pattern hold within each subgroup? If not — the subgroup difference is the story.
Practice Questions
1. Which pandas method returns the column name (or index label) that has the highest value in a row or Series?
2. To convert a pivot table of raw totals into row percentages, you divide each row by its row total. Which pandas method does this division efficiently across the whole DataFrame?
3. The returned column contains 0s and 1s. Which aggregation function gives you the count of returned orders when used in a groupby?
Quiz
1. South has a 50% return rate. Electronics has a 0% return rate. What should be your next analytical step?
2. Midlands earns 74% of its revenue from Electronics. Why is this a strategic risk?
3. Which pandas call produces a grid showing total revenue for every region × category combination?
Up Next · Lesson 29
Boxplots & Violin Plots
The two charts that show spread, outliers, and group comparisons all at once — and when to use each one.