EDA Lesson 28 – Categorical Visuals | Dataplexa

Intermediate Level · Lesson 28

Categorical Visuals

When your data has categories — product types, regions, customer segments — the key analytical question is always: how do these groups compare? This lesson is about getting to that answer fast, clearly, and in a way that drives decisions.

The Core Questions for Categorical Data

Every time you analyse a categorical column, you're trying to answer one of four questions:

How many of each?

Which categories appear most? Are any rare? → Frequency count

How do groups compare on a number?

Does revenue differ by region? → Grouped mean/sum

How do two categories relate?

Are returns more common for certain products? → Cross-tab

What's the composition?

What % of each region's sales come from each product? → Proportions

The Dataset We'll Use

The scenario: You're a data analyst at a retail company with three UK regions — North, South, and Midlands. Your manager has asked you to prepare a categorical breakdown of sales performance before the quarterly board meeting. You need to answer her four questions: how much did each region sell, which product category performs best, are there return rate differences across regions, and what's the revenue mix by category?

import pandas as pd
import numpy as np

# Retail sales data — 18 transactions across 3 regions and 3 product categories
df = pd.DataFrame({
    'region':    ['North','South','Midlands','North','South','Midlands',
                  'North','South','Midlands','North','South','Midlands',
                  'North','South','Midlands','North','South','Midlands'],
    'category':  ['Electronics','Clothing','Furniture','Clothing','Electronics','Clothing',
                  'Furniture','Clothing','Electronics','Electronics','Furniture','Electronics',
                  'Clothing','Furniture','Clothing','Electronics','Clothing','Furniture'],
    'revenue':   [420, 85,  310, 95,  380, 75,  280, 110, 460,
                  390, 260, 430, 120, 245, 90,  410, 100, 275],
    'returned':  [0,   0,   1,   0,   0,   1,   0,   1,   0,
                  0,   1,   0,   1,   0,   1,   0,   0,   0]
})

print(f"Rows: {len(df)}  |  Regions: {df['region'].nunique()}  |  Categories: {df['category'].nunique()}")
print(df.head(6))

Rows: 18  |  Regions: 3  |  Categories: 3
     region     category  revenue  returned
0     North  Electronics      420         0
1     South     Clothing       85         0
2  Midlands    Furniture      310         1
3     North     Clothing       95         0
4     South  Electronics      380         0
5  Midlands     Clothing       75         1

Question 1 — How Much Did Each Region Sell?

The most fundamental categorical analysis: group by one category and sum a number. This tells you where the business lives. Let's go beyond just the total and also show the percentage share and number of transactions — three numbers together paint a much clearer picture than one.

# Group by region and calculate total revenue, transaction count, and average order value
region_summary = df.groupby('region').agg(
    total_revenue = ('revenue', 'sum'),
    transactions  = ('revenue', 'count'),
    avg_order     = ('revenue', 'mean')
).round(1)

# Add a revenue share column — what % of total revenue does each region contribute?
total = region_summary['total_revenue'].sum()
region_summary['revenue_share'] = (region_summary['total_revenue'] / total * 100).round(1)

# Sort by revenue descending so the top region is first
region_summary = region_summary.sort_values('total_revenue', ascending=False)

print("=== REVENUE BY REGION ===\n")
print(region_summary)
print(f"\nTotal revenue: £{total:,.0f}")

=== REVENUE BY REGION ===

          total_revenue  transactions  avg_order  revenue_share
region
Midlands           1640             6      273.3           34.6
North              1435             6      239.2           30.3
South              1660             6      276.7           35.1

Total revenue: £4,735

What just happened?

pandas' .agg() computes three metrics per region in one step. The revenue share column is a simple division — each region's total divided by the grand total, multiplied by 100.

South and Midlands are very close in total revenue (£1,660 vs £1,640), but South has a slightly higher average order value (£276.7 vs £273.3). North lags on both metrics despite the same transaction count — its average order is £239, nearly £38 lower than the others. That gap is the board question: why are North's orders smaller?

Question 2 — Which Category Performs Best?

Now we break it down by product category — and crucially, we look at it within each region. A category that performs well overall might underperform in a specific region, which is where the real insight lives.

# Overall performance by category
cat_overall = df.groupby('category')['revenue'].agg(['sum','mean','count']).round(1)
cat_overall.columns = ['total_rev', 'avg_rev', 'transactions']
cat_overall = cat_overall.sort_values('total_rev', ascending=False)

print("=== OVERALL: REVENUE BY CATEGORY ===\n")
print(cat_overall)
print()

# Category performance WITHIN each region — this is where the story gets interesting
# pd.pivot_table gives us a region × category revenue grid in one line
cat_by_region = pd.pivot_table(
    df,
    values  = 'revenue',
    index   = 'region',
    columns = 'category',
    aggfunc = 'sum',
    fill_value = 0         # if a region has no sales in a category, show 0 not NaN
)

print("=== REVENUE BY REGION × CATEGORY ===\n")
print(cat_by_region)

=== OVERALL: REVENUE BY CATEGORY ===

             total_rev  avg_rev  transactions
category
Electronics     2010.0    335.0             6
Furniture       1370.0    228.3             6
Clothing        1355.0    112.5            12

=== REVENUE BY REGION × CATEGORY ===

category  Clothing  Electronics  Furniture
region
Midlands       240         1220        180  ← Electronics dominates Midlands
North          215          820        400  ← Furniture stronger in North
South          900          890        260  ← Clothing is South's big earner!

What just happened?

pandas' pd.pivot_table() reshapes the grouped data into a grid — regions as rows, categories as columns — in one line. The overall table tells one story (Electronics wins overall), but the pivot tells a completely different one: the winner depends entirely on which region you're in.

This is the insight that only appears when you cross two categories: Electronics absolutely dominates Midlands (£1,220 out of £1,640 total), Furniture is North's second-biggest earner, and Clothing drives South's revenue (£900 — nearly triple what North or Midlands sell in that category). These are regional product strategy differences that a single overall total would completely hide.

Question 3 — Are Return Rates Different Across Regions?

Returns cost money. If one region has a significantly higher return rate, the business needs to know — and the answer might point to product quality issues, customer expectations, or fulfilment problems. Here we analyse the rate (%) not just the raw count.

# Return rate by region: what % of orders were returned?
return_by_region = df.groupby('region').agg(
    total_orders   = ('returned', 'count'),
    total_returned = ('returned', 'sum')    # 'returned' is 0 or 1, so sum = count of returns
)
return_by_region['return_rate_pct'] = (
    return_by_region['total_returned'] / return_by_region['total_orders'] * 100
).round(1)

print("=== RETURN RATE BY REGION ===\n")
print(return_by_region)
print()

# Cross-tab: return rate by CATEGORY — which product type gets returned most?
return_by_cat = df.groupby('category').agg(
    orders   = ('returned', 'count'),
    returned = ('returned', 'sum')
)
return_by_cat['return_rate_pct'] = (return_by_cat['returned'] / return_by_cat['orders'] * 100).round(1)

print("=== RETURN RATE BY CATEGORY ===\n")
print(return_by_cat)

=== RETURN RATE BY REGION ===

          total_orders  total_returned  return_rate_pct
region
Midlands             6               2             33.3
North                6               1             16.7
South                6               3             50.0

=== RETURN RATE BY CATEGORY ===

             orders  returned  return_rate_pct
category
Clothing         12         4             33.3
Electronics       6         0              0.0
Furniture         6         2             33.3

What just happened?

pandas' .agg() computes both the total orders and total returns per group in one step. Because returned is coded as 0 or 1, summing it gives us the count of returns automatically — no extra filter needed.

Two very different findings here. South has a 50% return rate — one in two orders sent back. That's alarming. Electronics has a 0% return rate — nobody returned a single electronic item. Clothing returns 33% and Furniture 33%. So South's high return rate is almost certainly being driven by Clothing or Furniture sales there, not Electronics. This is exactly the kind of finding that turns into a board-level action item.

Question 4 — What Is the Revenue Mix?

The final question is about composition. Not just how much each region earns, but what percentage of each region's revenue comes from each category. This reveals whether regions are balanced or dependent on one product type.

# Build the pivot table of revenue by region × category (same as Question 2)
mix = pd.pivot_table(df, values='revenue', index='region',
                     columns='category', aggfunc='sum', fill_value=0)

# Convert each row to percentages — divide each cell by that row's total
# .div() divides each row by the corresponding row total
# axis=0 means divide row by row
mix_pct = mix.div(mix.sum(axis=1), axis=0).mul(100).round(1)

print("=== REVENUE MIX BY REGION (% of each region's total) ===\n")
print(mix_pct)
print()

# Plain-English summary — what's the dominant category in each region?
print("Key finding:")
for region in mix_pct.index:
    top_cat = mix_pct.loc[region].idxmax()    # .idxmax() returns the column name with the highest value
    top_pct = mix_pct.loc[region].max()
    print(f"  {region}: {top_cat} dominates at {top_pct:.0f}% of revenue")

=== REVENUE MIX BY REGION (% of each region's total) ===

category  Clothing  Electronics  Furniture
region
Midlands      14.6         74.4       11.0
North         15.0         57.1       27.9
South         54.2         33.7       12.1

Key finding:
  Midlands: Electronics dominates at 74% of revenue
  North: Electronics dominates at 57% of revenue
  South: Clothing dominates at 54% of revenue

What just happened?

pandas' .div() divides the entire DataFrame by a Series row-by-row when you use axis=0. mix.sum(axis=1) gives the total per row (total revenue per region), and dividing each cell by its row total converts everything to a proportion. Multiplying by 100 converts to percentages.

.idxmax() returns the column name with the highest value in each row — a clean way to find the dominant category without writing a loop.

The Board Deck — What the Data Actually Says

Four analyses, four findings. Here's what you walk into the board meeting with:

🔴

South return rate is 50% — needs urgent investigation

One in two South orders is returned. Electronics returns zero — so the problem is Clothing or Furniture fulfilment in the South region specifically.

🟡

Midlands is dangerously concentrated in Electronics (74%)

If Electronics demand drops or a competitor undercuts prices, Midlands has almost nothing else to fall back on. Revenue diversification needed.

🟢

North has the lowest return rate (17%) — learn from what they're doing

North's lower average order value might reflect a more careful customer base or better product matching. Worth studying as a best-practice model for other regions.

🔵

Electronics never gets returned — consider expanding its range

0% return rate with the highest average order value (£335) makes Electronics the most profitable and least risky category in the business.

Teacher's Note

The hardest part of categorical analysis isn't the code — it's knowing what to compare. The mistake most beginners make is stopping at the first level: "South has the most revenue." That's one number. The real insight comes from the second level: "South's revenue is dominated by Clothing, which also has the highest return rate." Two lines of code apart. Completely different business implication.

Always ask: does the overall pattern hold within each subgroup? If not — the subgroup difference is the story.

Practice Questions

1. Which pandas method returns the column name (or index label) that has the highest value in a row or Series?

2. To convert a pivot table of raw totals into row percentages, you divide each row by its row total. Which pandas method does this division efficiently across the whole DataFrame?

3. The returned column contains 0s and 1s. Which aggregation function gives you the count of returned orders when used in a groupby?

Quiz

Up Next · Lesson 29

Boxplots & Violin Plots

The two charts that show spread, outliers, and group comparisons all at once — and when to use each one.

← Previous Course Index Next →

EDA Course

Categorical Visuals

The Core Questions for Categorical Data

The Dataset We'll Use

Question 1 — How Much Did Each Region Sell?

Question 2 — Which Category Performs Best?

Question 3 — Are Return Rates Different Across Regions?

Question 4 — What Is the Revenue Mix?

The Board Deck — What the Data Actually Says

Practice Questions

Quiz