EDA Lesson 22 – Categorical Exploration | Dataplexa
Intermediate Level · Lesson 22

Categorical Exploration

Not all columns are numbers. A column like "product category" or "country" holds text labels — and those labels carry a lot of information if you know how to read them. This lesson shows you exactly how to explore categorical columns before you touch your model.

What Is a Categorical Column?

A categorical column holds values that belong to a fixed set of groups — like colours, country names, customer tiers, or satisfaction ratings. Instead of measuring "how much" of something (which is what numbers do), categorical columns answer "which kind."

There are two flavours you'll meet constantly:

Nominal — No Order

The categories have no natural ranking. "Red", "Blue", "Green" — none is bigger or better than another. Country names, product categories, payment methods.

→ Use one-hot encoding

Ordinal — Has Order

The categories have a natural ranking. "Low", "Medium", "High" — Medium is clearly bigger than Low. Star ratings, satisfaction scores, education levels.

→ Use label encoding (keep the order)

The Dataset We'll Use

The scenario: You work in the analytics team at an online fashion retailer. Your manager has handed you a sample of 14 recent orders and wants you to explore the categorical columns before the data science team builds a returns prediction model. You have four categorical columns to investigate: product category, customer tier, return reason, and satisfaction rating.

First, let's load the data and take a first look:

import pandas as pd
import numpy as np

# Our fashion retailer orders dataset
df = pd.DataFrame({
    'order_id':       range(1, 15),
    'category':       ['Tops','Shoes','Tops','Dresses','Shoes','Tops',
                       'Accessories','Dresses','Shoes','Tops','Dresses',
                       'Accessories','Shoes','Tops'],
    'customer_tier':  ['Gold','Silver','Bronze','Gold','Silver','Bronze',
                       'Gold','Bronze','Silver','Gold','Bronze','Silver',
                       'Gold','Silver'],
    'return_reason':  ['Wrong size','No reason','Wrong size','Damaged',
                       'Wrong size','No reason','No reason','Damaged',
                       'Wrong size','Wrong size','No reason','Damaged',
                       'Wrong size','No reason'],
    'satisfaction':   ['Happy','Neutral','Unhappy','Neutral','Happy',
                       'Happy','Happy','Unhappy','Neutral','Happy',
                       'Neutral','Unhappy','Happy','Neutral'],
    'order_value':    [45, 89, 32, 120, 67, 28, 55, 98, 74, 41, 110, 63, 82, 37]
})

print(df.to_string(index=False))

What just happened?

pandas is our data table library — it stores the data in rows and columns, just like a spreadsheet. We used pd.DataFrame() to create the table, and .to_string(index=False) to print it cleanly without the row numbers on the left. You can see four text columns and one number column — a typical mix in real retail data.

Step 1 — Count How Many Unique Values Each Column Has

Before anything else, you want to know: how many different categories does each column have? This is called cardinality. A column with 3 unique values is very different to one with 3,000 — they need completely different treatment.

Here's how to check cardinality in one line per column:

cat_cols = ['category', 'customer_tier', 'return_reason', 'satisfaction']

for col in cat_cols:
    # .nunique() counts how many DIFFERENT values exist in this column
    n_unique = df[col].nunique()

    # .unique() shows you exactly what those values are
    unique_vals = list(df[col].unique())

    print(f"{col}: {n_unique} unique values → {unique_vals}")

What just happened?

pandas' .nunique() counts the number of distinct values in a column. .unique() returns the actual list of those values. Together they give you a complete picture of what's in each column. All four columns have low cardinality (3–4 values) — easy to encode. A column with 500 unique values would need a very different strategy.

Step 2 — Which Category Appears Most Often?

Cardinality tells you how many categories there are. Frequency tells you how often each one appears. This matters because if one category appears in 95% of rows, your model will be heavily biased toward it — that's a class imbalance problem you need to know about early.

Let's count the frequency of every category in every column:

for col in cat_cols:
    print(f"--- {col} ---")

    # .value_counts() counts how many times each category appears
    # normalize=True converts counts to percentages (0.0 to 1.0)
    counts = df[col].value_counts()
    pcts   = df[col].value_counts(normalize=True) * 100  # multiply by 100 for %

    for category in counts.index:
        bar = '█' * counts[category]   # simple visual bar — one block per order
        print(f"  {category:<15} {counts[category]:>2} orders  ({pcts[category]:.0f}%)  {bar}")
    print()

What just happened?

pandas' .value_counts() is the go-to method for categorical columns — it counts how many times each value appears, sorted from most to least common. Adding normalize=True converts those counts to proportions so you can see percentages.

Good news: the distribution looks fairly balanced here — no category dominates. "Wrong size" is the most common return reason at 43%, which makes immediate business sense. "Unhappy" is the least common satisfaction rating at 21% — worth flagging but not alarming.

Step 3 — Spotting Rare Categories

A rare category is one that appears so infrequently it becomes a problem. If "Accessories" only appears in 2 out of 1,000 orders, a model trained on that data will barely ever see it — and won't know what to do when it does. These rare categories need to be flagged and handled before encoding.

Here's a simple check that flags any category appearing in less than 15% of rows:

RARE_THRESHOLD = 0.15   # any category below 15% of rows is "rare"

print("=== RARE CATEGORY CHECK ===\n")

for col in cat_cols:
    # Get proportions for each category (0.0 to 1.0)
    proportions = df[col].value_counts(normalize=True)

    # Find the ones below our threshold
    rare = proportions[proportions < RARE_THRESHOLD]

    if len(rare) > 0:
        print(f"  ⚠  {col}: rare categories found:")
        for cat, pct in rare.items():
            print(f"       '{cat}' appears in only {pct*100:.1f}% of rows")
    else:
        print(f"  ✓  {col}: no rare categories")

All clear here — but in real data, this often catches problems

In a real dataset with thousands of rows, you might find a product category that appears just twice. When that category gets one-hot encoded into a column, you'd have a column that's 99.8% zeros and 0.2% ones — almost no signal for the model. The fix: merge rare categories into an "Other" bucket before encoding.

Step 4 — Nominal vs Ordinal: Does the Order Matter?

Look at our four columns. Three of them are nominal (no order — category, return reason, customer tier). One is clearly ordinal — satisfaction has a natural order: Unhappy < Neutral < Happy.

If you label-encode satisfaction alphabetically, you'd get Happy=0, Neutral=1, Unhappy=2 — which is backwards and wrong. You need to define the order yourself:

# Define the correct order manually — lowest to highest
satisfaction_order = {'Unhappy': 0, 'Neutral': 1, 'Happy': 2}

# .map() applies this dictionary to every row in the column
# 'Unhappy' becomes 0, 'Neutral' becomes 1, 'Happy' becomes 2
df['satisfaction_encoded'] = df['satisfaction'].map(satisfaction_order)

# Show the result side by side
print(df[['satisfaction', 'satisfaction_encoded']].drop_duplicates().sort_values('satisfaction_encoded'))

What just happened?

pandas' .map() takes a dictionary and replaces every value in the column using it as a lookup table. We defined the order ourselves — Unhappy=0, Neutral=1, Happy=2 — so the numbers respect the real-world meaning. The model can now understand that 2 (Happy) is better than 0 (Unhappy), which is exactly the relationship we want it to learn.

Step 5 — One-Hot Encoding the Nominal Columns

For nominal columns (no order), we use one-hot encoding. This creates a separate Yes/No column for each category. "Tops" becomes a column called category_Tops that is 1 if the order is a Top, and 0 if not. The model can then treat each category as its own independent signal.

Here's how to one-hot encode the product category column:

# pd.get_dummies() creates one new column per category
# prefix='cat' keeps column names readable: cat_Tops, cat_Shoes, etc.
# drop_first=True removes one column to avoid a maths problem called multicollinearity
#   (if you know it's not Tops/Shoes/Dresses, it must be Accessories — so that column is redundant)
dummies = pd.get_dummies(df['category'], prefix='cat', drop_first=True)

print("One-hot encoded columns:")
print(dummies)

What just happened?

pandas' pd.get_dummies() automatically finds every unique value in the column and creates a True/False column for each one. We used drop_first=True to remove one column (Accessories) — because if all three other columns are False, the order must be Accessories. Keeping all four would give the model redundant information.

Row 0 is a Top: cat_Tops = True, the others are False. Row 3 is a Dress: cat_Dresses = True, the others are False. Row 6 is Accessories (the dropped category): all three are False. The model can figure that out on its own.

Step 6 — Connecting Categories to the Outcome

The final and most important step: does the category actually affect what we're trying to predict? Here, the model will predict returns. So the question is: do different product categories get returned more than others? Do higher-spending customers return less?

Let's look at average order value broken down by category and satisfaction:

# .groupby() splits the data by category, then calculates the mean order value per group
# This shows us: does a customer's category affect how much they spend?
print("--- Average order value by product category ---")
print(df.groupby('category')['order_value'].mean().sort_values(ascending=False).round(1))
print()

# Now check satisfaction — do unhappy customers spend more or less?
print("--- Average order value by satisfaction level ---")
# We use the encoded version so it sorts correctly: Unhappy < Neutral < Happy
print(df.groupby('satisfaction')['order_value'].mean())
print()

# Return reason breakdown: what % of each return reason comes from each category?
print("--- Return reasons by product category ---")
# pd.crosstab() counts how many times each combination of two categories appears
print(pd.crosstab(df['return_reason'], df['category']))

What just happened?

pandas' .groupby() splits the data into groups and computes a statistic for each. pd.crosstab() counts how many rows share each combination of two categorical columns — like a tally chart.

Three genuinely interesting findings pop out of this:

① Dresses have the highest average order value (£109) — the premium category. Tops are the cheapest (£37).

② Unhappy customers actually spend more on average (£84) than Happy ones (£55). This is a counterintuitive but important finding — higher-value orders may lead to higher expectations and more disappointment.

③ "Wrong size" returns are almost exclusively Shoes (4 out of 6). That's an actionable insight for the returns model — and for the product team who might need better sizing guides on footwear.

The Categorical Exploration Visual Summary

Here's a quick visual of the two most important findings — category frequency and return reasons by category.

Orders by Category

Tops
5 (36%)
Shoes
4 (29%)
Dresses
3 (21%)
Accessories
2 (14%)

Return Reason Split

Wrong size
6 (43%)
No reason
5 (36%)
Damaged
3 (21%)

Teacher's Note

The biggest mistake people make with categorical columns is jumping straight to encoding without exploring first. Encoding before exploring means you might encode a column that has no relationship with your target at all — wasting columns and confusing your model.

Always follow this order: count → frequency → rare check → ordinal/nominal → connect to outcome → then encode. Five minutes of exploration saves hours of debugging a model that silently underperforms because of a badly encoded feature.

Practice Questions

1. Which pandas method tells you how many unique values exist in a categorical column?



2. A column contains the values "Junior", "Mid", "Senior". These have a natural ranking. Is this column nominal or ordinal?



3. When using pd.get_dummies(), which argument removes one column to prevent the model from receiving redundant information?



Quiz

1. A product category called "Vintage Hats" appears in only 3 out of 10,000 orders (0.03%). What should you do with it?


2. You have a column with values "Low", "Medium", "High". What is the correct way to encode it?


3. Which pandas function creates a table that counts how many times each combination of two categorical columns appears together?


Up Next · Lesson 23

Numerical Exploration

The deep dive into numeric columns — distributions, percentiles, outlier zones, and how to build a complete numeric profile you can hand to any data scientist.