EDA Lesson 41 – EDA for Classification | Dataplexa

Advanced Level · Lesson 41

EDA for Classification

When your model needs to sort things into categories rather than predict a number, the EDA questions change completely. You stop asking "is this relationship linear?" and start asking "does this feature actually separate the classes?" The single most common mistake before a classification model is not checking whether your classes are balanced — and shipping a model that looks 95% accurate but actually predicts the majority class for everything.

The Four Pre-Classification Checks

Four targeted questions replace the regression assumption checklist:

Class Balance

Are the classes roughly equal? A 95:5 imbalance means a model that always predicts the majority class is "95% accurate" — without learning anything useful.

Feature Separability

Does each feature have a meaningfully different distribution between the classes? If the distributions completely overlap, the feature can't help the model separate them.

Baseline Accuracy

What accuracy would a "dummy model" that always predicts the majority class achieve? Any real model must beat this to have learned anything meaningful.

Within-Class Distributions

Are there subgroups within a class that behave differently? A single class can hide multiple clusters — which complicates any boundary the model tries to draw.

The Dataset We'll Use

The scenario: You're a data scientist at a telecom company. The customer success team wants a model to predict which customers will churn in the next 30 days — so they can intervene before it happens. They've given you a dataset of 20 customers with four features. Before you touch a model, the team lead asks: "Run the classification EDA first. Tell me if our data is balanced, whether these features actually separate churners from non-churners, and what accuracy we'd need to beat just to prove the model is learning something."

import pandas as pd
import numpy as np
from scipy import stats

# Telecom churn dataset — 20 customers
df = pd.DataFrame({
    'customer_id':   range(1, 21),
    'monthly_calls': [42, 8,  35, 5,  38, 6,  40, 4,  36, 7,
                      39, 9,  34, 6,  41, 5,  37, 8,  43, 3 ],
    'support_tickets':[1, 6,  2,  8,  1,  7,  1,  9,  2,  5,
                       1,  6,  2,  7,  1,  8,  2,  6,  1,  10],
    'contract_months':[24, 3, 18, 1,  22, 2,  26, 1,  20, 4,
                       23, 3,  19, 2,  25, 1,  21, 4,  28, 1 ],
    'avg_bill':       [85, 62, 78, 55, 82, 58, 88, 52, 80, 65,
                       84, 60, 76, 57, 87, 53, 81, 63, 90, 48],
    'churned':        [0,  1,  0,  1,  0,  1,  0,  1,  0,  1,
                       0,  1,  0,  1,  0,  1,  0,  1,  0,  1 ]
    # 10 churned, 10 stayed — perfectly balanced in this dataset
})

n_total   = len(df)
n_churned = df['churned'].sum()
n_stayed  = n_total - n_churned

print(f"Dataset: {n_total} customers")
print(f"  Churned (1):     {n_churned}  ({n_churned/n_total*100:.0f}%)")
print(f"  Stayed  (0):     {n_stayed}   ({n_stayed/n_total*100:.0f}%)")

Dataset: 20 customers
  Churned (1):     10  (50%)
  Stayed  (0):     10  (50%)

What just happened?

This dataset is perfectly balanced — 50:50. That's the ideal case. In real telecom churn datasets, imbalance of 90:10 or worse is common (most customers don't churn in any given month). We'll see in Step 1 what that looks like and why it matters so much.

Check 1 — Class Balance and the Baseline Trap

The scenario: The team lead explains the stakes: "Last quarter we built a churn model and the data scientist said it was 92% accurate. But when we used it, it barely flagged anyone. Turned out the dataset was 92% non-churners, so the model just learned to say 'not churning' for everyone. I need you to tell me what our baseline accuracy is — because any model we build has to beat that number to be worth anything." You compute the class balance and the naive baseline.

# --- CURRENT DATASET ---
print("=== CLASS BALANCE CHECK ===\n")
majority_class_rate = max(n_churned, n_stayed) / n_total
print(f"Current dataset: {n_churned} churned / {n_stayed} stayed")
print(f"Naive baseline (always predict majority): {majority_class_rate*100:.0f}%")
print(f"→ Any model must beat {majority_class_rate*100:.0f}% accuracy to be useful\n")

# --- SIMULATE A REALISTIC IMBALANCED DATASET ---
# Show what the "92% accuracy trap" looks like with real numbers
print("Simulating what happens with an imbalanced real-world dataset:\n")

for churn_rate in [0.50, 0.20, 0.10, 0.05]:
    total   = 1000
    churned = int(total * churn_rate)
    stayed  = total - churned
    naive_acc = stayed / total * 100    # "always predict no churn" accuracy

    model_needed = naive_acc + 5        # a model needs to beat this to be worth building

    flag = "⚠ IMBALANCE" if churn_rate < 0.20 else "✓ Acceptable"
    print(f"  Churn rate {churn_rate*100:>5.0f}%:  naive accuracy = {naive_acc:.0f}%  "
          f"model must beat {model_needed:.0f}%  {flag}")

=== CLASS BALANCE CHECK ===

Current dataset: 10 churned / 10 stayed
Naive baseline (always predict majority): 50%
→ Any model must beat 50% accuracy to be useful

Simulating what happens with an imbalanced real-world dataset:

  Churn rate    50%:  naive accuracy = 50%  model must beat 55%  ✓ Acceptable
  Churn rate    20%:  naive accuracy = 80%  model must beat 85%  ✓ Acceptable
  Churn rate    10%:  naive accuracy = 90%  model must beat 95%  ⚠ IMBALANCE
  Churn rate     5%:  naive accuracy = 95%  model must beat 100% ⚠ IMBALANCE

What just happened?

pandas' simple arithmetic computes everything needed. The naive accuracy is just the proportion of the majority class — a model that predicts nothing but the most common class achieves this automatically.

The simulation shows why the team lead was burned before. At 10% churn rate, a model that always says "not churning" is 90% accurate — and completely useless. A model that correctly identifies even 50% of churners but also falsely flags 5% of non-churners might be less accurate overall but enormously more valuable. This is why accuracy is the wrong metric for imbalanced classification — recall, precision, and F1-score are what you actually need to optimise.

Check 2 — Feature Separability

The scenario: The team lead wants to know which features actually help predict churn before any model is built. "Show me whether the churned customers look different from the ones who stayed on each feature. If both groups have the same distribution on a feature, that feature is useless for classification — the model will just add noise." You compute the mean and distribution for each feature split by churn label, then run a statistical test on each one.

features = ['monthly_calls', 'support_tickets', 'contract_months', 'avg_bill']

churned  = df[df['churned'] == 1]
stayed   = df[df['churned'] == 0]

print("=== FEATURE SEPARABILITY ===\n")
print(f"  {'Feature':<20} {'Stayed mean':>12}  {'Churned mean':>13}  "
      f"{'Gap':>7}  {'p-value':>9}  Separable?")
print("  " + "─" * 78)

for feat in features:
    s_mean = stayed[feat].mean()
    c_mean = churned[feat].mean()
    gap    = c_mean - s_mean

    # Mann-Whitney U test: non-parametric test for whether two distributions differ
    # Does not assume normality — better than t-test for small samples
    # p < 0.05 means the two groups are statistically different on this feature
    u_stat, p = stats.mannwhitneyu(stayed[feat], churned[feat], alternative='two-sided')

    sep = "✓ YES" if p < 0.05 else "✗ No"
    print(f"  {feat:<20} {s_mean:>12.1f}  {c_mean:>13.1f}  "
          f"{gap:>+7.1f}  {p:>9.4f}  {sep}")

=== FEATURE SEPARABILITY ===

  Feature              Stayed mean  Churned mean      Gap    p-value  Separable?
  ──────────────────────────────────────────────────────────────────────────────
  monthly_calls               38.9           6.1    -32.8     0.0000  ✓ YES
  support_tickets              1.4           7.2     +5.8     0.0000  ✓ YES
  contract_months             21.8           2.3    -19.5     0.0000  ✓ YES
  avg_bill                    83.1          57.3    -25.8     0.0000  ✓ YES

What just happened?

scipy's stats.mannwhitneyu() is the non-parametric equivalent of a t-test — it tests whether two distributions are different without assuming normality. With only 10 samples per group, the t-test's normality assumption is shaky; Mann-Whitney is more appropriate.

All four features are highly separable (all p < 0.0001). The patterns are clear: churned customers make far fewer calls (6.1 vs 38.9), raise many more support tickets (7.2 vs 1.4), have much shorter contract history (2.3 vs 21.8 months), and pay lower bills (57.3 vs 83.1). These two groups look nothing alike on any feature. This is an unusually clean dataset — in real churn data, you'd often find some features with complete overlap and others with partial separation. Each should be evaluated independently.

Check 3 — The Overlap Zone

The scenario: The team lead pushes further: "Knowing the means are different is good — but what about the edges? Are there high-support-ticket customers who didn't churn? Are there low-call customers who stayed? Because those edge cases are where the model will make mistakes, and I need to know how hard the classification problem really is." You look at the distribution overlap — the grey zone where both classes appear.

print("=== OVERLAP ZONE ANALYSIS ===\n")

for feat in features:
    s_vals = stayed[feat]
    c_vals = churned[feat]

    # The overlap zone: between the minimum of the higher-mean group
    # and the maximum of the lower-mean group
    # If the two groups are fully separated, this range is empty (or negative)
    higher_min = min(s_vals.max(), c_vals.max())
    lower_max  = max(s_vals.min(), c_vals.min())

    overlap_exists = lower_max <= higher_min

    # Count how many customers from EACH class fall into the overlap zone
    overlap_low  = min(s_vals.max(), c_vals.max())
    overlap_high = max(s_vals.min(), c_vals.min())

    if s_vals.mean() > c_vals.mean():
        in_overlap_stayed  = ((s_vals >= overlap_high) & (s_vals <= overlap_low)).sum()
        in_overlap_churned = ((c_vals >= overlap_high) & (c_vals <= overlap_low)).sum()
    else:
        in_overlap_stayed  = ((s_vals <= overlap_low) & (s_vals >= overlap_high)).sum()
        in_overlap_churned = ((c_vals <= overlap_low) & (c_vals >= overlap_high)).sum()

    print(f"  {feat}:")
    print(f"    Stayed range:   [{s_vals.min():.0f} – {s_vals.max():.0f}]")
    print(f"    Churned range:  [{c_vals.min():.0f} – {c_vals.max():.0f}]")

    # Direct check: any values that appear in both groups' ranges?
    shared_lo = max(s_vals.min(), c_vals.min())
    shared_hi = min(s_vals.max(), c_vals.max())
    if shared_lo <= shared_hi:
        overlap_stayed  = ((s_vals >= shared_lo) & (s_vals <= shared_hi)).sum()
        overlap_churned = ((c_vals >= shared_lo) & (c_vals <= shared_hi)).sum()
        print(f"    Overlap zone:   [{shared_lo:.0f} – {shared_hi:.0f}]  "
              f"→ {overlap_stayed} stayed & {overlap_churned} churned in this zone")
    else:
        print(f"    Overlap zone:   NONE — classes fully separated on this feature ✓")
    print()

=== OVERLAP ZONE ANALYSIS ===

  monthly_calls:
    Stayed range:   [34 – 43]
    Churned range:  [3 – 9]
    Overlap zone:   NONE — classes fully separated on this feature ✓

  support_tickets:
    Stayed range:   [1 – 2]
    Churned range:  [5 – 10]
    Overlap zone:   NONE — classes fully separated on this feature ✓

  contract_months:
    Stayed range:   [18 – 28]
    Churned range:  [1 – 4]
    Overlap zone:   NONE — classes fully separated on this feature ✓

  avg_bill:
    Stayed range:   [76 – 90]
    Churned range:  [48 – 65]
    Overlap zone:   NONE — classes fully separated on this feature ✓

What just happened?

pandas' .min() and .max() per group give the full range of each class. If the max of the lower group is below the min of the higher group — there's a clean gap between them with no overlap. We check this for every feature.

All four features are fully separated — the ranges don't even touch. Monthly calls: churned customers range 3–9, stayed range 34–43. Contract months: churned range 1–4, stayed range 18–28. There is literally no customer where the classification is ambiguous on any individual feature. This is an ideal dataset for demonstrating the technique, but the team lead needs to understand that real churn data usually has significant overlap. The analysis approach is the same — the answers will just be less clean.

Check 4 — The Classification EDA Report

The scenario: The team lead needs a final document that captures all four checks and gives the modelling team clear guidance: which features to use, what imbalance strategy to apply (if any), what accuracy metric to use instead of raw accuracy, and what baseline the model needs to beat. You produce the complete pre-classification EDA summary.

print("=" * 58)
print("  PRE-CLASSIFICATION EDA REPORT — CHURN MODEL")
print("=" * 58)

# 1. Class balance
churn_rate = n_churned / n_total
baseline   = max(churn_rate, 1 - churn_rate)
imbalance_flag = "⚠ IMBALANCED — use SMOTE or class_weight" if churn_rate < 0.30 \
                  else "✓ Acceptable balance"
print(f"\n  1. CLASS BALANCE")
print(f"     Churn rate: {churn_rate*100:.0f}%  |  Non-churn: {(1-churn_rate)*100:.0f}%")
print(f"     Naive baseline accuracy: {baseline*100:.0f}%")
print(f"     {imbalance_flag}")
print(f"     Recommended metric: F1-score or AUC-ROC, not raw accuracy\n")

# 2. Feature separability summary
print(f"  2. FEATURE SEPARABILITY (Mann-Whitney p-values)\n")
print(f"     {'Feature':<20} {'p-value':>9}  {'Recommendation'}")
print("     " + "─" * 52)
for feat in features:
    _, p = stats.mannwhitneyu(stayed[feat], churned[feat], alternative='two-sided')
    rec = "✓ Keep — strong separator" if p < 0.01 else \
          "~ Review — moderate"       if p < 0.05 else \
          "✗ Weak — may not help"
    print(f"     {feat:<20} {p:>9.4f}  {rec}")

# 3. Overlap assessment
print(f"\n  3. OVERLAP ZONES")
print(f"     All 4 features show complete class separation.")
print(f"     No overlap zone — classification should be straightforward.\n")

# 4. Modelling recommendations
print(f"  4. MODELLING GUIDANCE")
print(f"     • Baseline to beat: {baseline*100:.0f}% (naive classifier)")
print(f"     • Metric to use: F1-score (harmonic mean of precision and recall)")
print(f"     • All 4 features are statistically significant — include all")
print(f"     • No transformation required (features are already linearly separable)")
print(f"     • Recommend: Logistic Regression as interpretable first model")
print(f"\n{'='*58}")

==========================================================
  PRE-CLASSIFICATION EDA REPORT — CHURN MODEL
==========================================================

  1. CLASS BALANCE
     Churn rate: 50%  |  Non-churn: 50%
     Naive baseline accuracy: 50%
     ✓ Acceptable balance
     Recommended metric: F1-score or AUC-ROC, not raw accuracy

  2. FEATURE SEPARABILITY (Mann-Whitney p-values)

     Feature              p-value  Recommendation
     ────────────────────────────────────────────────────
     monthly_calls         0.0000  ✓ Keep — strong separator
     support_tickets       0.0000  ✓ Keep — strong separator
     contract_months       0.0000  ✓ Keep — strong separator
     avg_bill              0.0000  ✓ Keep — strong separator

  3. OVERLAP ZONES
     All 4 features show complete class separation.
     No overlap zone — classification should be straightforward.

  4. MODELLING GUIDANCE
     • Baseline to beat: 50% (naive classifier)
     • Metric to use: F1-score (harmonic mean of precision and recall)
     • All 4 features are statistically significant — include all
     • No transformation required (features are already linearly separable)
     • Recommend: Logistic Regression as interpretable first model

==========================================================

What just happened?

The report packages every EDA finding into a single handoff document with four clear sections. The team lead now has: the baseline they need to beat (50%), the right metric to use (F1-score), which features to include (all four), and a recommended starting model (Logistic Regression). The classification EDA did its job — not to build a model, but to make sure the person who does build it starts with all the right information.

Teacher's Note

Accuracy is almost always the wrong metric for classification. If you report accuracy on an imbalanced dataset, you are almost certainly misleading yourself and your stakeholders. Use F1-score when false negatives and false positives are both costly. Use recall when missing a positive (like a churning customer) is worse than a false alarm. Use precision when false alarms are more costly than misses. Choosing the metric is an EDA decision, not a modelling one — you make it by understanding the class balance and the business consequences of each type of error.

When classes are imbalanced, the standard fixes are: class_weight='balanced' in sklearn models (weights each class inversely to its frequency), SMOTE (synthetic oversampling of the minority class), or simply resampling to create a balanced training set. All three should be evaluated — but first, you need to know the imbalance exists. That's what this EDA step is for.

Practice Questions

1. The accuracy a model achieves by always predicting the majority class — without learning anything about the data — is called the what?

2. Which non-parametric test checks whether two distributions are significantly different — used here instead of a t-test because the samples are small and normality can't be assumed?

3. Which classification metric combines precision and recall into a single number — recommended instead of raw accuracy when both false positives and false negatives matter?

Quiz

Up Next · Lesson 42

EDA for Time Series

Stationarity, autocorrelation, and seasonality detection — the targeted EDA checks before any forecasting model, and why a non-stationary series will silently break most standard algorithms.

← Previous Course Index Next →

EDA Course

EDA for Classification

The Four Pre-Classification Checks

The Dataset We'll Use

Check 1 — Class Balance and the Baseline Trap

Check 2 — Feature Separability

Check 3 — The Overlap Zone

Check 4 — The Classification EDA Report

Practice Questions

Quiz