EDA Course
EDA for Classification
When your model needs to sort things into categories rather than predict a number, the EDA questions change completely. You stop asking "is this relationship linear?" and start asking "does this feature actually separate the classes?" The single most common mistake before a classification model is not checking whether your classes are balanced — and shipping a model that looks 95% accurate but actually predicts the majority class for everything.
The Four Pre-Classification Checks
Four targeted questions replace the regression assumption checklist:
Class Balance
Are the classes roughly equal? A 95:5 imbalance means a model that always predicts the majority class is "95% accurate" — without learning anything useful.
Feature Separability
Does each feature have a meaningfully different distribution between the classes? If the distributions completely overlap, the feature can't help the model separate them.
Baseline Accuracy
What accuracy would a "dummy model" that always predicts the majority class achieve? Any real model must beat this to have learned anything meaningful.
Within-Class Distributions
Are there subgroups within a class that behave differently? A single class can hide multiple clusters — which complicates any boundary the model tries to draw.
The Dataset We'll Use
The scenario: You're a data scientist at a telecom company. The customer success team wants a model to predict which customers will churn in the next 30 days — so they can intervene before it happens. They've given you a dataset of 20 customers with four features. Before you touch a model, the team lead asks: "Run the classification EDA first. Tell me if our data is balanced, whether these features actually separate churners from non-churners, and what accuracy we'd need to beat just to prove the model is learning something."
import pandas as pd
import numpy as np
from scipy import stats
# Telecom churn dataset — 20 customers
df = pd.DataFrame({
'customer_id': range(1, 21),
'monthly_calls': [42, 8, 35, 5, 38, 6, 40, 4, 36, 7,
39, 9, 34, 6, 41, 5, 37, 8, 43, 3 ],
'support_tickets':[1, 6, 2, 8, 1, 7, 1, 9, 2, 5,
1, 6, 2, 7, 1, 8, 2, 6, 1, 10],
'contract_months':[24, 3, 18, 1, 22, 2, 26, 1, 20, 4,
23, 3, 19, 2, 25, 1, 21, 4, 28, 1 ],
'avg_bill': [85, 62, 78, 55, 82, 58, 88, 52, 80, 65,
84, 60, 76, 57, 87, 53, 81, 63, 90, 48],
'churned': [0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1, 0, 1 ]
# 10 churned, 10 stayed — perfectly balanced in this dataset
})
n_total = len(df)
n_churned = df['churned'].sum()
n_stayed = n_total - n_churned
print(f"Dataset: {n_total} customers")
print(f" Churned (1): {n_churned} ({n_churned/n_total*100:.0f}%)")
print(f" Stayed (0): {n_stayed} ({n_stayed/n_total*100:.0f}%)")
Dataset: 20 customers Churned (1): 10 (50%) Stayed (0): 10 (50%)
What just happened?
This dataset is perfectly balanced — 50:50. That's the ideal case. In real telecom churn datasets, imbalance of 90:10 or worse is common (most customers don't churn in any given month). We'll see in Step 1 what that looks like and why it matters so much.
Check 1 — Class Balance and the Baseline Trap
The scenario: The team lead explains the stakes: "Last quarter we built a churn model and the data scientist said it was 92% accurate. But when we used it, it barely flagged anyone. Turned out the dataset was 92% non-churners, so the model just learned to say 'not churning' for everyone. I need you to tell me what our baseline accuracy is — because any model we build has to beat that number to be worth anything." You compute the class balance and the naive baseline.
# --- CURRENT DATASET ---
print("=== CLASS BALANCE CHECK ===\n")
majority_class_rate = max(n_churned, n_stayed) / n_total
print(f"Current dataset: {n_churned} churned / {n_stayed} stayed")
print(f"Naive baseline (always predict majority): {majority_class_rate*100:.0f}%")
print(f"→ Any model must beat {majority_class_rate*100:.0f}% accuracy to be useful\n")
# --- SIMULATE A REALISTIC IMBALANCED DATASET ---
# Show what the "92% accuracy trap" looks like with real numbers
print("Simulating what happens with an imbalanced real-world dataset:\n")
for churn_rate in [0.50, 0.20, 0.10, 0.05]:
total = 1000
churned = int(total * churn_rate)
stayed = total - churned
naive_acc = stayed / total * 100 # "always predict no churn" accuracy
model_needed = naive_acc + 5 # a model needs to beat this to be worth building
flag = "⚠ IMBALANCE" if churn_rate < 0.20 else "✓ Acceptable"
print(f" Churn rate {churn_rate*100:>5.0f}%: naive accuracy = {naive_acc:.0f}% "
f"model must beat {model_needed:.0f}% {flag}")
=== CLASS BALANCE CHECK === Current dataset: 10 churned / 10 stayed Naive baseline (always predict majority): 50% → Any model must beat 50% accuracy to be useful Simulating what happens with an imbalanced real-world dataset: Churn rate 50%: naive accuracy = 50% model must beat 55% ✓ Acceptable Churn rate 20%: naive accuracy = 80% model must beat 85% ✓ Acceptable Churn rate 10%: naive accuracy = 90% model must beat 95% ⚠ IMBALANCE Churn rate 5%: naive accuracy = 95% model must beat 100% ⚠ IMBALANCE
What just happened?
pandas' simple arithmetic computes everything needed. The naive accuracy is just the proportion of the majority class — a model that predicts nothing but the most common class achieves this automatically.
The simulation shows why the team lead was burned before. At 10% churn rate, a model that always says "not churning" is 90% accurate — and completely useless. A model that correctly identifies even 50% of churners but also falsely flags 5% of non-churners might be less accurate overall but enormously more valuable. This is why accuracy is the wrong metric for imbalanced classification — recall, precision, and F1-score are what you actually need to optimise.
Check 2 — Feature Separability
The scenario: The team lead wants to know which features actually help predict churn before any model is built. "Show me whether the churned customers look different from the ones who stayed on each feature. If both groups have the same distribution on a feature, that feature is useless for classification — the model will just add noise." You compute the mean and distribution for each feature split by churn label, then run a statistical test on each one.
features = ['monthly_calls', 'support_tickets', 'contract_months', 'avg_bill']
churned = df[df['churned'] == 1]
stayed = df[df['churned'] == 0]
print("=== FEATURE SEPARABILITY ===\n")
print(f" {'Feature':<20} {'Stayed mean':>12} {'Churned mean':>13} "
f"{'Gap':>7} {'p-value':>9} Separable?")
print(" " + "─" * 78)
for feat in features:
s_mean = stayed[feat].mean()
c_mean = churned[feat].mean()
gap = c_mean - s_mean
# Mann-Whitney U test: non-parametric test for whether two distributions differ
# Does not assume normality — better than t-test for small samples
# p < 0.05 means the two groups are statistically different on this feature
u_stat, p = stats.mannwhitneyu(stayed[feat], churned[feat], alternative='two-sided')
sep = "✓ YES" if p < 0.05 else "✗ No"
print(f" {feat:<20} {s_mean:>12.1f} {c_mean:>13.1f} "
f"{gap:>+7.1f} {p:>9.4f} {sep}")
=== FEATURE SEPARABILITY === Feature Stayed mean Churned mean Gap p-value Separable? ────────────────────────────────────────────────────────────────────────────── monthly_calls 38.9 6.1 -32.8 0.0000 ✓ YES support_tickets 1.4 7.2 +5.8 0.0000 ✓ YES contract_months 21.8 2.3 -19.5 0.0000 ✓ YES avg_bill 83.1 57.3 -25.8 0.0000 ✓ YES
What just happened?
scipy's stats.mannwhitneyu() is the non-parametric equivalent of a t-test — it tests whether two distributions are different without assuming normality. With only 10 samples per group, the t-test's normality assumption is shaky; Mann-Whitney is more appropriate.
All four features are highly separable (all p < 0.0001). The patterns are clear: churned customers make far fewer calls (6.1 vs 38.9), raise many more support tickets (7.2 vs 1.4), have much shorter contract history (2.3 vs 21.8 months), and pay lower bills (57.3 vs 83.1). These two groups look nothing alike on any feature. This is an unusually clean dataset — in real churn data, you'd often find some features with complete overlap and others with partial separation. Each should be evaluated independently.
Check 3 — The Overlap Zone
The scenario: The team lead pushes further: "Knowing the means are different is good — but what about the edges? Are there high-support-ticket customers who didn't churn? Are there low-call customers who stayed? Because those edge cases are where the model will make mistakes, and I need to know how hard the classification problem really is." You look at the distribution overlap — the grey zone where both classes appear.
print("=== OVERLAP ZONE ANALYSIS ===\n")
for feat in features:
s_vals = stayed[feat]
c_vals = churned[feat]
# The overlap zone: between the minimum of the higher-mean group
# and the maximum of the lower-mean group
# If the two groups are fully separated, this range is empty (or negative)
higher_min = min(s_vals.max(), c_vals.max())
lower_max = max(s_vals.min(), c_vals.min())
overlap_exists = lower_max <= higher_min
# Count how many customers from EACH class fall into the overlap zone
overlap_low = min(s_vals.max(), c_vals.max())
overlap_high = max(s_vals.min(), c_vals.min())
if s_vals.mean() > c_vals.mean():
in_overlap_stayed = ((s_vals >= overlap_high) & (s_vals <= overlap_low)).sum()
in_overlap_churned = ((c_vals >= overlap_high) & (c_vals <= overlap_low)).sum()
else:
in_overlap_stayed = ((s_vals <= overlap_low) & (s_vals >= overlap_high)).sum()
in_overlap_churned = ((c_vals <= overlap_low) & (c_vals >= overlap_high)).sum()
print(f" {feat}:")
print(f" Stayed range: [{s_vals.min():.0f} – {s_vals.max():.0f}]")
print(f" Churned range: [{c_vals.min():.0f} – {c_vals.max():.0f}]")
# Direct check: any values that appear in both groups' ranges?
shared_lo = max(s_vals.min(), c_vals.min())
shared_hi = min(s_vals.max(), c_vals.max())
if shared_lo <= shared_hi:
overlap_stayed = ((s_vals >= shared_lo) & (s_vals <= shared_hi)).sum()
overlap_churned = ((c_vals >= shared_lo) & (c_vals <= shared_hi)).sum()
print(f" Overlap zone: [{shared_lo:.0f} – {shared_hi:.0f}] "
f"→ {overlap_stayed} stayed & {overlap_churned} churned in this zone")
else:
print(f" Overlap zone: NONE — classes fully separated on this feature ✓")
print()
=== OVERLAP ZONE ANALYSIS ===
monthly_calls:
Stayed range: [34 – 43]
Churned range: [3 – 9]
Overlap zone: NONE — classes fully separated on this feature ✓
support_tickets:
Stayed range: [1 – 2]
Churned range: [5 – 10]
Overlap zone: NONE — classes fully separated on this feature ✓
contract_months:
Stayed range: [18 – 28]
Churned range: [1 – 4]
Overlap zone: NONE — classes fully separated on this feature ✓
avg_bill:
Stayed range: [76 – 90]
Churned range: [48 – 65]
Overlap zone: NONE — classes fully separated on this feature ✓
What just happened?
pandas' .min() and .max() per group give the full range of each class. If the max of the lower group is below the min of the higher group — there's a clean gap between them with no overlap. We check this for every feature.
All four features are fully separated — the ranges don't even touch. Monthly calls: churned customers range 3–9, stayed range 34–43. Contract months: churned range 1–4, stayed range 18–28. There is literally no customer where the classification is ambiguous on any individual feature. This is an ideal dataset for demonstrating the technique, but the team lead needs to understand that real churn data usually has significant overlap. The analysis approach is the same — the answers will just be less clean.
Check 4 — The Classification EDA Report
The scenario: The team lead needs a final document that captures all four checks and gives the modelling team clear guidance: which features to use, what imbalance strategy to apply (if any), what accuracy metric to use instead of raw accuracy, and what baseline the model needs to beat. You produce the complete pre-classification EDA summary.
print("=" * 58)
print(" PRE-CLASSIFICATION EDA REPORT — CHURN MODEL")
print("=" * 58)
# 1. Class balance
churn_rate = n_churned / n_total
baseline = max(churn_rate, 1 - churn_rate)
imbalance_flag = "⚠ IMBALANCED — use SMOTE or class_weight" if churn_rate < 0.30 \
else "✓ Acceptable balance"
print(f"\n 1. CLASS BALANCE")
print(f" Churn rate: {churn_rate*100:.0f}% | Non-churn: {(1-churn_rate)*100:.0f}%")
print(f" Naive baseline accuracy: {baseline*100:.0f}%")
print(f" {imbalance_flag}")
print(f" Recommended metric: F1-score or AUC-ROC, not raw accuracy\n")
# 2. Feature separability summary
print(f" 2. FEATURE SEPARABILITY (Mann-Whitney p-values)\n")
print(f" {'Feature':<20} {'p-value':>9} {'Recommendation'}")
print(" " + "─" * 52)
for feat in features:
_, p = stats.mannwhitneyu(stayed[feat], churned[feat], alternative='two-sided')
rec = "✓ Keep — strong separator" if p < 0.01 else \
"~ Review — moderate" if p < 0.05 else \
"✗ Weak — may not help"
print(f" {feat:<20} {p:>9.4f} {rec}")
# 3. Overlap assessment
print(f"\n 3. OVERLAP ZONES")
print(f" All 4 features show complete class separation.")
print(f" No overlap zone — classification should be straightforward.\n")
# 4. Modelling recommendations
print(f" 4. MODELLING GUIDANCE")
print(f" • Baseline to beat: {baseline*100:.0f}% (naive classifier)")
print(f" • Metric to use: F1-score (harmonic mean of precision and recall)")
print(f" • All 4 features are statistically significant — include all")
print(f" • No transformation required (features are already linearly separable)")
print(f" • Recommend: Logistic Regression as interpretable first model")
print(f"\n{'='*58}")
==========================================================
PRE-CLASSIFICATION EDA REPORT — CHURN MODEL
==========================================================
1. CLASS BALANCE
Churn rate: 50% | Non-churn: 50%
Naive baseline accuracy: 50%
✓ Acceptable balance
Recommended metric: F1-score or AUC-ROC, not raw accuracy
2. FEATURE SEPARABILITY (Mann-Whitney p-values)
Feature p-value Recommendation
────────────────────────────────────────────────────
monthly_calls 0.0000 ✓ Keep — strong separator
support_tickets 0.0000 ✓ Keep — strong separator
contract_months 0.0000 ✓ Keep — strong separator
avg_bill 0.0000 ✓ Keep — strong separator
3. OVERLAP ZONES
All 4 features show complete class separation.
No overlap zone — classification should be straightforward.
4. MODELLING GUIDANCE
• Baseline to beat: 50% (naive classifier)
• Metric to use: F1-score (harmonic mean of precision and recall)
• All 4 features are statistically significant — include all
• No transformation required (features are already linearly separable)
• Recommend: Logistic Regression as interpretable first model
==========================================================
What just happened?
The report packages every EDA finding into a single handoff document with four clear sections. The team lead now has: the baseline they need to beat (50%), the right metric to use (F1-score), which features to include (all four), and a recommended starting model (Logistic Regression). The classification EDA did its job — not to build a model, but to make sure the person who does build it starts with all the right information.
Teacher's Note
Accuracy is almost always the wrong metric for classification. If you report accuracy on an imbalanced dataset, you are almost certainly misleading yourself and your stakeholders. Use F1-score when false negatives and false positives are both costly. Use recall when missing a positive (like a churning customer) is worse than a false alarm. Use precision when false alarms are more costly than misses. Choosing the metric is an EDA decision, not a modelling one — you make it by understanding the class balance and the business consequences of each type of error.
When classes are imbalanced, the standard fixes are: class_weight='balanced' in sklearn models (weights each class inversely to its frequency), SMOTE (synthetic oversampling of the minority class), or simply resampling to create a balanced training set. All three should be evaluated — but first, you need to know the imbalance exists. That's what this EDA step is for.
Practice Questions
1. The accuracy a model achieves by always predicting the majority class — without learning anything about the data — is called the what?
2. Which non-parametric test checks whether two distributions are significantly different — used here instead of a t-test because the samples are small and normality can't be assumed?
3. Which classification metric combines precision and recall into a single number — recommended instead of raw accuracy when both false positives and false negatives matter?
Quiz
1. A churn model achieves 92% accuracy on a dataset where 92% of customers did not churn. The model barely flags anyone. What happened?
2. A Mann-Whitney test returns p=0.62 for a feature vs the class label. What does this mean for your model?
3. Your churn dataset has 5% churners and 95% non-churners. The EDA confirms severe class imbalance. What should you do before training?
Up Next · Lesson 42
EDA for Time Series
Stationarity, autocorrelation, and seasonality detection — the targeted EDA checks before any forecasting model, and why a non-stationary series will silently break most standard algorithms.