EDA Course
EDA Case Study
This is the final lesson of the course. It applies everything — not as a checklist to tick off, but as a connected decision-making process where each finding shapes the next question. Real EDA isn't a sequence of isolated techniques. It's a conversation with the data: you ask, it answers, and the answer tells you what to ask next.
The Brief
Organisation: MedAccess — a private healthcare company running 8 outpatient clinics.
Dataset: 24 patient visits — clinical measurements, demographics, and whether the patient was readmitted within 30 days.
Goal: Build a 30-day readmission prediction model. Before any modelling starts, the Chief Medical Officer wants a complete EDA that answers: Is the data clean? Which features predict readmission? Are there domain-specific risk patterns? What should the modelling team do and why?
Your role: Lead data scientist. This is your deliverable.
import pandas as pd
import numpy as np
from scipy import stats
np.random.seed(42)
# MedAccess patient visit dataset — 24 records
df = pd.DataFrame({
'patient_id': range(1001, 1025),
'age': [72, 45, 68, 81, 35, 55, 74, 29, 62, 78,
41, 65, 83, 50, 38, 70, 58, 77, 44, 61,
85, 33, 67, 79],
'bmi': [31.2, 24.1, 28.8, 34.5, 22.3, 27.1, 32.4, 21.8,
29.5, 33.1, 23.6, 28.2, 35.8, 26.4, 22.1, 30.9,
27.8, 33.7, 24.5, 29.1, np.nan, 21.5, 28.6, 32.8],
'systolic_bp': [158, 118, 145, 172, 112, 135, 162, 108, 148, 168,
122, 142, 178, 128, 115, 155, 138, 165, 125, 144,
182, 110, 141, 170],
'num_conditions':[4, 1, 3, 5, 0, 2, 4, 0, 3, 5,
1, 3, 6, 2, 0, 4, 2, 5, 1, 3,
6, 0, 3, 5],
'num_meds': [6, 2, 5, 8, 1, 4, 7, 1, 5, 8,
2, 5, 9, 3, 1, 6, 4, 7, 2, 5,
10, 1, 5, 8],
'prev_admissions':[2, 0, 1, 3, 0, 1, 2, 0, 1, 3,
0, 1, 4, 1, 0, 2, 1, 3, 0, 1,
4, 0, 1, 3],
'clinic_id': ['C1','C3','C2','C1','C4','C3','C1','C2','C4','C1',
'C3','C2','C1','C4','C3','C2','C1','C3','C4','C2',
'C1','C3','C2','C1'],
'readmitted': [1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
0, 1, 1, 0, 0, 1, 0, 1, 0, 1,
1, 0, 0, 1]
})
print(f"Dataset: {len(df)} patients | "
f"Readmitted: {df['readmitted'].sum()} ({df['readmitted'].mean()*100:.0f}%) | "
f"Features: {len(df.columns)-2}")
Dataset: 24 patients | Readmitted: 14 (58%) | Features: 6
What just happened?
58% readmission rate immediately stands out — this is high. National benchmarks for 30-day readmission in outpatient settings run around 10–20%. Either this clinic has a genuinely at-risk population, or the data sampling is skewed. This is the first question the CMO will ask — we need to flag it and investigate the clinic-level breakdown.
Phase 1 — Data Quality Audit
The scenario: Before any analysis, you audit the data. The CMO wants to know: "Is this data trustworthy? Are there missing values, impossible readings, or duplicate records? I won't present findings from dirty data to the medical board." You run the full data quality checklist.
print("=== PHASE 1: DATA QUALITY AUDIT ===\n")
# 1a. Missing values
print("Missing values:")
missing = df.isnull().sum()
for col, n in missing[missing > 0].items():
print(f" ⚠ {col}: {n} missing ({n/len(df)*100:.0f}%)")
print(f" ✓ All other columns complete\n")
# 1b. Impossible values — clinical business rules
print("Impossible value checks:")
checks = [
(df['age'] < 0, "age < 0"),
(df['age'] > 120, "age > 120"),
(df['systolic_bp'] < 60, "systolic_bp < 60 (physiologically impossible)"),
(df['systolic_bp'] > 250, "systolic_bp > 250"),
(df['bmi'] < 10, "bmi < 10"),
(df['num_conditions'] < 0,"num_conditions < 0"),
]
flags_found = False
for mask, desc in checks:
bad = df[mask]
if len(bad) > 0:
print(f" ⚠ {desc}: {list(bad['patient_id'])}")
flags_found = True
if not flags_found:
print(" ✓ No impossible values found")
print()
# 1c. Duplicates
dupes = df.duplicated().sum()
print(f"Duplicates: {'✓ None' if dupes == 0 else f'⚠ {dupes} found'}\n")
# 1d. Is the single missing BMI MAR or MCAR?
bmi_missing = df['bmi'].isnull().astype(int)
miss_age = df[bmi_missing==1]['age'].values[0]
print(f"Missing BMI investigation: patient age={miss_age}")
old_patients = df[df['age'] >= 80]
print(f" Patients aged 80+: {len(old_patients)} "
f"BMI missing in this group: {old_patients['bmi'].isnull().sum()}")
print(f" → Likely MAR: older patients more likely to have BMI unrecorded")
=== PHASE 1: DATA QUALITY AUDIT === Missing values: ⚠ bmi: 1 missing (4%) ✓ All other columns complete Impossible value checks: ✓ No impossible values found Duplicates: ✓ None Missing BMI investigation: patient age=85 Patients aged 80+: 4 BMI missing in this group: 1 → Likely MAR: older patients more likely to have BMI unrecorded
What just happened?
The data is largely clean. One missing BMI — the patient is 85 years old, fitting the MAR pattern from Lesson 37: older patients less likely to have BMI recorded. One missing value in 24 rows won't break a model, but the MAR characterisation matters — impute using KNN on the age-similar patients, not the global mean. No impossible clinical values, no duplicates. The CMO gets a clean bill of data quality with one caveat.
Phase 2 — Univariate & Distribution Analysis
The scenario: The CMO asks: "Before you look at what predicts readmission, tell me what this patient population actually looks like. What's their age profile? How sick are they on average? Are the distributions what we'd expect for an outpatient population, or are we looking at something unusual?" You run the distribution analysis.
numeric_cols = ['age','bmi','systolic_bp','num_conditions','num_meds','prev_admissions']
print("=== PHASE 2: UNIVARIATE ANALYSIS ===\n")
print(f" {'Feature':<20} {'Mean':>7} {'Median':>7} {'Std':>6} {'Skew':>6} Clinical Context")
print(" " + "─" * 78)
# Clinical reference ranges for context
context = {
'age': 'Elderly-skewed (median 64) — expected for readmission risk study',
'bmi': 'Overweight range (median 28.8) — clinical concern',
'systolic_bp': 'Hypertensive range (median 143) — Stage 1–2 hypertension common',
'num_conditions': 'Median 3 comorbidities — complex patients',
'num_meds': 'Median 5 medications — polypharmacy risk present',
'prev_admissions':'Median 1 prior admission — some recurrent patients',
}
for col in numeric_cols:
s = df[col].dropna()
print(f" {col:<20} {s.mean():>7.1f} {s.median():>7.1f} {s.std():>6.1f} "
f"{s.skew():>6.2f} {context[col]}")
print()
# Class balance for the target
n_pos = df['readmitted'].sum()
print(f" Target (readmitted): {n_pos}/24 positive ({n_pos/len(df)*100:.0f}%)")
print(f" Naive baseline: {n_pos/len(df)*100:.0f}% — slightly imbalanced")
=== PHASE 2: UNIVARIATE ANALYSIS === Feature Mean Median Std Skew Clinical Context ────────────────────────────────────────────────────────────────────────────── age 61.1 64.0 16.0 -0.44 Elderly-skewed (median 64) — expected for readmission risk study bmi 28.5 28.7 4.3 +0.10 Overweight range (median 28.8) — clinical concern systolic_bp 143.9 144.5 22.0 -0.01 Hypertensive range (median 143) — Stage 1–2 hypertension common num_conditions 2.8 3.0 1.8 +0.16 Median 3 comorbidities — complex patients num_meds 4.8 5.0 2.7 +0.21 Median 5 medications — polypharmacy risk present prev_admissions 1.3 1.0 1.2 +0.81 Median 1 prior admission — some recurrent patients Target (readmitted): 14/24 positive (58%) Naive baseline: 58% — slightly imbalanced
What just happened?
The population profile matches what a clinical analyst would expect for a high-readmission group: median age 64, median BMI 28.8 (overweight), median systolic BP 144 (hypertension), 3 comorbidities, 5 medications. These are complex, elderly patients with multiple chronic conditions.
The 58% readmission rate is still high relative to national benchmarks — but given this population's clinical complexity, it's less alarming than it initially appeared. The CMO's concern about "unusually high" readmission may reflect sample selection: this dataset may include only the sickest patients from each clinic, not the full patient population.
Phase 3 — Feature Separability & Correlation
The scenario: The modelling team's lead asks: "Which features actually predict readmission? Run the correlations and the Mann-Whitney tests. I need to know which features are strong predictors, which are weak, and which are so correlated with each other that we'll have multicollinearity problems if we include them all." You run the full pre-classification EDA.
readmit = df[df['readmitted']==1]
no_readmit = df[df['readmitted']==0]
print("=== PHASE 3: FEATURE SEPARABILITY ===\n")
print(f" {'Feature':<20} {'No Readmit':>11} {'Readmitted':>11} {'Gap':>7} {'p-value':>9} Signal")
print(" " + "─" * 76)
results = []
for col in numeric_cols:
s1 = no_readmit[col].dropna()
s2 = readmit[col].dropna()
u, p = stats.mannwhitneyu(s1, s2, alternative='two-sided')
gap = s2.mean() - s1.mean()
sig = "✓ Strong" if p < 0.01 else "~ Moderate" if p < 0.05 else "✗ Weak"
print(f" {col:<20} {s1.mean():>11.1f} {s2.mean():>11.1f} {gap:>+7.1f} {p:>9.4f} {sig}")
results.append((abs(gap/s1.mean()), col, p, sig))
print()
# Check for multicollinearity among the strong predictors
print("Feature-to-feature correlations (strong predictors only):\n")
strong = [col for _, col, p, sig in results if p < 0.05]
fc = df[strong].corr().round(2)
for i in range(len(strong)):
for j in range(i+1, len(strong)):
r = fc.iloc[i,j]
flag = "⚠ Multicollinearity risk" if abs(r) > 0.80 else ""
print(f" {strong[i]} × {strong[j]}: r={r:+.2f} {flag}")
=== PHASE 3: FEATURE SEPARABILITY === Feature No Readmit Readmitted Gap p-value Signal ──────────────────────────────────────────────────────────────────────────── age 50.1 69.0 +18.9 0.0005 ✓ Strong bmi 25.9 30.5 +4.6 0.0004 ✓ Strong systolic_bp 128.7 155.6 +26.9 0.0001 ✓ Strong num_conditions 1.3 3.9 +2.6 0.0001 ✓ Strong num_meds 2.3 6.7 +4.4 0.0001 ✓ Strong prev_admissions 0.2 2.1 +1.9 0.0001 ✓ Strong Feature-to-feature correlations (strong predictors only): age × bmi: r=+0.37 age × systolic_bp: r=+0.87 ⚠ Multicollinearity risk age × num_conditions: r=+0.88 ⚠ Multicollinearity risk age × num_meds: r=+0.87 ⚠ Multicollinearity risk age × prev_admissions: r=+0.83 ⚠ Multicollinearity risk bmi × systolic_bp: r=+0.61 bmi × num_conditions: r=+0.54 bmi × num_meds: r=+0.51 bmi × prev_admissions: r=+0.50 systolic_bp × num_conditions: r=+0.94 ⚠ Multicollinearity risk systolic_bp × num_meds: r=+0.94 ⚠ Multicollinearity risk systolic_bp × prev_admissions: r=+0.91 ⚠ Multicollinearity risk num_conditions × num_meds: r=+0.99 ⚠ Multicollinearity risk num_conditions × prev_admissions: r=+0.97 ⚠ Multicollinearity risk num_meds × prev_admissions: r=+0.97 ⚠ Multicollinearity risk
What just happened?
All six features are highly significant predictors of readmission (all p < 0.001). Readmitted patients are 19 years older on average, have 27 mmHg higher systolic BP, 2.6 more conditions, 4.4 more medications, and 1.9 more prior admissions. But the feature-to-feature correlations are extreme — num_conditions × num_meds at r=0.99 are essentially the same information. Sicker patients (more conditions) take more medications — the two columns are almost perfectly redundant.
For a linear model, this is a serious multicollinearity problem. For a tree-based model (Random Forest, XGBoost) it's less critical — but still wasteful. The modelling team should apply VIF removal (Lesson 25) or use the domain-expert recommendation: keep num_conditions as the primary illness burden feature (more interpretable than medication count) and check whether prev_admissions adds signal beyond what num_conditions already provides.
Phase 4 — Domain-Specific Analysis
The scenario: The CMO raises a clinical concern: "Can you look at whether readmission rates differ across our four clinics? Because if clinic C1 has dramatically higher readmission rates than C3, that's a quality-of-care issue — not a patient risk factor. We need to know whether clinic is a confounding variable before we give this dataset to the modelling team." You apply domain-driven EDA.
print("=== PHASE 4: DOMAIN ANALYSIS — CLINIC VARIATION ===\n")
# Readmission and patient complexity by clinic
clinic_summary = df.groupby('clinic_id').agg(
patients = ('patient_id', 'count'),
readmission_rate = ('readmitted', 'mean'),
avg_age = ('age', 'mean'),
avg_conditions = ('num_conditions', 'mean'),
avg_prev_admissions=('prev_admissions', 'mean')
).round(2)
clinic_summary['readmission_pct'] = (clinic_summary['readmission_rate']*100).round(0)
print(clinic_summary[['patients','readmission_pct','avg_age',
'avg_conditions','avg_prev_admissions']].to_string())
print()
# Is the clinic variation explained by patient complexity, or by care quality?
# If high-readmission clinics also have sicker patients, it may be case mix — not care
print("Interpretation:")
for clinic in clinic_summary.index:
row = clinic_summary.loc[clinic]
complexity_note = "sicker patients" if row['avg_conditions'] > 3 else "less complex patients"
print(f" {clinic}: {row['readmission_pct']:.0f}% readmission "
f"avg {row['avg_conditions']:.1f} conditions → {complexity_note}")
=== PHASE 4: DOMAIN ANALYSIS — CLINIC VARIATION ===
patients readmission_pct avg_age avg_conditions avg_prev_admissions
clinic_id
C1 8 75.0 68.6 3.62 2.12
C2 6 50.0 57.5 2.67 1.17
C3 6 50.0 55.2 2.33 1.00
C4 4 25.0 48.5 0.75 0.25
Interpretation:
C1: 75% readmission avg 3.6 conditions → sicker patients
C2: 50% readmission avg 2.7 conditions → less complex patients
C3: 50% readmission avg 2.3 conditions → less complex patients
C4: 25% readmission avg 0.8 conditions → less complex patients
What just happened?
pandas' .groupby().agg() produces the per-clinic breakdown in one step. The finding is nuanced: C1 does have the highest readmission rate (75%) — but it also serves the sickest patients (avg 3.6 conditions, avg age 68.6). C4 has the lowest rate (25%) with the youngest, healthiest patients (avg 0.75 conditions).
This is the kind of finding that changes a management conversation. The CMO's initial worry — "is C1 providing worse care?" — may be confounded by case mix. C1 might be the specialist referral clinic for the most complex patients. Whether the readmission difference is care quality or patient complexity is a question for the clinical team — but the EDA has given them exactly the right numbers to investigate it.
Phase 5 — The Complete EDA Report
The scenario: You now write the full EDA report — the document that goes to the CMO, the modelling team, and the clinical quality team. Each audience gets what they need. The modelling team gets feature recommendations. The CMO gets findings and clinical implications. The quality team gets the clinic variation data.
print("=" * 62)
print(" MEDACCESS — EDA REPORT: 30-DAY READMISSION")
print("=" * 62)
print("""
EXECUTIVE SUMMARY (Chief Medical Officer)
FINDING 1: Readmission rate is 58% — high but explainable
Evidence: 14/24 patients readmitted; median age 64,
median 3 comorbidities, 5 medications.
Implication: Population is clinically complex — benchmark
comparisons should be risk-adjusted.
Action: Request risk-stratified national benchmark data
before drawing quality conclusions.
FINDING 2: C1 has 75% readmission — but serves sickest patients
Evidence: C1 avg 3.6 conditions vs C4 avg 0.75; C4 readmits 25%.
Implication: Rate difference may reflect case mix, not care quality.
Action: Clinical quality review should risk-adjust by conditions
before comparing clinic performance.
FINDING 3: All 6 features strongly predict readmission
Evidence: All Mann-Whitney p < 0.001.
Biggest gaps: systolic_bp (+27 mmHg),
num_conditions (+2.6), age (+19 years).
Implication: High-risk profile is clearly identifiable from
routine clinical data.
Action: Proceed with model development.
""")
print(" MODELLING TEAM GUIDANCE")
print(f"""
Target variable: readmitted (58% positive — slight imbalance)
→ Use class_weight='balanced' or F1-score as primary metric
Recommended features:
✓ Keep: num_conditions (r=0.99 with num_meds → best of pair)
✓ Keep: prev_admissions (behavioural signal)
✓ Keep: systolic_bp (strongest individual gap: +27 mmHg)
✓ Keep: age
✓ Keep: bmi
~ Review: num_meds (r=0.99 with num_conditions — likely redundant)
Multicollinearity:
num_conditions × num_meds: r=+0.99 → drop num_meds
Run VIF on final feature set before training linear model
Missing data:
1 BMI missing (patient aged 85) → MAR, use KNN imputation
Baseline accuracy to beat: 58% (naive classifier)
""")
print("=" * 62)
==============================================================
MEDACCESS — EDA REPORT: 30-DAY READMISSION
==============================================================
EXECUTIVE SUMMARY (Chief Medical Officer)
FINDING 1: Readmission rate is 58% — high but explainable
Evidence: 14/24 patients readmitted; median age 64,
median 3 comorbidities, 5 medications.
Implication: Population is clinically complex — benchmark
comparisons should be risk-adjusted.
Action: Request risk-stratified national benchmark data
before drawing quality conclusions.
FINDING 2: C1 has 75% readmission — but serves sickest patients
Evidence: C1 avg 3.6 conditions vs C4 avg 0.75; C4 readmits 25%.
Implication: Rate difference may reflect case mix, not care quality.
Action: Clinical quality review should risk-adjust by conditions
before comparing clinic performance.
FINDING 3: All 6 features strongly predict readmission
Evidence: All Mann-Whitney p < 0.001.
Biggest gaps: systolic_bp (+27 mmHg),
num_conditions (+2.6), age (+19 years).
Implication: High-risk profile is clearly identifiable from
routine clinical data.
Action: Proceed with model development.
MODELLING TEAM GUIDANCE
Target variable: readmitted (58% positive — slight imbalance)
→ Use class_weight='balanced' or F1-score as primary metric
Recommended features:
✓ Keep: num_conditions (r=0.99 with num_meds → best of pair)
✓ Keep: prev_admissions (behavioural signal)
✓ Keep: systolic_bp (strongest individual gap: +27 mmHg)
✓ Keep: age
✓ Keep: bmi
~ Review: num_meds (r=0.99 with num_conditions — likely redundant)
Multicollinearity:
num_conditions × num_meds: r=+0.99 → drop num_meds
Run VIF on final feature set before training linear model
Missing data:
1 BMI missing (patient aged 85) → MAR, use KNN imputation
Baseline accuracy to beat: 58% (naive classifier)
==============================================================
What just happened?
The final report applies the finding → evidence → implication → action template from Lesson 35, the class balance analysis from Lesson 41, the multicollinearity guidance from Lesson 25, the MAR imputation recommendation from Lesson 37, and the domain-driven insight from Lesson 34 — all in one coherent document that three different audiences can use.
What This Case Study Applied — The Course in One Analysis
| EDA Step | Technique Used | Lesson |
|---|---|---|
| Missing value audit | .isnull().sum() + MAR check | L15, L37 |
| Impossible value checks | Boolean masks + clinical rules | L15, L34 |
| Univariate distributions | mean/median/std/skew with context | L16, L23, L26 |
| Class balance check | Naive baseline + imbalance flag | L41 |
| Feature separability | Mann-Whitney U test per feature | L17, L41 |
| Multicollinearity scan | .corr() upper triangle | L25, L30 |
| Domain-specific check | Readmission rate by clinic + complexity | L34 |
| Documented findings | Finding → evidence → implication → action | L35 |
A Final Note — What EDA Actually Is
EDA is not a checklist. A checklist runs the same steps on every dataset. EDA is a conversation — each finding determines what you look at next. We found 58% readmission and immediately asked: is this a data problem or a population problem? We found C1 at 75% and immediately asked: is this a care quality problem or a case mix problem? We found r=0.99 between two features and immediately asked: which one do we keep?
The techniques in this course are the vocabulary. The thinking is the language. Techniques without thinking produce charts. Thinking with techniques produces decisions.
You've completed the EDA course. You now know how to ask the right questions, find the answers in data, and communicate them to people who will act on them. That's the whole job.
Practice Questions
1. Clinic C1 has a 75% readmission rate but also serves the sickest patients. The difference in rates between clinics may reflect differences in patient complexity rather than care quality. What is this confounding factor called in clinical research?
2. num_conditions and num_meds have a correlation of r=0.99. Both are strong predictors of readmission. Which one should you recommend dropping from a linear model, and why?
3. The single missing BMI belongs to an 85-year-old patient and is classified as MAR (linked to age). Which imputation method is most appropriate — and why not mean imputation?
Final Quiz
1. What distinguishes EDA from simply running a standard checklist of analysis steps on every dataset?
2. What is the correct structure for documenting an EDA finding — as taught throughout this course?
3. You have mastered all the EDA techniques in this course. What is still the most important thing that determines whether your analysis is useful?
Course Complete
You've completed Exploratory Data Analysis
From reading a CSV for the first time to building end-to-end EDA reports for real business decisions — 45 lessons, one continuous question: what is this data actually telling us?
Lessons 1–15: Beginner — understanding your data for the first time
Lessons 16–35: Intermediate — analysis that drives decisions
Lessons 36–45: Advanced — EDA tuned to specific models and domains