EDA Lesson 45 – EDA Case Study | Dataplexa

Advanced Level · Lesson 45

EDA Case Study

This is the final lesson of the course. It applies everything — not as a checklist to tick off, but as a connected decision-making process where each finding shapes the next question. Real EDA isn't a sequence of isolated techniques. It's a conversation with the data: you ask, it answers, and the answer tells you what to ask next.

The Brief

Organisation: MedAccess — a private healthcare company running 8 outpatient clinics.

Dataset: 24 patient visits — clinical measurements, demographics, and whether the patient was readmitted within 30 days.

Goal: Build a 30-day readmission prediction model. Before any modelling starts, the Chief Medical Officer wants a complete EDA that answers: Is the data clean? Which features predict readmission? Are there domain-specific risk patterns? What should the modelling team do and why?

Your role: Lead data scientist. This is your deliverable.

import pandas as pd
import numpy as np
from scipy import stats

np.random.seed(42)

# MedAccess patient visit dataset — 24 records
df = pd.DataFrame({
    'patient_id':    range(1001, 1025),
    'age':           [72, 45, 68, 81, 35, 55, 74, 29, 62, 78,
                      41, 65, 83, 50, 38, 70, 58, 77, 44, 61,
                      85, 33, 67, 79],
    'bmi':           [31.2, 24.1, 28.8, 34.5, 22.3, 27.1, 32.4, 21.8,
                      29.5, 33.1, 23.6, 28.2, 35.8, 26.4, 22.1, 30.9,
                      27.8, 33.7, 24.5, 29.1, np.nan, 21.5, 28.6, 32.8],
    'systolic_bp':   [158, 118, 145, 172, 112, 135, 162, 108, 148, 168,
                      122, 142, 178, 128, 115, 155, 138, 165, 125, 144,
                      182, 110, 141, 170],
    'num_conditions':[4, 1, 3, 5, 0, 2, 4, 0, 3, 5,
                      1, 3, 6, 2, 0, 4, 2, 5, 1, 3,
                      6, 0, 3, 5],
    'num_meds':      [6, 2, 5, 8, 1, 4, 7, 1, 5, 8,
                      2, 5, 9, 3, 1, 6, 4, 7, 2, 5,
                      10, 1, 5, 8],
    'prev_admissions':[2, 0, 1, 3, 0, 1, 2, 0, 1, 3,
                       0, 1, 4, 1, 0, 2, 1, 3, 0, 1,
                       4, 0, 1, 3],
    'clinic_id':     ['C1','C3','C2','C1','C4','C3','C1','C2','C4','C1',
                      'C3','C2','C1','C4','C3','C2','C1','C3','C4','C2',
                      'C1','C3','C2','C1'],
    'readmitted':    [1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
                      0, 1, 1, 0, 0, 1, 0, 1, 0, 1,
                      1, 0, 0, 1]
})

print(f"Dataset: {len(df)} patients  |  "
      f"Readmitted: {df['readmitted'].sum()} ({df['readmitted'].mean()*100:.0f}%)  |  "
      f"Features: {len(df.columns)-2}")

Dataset: 24 patients  |  Readmitted: 14 (58%)  |  Features: 6

What just happened?

58% readmission rate immediately stands out — this is high. National benchmarks for 30-day readmission in outpatient settings run around 10–20%. Either this clinic has a genuinely at-risk population, or the data sampling is skewed. This is the first question the CMO will ask — we need to flag it and investigate the clinic-level breakdown.

Phase 1 — Data Quality Audit

The scenario: Before any analysis, you audit the data. The CMO wants to know: "Is this data trustworthy? Are there missing values, impossible readings, or duplicate records? I won't present findings from dirty data to the medical board." You run the full data quality checklist.

print("=== PHASE 1: DATA QUALITY AUDIT ===\n")

# 1a. Missing values
print("Missing values:")
missing = df.isnull().sum()
for col, n in missing[missing > 0].items():
    print(f"  ⚠ {col}: {n} missing ({n/len(df)*100:.0f}%)")
print(f"  ✓ All other columns complete\n")

# 1b. Impossible values — clinical business rules
print("Impossible value checks:")
checks = [
    (df['age'] < 0,           "age < 0"),
    (df['age'] > 120,         "age > 120"),
    (df['systolic_bp'] < 60,  "systolic_bp < 60 (physiologically impossible)"),
    (df['systolic_bp'] > 250, "systolic_bp > 250"),
    (df['bmi'] < 10,          "bmi < 10"),
    (df['num_conditions'] < 0,"num_conditions < 0"),
]
flags_found = False
for mask, desc in checks:
    bad = df[mask]
    if len(bad) > 0:
        print(f"  ⚠ {desc}: {list(bad['patient_id'])}")
        flags_found = True
if not flags_found:
    print("  ✓ No impossible values found")
print()

# 1c. Duplicates
dupes = df.duplicated().sum()
print(f"Duplicates: {'✓ None' if dupes == 0 else f'⚠ {dupes} found'}\n")

# 1d. Is the single missing BMI MAR or MCAR?
bmi_missing = df['bmi'].isnull().astype(int)
miss_age = df[bmi_missing==1]['age'].values[0]
print(f"Missing BMI investigation: patient age={miss_age}")
old_patients = df[df['age'] >= 80]
print(f"  Patients aged 80+: {len(old_patients)}  "
      f"BMI missing in this group: {old_patients['bmi'].isnull().sum()}")
print(f"  → Likely MAR: older patients more likely to have BMI unrecorded")

=== PHASE 1: DATA QUALITY AUDIT ===

Missing values:
  ⚠ bmi: 1 missing (4%)
  ✓ All other columns complete

Impossible value checks:
  ✓ No impossible values found

Duplicates: ✓ None

Missing BMI investigation: patient age=85
  Patients aged 80+: 4  BMI missing in this group: 1
  → Likely MAR: older patients more likely to have BMI unrecorded

What just happened?

The data is largely clean. One missing BMI — the patient is 85 years old, fitting the MAR pattern from Lesson 37: older patients less likely to have BMI recorded. One missing value in 24 rows won't break a model, but the MAR characterisation matters — impute using KNN on the age-similar patients, not the global mean. No impossible clinical values, no duplicates. The CMO gets a clean bill of data quality with one caveat.

Phase 2 — Univariate & Distribution Analysis

The scenario: The CMO asks: "Before you look at what predicts readmission, tell me what this patient population actually looks like. What's their age profile? How sick are they on average? Are the distributions what we'd expect for an outpatient population, or are we looking at something unusual?" You run the distribution analysis.

numeric_cols = ['age','bmi','systolic_bp','num_conditions','num_meds','prev_admissions']

print("=== PHASE 2: UNIVARIATE ANALYSIS ===\n")
print(f"  {'Feature':<20} {'Mean':>7}  {'Median':>7}  {'Std':>6}  {'Skew':>6}  Clinical Context")
print("  " + "─" * 78)

# Clinical reference ranges for context
context = {
    'age':            'Elderly-skewed (median 64) — expected for readmission risk study',
    'bmi':            'Overweight range (median 28.8) — clinical concern',
    'systolic_bp':    'Hypertensive range (median 143) — Stage 1–2 hypertension common',
    'num_conditions': 'Median 3 comorbidities — complex patients',
    'num_meds':       'Median 5 medications — polypharmacy risk present',
    'prev_admissions':'Median 1 prior admission — some recurrent patients',
}

for col in numeric_cols:
    s = df[col].dropna()
    print(f"  {col:<20} {s.mean():>7.1f}  {s.median():>7.1f}  {s.std():>6.1f}  "
          f"{s.skew():>6.2f}  {context[col]}")

print()
# Class balance for the target
n_pos = df['readmitted'].sum()
print(f"  Target (readmitted): {n_pos}/24 positive ({n_pos/len(df)*100:.0f}%)")
print(f"  Naive baseline: {n_pos/len(df)*100:.0f}%  — slightly imbalanced")

=== PHASE 2: UNIVARIATE ANALYSIS ===

  Feature                Mean   Median     Std    Skew  Clinical Context
  ──────────────────────────────────────────────────────────────────────────────
  age                    61.1     64.0    16.0   -0.44  Elderly-skewed (median 64) — expected for readmission risk study
  bmi                    28.5     28.7     4.3   +0.10  Overweight range (median 28.8) — clinical concern
  systolic_bp           143.9    144.5    22.0   -0.01  Hypertensive range (median 143) — Stage 1–2 hypertension common
  num_conditions          2.8      3.0     1.8   +0.16  Median 3 comorbidities — complex patients
  num_meds                4.8      5.0     2.7   +0.21  Median 5 medications — polypharmacy risk present
  prev_admissions         1.3      1.0     1.2   +0.81  Median 1 prior admission — some recurrent patients

  Target (readmitted): 14/24 positive (58%)
  Naive baseline: 58%  — slightly imbalanced

What just happened?

The population profile matches what a clinical analyst would expect for a high-readmission group: median age 64, median BMI 28.8 (overweight), median systolic BP 144 (hypertension), 3 comorbidities, 5 medications. These are complex, elderly patients with multiple chronic conditions.

The 58% readmission rate is still high relative to national benchmarks — but given this population's clinical complexity, it's less alarming than it initially appeared. The CMO's concern about "unusually high" readmission may reflect sample selection: this dataset may include only the sickest patients from each clinic, not the full patient population.

Phase 3 — Feature Separability & Correlation

The scenario: The modelling team's lead asks: "Which features actually predict readmission? Run the correlations and the Mann-Whitney tests. I need to know which features are strong predictors, which are weak, and which are so correlated with each other that we'll have multicollinearity problems if we include them all." You run the full pre-classification EDA.

readmit  = df[df['readmitted']==1]
no_readmit = df[df['readmitted']==0]

print("=== PHASE 3: FEATURE SEPARABILITY ===\n")
print(f"  {'Feature':<20} {'No Readmit':>11}  {'Readmitted':>11}  {'Gap':>7}  {'p-value':>9}  Signal")
print("  " + "─" * 76)

results = []
for col in numeric_cols:
    s1 = no_readmit[col].dropna()
    s2 = readmit[col].dropna()
    u, p = stats.mannwhitneyu(s1, s2, alternative='two-sided')
    gap  = s2.mean() - s1.mean()
    sig  = "✓ Strong" if p < 0.01 else "~ Moderate" if p < 0.05 else "✗ Weak"
    print(f"  {col:<20} {s1.mean():>11.1f}  {s2.mean():>11.1f}  {gap:>+7.1f}  {p:>9.4f}  {sig}")
    results.append((abs(gap/s1.mean()), col, p, sig))

print()
# Check for multicollinearity among the strong predictors
print("Feature-to-feature correlations (strong predictors only):\n")
strong = [col for _, col, p, sig in results if p < 0.05]
fc = df[strong].corr().round(2)
for i in range(len(strong)):
    for j in range(i+1, len(strong)):
        r = fc.iloc[i,j]
        flag = "⚠ Multicollinearity risk" if abs(r) > 0.80 else ""
        print(f"  {strong[i]} × {strong[j]}:  r={r:+.2f}  {flag}")

=== PHASE 3: FEATURE SEPARABILITY ===

  Feature              No Readmit  Readmitted      Gap   p-value  Signal
  ────────────────────────────────────────────────────────────────────────────
  age                        50.1        69.0    +18.9    0.0005  ✓ Strong
  bmi                        25.9        30.5     +4.6    0.0004  ✓ Strong
  systolic_bp               128.7       155.6    +26.9    0.0001  ✓ Strong
  num_conditions              1.3         3.9     +2.6    0.0001  ✓ Strong
  num_meds                    2.3         6.7     +4.4    0.0001  ✓ Strong
  prev_admissions             0.2         2.1     +1.9    0.0001  ✓ Strong

Feature-to-feature correlations (strong predictors only):

  age × bmi:  r=+0.37
  age × systolic_bp:  r=+0.87  ⚠ Multicollinearity risk
  age × num_conditions:  r=+0.88  ⚠ Multicollinearity risk
  age × num_meds:  r=+0.87  ⚠ Multicollinearity risk
  age × prev_admissions:  r=+0.83  ⚠ Multicollinearity risk
  bmi × systolic_bp:  r=+0.61
  bmi × num_conditions:  r=+0.54
  bmi × num_meds:  r=+0.51
  bmi × prev_admissions:  r=+0.50
  systolic_bp × num_conditions:  r=+0.94  ⚠ Multicollinearity risk
  systolic_bp × num_meds:  r=+0.94  ⚠ Multicollinearity risk
  systolic_bp × prev_admissions:  r=+0.91  ⚠ Multicollinearity risk
  num_conditions × num_meds:  r=+0.99  ⚠ Multicollinearity risk
  num_conditions × prev_admissions:  r=+0.97  ⚠ Multicollinearity risk
  num_meds × prev_admissions:  r=+0.97  ⚠ Multicollinearity risk

What just happened?

All six features are highly significant predictors of readmission (all p < 0.001). Readmitted patients are 19 years older on average, have 27 mmHg higher systolic BP, 2.6 more conditions, 4.4 more medications, and 1.9 more prior admissions. But the feature-to-feature correlations are extreme — num_conditions × num_meds at r=0.99 are essentially the same information. Sicker patients (more conditions) take more medications — the two columns are almost perfectly redundant.

For a linear model, this is a serious multicollinearity problem. For a tree-based model (Random Forest, XGBoost) it's less critical — but still wasteful. The modelling team should apply VIF removal (Lesson 25) or use the domain-expert recommendation: keep num_conditions as the primary illness burden feature (more interpretable than medication count) and check whether prev_admissions adds signal beyond what num_conditions already provides.

Phase 4 — Domain-Specific Analysis

The scenario: The CMO raises a clinical concern: "Can you look at whether readmission rates differ across our four clinics? Because if clinic C1 has dramatically higher readmission rates than C3, that's a quality-of-care issue — not a patient risk factor. We need to know whether clinic is a confounding variable before we give this dataset to the modelling team." You apply domain-driven EDA.

print("=== PHASE 4: DOMAIN ANALYSIS — CLINIC VARIATION ===\n")

# Readmission and patient complexity by clinic
clinic_summary = df.groupby('clinic_id').agg(
    patients          = ('patient_id',       'count'),
    readmission_rate  = ('readmitted',        'mean'),
    avg_age           = ('age',               'mean'),
    avg_conditions    = ('num_conditions',    'mean'),
    avg_prev_admissions=('prev_admissions',   'mean')
).round(2)
clinic_summary['readmission_pct'] = (clinic_summary['readmission_rate']*100).round(0)

print(clinic_summary[['patients','readmission_pct','avg_age',
                        'avg_conditions','avg_prev_admissions']].to_string())
print()

# Is the clinic variation explained by patient complexity, or by care quality?
# If high-readmission clinics also have sicker patients, it may be case mix — not care
print("Interpretation:")
for clinic in clinic_summary.index:
    row = clinic_summary.loc[clinic]
    complexity_note = "sicker patients" if row['avg_conditions'] > 3 else "less complex patients"
    print(f"  {clinic}: {row['readmission_pct']:.0f}% readmission  "
          f"avg {row['avg_conditions']:.1f} conditions → {complexity_note}")

=== PHASE 4: DOMAIN ANALYSIS — CLINIC VARIATION ===

           patients  readmission_pct  avg_age  avg_conditions  avg_prev_admissions
clinic_id
C1                8             75.0     68.6            3.62                 2.12
C2                6             50.0     57.5            2.67                 1.17
C3                6             50.0     55.2            2.33                 1.00
C4                4             25.0     48.5            0.75                 0.25

Interpretation:
  C1: 75% readmission  avg 3.6 conditions → sicker patients
  C2: 50% readmission  avg 2.7 conditions → less complex patients
  C3: 50% readmission  avg 2.3 conditions → less complex patients
  C4: 25% readmission  avg 0.8 conditions → less complex patients

What just happened?

pandas' .groupby().agg() produces the per-clinic breakdown in one step. The finding is nuanced: C1 does have the highest readmission rate (75%) — but it also serves the sickest patients (avg 3.6 conditions, avg age 68.6). C4 has the lowest rate (25%) with the youngest, healthiest patients (avg 0.75 conditions).

This is the kind of finding that changes a management conversation. The CMO's initial worry — "is C1 providing worse care?" — may be confounded by case mix. C1 might be the specialist referral clinic for the most complex patients. Whether the readmission difference is care quality or patient complexity is a question for the clinical team — but the EDA has given them exactly the right numbers to investigate it.

Phase 5 — The Complete EDA Report

The scenario: You now write the full EDA report — the document that goes to the CMO, the modelling team, and the clinical quality team. Each audience gets what they need. The modelling team gets feature recommendations. The CMO gets findings and clinical implications. The quality team gets the clinic variation data.

print("=" * 62)
print("  MEDACCESS — EDA REPORT: 30-DAY READMISSION")
print("=" * 62)

print("""
  EXECUTIVE SUMMARY (Chief Medical Officer)

  FINDING 1: Readmission rate is 58% — high but explainable
    Evidence:  14/24 patients readmitted; median age 64,
               median 3 comorbidities, 5 medications.
    Implication: Population is clinically complex — benchmark
               comparisons should be risk-adjusted.
    Action:    Request risk-stratified national benchmark data
               before drawing quality conclusions.

  FINDING 2: C1 has 75% readmission — but serves sickest patients
    Evidence:  C1 avg 3.6 conditions vs C4 avg 0.75; C4 readmits 25%.
    Implication: Rate difference may reflect case mix, not care quality.
    Action:    Clinical quality review should risk-adjust by conditions
               before comparing clinic performance.

  FINDING 3: All 6 features strongly predict readmission
    Evidence:  All Mann-Whitney p < 0.001.
               Biggest gaps: systolic_bp (+27 mmHg),
               num_conditions (+2.6), age (+19 years).
    Implication: High-risk profile is clearly identifiable from
               routine clinical data.
    Action:    Proceed with model development.
""")

print("  MODELLING TEAM GUIDANCE")
print(f"""
  Target variable: readmitted (58% positive — slight imbalance)
  → Use class_weight='balanced' or F1-score as primary metric

  Recommended features:
    ✓ Keep:  num_conditions (r=0.99 with num_meds → best of pair)
    ✓ Keep:  prev_admissions (behavioural signal)
    ✓ Keep:  systolic_bp (strongest individual gap: +27 mmHg)
    ✓ Keep:  age
    ✓ Keep:  bmi
    ~ Review: num_meds (r=0.99 with num_conditions — likely redundant)

  Multicollinearity:
    num_conditions × num_meds: r=+0.99 → drop num_meds
    Run VIF on final feature set before training linear model

  Missing data:
    1 BMI missing (patient aged 85) → MAR, use KNN imputation

  Baseline accuracy to beat: 58% (naive classifier)
""")
print("=" * 62)

==============================================================
  MEDACCESS — EDA REPORT: 30-DAY READMISSION
==============================================================

  EXECUTIVE SUMMARY (Chief Medical Officer)

  FINDING 1: Readmission rate is 58% — high but explainable
    Evidence:  14/24 patients readmitted; median age 64,
               median 3 comorbidities, 5 medications.
    Implication: Population is clinically complex — benchmark
               comparisons should be risk-adjusted.
    Action:    Request risk-stratified national benchmark data
               before drawing quality conclusions.

  FINDING 2: C1 has 75% readmission — but serves sickest patients
    Evidence:  C1 avg 3.6 conditions vs C4 avg 0.75; C4 readmits 25%.
    Implication: Rate difference may reflect case mix, not care quality.
    Action:    Clinical quality review should risk-adjust by conditions
               before comparing clinic performance.

  FINDING 3: All 6 features strongly predict readmission
    Evidence:  All Mann-Whitney p < 0.001.
               Biggest gaps: systolic_bp (+27 mmHg),
               num_conditions (+2.6), age (+19 years).
    Implication: High-risk profile is clearly identifiable from
               routine clinical data.
    Action:    Proceed with model development.

  MODELLING TEAM GUIDANCE

  Target variable: readmitted (58% positive — slight imbalance)
  → Use class_weight='balanced' or F1-score as primary metric

  Recommended features:
    ✓ Keep:  num_conditions (r=0.99 with num_meds → best of pair)
    ✓ Keep:  prev_admissions (behavioural signal)
    ✓ Keep:  systolic_bp (strongest individual gap: +27 mmHg)
    ✓ Keep:  age
    ✓ Keep:  bmi
    ~ Review: num_meds (r=0.99 with num_conditions — likely redundant)

  Multicollinearity:
    num_conditions × num_meds: r=+0.99 → drop num_meds
    Run VIF on final feature set before training linear model

  Missing data:
    1 BMI missing (patient aged 85) → MAR, use KNN imputation

  Baseline accuracy to beat: 58% (naive classifier)

==============================================================

What just happened?

The final report applies the finding → evidence → implication → action template from Lesson 35, the class balance analysis from Lesson 41, the multicollinearity guidance from Lesson 25, the MAR imputation recommendation from Lesson 37, and the domain-driven insight from Lesson 34 — all in one coherent document that three different audiences can use.

What This Case Study Applied — The Course in One Analysis

EDA Step	Technique Used	Lesson
Missing value audit	.isnull().sum() + MAR check	L15, L37
Impossible value checks	Boolean masks + clinical rules	L15, L34
Univariate distributions	mean/median/std/skew with context	L16, L23, L26
Class balance check	Naive baseline + imbalance flag	L41
Feature separability	Mann-Whitney U test per feature	L17, L41
Multicollinearity scan	.corr() upper triangle	L25, L30
Domain-specific check	Readmission rate by clinic + complexity	L34
Documented findings	Finding → evidence → implication → action	L35

A Final Note — What EDA Actually Is

EDA is not a checklist. A checklist runs the same steps on every dataset. EDA is a conversation — each finding determines what you look at next. We found 58% readmission and immediately asked: is this a data problem or a population problem? We found C1 at 75% and immediately asked: is this a care quality problem or a case mix problem? We found r=0.99 between two features and immediately asked: which one do we keep?

The techniques in this course are the vocabulary. The thinking is the language. Techniques without thinking produce charts. Thinking with techniques produces decisions.

You've completed the EDA course. You now know how to ask the right questions, find the answers in data, and communicate them to people who will act on them. That's the whole job.

Practice Questions

1. Clinic C1 has a 75% readmission rate but also serves the sickest patients. The difference in rates between clinics may reflect differences in patient complexity rather than care quality. What is this confounding factor called in clinical research?

2. num_conditions and num_meds have a correlation of r=0.99. Both are strong predictors of readmission. Which one should you recommend dropping from a linear model, and why?

3. The single missing BMI belongs to an 85-year-old patient and is classified as MAR (linked to age). Which imputation method is most appropriate — and why not mean imputation?

Final Quiz

Course Complete

You've completed Exploratory Data Analysis

From reading a CSV for the first time to building end-to-end EDA reports for real business decisions — 45 lessons, one continuous question: what is this data actually telling us?

Lessons 1–15: Beginner — understanding your data for the first time

Lessons 16–35: Intermediate — analysis that drives decisions

Lessons 36–45: Advanced — EDA tuned to specific models and domains

← Previous Course Index

EDA Course

EDA Case Study

The Brief

Phase 1 — Data Quality Audit

Phase 2 — Univariate & Distribution Analysis

Phase 3 — Feature Separability & Correlation

Phase 4 — Domain-Specific Analysis

Phase 5 — The Complete EDA Report

What This Case Study Applied — The Course in One Analysis

Practice Questions

Final Quiz