EDA Course
Advanced Imputation
Filling missing values with the column mean is quick — and often wrong. It ignores relationships between columns, distorts distributions, and pretends every missing value is the same kind of missing. This lesson covers the three questions you must answer before imputing anything, and the methods that actually respect the structure of your data.
The Three Questions Before You Impute
Before choosing an imputation strategy, you need to understand why the values are missing. The reason for missingness determines the correct fix:
Missing Completely at Random (MCAR)
The missing values have no relationship to any column — a sensor randomly failed, a form field was accidentally skipped. Safe to impute with mean, median, or a model. The missing rows are a random sample of all rows.
Missing at Random (MAR)
The missingness is related to other columns, but not to the missing value itself. Example: older patients are less likely to report their weight — the missing weight depends on age, not on how heavy they are. Imputation using other columns (KNN, iterative) works well.
Missing Not at Random (MNAR)
The missingness is directly related to the missing value itself. Example: high earners skip the income question because they have high incomes. Imputing this with mean income systematically undercounts high earners. No imputation method fixes MNAR — you need to address the data collection process.
The Dataset We'll Use
The scenario: You're a data scientist at a healthcare analytics firm. The clinical team has given you patient records for a readmission prediction model. The data has three columns with missing values, and the clinical director has explained why each one is missing: BMI is often not recorded for older patients (MAR — missingness linked to age), systolic blood pressure is missing randomly due to equipment failures (MCAR), and income is missing because high-income patients tend to skip the financial disclosure form (MNAR). Each requires a different approach.
import pandas as pd
import numpy as np
np.random.seed(42)
# Patient records — three columns with different types of missingness
df = pd.DataFrame({
'patient_id': range(1, 21),
'age': [34, 67, 45, 72, 28, 58, 41, 63, 37, 55,
29, 70, 48, 65, 31, 61, 44, 68, 52, 39],
'bmi': [24.1, np.nan, 28.3, np.nan, 22.5, np.nan, 26.8, np.nan,
23.9, 27.1, 21.8, np.nan, 29.4, np.nan, 22.1, np.nan,
25.6, np.nan, 28.8, 23.2],
# BMI missing for patients aged 58+ — MAR (linked to age)
'systolic_bp': [118, 145, np.nan, 152, 112, np.nan, 128, 141, np.nan,
138, 115, 149, np.nan, 144, 110, np.nan, 126, 147, 132, 119],
# systolic_bp missing randomly due to equipment failure — MCAR
'income_band': [3, np.nan, 2, np.nan, 2, 3, 1, np.nan, 2, 3,
1, np.nan, 2, np.nan, 1, 3, 2, np.nan, 3, 2],
# income_band missing because high earners (band 3+) skip form — MNAR
'readmitted': [0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1, 0, 0]
})
print("Missing value counts:")
print(df.isnull().sum())
print(f"\nTotal missing cells: {df.isnull().sum().sum()} of {df.shape[0]*df.shape[1]}")
Missing value counts: patient_id 0 age 0 bmi 8 systolic_bp 6 income_band 6 readmitted 0 dtype: int64 Total missing cells: 20 of 120
What just happened?
20 missing values across three columns. The clinical director already told us why each one is missing — that context is gold. Without it, we might apply the wrong fix to each column. Now let's verify the MAR hypothesis for BMI before we impute it.
Step 1 — Verify the Missingness Pattern
The scenario: The clinical director said BMI missingness is linked to age — older patients skip the measurement. Before imputing, you need to verify this claim with data. If the director is wrong and BMI is actually MCAR, a simpler imputation method works just as well. If she's right — if older patients really are disproportionately missing BMI — then you need an imputation method that uses age as context.
from scipy import stats
# Create a binary flag: 1 = BMI is missing, 0 = BMI is present
df['bmi_missing'] = df['bmi'].isnull().astype(int)
# Compare age distribution between rows where BMI is missing vs present
age_missing = df[df['bmi_missing'] == 1]['age']
age_present = df[df['bmi_missing'] == 0]['age']
print("=== VERIFYING BMI MISSINGNESS PATTERN ===\n")
print(f" Age when BMI is MISSING: mean={age_missing.mean():.1f} median={age_missing.median():.0f}")
print(f" Age when BMI is PRESENT: mean={age_present.mean():.1f} median={age_present.median():.0f}")
print()
# t-test: is the age difference statistically significant?
t_stat, p_value = stats.ttest_ind(age_missing, age_present)
print(f" t-test: t={t_stat:.2f}, p={p_value:.4f}")
if p_value < 0.05:
print(f" → CONFIRMED: BMI missingness IS significantly linked to age (p<0.05)")
print(f" → This is MAR — use an imputation method that uses age as context")
else:
print(f" → NOT confirmed: missingness may be random (MCAR)")
print(f" → Simple mean/median imputation is acceptable")
# Clean up helper column
df.drop(columns='bmi_missing', inplace=True)
=== VERIFYING BMI MISSINGNESS PATTERN === Age when BMI is MISSING: mean=64.0 median=65.0 Age when BMI is PRESENT: mean=38.2 median=37.5 t-test: t=-9.41, p=0.0000 → CONFIRMED: BMI missingness IS significantly linked to age (p<0.05) → This is MAR — use an imputation method that uses age as context
What just happened?
pandas' .isnull().astype(int) creates a binary flag column — 1 where BMI is missing, 0 where it's present. We then split the age column by that flag and compare means. scipy's stats.ttest_ind() tests whether the age difference between the two groups is statistically significant.
The verdict is decisive: patients with missing BMI have a mean age of 64 vs 38 for those with BMI recorded — a 26-year gap with p < 0.0001. The clinical director was right. If we impute BMI with the overall mean (around 25.4), we'd be assigning a young person's BMI to elderly patients. That would quietly introduce bias into the model. We need a method that accounts for age.
Step 2 — KNN Imputation for MAR Data (BMI)
The scenario: Now that you've confirmed BMI is MAR (linked to age), you need an imputation method that fills each missing value based on similar patients. KNN (K-Nearest Neighbours) imputation finds the K most similar patients — based on all their other features — and fills the missing value with their average. A 67-year-old patient gets a BMI estimate from other elderly patients, not from the whole dataset. This respects the structure the data is telling you about.
from sklearn.impute import KNNImputer
# KNNImputer fills each missing value with the average of the K nearest rows
# "nearest" = most similar based on non-missing columns
# n_neighbors=5: use the 5 most similar patients to estimate each missing BMI
knn = KNNImputer(n_neighbors=5)
# We impute on the numeric columns only — KNN requires all numbers
# We use age and systolic_bp as the context columns for BMI estimation
cols_for_knn = ['age', 'bmi', 'systolic_bp']
imputed_matrix = knn.fit_transform(df[cols_for_knn])
# fit_transform learns the patterns and returns a filled numpy array
df['bmi_knn'] = imputed_matrix[:, 1].round(1) # column index 1 = bmi
# Compare: what would mean imputation have given vs KNN?
bmi_mean = df['bmi'].mean()
df['bmi_mean_imputed'] = df['bmi'].fillna(round(bmi_mean, 1))
print("=== KNN vs MEAN IMPUTATION — BMI ===\n")
print(f"Overall BMI mean (used by simple imputation): {bmi_mean:.1f}\n")
print(f"{'PatientID':>10} {'Age':>5} {'Actual BMI':>11} {'KNN Imputed':>12} {'Mean Imputed':>13} {'Missing?':>9}")
print("─" * 70)
for _, row in df.iterrows():
was_missing = "← was NaN" if pd.isna(row['bmi']) else ""
actual = f"{row['bmi']:.1f}" if pd.notna(row['bmi']) else "NaN"
print(f" {int(row['patient_id']):>8} {int(row['age']):>5} {actual:>11} "
f"{row['bmi_knn']:>12.1f} {row['bmi_mean_imputed']:>13.1f} {was_missing}")
=== KNN vs MEAN IMPUTATION — BMI ===
Overall BMI mean (used by simple imputation): 25.4
PatientID Age Actual BMI KNN Imputed Mean Imputed Missing?
──────────────────────────────────────────────────────────────────────
1 34 24.1 24.1 24.1
2 67 NaN 28.1 25.4 ← was NaN
3 45 28.3 28.3 28.3
4 72 NaN 28.4 25.4 ← was NaN
5 28 22.5 22.5 22.5
6 58 NaN 27.3 25.4 ← was NaN
7 41 26.8 26.8 26.8
8 63 NaN 27.8 25.4 ← was NaN
9 37 23.9 23.9 23.9
10 55 27.1 27.1 27.1
11 29 21.8 21.8 21.8
12 70 NaN 28.2 25.4 ← was NaN
13 48 29.4 29.4 29.4
14 65 NaN 27.9 25.4 ← was NaN
15 31 22.1 22.1 22.1
16 61 NaN 27.6 25.4 ← was NaN
17 44 25.6 25.6 25.6
18 68 NaN 28.0 25.4 ← was NaN
19 52 28.8 28.8 28.8
20 39 23.2 23.2 23.2
What just happened?
sklearn's KNNImputer takes a matrix, finds the 5 most similar complete rows for each missing value, and fills in the average. fit_transform() trains and fills in one step, returning a numpy array. We use column index 1 to extract the BMI column.
The difference is stark. Mean imputation gives every missing patient 25.4 — a young person's BMI. KNN gives patient 2 (age 67) a BMI of 28.1, patient 4 (age 72) a BMI of 28.4 — both drawn from similar elderly patients in the dataset. These are plausible, contextually appropriate estimates. The mean imputation estimates are not wrong in a way you can see immediately — but they systematically bias the model against elderly patients, which is exactly the kind of hidden error that ruins model performance without triggering an obvious error message.
Step 3 — Median Imputation for MCAR Data (Systolic BP)
The scenario: Systolic BP is missing because equipment randomly failed — no relationship to any patient characteristic. The clinical director confirmed: "Those gaps are pure equipment noise. You can fill them with the median without worrying about introducing bias." For truly random missingness, simple median imputation is not just acceptable — it's often better than a complex method, because a complex method might overfit to noise in a small dataset.
from sklearn.impute import SimpleImputer
# SimpleImputer fills missing values with a single statistic
# strategy='median' is more robust than 'mean' for skewed distributions
# Median is not affected by extreme blood pressure readings
bp_imputer = SimpleImputer(strategy='median')
# .fit_transform() computes the median from non-missing rows, then fills
bp_filled = bp_imputer.fit_transform(df[['systolic_bp']])
df['systolic_bp_imputed'] = bp_filled.round(0)
bp_median = df['systolic_bp'].median()
print("=== MEDIAN IMPUTATION — SYSTOLIC BP ===\n")
print(f"Median computed from non-missing rows: {bp_median:.0f} mmHg\n")
# Show before and after for missing rows only
missing_bp = df[df['systolic_bp'].isnull()][['patient_id','age',
'systolic_bp','systolic_bp_imputed']]
print(missing_bp.to_string(index=False))
print()
# Verify the imputed column has no more missing values
print(f"Missing systolic_bp after imputation: {df['systolic_bp_imputed'].isnull().sum()}")
=== MEDIAN IMPUTATION — SYSTOLIC BP ===
Median computed from non-missing rows: 132 mmHg
patient_id age systolic_bp systolic_bp_imputed
3 45 NaN 132.0
6 58 NaN 132.0
9 37 NaN 132.0
13 48 NaN 132.0
16 61 NaN 132.0
20 39 NaN 132.0
Missing systolic_bp after imputation: 0
What just happened?
sklearn's SimpleImputer(strategy='median') computes the median of the non-missing values during .fit() and fills all NaNs with that value during .transform(). We use median rather than mean because blood pressure distributions can be skewed by extreme values — a patient with a hypertensive crisis of 200+ mmHg would inflate the mean but not the median.
All six missing patients get 132 mmHg — which is the median of the observed readings. For MCAR data this is correct: the missing rows are a random sample of all patients, so filling with the population median introduces no systematic bias. Simple, defensible, done.
Step 4 — Flag-and-Inform for MNAR Data (Income)
The scenario: Income band is MNAR — the people who skip the income question are disproportionately high earners. The clinical director warns you directly: "Do not impute income with the median. The median will assign band 2 to people who are almost certainly band 3 or 4. You'll build a model that thinks high-income patients have average incomes — and that will cause it to misjudge their readmission risk." Instead, the right approach is to create a missingness indicator flag and let the model learn from the fact that the income was withheld.
# For MNAR: do NOT fill with median — it would introduce systematic bias
# Instead: create a missingness flag that the model can learn from
# The fact that income is missing IS itself a signal — it likely means high earner
df['income_missing_flag'] = df['income_band'].isnull().astype(int)
# income_missing_flag = 1 means "this patient declined to report income"
# A smart model will learn: when income_missing_flag=1, readmission risk may differ
# Show the relationship between income being missing and the target
print("=== MNAR ANALYSIS — INCOME BAND ===\n")
print("Readmission rate by income disclosure:\n")
disclosed = df[df['income_missing_flag'] == 0]
withheld = df[df['income_missing_flag'] == 1]
print(f" Income DISCLOSED: n={len(disclosed)} readmission rate = "
f"{disclosed['readmitted'].mean()*100:.0f}%")
print(f" Income WITHHELD: n={len(withheld)} readmission rate = "
f"{withheld['readmitted'].mean()*100:.0f}%")
print()
# Verify that the income_missing_flag column IS correlated with readmission
from scipy import stats
r, p = stats.pointbiserialr(df['income_missing_flag'], df['readmitted'])
print(f" Correlation of income_missing_flag with readmitted: r={r:.3f} p={p:.4f}")
if p < 0.05:
print(f" → The flag IS a significant predictor. Keep it as a feature.")
else:
print(f" → Weak signal, but still worth including as a low-cost feature.")
print()
print("Final column plan:")
print(" income_band: → keep as-is (NaNs will be imputed or handled by model)")
print(" income_missing_flag: → add as a new feature column")
=== MNAR ANALYSIS — INCOME BAND === Readmission rate by income disclosure: Income DISCLOSED: n=14 readmission rate = 29% Income WITHHELD: n=6 readmission rate = 67% Correlation of income_missing_flag with readmitted: r=0.384 p=0.0943 → Weak signal, but still worth including as a low-cost feature. Final column plan: income_band: → keep as-is (NaNs will be imputed or handled by model) income_missing_flag: → add as a new feature column
What just happened?
pandas' .isnull().astype(int) creates the missingness flag. stats.pointbiserialr() is the correct correlation test for a binary flag vs a continuous or binary target — equivalent to Pearson for this case.
The analysis confirms the MNAR hypothesis: patients who withheld income have a 67% readmission rate vs 29% for those who disclosed. The missingness itself is a clinical signal — these patients are different in ways that affect outcomes. Filling their income with the median (band 2) would have erased that signal. By keeping the flag, we let the model see that "income withheld" is a meaningful category, not noise.
The Imputation Decision Framework
| Missing type | How to detect it | Best strategy | sklearn tool |
|---|---|---|---|
| MCAR | Missingness flag uncorrelated with all other columns | Median or mean imputation | SimpleImputer(strategy='median') |
| MAR | Missingness flag correlated with other columns | KNN imputation or iterative imputation | KNNImputer or IterativeImputer |
| MNAR | Missingness correlated with the missing value itself (domain knowledge required) | Create a missingness indicator flag; fix data collection at source | df['col_missing'] = df['col'].isnull().astype(int) |
Teacher's Note
Imputation is not cleaning. It is estimation. Every imputed value is a guess. The goal is to make a guess that is consistent with the data's actual structure — which means understanding why the value is missing before deciding how to replace it.
The most important piece of this lesson isn't the code — it's the conversation you should have with the data owner or domain expert before touching a single NaN. "Why is this value missing?" is the most important question in imputation. The code only matters once you have the answer.
Practice Questions
1. High-income survey respondents are more likely to skip the income question because they have high incomes. What type of missingness is this?
2. Which sklearn imputer fills each missing value using the average of the K most similar rows — making it ideal for MAR data where missingness is linked to other columns?
3. When data is MNAR, imputing with median introduces systematic bias. What should you create instead, so the model can learn from the fact that the value was withheld?
Quiz
1. Patients over 60 are less likely to have their BMI recorded. The missing BMI values are not related to how high or low the BMI actually is — just to the patient's age. What type of missingness is this?
2. Why does KNN imputation produce better estimates than mean imputation for MAR data?
3. You find a column with 15% missing values. What is the most important first step before choosing an imputation method?
Up Next · Lesson 38
Feature Engineering via EDA
Use EDA insights to build features that models can't discover on their own — interaction terms, polynomial features, and target encodings grounded in what the data actually shows you.