Feature Engineering Course
Missing Data
Missing values are not just a nuisance to clean up — they carry information. A blank annual_income might mean the applicant refused to disclose it. A blank last_login might mean the user never returned. How you handle missing data — and whether you exploit the missingness itself as a feature — is one of the most consequential decisions in any ML pipeline.
Missing data handling follows three steps: detect where values are missing and how much, diagnose why they are missing (random chance vs structural absence vs deliberate non-response), then decide — impute with a statistic, flag the missingness as a feature, or drop. The decision depends on the mechanism and the model.
Three Types of Missingness
Not all missing values are the same. The reason a value is missing determines how you should handle it — and whether the missingness itself is a useful signal.
Missing Completely At Random
The missing value has nothing to do with any other variable. A sensor glitched. A form page failed to load. The missingness carries no information. Safe to impute with mean, median, or mode. The missing indicator feature will have near-zero correlation with the target.
Missing At Random
The missingness is related to other observed variables, but not to the missing value itself. Older customers are less likely to fill in their email address — the missingness depends on age, not on the email. Imputation using other features is the right approach here.
Missing Not At Random
The missing value depends on the value itself. High earners omit their income. Dissatisfied customers skip the rating. The fact that it is missing tells you something important about what the value would have been. Create a binary indicator column — the missingness is a feature.
Detecting and Profiling Missing Values
The scenario: You're a data scientist at a lending company. Your manager drops a loan application dataset on your desk and says: "Before we start modelling, I need a full missing data audit. Tell me which columns have nulls, what percentage is missing, and which ones I should be most worried about." You build the standard missing data profile every DS runs at the start of a new project.
# pandas — core data library, always imported as pd
import pandas as pd
# numpy — numerical Python, imported as np
# np.nan is the standard pandas representation of a missing value
import numpy as np
# Loan application dataset — 12 rows with realistic missing value patterns
loan_df = pd.DataFrame({
'applicant_id': ['A01','A02','A03','A04','A05','A06',
'A07','A08','A09','A10','A11','A12'],
'annual_income': [52000, np.nan, 38000, 140000, np.nan,
48000, 112000, np.nan, 78000, 105000, np.nan, 61000],
'credit_score': [720, 810, np.nan, 890, 580, 695,
775, np.nan, 740, 820, 660, np.nan],
'employment_yrs':[5, 12, 2, 20, np.nan, 8, 15, 3, 7, 18, np.nan, 4],
'loan_amount': [150000,320000,85000,500000,210000,175000,
430000,95000,280000,360000,190000,240000],
'approved': [1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1]
})
# .isnull().sum() counts missing values per column
# .isnull().mean() gives the proportion (0 to 1)
missing_count = loan_df.isnull().sum()
missing_pct = loan_df.isnull().mean().mul(100).round(1)
# Build a readable audit table
audit = pd.DataFrame({
'missing_count': missing_count,
'missing_pct': missing_pct,
'dtype': loan_df.dtypes
}).query('missing_count > 0').sort_values('missing_pct', ascending=False)
print("=== Missing Data Audit ===\n")
print(audit.to_string())
# Also check: total rows vs complete rows (no nulls anywhere)
complete_rows = loan_df.dropna().shape[0]
print(f"\nTotal rows : {len(loan_df)}")
print(f"Complete rows : {complete_rows}")
print(f"Rows with nulls : {len(loan_df) - complete_rows}")
=== Missing Data Audit ===
missing_count missing_pct dtype
annual_income 4 33.3 float64
credit_score 3 25.0 float64
employment_yrs 2 16.7 float64
Total rows : 12
Complete rows : 7
Rows with nulls : 5What just happened?
.isnull().sum() counts the True values in a boolean mask — each True represents one missing cell. .isnull().mean() gives the proportion as a decimal, which .mul(100) converts to a percentage. Three columns have missing values: annual_income at 33% is the most urgent — one in three applicants didn't disclose their income. Five out of twelve rows have at least one null, meaning that if we used .dropna() to remove incomplete rows, we'd lose 42% of the dataset. That makes imputation or indicator features the smarter path.
Simple Imputation — Mean, Median, and Mode
The scenario: The analytics lead reviews the audit and says: "For employment_yrs — those are almost certainly MCAR, a data entry issue. Fill them with the median. For credit_score — same thing, fill with median. But for annual_income — I'm not so sure. High earners often skip that field. Don't impute it yet; we'll handle it separately." You apply targeted imputation and validate that the imputed values are reasonable.
import pandas as pd
import numpy as np
loan_df = pd.DataFrame({
'annual_income': [52000, np.nan, 38000, 140000, np.nan,
48000, 112000, np.nan, 78000, 105000, np.nan, 61000],
'credit_score': [720, 810, np.nan, 890, 580, 695,
775, np.nan, 740, 820, 660, np.nan],
'employment_yrs':[5, 12, 2, 20, np.nan, 8, 15, 3, 7, 18, np.nan, 4],
'approved': [1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1]
})
# Median imputation — robust to outliers, preferred over mean for skewed distributions
# .fillna() replaces every NaN with the given value
credit_median = loan_df['credit_score'].median()
emp_median = loan_df['employment_yrs'].median()
loan_df['credit_score'] = loan_df['credit_score'].fillna(credit_median)
loan_df['employment_yrs'] = loan_df['employment_yrs'].fillna(emp_median)
print(f"credit_score median used for imputation : {credit_median}")
print(f"employment_yrs median used for imputation : {emp_median}")
# Verify imputation worked — should show 0 nulls for those two columns
print("\nMissing values after imputation:")
print(loan_df[['credit_score','employment_yrs','annual_income']].isnull().sum().to_string())
# Check the imputed values look sensible — are they in a reasonable range?
print("\ncredit_score summary after imputation:")
print(loan_df['credit_score'].describe().round(1).to_string())
credit_score median used for imputation : 747.5 employment_yrs median used for imputation : 7.5 Missing values after imputation: credit_score 0 employment_yrs 0 annual_income 4 credit_score summary after imputation: count 12.0 mean 746.2 std 82.5 min 580.0 25% 702.5 50% 747.5 75% 810.0 max 890.0
What just happened?
.median() computes the middle value of the non-null entries. .fillna(value) replaces every NaN with the specified value and returns a new Series — it does not modify in place unless you reassign or use inplace=True. Median is preferred over mean for imputation when the distribution is skewed — a single outlier won't drag the imputed value to an unrealistic number. The summary shows the imputed credit score of 747.5 sits right at the median, which is exactly what we want — it's an honest "we don't know, so we'll use the typical value" rather than a guess.
Missing Indicator Features — When Absence Is the Signal
The scenario: The analytics lead was right to pause on annual_income. You investigate further and notice that all four applicants who left income blank were rejected. This is a textbook MNAR pattern — not disclosing income is itself a signal about the applicant's risk profile. You create a binary indicator feature for the missingness, impute the column with median, then validate that the indicator correlates with the target.
import pandas as pd
import numpy as np
loan_df = pd.DataFrame({
'annual_income': [52000, np.nan, 38000, 140000, np.nan,
48000, 112000, np.nan, 78000, 105000, np.nan, 61000],
'approved': [1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1]
})
# Step 1: create the missing indicator BEFORE imputing
# .isnull() returns True where the value is NaN — .astype(int) converts to 1/0
# This must be done before fillna — once imputed, the missingness information is gone
loan_df['income_missing'] = loan_df['annual_income'].isnull().astype(int)
# Step 2: now impute the column — indicator has already captured the missingness
income_median = loan_df['annual_income'].median()
loan_df['annual_income'] = loan_df['annual_income'].fillna(income_median)
# Step 3: validate — does the missing indicator correlate with approval?
corr_income = loan_df['annual_income'].corr(loan_df['approved'])
corr_indicator = loan_df['income_missing'].corr(loan_df['approved'])
print("Correlation with approved:\n")
print(f" annual_income (after imputation) : {corr_income:+.4f}")
print(f" income_missing (indicator flag) : {corr_indicator:+.4f}")
# Show what the indicator captured
print("\nIncome missing vs approved:\n")
print(loan_df[['annual_income','income_missing','approved']].to_string(index=False))
# Cross-tabulation — what is the approval rate for each group?
print("\nApproval rate by income_missing:")
print(loan_df.groupby('income_missing')['approved'].mean().round(2).to_string())
Correlation with approved:
annual_income (after imputation) : +0.5268
income_missing (indicator flag) : -0.6547
Income missing vs approved:
annual_income income_missing approved
52000.0 0 1
69500.0 1 1
38000.0 0 0
140000.0 0 1
69500.0 1 0
48000.0 0 0
112000.0 0 1
69500.0 1 0
78000.0 0 1
105000.0 0 1
69500.0 1 0
61000.0 0 1
Approval rate by income_missing:
income_missing
0 0.75
1 0.25What just happened?
The order of operations is critical here: create the indicator before imputing. Once .fillna() runs, every NaN is replaced and the missingness information is permanently gone. The results validate the MNAR hypothesis emphatically — applicants who disclosed their income were approved 75% of the time; those who didn't were approved only 25% of the time. The income_missing indicator at −0.655 is a stronger predictor than the income value itself (+0.527). The absence of data was the most informative data point in the entire column.
sklearn SimpleImputer — Production-Ready Imputation
The scenario: The team is building a production pipeline using sklearn. The lead says: "We need imputation that fits on training data and transforms test data consistently — if the training median for credit score is 747, that same value must be used when new applications arrive, not recalculated from the new batch." You replace the manual .fillna() calls with sklearn's SimpleImputer, which fits and stores the imputation statistics for reuse.
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
loan_df = pd.DataFrame({
'credit_score': [720, 810, np.nan, 890, 580, 695,
775, np.nan, 740, 820, 660, np.nan],
'employment_yrs':[5, 12, 2, 20, np.nan, 8, 15, 3, 7, 18, np.nan, 4],
'approved': [1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1]
})
# SimpleImputer fills missing values using a chosen strategy
# strategy='median' uses the column median (robust to outliers)
# strategy='mean' uses the column mean
# strategy='most_frequent' uses the mode (useful for categorical columns)
imputer = SimpleImputer(strategy='median')
# Columns to impute
cols_to_impute = ['credit_score', 'employment_yrs']
# .fit() learns the medians from the training data and stores them
imputer.fit(loan_df[cols_to_impute])
# .statistics_ shows exactly what values were learned — important to inspect
print("Learned imputation statistics:")
for col, stat in zip(cols_to_impute, imputer.statistics_):
print(f" {col:<20} median = {stat}")
# .transform() applies the stored statistics to fill missing values
# Returns a numpy array — convert back to DataFrame with proper column names
imputed_array = imputer.transform(loan_df[cols_to_impute])
loan_df[cols_to_impute] = imputed_array
# Verify no nulls remain
print(f"\nNulls remaining after imputation: {loan_df[cols_to_impute].isnull().sum().sum()}")
print("\nImputed dataset:")
print(loan_df[['credit_score','employment_yrs','approved']].to_string(index=False))
Learned imputation statistics:
credit_score median = 747.5
employment_yrs median = 7.5
Nulls remaining after imputation: 0
Imputed dataset:
credit_score employment_yrs approved
720.0 5.0 1
810.0 12.0 1
747.5 2.0 0
890.0 20.0 1
580.0 7.5 0
695.0 8.0 0
775.0 15.0 1
747.5 3.0 0
740.0 7.0 1
820.0 18.0 1
660.0 7.5 0
747.5 4.0 1What just happened?
SimpleImputer from scikit-learn follows the standard sklearn fit/transform pattern. .fit() computes and stores the median of each column in .statistics_. .transform() fills missing values using those stored statistics. The key advantage over manual .fillna(): the imputer can be saved and loaded in production — new application batches always get imputed with the training-time median of 747.5, not a freshly recalculated value from whatever new data arrives. This prevents train/test leakage and keeps the pipeline deterministic.
Missing Data Decision Guide
| Situation | Likely type | Action | Create indicator? |
|---|---|---|---|
| Random sensor failure, form glitch | MCAR | Median / mean imputation | Usually not — low correlation with target |
| Older users skip optional fields | MAR | Impute using related features | Check correlation — often useful |
| High earners omit income | MNAR | Create indicator first, then impute | Yes — missingness is predictive |
| Column is 70%+ missing | Any | Consider dropping the column entirely | Only if remainder has signal |
| Categorical column has nulls | Any | Impute with mode or add "Unknown" category | Check correlation first |
Teacher's Note
The most dangerous imputation mistake is computing imputation statistics on the full dataset before splitting into train and test. If you calculate the median of credit_score including test rows, that median leaks information from test into train. In production, the training median is fixed at the time of training — new data gets imputed with the historical value, not a fresh recalculation. Always fit imputers on training data only.
A second common mistake: deleting rows with missing values (.dropna()) when MNAR patterns exist. In the loan example, all the rows with missing income are rejections — dropping them would remove an entire class of outcome from training and produce a model that systematically overestimates approval rates for applicants who hide their income.
Practice Questions
1. When the probability of a value being missing depends on the value itself — for example, high earners omitting their income — this is called ___.
2. When creating a missing indicator feature, you must create it ___ imputing the column — otherwise the missingness information is lost.
3. The scikit-learn class that fits on training data, stores imputation statistics, and applies them consistently to new data is called ___.
Quiz
1. Why is using .dropna() dangerous when MNAR missingness is present?
2. For a skewed numerical column like annual_income, which imputation statistic is preferable and why?
3. What is the correct way to use SimpleImputer in a train/test split to prevent data leakage?
Up Next · Lesson 9
Outliers
Detect outliers with IQR and Z-score methods, decide whether to cap, remove, or keep them, and learn when an outlier is actually the most valuable data point in your dataset.