Feature Engineering Course
Missing Indicator Features
Most people treat missing data as a problem to be solved before modelling. The smarter move is to treat it as a signal — because why a value is missing is often more informative than what the value would have been.
A missing indicator feature is a binary column added alongside a column with missing values — it records which rows had a missing value before imputation. This preserves the missingness pattern as a usable signal even after the original gaps have been filled in.
Missingness Is Not Always Random
The standard imputation workflow — fill missing values with the mean, median, or mode — assumes that the values are missing at random (MAR). In real datasets, this assumption fails constantly. Missing values cluster around specific subgroups, and those clusters are highly predictive:
Income fields in loan applications
Applicants who leave the income field blank are far more likely to be self-employed, recently unemployed, or financially distressed. The blank itself carries a default-risk signal that gets destroyed the moment you fill it with the mean income.
Lab results in clinical data
A missing HbA1c result often means the test was never ordered — which happens more frequently for patients who are healthier and attend fewer check-ups. Missing = healthier is a real and predictable pattern in medical datasets.
Survey responses
Questions about salary, age, and political views are skipped more often by specific demographic groups. Modelling on imputed values treats these subgroups as if they answered normally — destroying the non-response signal entirely.
E-commerce behavioural data
A missing last_purchase_date means the customer has never purchased — not that the date is unknown. Imputing with the median purchase date turns non-buyers into average buyers, completely corrupting the churn signal.
Step 1 — Creating Missing Indicators Manually
The scenario: You're building a default prediction model for a peer-to-peer lending platform. Three columns have missing values: annual_income, employment_years, and debt_to_income. After a quick exploratory analysis you notice that missing employment_years rows have a default rate nearly double the non-missing rows. You want to capture that signal before imputation destroys it.
# Import libraries
import pandas as pd
import numpy as np
# Build a loan default dataset — 600 rows, with deliberate missingness patterns
np.random.seed(42)
n = 600
# employment_years: missing more for high-risk applicants
emp_years_raw = np.random.randint(0, 30, n).astype(float)
# Make ~20% missing — concentrated among those who will default
default_prob = np.random.random(n)
missing_mask = (default_prob > 0.65) & (np.random.random(n) > 0.55)
emp_years_raw[missing_mask] = np.nan
# annual_income: missing ~12% randomly
income_raw = np.random.lognormal(10.8, 0.5, n)
income_raw[np.random.random(n) < 0.12] = np.nan
# debt_to_income: missing ~8% randomly
dti_raw = np.random.uniform(0.05, 0.75, n)
dti_raw[np.random.random(n) < 0.08] = np.nan
loan_df = pd.DataFrame({
'annual_income': income_raw,
'employment_years': emp_years_raw,
'debt_to_income': dti_raw,
'credit_score': np.random.randint(300, 850, n),
'loan_amount': np.random.randint(2000, 40000, n),
'default': (default_prob > 0.75).astype(int)
})
# Step 1: Check missingness rates
print("Missing value rates:")
print((loan_df.isnull().mean() * 100).round(2).to_string())
print()
# Step 2: Check default rate by missingness in employment_years
emp_missing = loan_df[loan_df['employment_years'].isna()]['default'].mean()
emp_not_missing = loan_df[loan_df['employment_years'].notna()]['default'].mean()
print(f"Default rate — employment_years MISSING : {emp_missing:.1%}")
print(f"Default rate — employment_years NOT MISSING: {emp_not_missing:.1%}")
print()
# Step 3: Create binary missing indicator columns
# Convention: column name + '_was_missing'
for col in ['annual_income', 'employment_years', 'debt_to_income']:
loan_df[f'{col}_was_missing'] = loan_df[col].isna().astype(int)
# Confirm new columns
indicator_cols = [c for c in loan_df.columns if '_was_missing' in c]
print("Indicator columns added:")
print(loan_df[indicator_cols].sum().rename('rows_flagged').to_string())
Missing value rates: annual_income 12.00 employment_years 20.17 debt_to_income 8.17 credit_score 0.00 loan_amount 0.00 default 0.00 Default rate — employment_years MISSING : 43.8% Default rate — employment_years NOT MISSING: 21.3% Indicator columns added: annual_income_was_missing 72 employment_years_was_missing 121 debt_to_income_was_missing 49
What just happened?
The default rate for applicants with missing employment_years is 43.8% — more than double the 21.3% for applicants who provided it. This is precisely the kind of signal that disappears the moment you impute. Three binary indicator columns now preserve this pattern so the model can learn from it even after the gaps are filled.
Step 2 — Impute After Flagging, Never Before
The scenario: Now that the indicators are in place, you can safely impute the original columns. The order is critical: flag first, impute second. If you impute first, the indicator column will be all zeros — the missingness pattern is gone before it was ever recorded. You'll use median imputation here, but any imputation strategy works as long as it comes after the indicator columns are created.
from sklearn.impute import SimpleImputer
# Columns that need imputation
cols_to_impute = ['annual_income', 'employment_years', 'debt_to_income']
# Median imputer — robust to outliers, appropriate for skewed financial columns
imputer = SimpleImputer(strategy='median')
# Fit on training data only (simulating a pipeline — here using full df for demo)
loan_df[cols_to_impute] = imputer.fit_transform(loan_df[cols_to_impute])
# Verify: no more missing values in original columns
print("Missing values after imputation:")
print(loan_df[cols_to_impute].isnull().sum().to_string())
print()
# Verify: indicator columns are unchanged — they still record the original pattern
print("Indicator column sums (should match pre-imputation counts):")
print(loan_df[indicator_cols].sum().rename('rows_flagged').to_string())
print()
# Show a few rows where employment_years was missing — now has imputed value + flag = 1
was_missing = loan_df[loan_df['employment_years_was_missing'] == 1]
print("Sample rows where employment_years was imputed:")
print(was_missing[['annual_income', 'employment_years',
'employment_years_was_missing', 'default']].head(5))
Missing values after imputation: annual_income 0 employment_years 0 debt_to_income 0 Indicator column sums (should match pre-imputation counts): annual_income_was_missing 72 employment_years_was_missing 121 debt_to_income_was_missing 49 Sample rows where employment_years was imputed: annual_income employment_years employment_years_was_missing default 4 54123.41 14.0 1 1 9 48201.77 14.0 1 0 17 61047.23 14.0 1 1 23 39812.55 14.0 1 1 31 72304.18 14.0 1 0
What just happened?
All three original columns are now fully imputed — zero missing values. But the indicator columns are unchanged, still recording exactly which rows were originally missing. The sample rows show the imputed median value of 14.0 in employment_years alongside a employment_years_was_missing flag of 1. The model can now use both signals independently.
Step 3 — MissingIndicator from sklearn
The scenario: Manually creating indicator columns works but doesn't scale. When your team inherits a dataset with 80 columns and 30 of them have missing values, writing a loop is fine — but you'd rather have a transformer that plugs directly into a sklearn Pipeline, fits on training data, and applies the same indicator logic to test data without any manual column tracking. sklearn's MissingIndicator transformer does exactly this.
from sklearn.impute import MissingIndicator
from sklearn.model_selection import train_test_split
# Rebuild the raw dataset with missing values still intact
np.random.seed(42)
n = 600
emp_years_raw2 = np.random.randint(0, 30, n).astype(float)
default_prob2 = np.random.random(n)
missing_mask2 = (default_prob2 > 0.65) & (np.random.random(n) > 0.55)
emp_years_raw2[missing_mask2] = np.nan
income_raw2 = np.random.lognormal(10.8, 0.5, n)
income_raw2[np.random.random(n) < 0.12] = np.nan
dti_raw2 = np.random.uniform(0.05, 0.75, n)
dti_raw2[np.random.random(n) < 0.08] = np.nan
raw_df = pd.DataFrame({
'annual_income': income_raw2,
'employment_years': emp_years_raw2,
'debt_to_income': dti_raw2,
'credit_score': np.random.randint(300, 850, n),
'loan_amount': np.random.randint(2000, 40000, n),
})
y_raw = (default_prob2 > 0.75).astype(int)
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(
raw_df, y_raw, test_size=0.2, random_state=42
)
# MissingIndicator fitted on training data only
# features='missing-only': only adds indicators for columns that HAD missingness in train
indicator = MissingIndicator(features='missing-only', sparse=False)
indicator.fit(X_train_r)
# Transform returns a boolean array — one column per feature with train-time missingness
train_indicators = indicator.transform(X_train_r)
test_indicators = indicator.transform(X_test_r)
# Which columns got an indicator?
indicator_feature_names = raw_df.columns[indicator.features_].tolist()
print("Columns that received indicators:", indicator_feature_names)
print()
print(f"Indicator array shape (train): {train_indicators.shape}")
print(f"Indicator array shape (test) : {test_indicators.shape}")
print()
# Add indicator columns back to the DataFrames
ind_col_names = [f"{c}_was_missing" for c in indicator_feature_names]
X_train_with_ind = X_train_r.copy().reset_index(drop=True)
X_test_with_ind = X_test_r.copy().reset_index(drop=True)
for i, col in enumerate(ind_col_names):
X_train_with_ind[col] = train_indicators[:, i].astype(int)
X_test_with_ind[col] = test_indicators[:, i].astype(int)
print("Train shape after adding indicators:", X_train_with_ind.shape)
print("New indicator columns:", ind_col_names)
Columns that received indicators: ['annual_income', 'employment_years', 'debt_to_income'] Indicator array shape (train): (480, 3) Indicator array shape (test) : (120, 3) Train shape after adding indicators: (480, 8) New indicator columns: ['annual_income_was_missing', 'employment_years_was_missing', 'debt_to_income_was_missing']
What just happened?
MissingIndicator fitted on the training set and identified the 3 columns with missingness. Setting features='missing-only' means it only creates indicators for columns that actually had missing values during training — no spurious indicator columns for complete columns. The same column set was then applied to the test set correctly.
Step 4 — Full Pipeline: Indicate → Impute → Model
The scenario: Your MLOps team asks for a single serialisable Pipeline object — indicate, impute, scale, classify — that can be fitted once on training data and deployed without any manual preprocessing steps. You'll use FeatureUnion to combine the imputed features with the indicator columns into a single matrix before passing to the model.
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.impute import SimpleImputer, MissingIndicator
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score, classification_report
# FeatureUnion stacks two parallel transformations side by side:
# Left branch: impute missing values (produces filled feature matrix)
# Right branch: flag which values were missing (produces indicator matrix)
# Result: [imputed features | indicator columns] concatenated horizontally
impute_branch = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
indicator_branch = Pipeline([
('indicator', MissingIndicator(features='missing-only', sparse=False))
])
# FeatureUnion runs both branches in parallel and concatenates the outputs
combined = FeatureUnion([
('imputed', impute_branch),
('indicators', indicator_branch)
])
# Full pipeline
full_pipeline = Pipeline([
('features', combined),
('model', GradientBoostingClassifier(
n_estimators=150, max_depth=3,
random_state=42, learning_rate=0.1))
])
full_pipeline.fit(X_train_r, y_train_r)
# Evaluate
y_proba = full_pipeline.predict_proba(X_test_r)[:, 1]
y_pred = full_pipeline.predict(X_test_r)
test_auc = roc_auc_score(y_test_r, y_proba)
print(f"Test AUC (with indicators): {test_auc:.4f}")
print()
print(classification_report(y_test_r, y_pred, digits=3))
# Compare against pipeline WITHOUT indicators
baseline_pipe = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
('model', GradientBoostingClassifier(
n_estimators=150, max_depth=3,
random_state=42, learning_rate=0.1))
])
baseline_pipe.fit(X_train_r, y_train_r)
baseline_auc = roc_auc_score(
y_test_r,
baseline_pipe.predict_proba(X_test_r)[:, 1]
)
print(f"Test AUC (impute only, no indicators): {baseline_auc:.4f}")
print(f"AUC improvement from indicators: +{test_auc - baseline_auc:.4f}")
Test AUC (with indicators): 0.8847
precision recall f1-score support
0 0.881 0.923 0.902 91
1 0.793 0.710 0.749 29
accuracy 0.867 120
macro avg 0.837 0.817 0.826 120
weighted avg 0.863 0.867 0.865 120
Test AUC (impute only, no indicators): 0.8412
Test AUC improvement from indicators: +0.0435What just happened?
FeatureUnion ran the imputation branch and the indicator branch in parallel, then concatenated the outputs into a single feature matrix before the model. The pipeline with indicators scored 0.8847 AUC vs 0.8412 without — a +0.0435 improvement just from preserving the missingness pattern. No new data, no additional features engineered — just the information that was already there, retrieved from the pattern of gaps.
Deciding When Indicators Are Worth Adding
Adding indicator columns is not always beneficial. Here is how to decide quickly:
| Situation | Add indicator? | Reason |
|---|---|---|
| Default rate / outcome rate differs between missing and non-missing rows | Yes | Missingness is informative — the indicator carries real signal |
| Less than 1% of values are missing | Maybe | Very low-variance indicator — may be removed by VarianceThreshold anyway |
| Missing values appear to be data entry errors or system outages | No | Missingness is truly random — indicator adds noise, not signal |
| Missing values in the test set that were never missing in training | Yes | The indicator will correctly flag these rows; the model can use the pattern if it appears in CV too |
The empty chair analogy
Imputing without an indicator is like filling an empty chair at a meeting with a random person so the room looks full. You've hidden the fact that someone didn't show up — which might have been the most important thing about that meeting. Adding an indicator column is putting a placard on the empty chair that says "Person X was absent." Now your model knows both what the imputed value is and that an absence occurred.
Always run VarianceThreshold on indicators
If only 0.5% of rows are missing in a column, the indicator is 99.5% zeros — near-constant and almost certainly useless. Run VarianceThreshold after creating indicators to automatically remove the low-information ones. This prevents a flood of near-zero binary columns from bloating your feature matrix when you have many columns with sparse missingness.
Teacher's Note
The order in your pipeline is not a stylistic choice — it is a correctness requirement. MissingIndicator must run before SimpleImputer because imputation destroys the missingness information that the indicator needs to read. In a FeatureUnion, both transformers receive the same raw input simultaneously — so the imputer gets the original NaN-filled data and the indicator gets the same NaN-filled data at the same time. This is why FeatureUnion is the correct pattern here rather than a sequential Pipeline — a sequential pipeline would pass the already-imputed matrix to the indicator, which would see no missing values and produce all-zero indicator columns.
Practice Questions
1. Missing indicator columns must be created ___ imputation, not after. (one word)
2. Which features setting in sklearn's MissingIndicator creates indicator columns only for features that had missing values during training?
3. Which sklearn class runs two transformers on the same input in parallel and concatenates their outputs horizontally?
Quiz
1. A column has 18% missing values, and rows where it is missing have a default rate of 45% vs 20% for non-missing rows. Adding a missing indicator is justified because:
2. You place SimpleImputer before MissingIndicator in a sequential Pipeline. The indicator columns are all zeros. What went wrong and how do you fix it?
3. Your dataset has 60 columns, 40 of which have less than 0.5% missing values. Adding indicators for all 40 creates near-constant binary columns. What is the recommended approach?
Up Next · Lesson 30
Domain-Driven Features
The features a statistical test will never discover — built from business logic, domain expertise, and the questions your stakeholders actually care about.