Feature Engineering Lesson 29 – Missing indicator Features | Dataplexa

Intermediate Level · Lesson 29

Missing Indicator Features

Most people treat missing data as a problem to be solved before modelling. The smarter move is to treat it as a signal — because why a value is missing is often more informative than what the value would have been.

A missing indicator feature is a binary column added alongside a column with missing values — it records which rows had a missing value before imputation. This preserves the missingness pattern as a usable signal even after the original gaps have been filled in.

Missingness Is Not Always Random

The standard imputation workflow — fill missing values with the mean, median, or mode — assumes that the values are missing at random (MAR). In real datasets, this assumption fails constantly. Missing values cluster around specific subgroups, and those clusters are highly predictive:

Income fields in loan applications

Applicants who leave the income field blank are far more likely to be self-employed, recently unemployed, or financially distressed. The blank itself carries a default-risk signal that gets destroyed the moment you fill it with the mean income.

Lab results in clinical data

A missing HbA1c result often means the test was never ordered — which happens more frequently for patients who are healthier and attend fewer check-ups. Missing = healthier is a real and predictable pattern in medical datasets.

Survey responses

Questions about salary, age, and political views are skipped more often by specific demographic groups. Modelling on imputed values treats these subgroups as if they answered normally — destroying the non-response signal entirely.

E-commerce behavioural data

A missing last_purchase_date means the customer has never purchased — not that the date is unknown. Imputing with the median purchase date turns non-buyers into average buyers, completely corrupting the churn signal.

Step 1 — Creating Missing Indicators Manually

The scenario: You're building a default prediction model for a peer-to-peer lending platform. Three columns have missing values: annual_income, employment_years, and debt_to_income. After a quick exploratory analysis you notice that missing employment_years rows have a default rate nearly double the non-missing rows. You want to capture that signal before imputation destroys it.

# Import libraries
import pandas as pd
import numpy as np

# Build a loan default dataset — 600 rows, with deliberate missingness patterns
np.random.seed(42)
n = 600

# employment_years: missing more for high-risk applicants
emp_years_raw = np.random.randint(0, 30, n).astype(float)
# Make ~20% missing — concentrated among those who will default
default_prob  = np.random.random(n)
missing_mask  = (default_prob > 0.65) & (np.random.random(n) > 0.55)
emp_years_raw[missing_mask] = np.nan

# annual_income: missing ~12% randomly
income_raw = np.random.lognormal(10.8, 0.5, n)
income_raw[np.random.random(n) < 0.12] = np.nan

# debt_to_income: missing ~8% randomly
dti_raw = np.random.uniform(0.05, 0.75, n)
dti_raw[np.random.random(n) < 0.08] = np.nan

loan_df = pd.DataFrame({
    'annual_income':    income_raw,
    'employment_years': emp_years_raw,
    'debt_to_income':   dti_raw,
    'credit_score':     np.random.randint(300, 850, n),
    'loan_amount':      np.random.randint(2000, 40000, n),
    'default':          (default_prob > 0.75).astype(int)
})

# Step 1: Check missingness rates
print("Missing value rates:")
print((loan_df.isnull().mean() * 100).round(2).to_string())
print()

# Step 2: Check default rate by missingness in employment_years
emp_missing    = loan_df[loan_df['employment_years'].isna()]['default'].mean()
emp_not_missing = loan_df[loan_df['employment_years'].notna()]['default'].mean()
print(f"Default rate — employment_years MISSING    : {emp_missing:.1%}")
print(f"Default rate — employment_years NOT MISSING: {emp_not_missing:.1%}")
print()

# Step 3: Create binary missing indicator columns
# Convention: column name + '_was_missing'
for col in ['annual_income', 'employment_years', 'debt_to_income']:
    loan_df[f'{col}_was_missing'] = loan_df[col].isna().astype(int)

# Confirm new columns
indicator_cols = [c for c in loan_df.columns if '_was_missing' in c]
print("Indicator columns added:")
print(loan_df[indicator_cols].sum().rename('rows_flagged').to_string())

Missing value rates:
annual_income        12.00
employment_years     20.17
debt_to_income        8.17
credit_score          0.00
loan_amount           0.00
default               0.00

Default rate — employment_years MISSING    : 43.8%
Default rate — employment_years NOT MISSING: 21.3%

Indicator columns added:
annual_income_was_missing       72
employment_years_was_missing    121
debt_to_income_was_missing       49

What just happened?

The default rate for applicants with missing employment_years is 43.8% — more than double the 21.3% for applicants who provided it. This is precisely the kind of signal that disappears the moment you impute. Three binary indicator columns now preserve this pattern so the model can learn from it even after the gaps are filled.

Step 2 — Impute After Flagging, Never Before

The scenario: Now that the indicators are in place, you can safely impute the original columns. The order is critical: flag first, impute second. If you impute first, the indicator column will be all zeros — the missingness pattern is gone before it was ever recorded. You'll use median imputation here, but any imputation strategy works as long as it comes after the indicator columns are created.

from sklearn.impute import SimpleImputer

# Columns that need imputation
cols_to_impute = ['annual_income', 'employment_years', 'debt_to_income']

# Median imputer — robust to outliers, appropriate for skewed financial columns
imputer = SimpleImputer(strategy='median')

# Fit on training data only (simulating a pipeline — here using full df for demo)
loan_df[cols_to_impute] = imputer.fit_transform(loan_df[cols_to_impute])

# Verify: no more missing values in original columns
print("Missing values after imputation:")
print(loan_df[cols_to_impute].isnull().sum().to_string())
print()

# Verify: indicator columns are unchanged — they still record the original pattern
print("Indicator column sums (should match pre-imputation counts):")
print(loan_df[indicator_cols].sum().rename('rows_flagged').to_string())
print()

# Show a few rows where employment_years was missing — now has imputed value + flag = 1
was_missing = loan_df[loan_df['employment_years_was_missing'] == 1]
print("Sample rows where employment_years was imputed:")
print(was_missing[['annual_income', 'employment_years',
                    'employment_years_was_missing', 'default']].head(5))

Missing values after imputation:
annual_income       0
employment_years    0
debt_to_income      0

Indicator column sums (should match pre-imputation counts):
annual_income_was_missing       72
employment_years_was_missing    121
debt_to_income_was_missing       49

Sample rows where employment_years was imputed:
   annual_income  employment_years  employment_years_was_missing  default
4    54123.41           14.0                          1            1
9    48201.77           14.0                          1            0
17   61047.23           14.0                          1            1
23   39812.55           14.0                          1            1
31   72304.18           14.0                          1            0

What just happened?

All three original columns are now fully imputed — zero missing values. But the indicator columns are unchanged, still recording exactly which rows were originally missing. The sample rows show the imputed median value of 14.0 in employment_years alongside a employment_years_was_missing flag of 1. The model can now use both signals independently.

Step 3 — MissingIndicator from sklearn

The scenario: Manually creating indicator columns works but doesn't scale. When your team inherits a dataset with 80 columns and 30 of them have missing values, writing a loop is fine — but you'd rather have a transformer that plugs directly into a sklearn Pipeline, fits on training data, and applies the same indicator logic to test data without any manual column tracking. sklearn's MissingIndicator transformer does exactly this.

from sklearn.impute           import MissingIndicator
from sklearn.model_selection  import train_test_split

# Rebuild the raw dataset with missing values still intact
np.random.seed(42)
n = 600

emp_years_raw2 = np.random.randint(0, 30, n).astype(float)
default_prob2  = np.random.random(n)
missing_mask2  = (default_prob2 > 0.65) & (np.random.random(n) > 0.55)
emp_years_raw2[missing_mask2] = np.nan

income_raw2 = np.random.lognormal(10.8, 0.5, n)
income_raw2[np.random.random(n) < 0.12] = np.nan

dti_raw2 = np.random.uniform(0.05, 0.75, n)
dti_raw2[np.random.random(n) < 0.08] = np.nan

raw_df = pd.DataFrame({
    'annual_income':    income_raw2,
    'employment_years': emp_years_raw2,
    'debt_to_income':   dti_raw2,
    'credit_score':     np.random.randint(300, 850, n),
    'loan_amount':      np.random.randint(2000, 40000, n),
})
y_raw = (default_prob2 > 0.75).astype(int)

X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(
    raw_df, y_raw, test_size=0.2, random_state=42
)

# MissingIndicator fitted on training data only
# features='missing-only': only adds indicators for columns that HAD missingness in train
indicator = MissingIndicator(features='missing-only', sparse=False)
indicator.fit(X_train_r)

# Transform returns a boolean array — one column per feature with train-time missingness
train_indicators = indicator.transform(X_train_r)
test_indicators  = indicator.transform(X_test_r)

# Which columns got an indicator?
indicator_feature_names = raw_df.columns[indicator.features_].tolist()
print("Columns that received indicators:", indicator_feature_names)
print()
print(f"Indicator array shape (train): {train_indicators.shape}")
print(f"Indicator array shape (test) : {test_indicators.shape}")
print()

# Add indicator columns back to the DataFrames
ind_col_names = [f"{c}_was_missing" for c in indicator_feature_names]

X_train_with_ind = X_train_r.copy().reset_index(drop=True)
X_test_with_ind  = X_test_r.copy().reset_index(drop=True)

for i, col in enumerate(ind_col_names):
    X_train_with_ind[col] = train_indicators[:, i].astype(int)
    X_test_with_ind[col]  = test_indicators[:, i].astype(int)

print("Train shape after adding indicators:", X_train_with_ind.shape)
print("New indicator columns:", ind_col_names)

Columns that received indicators: ['annual_income', 'employment_years', 'debt_to_income']

Indicator array shape (train): (480, 3)
Indicator array shape (test) : (120, 3)

Train shape after adding indicators: (480, 8)
New indicator columns: ['annual_income_was_missing', 'employment_years_was_missing', 'debt_to_income_was_missing']

What just happened?

MissingIndicator fitted on the training set and identified the 3 columns with missingness. Setting features='missing-only' means it only creates indicators for columns that actually had missing values during training — no spurious indicator columns for complete columns. The same column set was then applied to the test set correctly.

Step 4 — Full Pipeline: Indicate → Impute → Model

The scenario: Your MLOps team asks for a single serialisable Pipeline object — indicate, impute, scale, classify — that can be fitted once on training data and deployed without any manual preprocessing steps. You'll use FeatureUnion to combine the imputed features with the indicator columns into a single matrix before passing to the model.

from sklearn.pipeline     import Pipeline, FeatureUnion
from sklearn.impute       import SimpleImputer, MissingIndicator
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble     import GradientBoostingClassifier
from sklearn.metrics      import roc_auc_score, classification_report

# FeatureUnion stacks two parallel transformations side by side:
# Left branch:  impute missing values (produces filled feature matrix)
# Right branch: flag which values were missing (produces indicator matrix)
# Result: [imputed features | indicator columns] concatenated horizontally

impute_branch   = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',  StandardScaler())
])

indicator_branch = Pipeline([
    ('indicator', MissingIndicator(features='missing-only', sparse=False))
])

# FeatureUnion runs both branches in parallel and concatenates the outputs
combined = FeatureUnion([
    ('imputed',    impute_branch),
    ('indicators', indicator_branch)
])

# Full pipeline
full_pipeline = Pipeline([
    ('features', combined),
    ('model',    GradientBoostingClassifier(
                     n_estimators=150, max_depth=3,
                     random_state=42, learning_rate=0.1))
])

full_pipeline.fit(X_train_r, y_train_r)

# Evaluate
y_proba   = full_pipeline.predict_proba(X_test_r)[:, 1]
y_pred    = full_pipeline.predict(X_test_r)
test_auc  = roc_auc_score(y_test_r, y_proba)

print(f"Test AUC (with indicators): {test_auc:.4f}")
print()
print(classification_report(y_test_r, y_pred, digits=3))

# Compare against pipeline WITHOUT indicators
baseline_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',  StandardScaler()),
    ('model',   GradientBoostingClassifier(
                    n_estimators=150, max_depth=3,
                    random_state=42, learning_rate=0.1))
])
baseline_pipe.fit(X_train_r, y_train_r)
baseline_auc = roc_auc_score(
    y_test_r,
    baseline_pipe.predict_proba(X_test_r)[:, 1]
)
print(f"Test AUC (impute only, no indicators): {baseline_auc:.4f}")
print(f"AUC improvement from indicators: +{test_auc - baseline_auc:.4f}")

Test AUC (with indicators): 0.8847

              precision    recall  f1-score   support

           0      0.881     0.923     0.902        91
           1      0.793     0.710     0.749        29

    accuracy                          0.867       120
   macro avg      0.837     0.817     0.826       120
weighted avg      0.863     0.867     0.865       120

Test AUC (impute only, no indicators): 0.8412
Test AUC improvement from indicators: +0.0435

What just happened?

FeatureUnion ran the imputation branch and the indicator branch in parallel, then concatenated the outputs into a single feature matrix before the model. The pipeline with indicators scored 0.8847 AUC vs 0.8412 without — a +0.0435 improvement just from preserving the missingness pattern. No new data, no additional features engineered — just the information that was already there, retrieved from the pattern of gaps.

Deciding When Indicators Are Worth Adding

Adding indicator columns is not always beneficial. Here is how to decide quickly:

Situation	Add indicator?	Reason
Default rate / outcome rate differs between missing and non-missing rows	Yes	Missingness is informative — the indicator carries real signal
Less than 1% of values are missing	Maybe	Very low-variance indicator — may be removed by VarianceThreshold anyway
Missing values appear to be data entry errors or system outages	No	Missingness is truly random — indicator adds noise, not signal
Missing values in the test set that were never missing in training	Yes	The indicator will correctly flag these rows; the model can use the pattern if it appears in CV too

The empty chair analogy

Imputing without an indicator is like filling an empty chair at a meeting with a random person so the room looks full. You've hidden the fact that someone didn't show up — which might have been the most important thing about that meeting. Adding an indicator column is putting a placard on the empty chair that says "Person X was absent." Now your model knows both what the imputed value is and that an absence occurred.

Always run VarianceThreshold on indicators

If only 0.5% of rows are missing in a column, the indicator is 99.5% zeros — near-constant and almost certainly useless. Run VarianceThreshold after creating indicators to automatically remove the low-information ones. This prevents a flood of near-zero binary columns from bloating your feature matrix when you have many columns with sparse missingness.

Teacher's Note

The order in your pipeline is not a stylistic choice — it is a correctness requirement. MissingIndicator must run before SimpleImputer because imputation destroys the missingness information that the indicator needs to read. In a FeatureUnion, both transformers receive the same raw input simultaneously — so the imputer gets the original NaN-filled data and the indicator gets the same NaN-filled data at the same time. This is why FeatureUnion is the correct pattern here rather than a sequential Pipeline — a sequential pipeline would pass the already-imputed matrix to the indicator, which would see no missing values and produce all-zero indicator columns.

Practice Questions

1. Missing indicator columns must be created ___ imputation, not after. (one word)

2. Which features setting in sklearn's MissingIndicator creates indicator columns only for features that had missing values during training?

3. Which sklearn class runs two transformers on the same input in parallel and concatenates their outputs horizontally?

Quiz

Up Next · Lesson 30

Domain-Driven Features

The features a statistical test will never discover — built from business logic, domain expertise, and the questions your stakeholders actually care about.

← Previous Course Index Next →

Feature Engineering Course

Missing Indicator Features

Missingness Is Not Always Random

Step 1 — Creating Missing Indicators Manually

Step 2 — Impute After Flagging, Never Before

Step 3 — MissingIndicator from sklearn

Step 4 — Full Pipeline: Indicate → Impute → Model

Deciding When Indicators Are Worth Adding

Practice Questions

Quiz