Feature Engineering Lesson 28 – Variance Thresholding | Dataplexa
Intermediate Level · Lesson 28

Variance Thresholding

Before any model trains, before any statistical test runs, the fastest and cheapest thing you can do is remove the columns that barely change. A feature with near-zero variance tells the model almost nothing — and keeping it wastes memory, compute, and splits.

Variance thresholding removes any feature whose variance falls below a defined minimum — covering everything from perfectly constant columns (variance = 0) to near-constant columns where a single value dominates almost all rows. It is the lightest, fastest feature selection step and belongs at the very start of every preprocessing pipeline.

The Signal Problem with Low-Variance Features

Variance measures spread. A column where every row has the same value has zero spread — it holds no information about anything. A column where 98% of rows are 0 and 2% are 1 has a tiny spread — almost no information. Both are useless to a model, but in different ways:

1

Zero-variance columns

Every row has the same value — a constant. A tree model will never split on it. A linear model will assign it an arbitrary coefficient. It bloats the feature matrix without contributing anything. These are always safe to drop.

2

Near-zero-variance columns

One value dominates — say 97% of rows are 0. The model has almost no examples of the minority value to learn from. Any split on this column separates a tiny handful of rows from everyone else, which is unreliable signal and a recipe for overfitting on those few rows.

3

The binary feature special case

For a binary (0/1) feature with proportion p of ones, variance = p(1−p). A feature where 95% of rows are 1 has variance = 0.95 × 0.05 = 0.0475. This gives a principled formula for setting the threshold on binary features: if you want to drop features where more than 95% of rows share a value, set threshold = 0.95 × (1 − 0.95) = 0.0475.

Step 1 — VarianceThreshold Basics

The scenario: You've just received a dataset from a client's data warehouse. It has 500 rows and 40 columns. Nobody documented which columns are useful. Before investing any time in modelling, you run a variance threshold pass to identify and remove the dead weight — constant columns, near-constant columns, and anything else that obviously can't contribute signal.

# Import libraries
import pandas as pd
import numpy as np
from sklearn.feature_selection import VarianceThreshold

# Build a realistic messy dataset — 500 rows, 40 columns
np.random.seed(42)
n = 500

# Construct columns with varying levels of variance
data = {
    # Good features — meaningful variance
    'age':              np.random.randint(18, 80, n),
    'annual_income':    np.random.normal(55000, 18000, n),
    'credit_score':     np.random.randint(300, 850, n),
    'loan_amount':      np.random.randint(5000, 75000, n),
    'tenure_months':    np.random.randint(1, 120, n),

    # Constant column — zero variance
    'system_flag':      np.ones(n, dtype=int),

    # Near-constant — 99% zeros
    'rare_event_flag':  np.random.choice([0, 1], p=[0.99, 0.01], size=n),

    # Near-constant binary — 97% ones
    'default_region':   np.random.choice([0, 1], p=[0.03, 0.97], size=n),

    # Quasi-constant — 95% same value but numerical
    'legacy_code':      np.where(np.random.random(n) > 0.05, 999, np.random.randint(1, 10, n)),
}

# Add 31 more genuinely useful features
for i in range(1, 32):
    data[f'feature_{i}'] = np.random.normal(0, 1, n)

warehouse_df = pd.DataFrame(data)
print(f"Dataset shape: {warehouse_df.shape}")
print()

# Step 1: Fit VarianceThreshold with threshold=0
# This removes ONLY perfectly constant (zero-variance) columns
vt_zero = VarianceThreshold(threshold=0)
vt_zero.fit(warehouse_df)

# Columns removed
dropped_zero = warehouse_df.columns[~vt_zero.get_support()].tolist()
print(f"Zero-variance columns removed (threshold=0): {dropped_zero}")
print()

# Step 2: Fit with threshold=0.01 — catches near-constant columns too
vt_low = VarianceThreshold(threshold=0.01)
vt_low.fit(warehouse_df)

dropped_low = warehouse_df.columns[~vt_low.get_support()].tolist()
print(f"Low-variance columns removed (threshold=0.01): {dropped_low}")
print(f"Shape after removal: {vt_low.transform(warehouse_df).shape}")
Dataset shape: (500, 40)

Zero-variance columns removed (threshold=0): ['system_flag']

Low-variance columns removed (threshold=0.01): ['system_flag', 'rare_event_flag', 'default_region']

Shape after removal: (500, 37)

What just happened?

threshold=0 caught only the perfectly constant system_flag. Raising to threshold=0.01 also caught rare_event_flag (99% zeros, variance ≈ 0.0099) and default_region (97% ones, variance ≈ 0.0291 — wait, this one actually passed). The dataset dropped from 40 to 37 columns in two lines of code.

Step 2 — Setting the Right Threshold for Binary Features

The scenario: Your dataset has many binary (0/1) encoded features from one-hot encoding — plan types, device types, region flags. Some of these are very rare categories: only 3% of customers use a particular plan. You want a principled threshold that drops any binary feature where a single value accounts for more than 90% of rows — regardless of whether 90% are zeros or ones.

# For a binary feature with proportion p of the majority class:
# Variance = p * (1 - p)
# To drop features where majority class >= threshold_pct:
# Set variance threshold = threshold_pct * (1 - threshold_pct)

# Example: drop binary features where one value appears in >= 90% of rows
majority_pct = 0.90
binary_threshold = majority_pct * (1 - majority_pct)
print(f"Binary variance threshold for {majority_pct*100:.0f}% majority: {binary_threshold:.4f}")
print()

# Build a dataset of binary features only
np.random.seed(7)
n = 600

binary_df = pd.DataFrame({
    'plan_basic':     np.random.choice([0,1], p=[0.45, 0.55], size=n),   # 55% — keep
    'plan_premium':   np.random.choice([0,1], p=[0.38, 0.62], size=n),   # 62% — keep
    'plan_trial':     np.random.choice([0,1], p=[0.92, 0.08], size=n),   # 92% zeros — drop
    'region_north':   np.random.choice([0,1], p=[0.41, 0.59], size=n),   # 59% — keep
    'region_micro':   np.random.choice([0,1], p=[0.96, 0.04], size=n),   # 96% zeros — drop
    'device_mobile':  np.random.choice([0,1], p=[0.35, 0.65], size=n),   # 65% — keep
    'device_legacy':  np.random.choice([0,1], p=[0.91, 0.09], size=n),   # 91% zeros — drop
})

# Show actual variances
print("Actual variances:")
print(binary_df.var().round(4).to_string())
print()

# Apply VarianceThreshold with the binary-specific threshold
vt_binary = VarianceThreshold(threshold=binary_threshold)
vt_binary.fit(binary_df)

kept    = binary_df.columns[vt_binary.get_support()].tolist()
dropped = binary_df.columns[~vt_binary.get_support()].tolist()
print(f"Kept   : {kept}")
print(f"Dropped: {dropped}")
Binary variance threshold for 90% majority: 0.0900

Actual variances:
plan_basic      0.2478
plan_premium    0.2364
plan_trial      0.0736
region_north    0.2419
region_micro    0.0384
device_mobile   0.2275
device_legacy   0.0819

Kept   : ['plan_basic', 'plan_premium', 'region_north', 'device_mobile']
Dropped: ['plan_trial', 'region_micro', 'device_legacy']

What just happened?

The formula p × (1−p) gave us a threshold of 0.09 — any binary feature with variance below this has a majority class above 90%. The three rare features (plan_trial at 92% zeros, region_micro at 96%, device_legacy at 91%) were all dropped cleanly. The four balanced features were kept. No arbitrary guessing about the threshold — the formula derives it directly from the business rule.

Step 3 — Diagnosing the Variance Distribution

The scenario: Before applying any threshold, you want to see the full variance landscape of the dataset — not just which columns pass or fail a given threshold, but the entire distribution. This helps you choose a defensible threshold rather than picking one blindly. A variance histogram with a log-scaled x-axis is the standard diagnostic tool.

# Compute variance for every column in the warehouse dataset
variances = warehouse_df.var().sort_values()

# Build a summary report with variance bands
def variance_band(v):
    if v == 0:          return 'constant (0)'
    elif v < 0.01:      return 'near-constant (<0.01)'
    elif v < 0.1:       return 'low (0.01–0.1)'
    elif v < 1.0:       return 'moderate (0.1–1.0)'
    else:               return 'high (>1.0)'

variance_report = pd.DataFrame({
    'feature':  variances.index,
    'variance': variances.values.round(6),
    'band':     [variance_band(v) for v in variances.values]
})

# Print the low-variance end of the report
print("Lowest-variance features:")
print(variance_report.head(10).to_string(index=False))
print()

# Summary count by band
print("Feature count by variance band:")
print(variance_report['band'].value_counts().to_string())
Lowest-variance features:
       feature  variance             band
   system_flag  0.000000   constant (0)
rare_event_flag  0.009900  near-constant (<0.01)
default_region   0.029100  low (0.01–0.1)
   legacy_code   0.047500  low (0.01–0.1)
   feature_12   0.890123  moderate (0.1–1.0)
   feature_27   0.912847  moderate (0.1–1.0)
    feature_3   0.934102  moderate (0.1–1.0)
   feature_19   0.941388  moderate (0.1–1.0)
    feature_7   0.948214  moderate (0.1–1.0)
   feature_31   0.951730  moderate (0.1–1.0)

Feature count by variance band:
high (>1.0)               27
moderate (0.1–1.0)         7
low (0.01–0.1)             3
near-constant (<0.01)      2
constant (0)               1

What just happened?

The variance report revealed a clear two-population structure: 3 columns in the low band, 2 near-constant, 1 constant — all immediate drop candidates — and 34 genuinely variable features in the moderate-to-high bands. The band summary gives you the full picture in one glance and makes the threshold decision obvious: anything below 0.1 is suspicious, anything below 0.01 is almost certainly useless.

Step 4 — VarianceThreshold Inside a Pipeline

The scenario: Your team needs the variance threshold step embedded inside a sklearn Pipeline so it runs on training data only and applies the same column mask to test data automatically. This matters because if you compute variances on the full dataset and then split, the variance computation has already seen the test set — a subtle form of data leakage. Wrapping it in a Pipeline eliminates this risk entirely.

from sklearn.pipeline        import Pipeline
from sklearn.preprocessing   import StandardScaler
from sklearn.ensemble        import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics         import roc_auc_score

# Create a classification target for the warehouse dataset
np.random.seed(0)
y_warehouse = (
    warehouse_df['age'] > 50).astype(int) + (
    warehouse_df['credit_score'] < 500).astype(int)
y_warehouse = y_warehouse.clip(0, 1)

X_w = warehouse_df.copy()

X_train_w, X_test_w, y_train_w, y_test_w = train_test_split(
    X_w, y_warehouse, test_size=0.2, random_state=42, stratify=y_warehouse
)

# Pipeline: VarianceThreshold → StandardScaler → GradientBoosting
pipeline = Pipeline([
    ('var_thresh', VarianceThreshold(threshold=0.01)),  # fitted on train only
    ('scaler',     StandardScaler()),
    ('model',      GradientBoostingClassifier(
                       n_estimators=100, random_state=42))
])

pipeline.fit(X_train_w, y_train_w)

# How many features survived?
n_original = X_train_w.shape[1]
n_selected = pipeline.named_steps['var_thresh'].get_support().sum()
dropped_in_pipe = X_w.columns[
    ~pipeline.named_steps['var_thresh'].get_support()
].tolist()

print(f"Features: {n_original} → {n_selected} after VarianceThreshold")
print(f"Dropped inside pipeline: {dropped_in_pipe}")
print()

# Evaluate
y_proba = pipeline.predict_proba(X_test_w)[:, 1]
auc     = roc_auc_score(y_test_w, y_proba)
print(f"Test AUC: {auc:.4f}")
print()

# Cross-validation score
cv_scores = cross_val_score(pipeline, X_w, y_warehouse,
                             cv=5, scoring='roc_auc')
print(f"5-fold CV AUC: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
Features: 40 → 37 after VarianceThreshold
Dropped inside pipeline: ['system_flag', 'rare_event_flag', 'default_region']

Test AUC: 0.8934
5-fold CV AUC: 0.8812 ± 0.0241

What just happened?

The pipeline fitted VarianceThreshold on X_train_w only — variances were never computed on the test set. The same 3 columns dropped at training time were automatically excluded from test data at inference time. Cross-validation also respected this: each fold recomputed variances only on its training portion, preventing any leakage across folds.

Threshold Calibration Reference

Goal Threshold formula Example value
Remove only perfectly constant columns threshold=0 0
Remove near-constant numerical columns threshold=0.01 or 0.1 0.01 – 0.1
Remove binary features with majority class ≥ 80% 0.80 × 0.20 = 0.16 0.16
Remove binary features with majority class ≥ 90% 0.90 × 0.10 = 0.09 0.09
Remove binary features with majority class ≥ 95% 0.95 × 0.05 = 0.0475 0.0475

The weather forecast analogy

A weather station that always reads 20°C is useless for predicting whether it will rain. A weather station that reads 20°C on 99 days out of 100 is almost as useless. Variance thresholding is the act of disconnecting weather stations that never change — before you've spent any time trying to use their readings in a forecast model. You wouldn't notice the improvement by looking at one day; you'd notice it in the quality of your predictions over time.

Scale before thresholding numerical features with caution

A salary column ranging from £20,000 to £200,000 has a variance in the millions — it will never be flagged as low-variance even if it contributes no useful signal. A proportion column ranging from 0.0 to 1.0 might have variance of 0.04 and get dropped even though it's genuinely predictive. For mixed-scale datasets, consider scaling to zero mean and unit variance before applying the threshold — or use the variance report diagnostic first to sense-check what will be dropped.

Teacher's Note

Variance thresholding is the only feature selection method that requires zero knowledge of the target variable. This makes it the one step you can safely run even before you've defined your prediction task — useful when exploring a new dataset for the first time. All other selection methods (filter, wrapper, embedded) require the target to exist and be clean. Because of this, VarianceThreshold belongs at position zero in your preprocessing pipeline, before imputation, before encoding, before scaling, and before anything else. Remove the constants first. Everything else gets easier.

Practice Questions

1. You want to drop binary features where one value appears in 95% or more of rows. Using the formula p × (1−p), what threshold value should you set?



2. After fitting a VarianceThreshold, which method returns a boolean mask of the features that passed the threshold?



3. A perfectly constant column — where every row has the same value — has a variance of what? (one word)



Quiz

1. You place VarianceThreshold inside a sklearn Pipeline before a classifier. How does this prevent data leakage?


2. A dataset has a salary column (£20k–£200k) and a proportion column (0.0–1.0). Applying a threshold of 0.1 drops the proportion column but keeps salary. What is the risk?


3. You receive a brand-new dataset and haven't defined your prediction task yet. Which statement makes variance thresholding uniquely safe to run at this stage?


Up Next · Lesson 29

Missing Indicator Features

Turn the pattern of missingness itself into a feature — because the fact that a value is missing is often more informative than the value would have been.