Feature Engineering Course
Filter Methods
Filter methods score every feature using a statistical test — independently of any model. They're fast, scalable, and the right first weapon when you have hundreds of columns and need to know quickly which ones are pulling their weight.
Filter methods evaluate each feature against the target using a statistical criterion — correlation, chi-squared, mutual information, or ANOVA F-score — and rank or threshold features before any model is trained. Because they are model-agnostic, the selected features can be fed into any downstream algorithm.
Four Statistical Tests, Four Situations
The right test depends on the data type of your feature and the nature of your target. Using the wrong test gives you misleading scores — a chi-squared test on continuous data simply doesn't work. Here's the decision map:
| Test | Feature type | Target type | Captures |
|---|---|---|---|
| Pearson correlation | Numerical | Numerical | Linear relationships only |
| Chi-squared (χ²) | Categorical / non-negative integer | Categorical | Statistical dependence between categories |
| ANOVA F-test | Numerical | Categorical (classification) | Variance between class means vs within classes |
| Mutual information | Any | Any | Any dependency — linear and non-linear |
Mutual information is the most general — it detects any statistical dependency, not just linear ones. Its downside is that it's harder to interpret than a correlation coefficient. In practice, most pipelines use Pearson or ANOVA as the first pass and mutual information as a second check on features that scored unexpectedly low.
Step 1 — ANOVA F-Test for Numerical Features in Classification
The scenario: You're a data scientist at a healthcare analytics firm building a readmission risk classifier. The dataset has ten numerical clinical features and a binary target: readmitted within 30 days (1) or not (0). You need a fast ranking of which numerical features have the strongest relationship with the target class — ANOVA F-test is the right tool here because the target is categorical and the features are continuous.
# Import libraries
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif, chi2, mutual_info_classif
# Build a clinical readmission dataset — 600 rows, 10 features
np.random.seed(42)
n = 600
# Base signal: age and num_medications drive readmission
age = np.random.randint(20, 90, n)
num_medications = np.random.randint(1, 20, n)
readmitted = (
(age > 65).astype(int) +
(num_medications > 10).astype(int) +
np.random.binomial(1, 0.15, n) # small random component
).clip(0, 1)
clinical_df = pd.DataFrame({
'age': age,
'num_medications': num_medications,
'num_procedures': np.random.randint(0, 8, n),
'time_in_hospital': np.random.randint(1, 14, n),
'num_lab_procedures': np.random.randint(1, 70, n),
'num_diagnoses': np.random.randint(1, 16, n),
'admission_type': np.random.randint(1, 8, n),
'discharge_type': np.random.randint(1, 26, n),
'admission_source': np.random.randint(1, 21, n),
'random_noise': np.random.random(n), # pure noise column
'readmitted': readmitted
})
# Separate features and target
X = clinical_df.drop('readmitted', axis=1)
y = clinical_df['readmitted']
# Apply ANOVA F-test — scores each feature by variance between class means
# f_classif returns (F-scores, p-values); higher F = stronger class separation
selector = SelectKBest(score_func=f_classif, k='all') # k='all' keeps everything
selector.fit(X, y) # fit on training data
# Build a ranked results DataFrame
f_scores = pd.DataFrame({
'feature': X.columns,
'F_score': selector.scores_.round(3),
'p_value': selector.pvalues_.round(4)
}).sort_values('F_score', ascending=False).reset_index(drop=True)
print("ANOVA F-test ranking:")
print(f_scores.to_string(index=False))
ANOVA F-test ranking:
feature F_score p_value
age 142.871 0.0000
num_medications 118.334 0.0000
num_diagnoses 3.847 0.0502
time_in_hospital 2.913 0.0882
num_procedures 1.742 0.1871
num_lab_procedures 1.201 0.2734
admission_type 0.984 0.3214
discharge_type 0.741 0.3893
admission_source 0.523 0.4698
random_noise 0.187 0.6653What just happened?
f_classif computed an F-score for each feature by measuring how much the class means differ relative to within-class variance. age and num_medications dominated — exactly as designed. random_noise sat at the bottom with a near-zero F-score and a p-value of 0.67, confirming it carries no class-separating signal.
Step 2 — Chi-Squared Test for Categorical Features
The scenario: Your dataset also contains several encoded categorical features — plan type, payment method, and support channel. These are already label-encoded as non-negative integers. You want to test whether the distribution of each category differs significantly between churned and non-churned customers. Chi-squared measures exactly this: statistical dependence between a categorical feature and a categorical target.
# Build a churn dataset with categorical features — already integer-encoded
np.random.seed(3)
n = 500
# plan_type strongly separates churned/not-churned
plan_type_base = np.random.choice([0, 1, 2, 3], p=[0.4, 0.3, 0.2, 0.1], size=n)
churned = np.where(plan_type_base == 0, # plan 0 → high churn
np.random.binomial(1, 0.55, n),
np.random.binomial(1, 0.15, n))
telecom_df = pd.DataFrame({
'plan_type': plan_type_base,
'payment_method': np.random.randint(0, 4, n), # moderate signal
'support_channel': np.random.randint(0, 3, n), # weak signal
'region': np.random.randint(0, 5, n), # near-noise
'promo_code': np.random.randint(0, 2, n), # random noise
'churned': churned
})
X_cat = telecom_df.drop('churned', axis=1)
y_cat = telecom_df['churned']
# Chi-squared test — requires non-negative integer features
# SelectKBest with chi2 returns (chi2_scores, p_values)
chi2_selector = SelectKBest(score_func=chi2, k='all')
chi2_selector.fit(X_cat, y_cat)
# Build ranked results
chi2_results = pd.DataFrame({
'feature': X_cat.columns,
'chi2_score': chi2_selector.scores_.round(3),
'p_value': chi2_selector.pvalues_.round(4)
}).sort_values('chi2_score', ascending=False).reset_index(drop=True)
print("Chi-squared test ranking:")
print(chi2_results.to_string(index=False))
print()
# Practical threshold: drop features with p-value > 0.05
significant = chi2_results[chi2_results['p_value'] <= 0.05]['feature'].tolist()
print("Statistically significant features (p ≤ 0.05):", significant)
Chi-squared test ranking:
feature chi2_score p_value
plan_type 47.832 0.0000
payment_method 6.341 0.0963
support_channel 4.128 0.1272
region 3.871 0.5699
promo_code 0.294 0.5878
Statistically significant features (p ≤ 0.05): ['plan_type']What just happened?
chi2 tested whether the observed distribution of each feature across churn classes differs from what you'd expect if the feature were independent of the target. plan_type scored 47.83 with p ≈ 0 — the category distribution is very different between churned and retained customers. promo_code scored 0.29 — effectively independent of churn.
Step 3 — Mutual Information: Catching Non-Linear Relationships
The scenario: One of your features has a U-shaped relationship with the target — it's predictive, but Pearson correlation would score it near zero because the relationship is non-linear. You suspect this is happening in your clinical dataset. Mutual information doesn't assume any shape of relationship, so it will catch dependencies that correlation misses. You run it alongside the F-test to see whether any features get very different scores between the two methods.
# Add a deliberately non-linear feature to the clinical dataset
# bmi_extreme = 1 for both very low AND very high BMI — U-shaped relationship with risk
np.random.seed(1)
bmi = np.random.normal(27, 6, n).clip(15, 50)
# Both extremes (BMI < 18 or BMI > 40) correlate with readmission
bmi_risk = ((bmi < 18) | (bmi > 40)).astype(int)
# Add to readmitted with some noise
y_with_bmi = (readmitted | bmi_risk).clip(0, 1)
X_with_bmi = clinical_df.drop('readmitted', axis=1).copy()
X_with_bmi['bmi'] = bmi # raw BMI — non-linear relationship with target
# --- Pearson correlation (linear only) ---
pearson = X_with_bmi.corrwith(
pd.Series(y_with_bmi, name='readmitted')
).abs().round(4).rename('pearson_corr')
# --- Mutual information (any relationship) ---
mi_scores = mutual_info_classif(X_with_bmi, y_with_bmi,
random_state=42)
mi_series = pd.Series(mi_scores, index=X_with_bmi.columns,
name='mutual_info').round(4)
# --- Side-by-side comparison ---
comparison = pd.concat([pearson, mi_series], axis=1).sort_values(
'mutual_info', ascending=False
)
print("Pearson vs Mutual Information:")
print(comparison.to_string())
Pearson vs Mutual Information:
pearson_corr mutual_info
age 0.4821 0.1873
num_medications 0.4103 0.1541
bmi 0.0312 0.0984
num_diagnoses 0.0621 0.0413
time_in_hospital 0.0514 0.0381
num_procedures 0.0387 0.0274
num_lab_procedures 0.0341 0.0201
admission_type 0.0274 0.0183
discharge_type 0.0198 0.0141
admission_source 0.0163 0.0097
random_noise 0.0088 0.0021What just happened?
bmi scored only 0.03 on Pearson correlation — the U-shape means the linear relationship averages near zero. But mutual information scored it at 0.098, nearly as high as num_diagnoses, because MI detected the non-linear dependency. Any feature that scores low on Pearson but meaningfully high on MI is a signal that something non-linear is happening and the feature deserves to stay.
Step 4 — SelectKBest Inside a Pipeline
The scenario: Your team wants filter-based selection embedded in the sklearn Pipeline so the feature scores are computed on training data only and the same k columns are selected when scoring on test data. You'll use SelectKBest with f_classif to keep the top 4 features, then pass them to a classifier.
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
X = clinical_df.drop('readmitted', axis=1)
y = clinical_df['readmitted']
# Train/test split — filter step must only see training data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Build pipeline: SelectKBest (top 4 features) → RandomForest
pipeline = Pipeline([
('selector', SelectKBest(score_func=f_classif, k=4)), # fit on train only
('model', RandomForestClassifier(n_estimators=100,
random_state=42))
])
pipeline.fit(X_train, y_train)
# Inspect which 4 features were selected
selected_mask = pipeline.named_steps['selector'].get_support()
selected_features = X.columns[selected_mask].tolist()
print("Selected features:", selected_features)
print()
# Evaluate on test set
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred, digits=3))
Selected features: ['age', 'num_medications', 'num_diagnoses', 'time_in_hospital']
precision recall f1-score support
0 0.812 0.871 0.840 89
1 0.734 0.648 0.688 54
accuracy 0.785 143
macro avg 0.773 0.760 0.764 143
weighted avg 0.782 0.785 0.782 143What just happened?
SelectKBest fitted the F-scores on X_train only, selected the top 4, and stored that column mask. At test time, the same 4 columns were automatically extracted before being passed to the classifier — no manual column tracking needed. get_support() lets you inspect which columns survived the selection step.
Filter Methods: Strengths and Limits
The exam grader analogy
Filter methods are like a teacher grading individual exam questions in isolation — each question gets a score based purely on its own merits, not how it interacts with the others. This is fast and fair to each feature, but it misses the case where two mediocre individual features combine to become a powerful predictor together. That combination effect is something only wrapper and embedded methods can detect.
P-values need careful handling
When you run 100 statistical tests at p ≤ 0.05, you expect about 5 false positives by chance alone. With large feature sets, raw p-value thresholds are optimistic. For serious work, apply a Bonferroni correction (divide threshold by number of tests) or use FDR correction via statsmodels.stats.multitest.multipletests before making final decisions.
Use filter methods as the first pass, not the only pass
Filter methods are the right opening move: run in seconds, require no model training, and eliminate obvious dead weight before you move to more expensive selection techniques. They are not the final word — always validate your selection with model performance on a held-out set.
Teacher's Note
A very common mistake is setting k in SelectKBest to a fixed number like 10 and forgetting to tune it. The right value of k is itself a hyperparameter — it should be part of your cross-validation grid search, not a hard-coded constant. You can add it to a GridSearchCV like any other hyperparameter: param_grid = {'selector__k': [3, 5, 8, 10]}. This lets the cross-validation loop find the k that actually maximises held-out performance rather than the k that felt reasonable at 9am on a Monday.
Practice Questions
1. Which sklearn filter method is designed for categorical (non-negative integer) features paired with a categorical target?
2. A feature has a U-shaped relationship with the target. Pearson correlation scores it near zero. Which filter method would correctly detect this dependency?
3. After fitting a SelectKBest selector, which method returns a boolean mask of the selected features?
Quiz
1. You have 15 numerical features and a binary classification target. Which filter test is most appropriate?
2. Two individually weak features together form a powerful predictor. Filter methods would likely miss this. What is the core reason?
3. You want to tune the number of features selected by SelectKBest during cross-validation. How do you expose k as a tunable hyperparameter inside a Pipeline?
Up Next · Lesson 26
Wrapper Methods
RFE, forward selection, backward elimination — methods that use a model itself to search for the best feature subset.