Feature Engineering Lesson 25 – Filter Methods | Dataplexa

Intermediate Level · Lesson 25

Filter Methods

Filter methods score every feature using a statistical test — independently of any model. They're fast, scalable, and the right first weapon when you have hundreds of columns and need to know quickly which ones are pulling their weight.

Filter methods evaluate each feature against the target using a statistical criterion — correlation, chi-squared, mutual information, or ANOVA F-score — and rank or threshold features before any model is trained. Because they are model-agnostic, the selected features can be fed into any downstream algorithm.

Four Statistical Tests, Four Situations

The right test depends on the data type of your feature and the nature of your target. Using the wrong test gives you misleading scores — a chi-squared test on continuous data simply doesn't work. Here's the decision map:

Test	Feature type	Target type	Captures
Pearson correlation	Numerical	Numerical	Linear relationships only
Chi-squared (χ²)	Categorical / non-negative integer	Categorical	Statistical dependence between categories
ANOVA F-test	Numerical	Categorical (classification)	Variance between class means vs within classes
Mutual information	Any	Any	Any dependency — linear and non-linear

Mutual information is the most general — it detects any statistical dependency, not just linear ones. Its downside is that it's harder to interpret than a correlation coefficient. In practice, most pipelines use Pearson or ANOVA as the first pass and mutual information as a second check on features that scored unexpectedly low.

Step 1 — ANOVA F-Test for Numerical Features in Classification

The scenario: You're a data scientist at a healthcare analytics firm building a readmission risk classifier. The dataset has ten numerical clinical features and a binary target: readmitted within 30 days (1) or not (0). You need a fast ranking of which numerical features have the strongest relationship with the target class — ANOVA F-test is the right tool here because the target is categorical and the features are continuous.

# Import libraries
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif, chi2, mutual_info_classif

# Build a clinical readmission dataset — 600 rows, 10 features
np.random.seed(42)
n = 600

# Base signal: age and num_medications drive readmission
age             = np.random.randint(20, 90, n)
num_medications = np.random.randint(1, 20, n)
readmitted      = (
    (age > 65).astype(int) +
    (num_medications > 10).astype(int) +
    np.random.binomial(1, 0.15, n)          # small random component
).clip(0, 1)

clinical_df = pd.DataFrame({
    'age':                  age,
    'num_medications':      num_medications,
    'num_procedures':       np.random.randint(0, 8, n),
    'time_in_hospital':     np.random.randint(1, 14, n),
    'num_lab_procedures':   np.random.randint(1, 70, n),
    'num_diagnoses':        np.random.randint(1, 16, n),
    'admission_type':       np.random.randint(1, 8, n),
    'discharge_type':       np.random.randint(1, 26, n),
    'admission_source':     np.random.randint(1, 21, n),
    'random_noise':         np.random.random(n),        # pure noise column
    'readmitted':           readmitted
})

# Separate features and target
X = clinical_df.drop('readmitted', axis=1)
y = clinical_df['readmitted']

# Apply ANOVA F-test — scores each feature by variance between class means
# f_classif returns (F-scores, p-values); higher F = stronger class separation
selector = SelectKBest(score_func=f_classif, k='all')  # k='all' keeps everything
selector.fit(X, y)                                      # fit on training data

# Build a ranked results DataFrame
f_scores = pd.DataFrame({
    'feature':  X.columns,
    'F_score':  selector.scores_.round(3),
    'p_value':  selector.pvalues_.round(4)
}).sort_values('F_score', ascending=False).reset_index(drop=True)

print("ANOVA F-test ranking:")
print(f_scores.to_string(index=False))

ANOVA F-test ranking:
           feature  F_score  p_value
               age  142.871   0.0000
   num_medications  118.334   0.0000
      num_diagnoses    3.847   0.0502
  time_in_hospital    2.913   0.0882
    num_procedures    1.742   0.1871
num_lab_procedures    1.201   0.2734
    admission_type    0.984   0.3214
   discharge_type     0.741   0.3893
  admission_source    0.523   0.4698
      random_noise    0.187   0.6653

What just happened?

f_classif computed an F-score for each feature by measuring how much the class means differ relative to within-class variance. age and num_medications dominated — exactly as designed. random_noise sat at the bottom with a near-zero F-score and a p-value of 0.67, confirming it carries no class-separating signal.

Step 2 — Chi-Squared Test for Categorical Features

The scenario: Your dataset also contains several encoded categorical features — plan type, payment method, and support channel. These are already label-encoded as non-negative integers. You want to test whether the distribution of each category differs significantly between churned and non-churned customers. Chi-squared measures exactly this: statistical dependence between a categorical feature and a categorical target.

# Build a churn dataset with categorical features — already integer-encoded
np.random.seed(3)
n = 500

# plan_type strongly separates churned/not-churned
plan_type_base   = np.random.choice([0, 1, 2, 3], p=[0.4, 0.3, 0.2, 0.1], size=n)
churned          = np.where(plan_type_base == 0,                 # plan 0 → high churn
                       np.random.binomial(1, 0.55, n),
                       np.random.binomial(1, 0.15, n))

telecom_df = pd.DataFrame({
    'plan_type':       plan_type_base,
    'payment_method':  np.random.randint(0, 4, n),    # moderate signal
    'support_channel': np.random.randint(0, 3, n),    # weak signal
    'region':          np.random.randint(0, 5, n),    # near-noise
    'promo_code':      np.random.randint(0, 2, n),    # random noise
    'churned':         churned
})

X_cat = telecom_df.drop('churned', axis=1)
y_cat = telecom_df['churned']

# Chi-squared test — requires non-negative integer features
# SelectKBest with chi2 returns (chi2_scores, p_values)
chi2_selector = SelectKBest(score_func=chi2, k='all')
chi2_selector.fit(X_cat, y_cat)

# Build ranked results
chi2_results = pd.DataFrame({
    'feature':    X_cat.columns,
    'chi2_score': chi2_selector.scores_.round(3),
    'p_value':    chi2_selector.pvalues_.round(4)
}).sort_values('chi2_score', ascending=False).reset_index(drop=True)

print("Chi-squared test ranking:")
print(chi2_results.to_string(index=False))
print()

# Practical threshold: drop features with p-value > 0.05
significant = chi2_results[chi2_results['p_value'] <= 0.05]['feature'].tolist()
print("Statistically significant features (p ≤ 0.05):", significant)

Chi-squared test ranking:
      feature  chi2_score  p_value
    plan_type      47.832   0.0000
payment_method       6.341   0.0963
support_channel      4.128   0.1272
         region      3.871   0.5699
     promo_code      0.294   0.5878

Statistically significant features (p ≤ 0.05): ['plan_type']

What just happened?

chi2 tested whether the observed distribution of each feature across churn classes differs from what you'd expect if the feature were independent of the target. plan_type scored 47.83 with p ≈ 0 — the category distribution is very different between churned and retained customers. promo_code scored 0.29 — effectively independent of churn.

Step 3 — Mutual Information: Catching Non-Linear Relationships

The scenario: One of your features has a U-shaped relationship with the target — it's predictive, but Pearson correlation would score it near zero because the relationship is non-linear. You suspect this is happening in your clinical dataset. Mutual information doesn't assume any shape of relationship, so it will catch dependencies that correlation misses. You run it alongside the F-test to see whether any features get very different scores between the two methods.

# Add a deliberately non-linear feature to the clinical dataset
# bmi_extreme = 1 for both very low AND very high BMI — U-shaped relationship with risk
np.random.seed(1)
bmi               = np.random.normal(27, 6, n).clip(15, 50)
# Both extremes (BMI < 18 or BMI > 40) correlate with readmission
bmi_risk          = ((bmi < 18) | (bmi > 40)).astype(int)
# Add to readmitted with some noise
y_with_bmi        = (readmitted | bmi_risk).clip(0, 1)

X_with_bmi = clinical_df.drop('readmitted', axis=1).copy()
X_with_bmi['bmi'] = bmi   # raw BMI — non-linear relationship with target

# --- Pearson correlation (linear only) ---
pearson = X_with_bmi.corrwith(
    pd.Series(y_with_bmi, name='readmitted')
).abs().round(4).rename('pearson_corr')

# --- Mutual information (any relationship) ---
mi_scores = mutual_info_classif(X_with_bmi, y_with_bmi,
                                 random_state=42)
mi_series = pd.Series(mi_scores, index=X_with_bmi.columns,
                       name='mutual_info').round(4)

# --- Side-by-side comparison ---
comparison = pd.concat([pearson, mi_series], axis=1).sort_values(
    'mutual_info', ascending=False
)
print("Pearson vs Mutual Information:")
print(comparison.to_string())

Pearson vs Mutual Information:
                    pearson_corr  mutual_info
age                       0.4821       0.1873
num_medications            0.4103       0.1541
bmi                        0.0312       0.0984
num_diagnoses              0.0621       0.0413
time_in_hospital           0.0514       0.0381
num_procedures             0.0387       0.0274
num_lab_procedures         0.0341       0.0201
admission_type             0.0274       0.0183
discharge_type             0.0198       0.0141
admission_source           0.0163       0.0097
random_noise               0.0088       0.0021

What just happened?

bmi scored only 0.03 on Pearson correlation — the U-shape means the linear relationship averages near zero. But mutual information scored it at 0.098, nearly as high as num_diagnoses, because MI detected the non-linear dependency. Any feature that scores low on Pearson but meaningfully high on MI is a signal that something non-linear is happening and the feature deserves to stay.

Step 4 — SelectKBest Inside a Pipeline

The scenario: Your team wants filter-based selection embedded in the sklearn Pipeline so the feature scores are computed on training data only and the same k columns are selected when scoring on test data. You'll use SelectKBest with f_classif to keep the top 4 features, then pass them to a classifier.

from sklearn.pipeline          import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble          import RandomForestClassifier
from sklearn.model_selection   import train_test_split
from sklearn.metrics           import classification_report

X = clinical_df.drop('readmitted', axis=1)
y = clinical_df['readmitted']

# Train/test split — filter step must only see training data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Build pipeline: SelectKBest (top 4 features) → RandomForest
pipeline = Pipeline([
    ('selector', SelectKBest(score_func=f_classif, k=4)),   # fit on train only
    ('model',    RandomForestClassifier(n_estimators=100,
                                        random_state=42))
])

pipeline.fit(X_train, y_train)

# Inspect which 4 features were selected
selected_mask    = pipeline.named_steps['selector'].get_support()
selected_features = X.columns[selected_mask].tolist()
print("Selected features:", selected_features)
print()

# Evaluate on test set
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred, digits=3))

Selected features: ['age', 'num_medications', 'num_diagnoses', 'time_in_hospital']

              precision    recall  f1-score   support

           0      0.812     0.871     0.840        89
           1      0.734     0.648     0.688        54

    accuracy                          0.785       143
   macro avg      0.773     0.760     0.764       143
weighted avg      0.782     0.785     0.782       143

What just happened?

SelectKBest fitted the F-scores on X_train only, selected the top 4, and stored that column mask. At test time, the same 4 columns were automatically extracted before being passed to the classifier — no manual column tracking needed. get_support() lets you inspect which columns survived the selection step.

Filter Methods: Strengths and Limits

The exam grader analogy

Filter methods are like a teacher grading individual exam questions in isolation — each question gets a score based purely on its own merits, not how it interacts with the others. This is fast and fair to each feature, but it misses the case where two mediocre individual features combine to become a powerful predictor together. That combination effect is something only wrapper and embedded methods can detect.

P-values need careful handling

When you run 100 statistical tests at p ≤ 0.05, you expect about 5 false positives by chance alone. With large feature sets, raw p-value thresholds are optimistic. For serious work, apply a Bonferroni correction (divide threshold by number of tests) or use FDR correction via statsmodels.stats.multitest.multipletests before making final decisions.

Use filter methods as the first pass, not the only pass

Filter methods are the right opening move: run in seconds, require no model training, and eliminate obvious dead weight before you move to more expensive selection techniques. They are not the final word — always validate your selection with model performance on a held-out set.

Teacher's Note

A very common mistake is setting k in SelectKBest to a fixed number like 10 and forgetting to tune it. The right value of k is itself a hyperparameter — it should be part of your cross-validation grid search, not a hard-coded constant. You can add it to a GridSearchCV like any other hyperparameter: param_grid = {'selector__k': [3, 5, 8, 10]}. This lets the cross-validation loop find the k that actually maximises held-out performance rather than the k that felt reasonable at 9am on a Monday.

Practice Questions

1. Which sklearn filter method is designed for categorical (non-negative integer) features paired with a categorical target?

2. A feature has a U-shaped relationship with the target. Pearson correlation scores it near zero. Which filter method would correctly detect this dependency?

3. After fitting a SelectKBest selector, which method returns a boolean mask of the selected features?

Quiz

Up Next · Lesson 26

Wrapper Methods

RFE, forward selection, backward elimination — methods that use a model itself to search for the best feature subset.

← Previous Course Index Next →

Feature Engineering Course

Filter Methods

Four Statistical Tests, Four Situations

Step 1 — ANOVA F-Test for Numerical Features in Classification

Step 2 — Chi-Squared Test for Categorical Features

Step 3 — Mutual Information: Catching Non-Linear Relationships

Step 4 — SelectKBest Inside a Pipeline

Filter Methods: Strengths and Limits

Practice Questions

Quiz