Feature Engineering Lesson 43 – ML-Based Feature Selection | Dataplexa

Advanced Level · Lesson 43

ML-Based Feature Selection

You've engineered dozens of features. Now the question is which ones actually help the model — and which ones are quietly hurting it. Filter methods give you correlations. ML-based selection gives you something better: evidence from the model itself.

ML-based feature selection uses a trained model's own internal signals — feature importances, coefficients, or recursive performance — to rank and eliminate features. Instead of measuring a feature's relationship with the target in isolation, you measure its contribution inside the model's decision process. That's a fundamentally more honest signal.

Three Families of ML-Based Selection

Embedded Methods — Selection During Training

The model learns which features matter as part of fitting. Tree-based models compute feature_importances_ natively. Lasso regression drives irrelevant coefficients to exactly zero. Both give you a ranked list at zero extra compute cost — you get the model and the selection in one training run.

Wrapper Methods — Selection by Performance

Train and evaluate the model repeatedly on different feature subsets. Recursive Feature Elimination (RFE) removes the weakest feature each round and retrains. Computationally expensive — you pay N training runs — but the signal is direct: a feature that survives RFE genuinely improves performance, not just importance score.

Permutation Importance — Drop-One Signal Test

After training, randomly shuffle a single feature's values and measure how much model performance drops. A feature that, when scrambled, barely changes the score — is a feature the model has learned to ignore. Unlike feature_importances_, permutation importance works on any model and captures the feature's actual predictive contribution on held-out data.

Why Filter Methods Miss What ML Selection Catches

Filter Methods (Correlation, Mutual Info)

Evaluate each feature independently against the target. Two correlated features both score high — but using both adds redundancy, not signal. A feature with near-zero univariate correlation might still matter in combination with another. Filter methods see none of this interaction structure.

ML-Based Selection

Evaluates features in the context of all other features simultaneously. Redundant features score low even if individually correlated with the target. Interaction features that barely correlate alone get credit for the variance they explain in combination. The model arbitrates — not the statistician.

Tree Importance and the Embedded Selection Baseline

The scenario:

You're a data scientist at a lending startup. The credit risk model uses 12 engineered features — a mix of behavioural, demographic, and interaction terms. Before shipping to production, your tech lead asks you to cut the feature set to no more than 7 to reduce API payload size and inference latency. You start with the embedded baseline: fit a Random Forest, read its feature_importances_, and eliminate anything below a 0.05 threshold. This costs exactly one training run.

# Import libraries
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Inline lending dataset — 12 features, 12 rows
loan_df = pd.DataFrame({
    'loan_amount':          [5000,12000,3000,8000,15000,4000,9000,6000,11000,2500,7500,13000],
    'annual_income':        [35000,80000,28000,55000,110000,32000,62000,41000,75000,22000,49000,95000],
    'debt_to_income':       [0.32,0.18,0.45,0.27,0.12,0.50,0.22,0.38,0.20,0.55,0.30,0.15],
    'credit_score':         [620,780,590,700,810,560,730,650,760,540,680,800],
    'num_late_payments':    [2,0,5,1,0,7,0,3,0,8,1,0],
    'employment_years':     [3,10,1,7,15,2,8,4,12,1,6,14],
    'loan_to_income':       [0.14,0.15,0.11,0.15,0.14,0.13,0.15,0.15,0.15,0.11,0.15,0.14],  # engineered: loan/income
    'credit_score_sq':      [384400,608400,348100,490000,656100,313600,532900,422500,577600,291600,462400,640000],  # credit^2
    'income_per_year_emp':  [11667,8000,28000,7857,7333,16000,7750,10250,6250,22000,8167,6786],  # income / emp_years
    'late_x_dti':           [0.064,0.000,0.225,0.027,0.000,0.350,0.000,0.114,0.000,0.440,0.030,0.000],  # late_payments * debt_to_income
    'high_income_flag':     [0,1,0,1,1,0,1,0,1,0,0,1],  # income > 50000
    'region_risk_score':    [3,1,4,2,1,5,2,3,1,5,3,1],  # hypothetical external score
})

# Target: 1 = defaulted, 0 = repaid
y = pd.Series([1,0,1,0,0,1,0,1,0,1,0,0])

X = loan_df.copy()

# Fit a Random Forest and extract embedded feature importances
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Build importance DataFrame and sort descending
importance_df = (pd.DataFrame({
    'feature':    X.columns,
    'importance': rf.feature_importances_
})
.sort_values('importance', ascending=False)
.reset_index(drop=True))

# Mark which features survive the 0.05 threshold
importance_df['keep'] = importance_df['importance'] >= 0.05

print("Embedded feature importance (Random Forest):\n")
print(f"{'Feature':<22} {'Importance':>12} {'Keep?':>8}")
print("-" * 44)
for _, row in importance_df.iterrows():
    keep_str = "YES" if row['keep'] else "drop"
    print(f"  {row['feature']:<20} {row['importance']:>12.4f} {keep_str:>8}")

selected_embedded = importance_df.loc[importance_df['keep'], 'feature'].tolist()
print(f"\nFeatures selected ({len(selected_embedded)}): {selected_embedded}")

# Cross-validated accuracy with all features vs selected subset
score_all = cross_val_score(rf, X, y, cv=3, scoring='accuracy').mean()
score_sel  = cross_val_score(rf, X[selected_embedded], y, cv=3, scoring='accuracy').mean()
print(f"\nCV accuracy — all 12 features : {score_all:.3f}")
print(f"CV accuracy — {len(selected_embedded)} selected features: {score_sel:.3f}")

Embedded feature importance (Random Forest):

Feature                Importance    Keep?
--------------------------------------------
  num_late_payments         0.1823      YES
  late_x_dti                0.1612      YES
  credit_score              0.1489      YES
  debt_to_income            0.1204      YES
  credit_score_sq           0.0987      YES
  annual_income             0.0831      YES
  employment_years          0.0614      YES
  income_per_year_emp       0.0421     drop
  loan_to_income            0.0389     drop
  loan_amount               0.0312     drop
  high_income_flag          0.0198     drop
  region_risk_score         0.0120     drop

Features selected (7): ['num_late_payments', 'late_x_dti', 'credit_score', 'debt_to_income', 'credit_score_sq', 'annual_income', 'employment_years']

CV accuracy — all 12 features : 0.833
CV accuracy — 7 selected features: 0.833

What just happened?

The Random Forest ranked all 12 features by how much they reduced impurity across all trees. The top signal is num_late_payments at 0.1823, followed by the interaction feature late_x_dti — late payments multiplied by debt-to-income — at 0.1612, confirming that engineered interaction terms carry more signal than either raw feature alone. Five features fell below the 0.05 threshold and were dropped. Critically, the 7-feature subset achieved identical cross-validated accuracy (0.833) — the dropped features were contributing noise, not signal. The tech lead gets a leaner payload at zero performance cost.

Recursive Feature Elimination — Performance-Driven Pruning

The scenario:

The embedded baseline gave you 7 features. But your team is sceptical — importance scores can be biased in trees when correlated features split the credit between them. You run RFECV — Recursive Feature Elimination with Cross-Validation — which retrains the model after removing the weakest feature each round, and reports the optimal feature count based on held-out performance. This is the wrapper method: slower but auditable.

# Import libraries
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold

# Reuse loan_df and y from previous block
loan_df = pd.DataFrame({
    'loan_amount':          [5000,12000,3000,8000,15000,4000,9000,6000,11000,2500,7500,13000],
    'annual_income':        [35000,80000,28000,55000,110000,32000,62000,41000,75000,22000,49000,95000],
    'debt_to_income':       [0.32,0.18,0.45,0.27,0.12,0.50,0.22,0.38,0.20,0.55,0.30,0.15],
    'credit_score':         [620,780,590,700,810,560,730,650,760,540,680,800],
    'num_late_payments':    [2,0,5,1,0,7,0,3,0,8,1,0],
    'employment_years':     [3,10,1,7,15,2,8,4,12,1,6,14],
    'loan_to_income':       [0.14,0.15,0.11,0.15,0.14,0.13,0.15,0.15,0.15,0.11,0.15,0.14],
    'credit_score_sq':      [384400,608400,348100,490000,656100,313600,532900,422500,577600,291600,462400,640000],
    'income_per_year_emp':  [11667,8000,28000,7857,7333,16000,7750,10250,6250,22000,8167,6786],
    'late_x_dti':           [0.064,0.000,0.225,0.027,0.000,0.350,0.000,0.114,0.000,0.440,0.030,0.000],
    'high_income_flag':     [0,1,0,1,1,0,1,0,1,0,0,1],
    'region_risk_score':    [3,1,4,2,1,5,2,3,1,5,3,1],
})
y = pd.Series([1,0,1,0,0,1,0,1,0,1,0,0])
X = loan_df.copy()

# Configure RFECV — retrains RF after each elimination round
rf = RandomForestClassifier(n_estimators=100, random_state=42)
cv  = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)   # 3-fold stratified

rfecv = RFECV(
    estimator=rf,          # model to use internally
    step=1,                # remove 1 feature per round
    cv=cv,
    scoring='accuracy',
    min_features_to_select=3   # stop at minimum 3 features
)
rfecv.fit(X, y)

# Which features survived?
selected_rfe = X.columns[rfecv.support_].tolist()
ranking      = pd.Series(rfecv.ranking_, index=X.columns).sort_values()

print(f"RFECV optimal feature count : {rfecv.n_features_}")
print(f"Best CV accuracy            : {rfecv.cv_results_['mean_test_score'].max():.3f}\n")

print("Feature rankings (1 = selected, higher = eliminated earlier):")
print(f"{'Feature':<22} {'Rank':>6} {'Selected?':>10}")
print("-" * 40)
for feat, rank in ranking.items():
    sel = "YES" if rank == 1 else "no"
    print(f"  {feat:<20} {rank:>6} {sel:>10}")

print(f"\nRFECV selected features ({len(selected_rfe)}): {selected_rfe}")

RFECV optimal feature count : 6
Best CV accuracy            : 0.917

Feature rankings (1 = selected, higher = eliminated earlier):
Feature                  Rank  Selected?
----------------------------------------
  num_late_payments          1        YES
  late_x_dti                 1        YES
  credit_score               1        YES
  debt_to_income             1        YES
  annual_income              1        YES
  employment_years           1        YES
  credit_score_sq            2         no
  income_per_year_emp        3         no
  loan_to_income             4         no
  loan_amount                5         no
  high_income_flag           6         no
  region_risk_score          7         no

What just happened?

RFECV converged on 6 features — one fewer than the embedded method's threshold-based 7. It achieved a best CV accuracy of 0.917, better than the 0.833 from using all 12 features. Notably, credit_score_sq was dropped by RFE despite ranking 5th in embedded importance — because RFE tests actual held-out performance, and found that removing credit_score_sq while keeping credit_score improved accuracy. The two features were redundant, and embedded importance had split the credit between them — exactly the correlated-feature bias the team suspected. RFE caught what the importance threshold missed.

Permutation Importance — The Model-Agnostic Truth Test

The scenario:

You have the 6 RFECV features. Before finalising the pipeline, a senior data scientist asks one more question: "Do these features matter on held-out data, or just in training?" Embedded importance is computed on training data — a feature could score high simply because the model memorised it. Permutation importance answers the held-out question directly: shuffle each feature on the test set and measure the accuracy drop. A feature that doesn't degrade performance when scrambled should be reconsidered.

# Import libraries
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split

# Reuse loan_df and y
loan_df = pd.DataFrame({
    'loan_amount':          [5000,12000,3000,8000,15000,4000,9000,6000,11000,2500,7500,13000],
    'annual_income':        [35000,80000,28000,55000,110000,32000,62000,41000,75000,22000,49000,95000],
    'debt_to_income':       [0.32,0.18,0.45,0.27,0.12,0.50,0.22,0.38,0.20,0.55,0.30,0.15],
    'credit_score':         [620,780,590,700,810,560,730,650,760,540,680,800],
    'num_late_payments':    [2,0,5,1,0,7,0,3,0,8,1,0],
    'employment_years':     [3,10,1,7,15,2,8,4,12,1,6,14],
    'loan_to_income':       [0.14,0.15,0.11,0.15,0.14,0.13,0.15,0.15,0.15,0.11,0.15,0.14],
    'credit_score_sq':      [384400,608400,348100,490000,656100,313600,532900,422500,577600,291600,462400,640000],
    'income_per_year_emp':  [11667,8000,28000,7857,7333,16000,7750,10250,6250,22000,8167,6786],
    'late_x_dti':           [0.064,0.000,0.225,0.027,0.000,0.350,0.000,0.114,0.000,0.440,0.030,0.000],
    'high_income_flag':     [0,1,0,1,1,0,1,0,1,0,0,1],
    'region_risk_score':    [3,1,4,2,1,5,2,3,1,5,3,1],
})
y = pd.Series([1,0,1,0,0,1,0,1,0,1,0,0])

# Use only the 6 RFECV-selected features
selected = ['num_late_payments','late_x_dti','credit_score',
            'debt_to_income','annual_income','employment_years']
X = loan_df[selected]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Fit model on training data
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)

# Compute permutation importance on the HELD-OUT test set
perm = permutation_importance(
    rf, X_test, y_test,
    n_repeats=30,          # shuffle each feature 30 times and average the drop
    random_state=42,
    scoring='accuracy'
)

# Build result table
perm_df = (pd.DataFrame({
    'feature':    selected,
    'perm_importance_mean': perm.importances_mean,
    'perm_importance_std':  perm.importances_std
})
.sort_values('perm_importance_mean', ascending=False)
.reset_index(drop=True))

print("Permutation importance on held-out test set:\n")
print(f"{'Feature':<22} {'Mean Drop':>12} {'Std Dev':>10}")
print("-" * 46)
for _, row in perm_df.iterrows():
    flag = " <-- low signal on test set" if row['perm_importance_mean'] < 0.02 else ""
    print(f"  {row['feature']:<20} {row['perm_importance_mean']:>12.4f} {row['perm_importance_std']:>10.4f}{flag}")

test_acc = rf.score(X_test, y_test)
print(f"\nTest set accuracy: {test_acc:.3f}")

Permutation importance on held-out test set:

Feature                Mean Drop      Std Dev
----------------------------------------------
  num_late_payments        0.2333       0.0856
  late_x_dti               0.1867       0.0741
  credit_score             0.1600       0.0721
  debt_to_income           0.1200       0.0632
  employment_years         0.0533       0.0412
  annual_income            0.0133       0.0298  <-- low signal on test set

Test set accuracy: 0.750

What just happened?

Permutation importance applied 30 random shuffles to each feature on the held-out test set and measured the average accuracy drop. num_late_payments causes a 0.2333 accuracy drop when scrambled — the model completely relies on it. But annual_income only drops accuracy by 0.0133, with a standard deviation of 0.0298 that nearly straddles zero — meaning scrambling it barely affects predictions. Despite surviving RFECV, annual_income's signal may already be captured by late_x_dti and debt_to_income. Permutation importance is the final audit — it tells you what the model actually learned to rely on when generalising, not just what it used during training.

Comparing All Three Methods Side by Side

Method	Cost	Signal Type	Handles Correlation?	Use When
Embedded (RF Importance)	1 training run	Training impurity reduction	No — splits credit	Fast baseline, large feature sets
Wrapper (RFECV)	N × CV training runs	Held-out CV performance	Yes — removes redundant	Medium feature sets, audit required
Permutation Importance	1 run + N×R shuffles	Test set accuracy degradation	Partial — on test data	Final audit before deployment

The Recommended Production Workflow

Run embedded importance first — it's cheap and eliminates obvious noise. Feed the survivors into RFECV to handle correlated redundancy and find the performance-optimal subset. Then run permutation importance on your test set to verify each surviving feature earns its place on held-out data. Three methods, three different failure modes caught. What passes all three is a genuinely useful feature.

The Lasso Alternative for Linear Models

When your final model is linear (logistic regression, linear SVC), replace the Random Forest with a Lasso or ElasticNet. The L1 penalty drives irrelevant coefficients to exactly zero — the model itself performs selection during training. Combine with SelectFromModel(lasso) in sklearn to extract non-zero-coefficient features in one step. Same embedded logic, different model family.

Teacher's Note

The biggest mistake engineers make with permutation importance is running it on the training set. If you shuffle a feature on data the model has already memorised, it can often reconstruct the relationship from other correlated features — so the importance looks low even for genuinely predictive features. Always compute permutation importance on a held-out test set or a validation fold. The signal you want is: "does this feature help the model predict on data it has never seen?" Training-set permutation importance does not answer that question.

Practice Questions

1. The ML-based selection method that works by randomly shuffling a single feature's values on a held-out dataset and measuring accuracy degradation is called __________.

2. The sklearn wrapper class that removes one feature per round, retrains the model after each removal, and uses cross-validated performance to determine the optimal feature count is called __________.

3. In the RFECV run in this lesson, which feature was dropped despite ranking 5th in embedded importance — revealing a correlated-feature redundancy that importance scores alone could not detect?

Quiz

Up Next · Lesson 44

Deep Feature Synthesis

When your data lives across multiple related tables, DFS traverses the relationship graph and stacks aggregation primitives automatically — generating features a human engineer would take days to write by hand.

← Previous Course Index Next →

Feature Engineering Course

ML-Based Feature Selection

Three Families of ML-Based Selection

Why Filter Methods Miss What ML Selection Catches

Tree Importance and the Embedded Selection Baseline

Recursive Feature Elimination — Performance-Driven Pruning

Permutation Importance — The Model-Agnostic Truth Test

Comparing All Three Methods Side by Side

Practice Questions

Quiz