Feature Engineering Course
ML-Based Feature Selection
You've engineered dozens of features. Now the question is which ones actually help the model — and which ones are quietly hurting it. Filter methods give you correlations. ML-based selection gives you something better: evidence from the model itself.
ML-based feature selection uses a trained model's own internal signals — feature importances, coefficients, or recursive performance — to rank and eliminate features. Instead of measuring a feature's relationship with the target in isolation, you measure its contribution inside the model's decision process. That's a fundamentally more honest signal.
Three Families of ML-Based Selection
Embedded Methods — Selection During Training
The model learns which features matter as part of fitting. Tree-based models compute feature_importances_ natively. Lasso regression drives irrelevant coefficients to exactly zero. Both give you a ranked list at zero extra compute cost — you get the model and the selection in one training run.
Wrapper Methods — Selection by Performance
Train and evaluate the model repeatedly on different feature subsets. Recursive Feature Elimination (RFE) removes the weakest feature each round and retrains. Computationally expensive — you pay N training runs — but the signal is direct: a feature that survives RFE genuinely improves performance, not just importance score.
Permutation Importance — Drop-One Signal Test
After training, randomly shuffle a single feature's values and measure how much model performance drops. A feature that, when scrambled, barely changes the score — is a feature the model has learned to ignore. Unlike feature_importances_, permutation importance works on any model and captures the feature's actual predictive contribution on held-out data.
Why Filter Methods Miss What ML Selection Catches
Filter Methods (Correlation, Mutual Info)
Evaluate each feature independently against the target. Two correlated features both score high — but using both adds redundancy, not signal. A feature with near-zero univariate correlation might still matter in combination with another. Filter methods see none of this interaction structure.
ML-Based Selection
Evaluates features in the context of all other features simultaneously. Redundant features score low even if individually correlated with the target. Interaction features that barely correlate alone get credit for the variance they explain in combination. The model arbitrates — not the statistician.
Tree Importance and the Embedded Selection Baseline
The scenario:
You're a data scientist at a lending startup. The credit risk model uses 12 engineered features — a mix of behavioural, demographic, and interaction terms. Before shipping to production, your tech lead asks you to cut the feature set to no more than 7 to reduce API payload size and inference latency. You start with the embedded baseline: fit a Random Forest, read its feature_importances_, and eliminate anything below a 0.05 threshold. This costs exactly one training run.
# Import libraries
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
# Inline lending dataset — 12 features, 12 rows
loan_df = pd.DataFrame({
'loan_amount': [5000,12000,3000,8000,15000,4000,9000,6000,11000,2500,7500,13000],
'annual_income': [35000,80000,28000,55000,110000,32000,62000,41000,75000,22000,49000,95000],
'debt_to_income': [0.32,0.18,0.45,0.27,0.12,0.50,0.22,0.38,0.20,0.55,0.30,0.15],
'credit_score': [620,780,590,700,810,560,730,650,760,540,680,800],
'num_late_payments': [2,0,5,1,0,7,0,3,0,8,1,0],
'employment_years': [3,10,1,7,15,2,8,4,12,1,6,14],
'loan_to_income': [0.14,0.15,0.11,0.15,0.14,0.13,0.15,0.15,0.15,0.11,0.15,0.14], # engineered: loan/income
'credit_score_sq': [384400,608400,348100,490000,656100,313600,532900,422500,577600,291600,462400,640000], # credit^2
'income_per_year_emp': [11667,8000,28000,7857,7333,16000,7750,10250,6250,22000,8167,6786], # income / emp_years
'late_x_dti': [0.064,0.000,0.225,0.027,0.000,0.350,0.000,0.114,0.000,0.440,0.030,0.000], # late_payments * debt_to_income
'high_income_flag': [0,1,0,1,1,0,1,0,1,0,0,1], # income > 50000
'region_risk_score': [3,1,4,2,1,5,2,3,1,5,3,1], # hypothetical external score
})
# Target: 1 = defaulted, 0 = repaid
y = pd.Series([1,0,1,0,0,1,0,1,0,1,0,0])
X = loan_df.copy()
# Fit a Random Forest and extract embedded feature importances
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)
# Build importance DataFrame and sort descending
importance_df = (pd.DataFrame({
'feature': X.columns,
'importance': rf.feature_importances_
})
.sort_values('importance', ascending=False)
.reset_index(drop=True))
# Mark which features survive the 0.05 threshold
importance_df['keep'] = importance_df['importance'] >= 0.05
print("Embedded feature importance (Random Forest):\n")
print(f"{'Feature':<22} {'Importance':>12} {'Keep?':>8}")
print("-" * 44)
for _, row in importance_df.iterrows():
keep_str = "YES" if row['keep'] else "drop"
print(f" {row['feature']:<20} {row['importance']:>12.4f} {keep_str:>8}")
selected_embedded = importance_df.loc[importance_df['keep'], 'feature'].tolist()
print(f"\nFeatures selected ({len(selected_embedded)}): {selected_embedded}")
# Cross-validated accuracy with all features vs selected subset
score_all = cross_val_score(rf, X, y, cv=3, scoring='accuracy').mean()
score_sel = cross_val_score(rf, X[selected_embedded], y, cv=3, scoring='accuracy').mean()
print(f"\nCV accuracy — all 12 features : {score_all:.3f}")
print(f"CV accuracy — {len(selected_embedded)} selected features: {score_sel:.3f}")
Embedded feature importance (Random Forest): Feature Importance Keep? -------------------------------------------- num_late_payments 0.1823 YES late_x_dti 0.1612 YES credit_score 0.1489 YES debt_to_income 0.1204 YES credit_score_sq 0.0987 YES annual_income 0.0831 YES employment_years 0.0614 YES income_per_year_emp 0.0421 drop loan_to_income 0.0389 drop loan_amount 0.0312 drop high_income_flag 0.0198 drop region_risk_score 0.0120 drop Features selected (7): ['num_late_payments', 'late_x_dti', 'credit_score', 'debt_to_income', 'credit_score_sq', 'annual_income', 'employment_years'] CV accuracy — all 12 features : 0.833 CV accuracy — 7 selected features: 0.833
What just happened?
The Random Forest ranked all 12 features by how much they reduced impurity across all trees. The top signal is num_late_payments at 0.1823, followed by the interaction feature late_x_dti — late payments multiplied by debt-to-income — at 0.1612, confirming that engineered interaction terms carry more signal than either raw feature alone. Five features fell below the 0.05 threshold and were dropped. Critically, the 7-feature subset achieved identical cross-validated accuracy (0.833) — the dropped features were contributing noise, not signal. The tech lead gets a leaner payload at zero performance cost.
Recursive Feature Elimination — Performance-Driven Pruning
The scenario:
The embedded baseline gave you 7 features. But your team is sceptical — importance scores can be biased in trees when correlated features split the credit between them. You run RFECV — Recursive Feature Elimination with Cross-Validation — which retrains the model after removing the weakest feature each round, and reports the optimal feature count based on held-out performance. This is the wrapper method: slower but auditable.
# Import libraries
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold
# Reuse loan_df and y from previous block
loan_df = pd.DataFrame({
'loan_amount': [5000,12000,3000,8000,15000,4000,9000,6000,11000,2500,7500,13000],
'annual_income': [35000,80000,28000,55000,110000,32000,62000,41000,75000,22000,49000,95000],
'debt_to_income': [0.32,0.18,0.45,0.27,0.12,0.50,0.22,0.38,0.20,0.55,0.30,0.15],
'credit_score': [620,780,590,700,810,560,730,650,760,540,680,800],
'num_late_payments': [2,0,5,1,0,7,0,3,0,8,1,0],
'employment_years': [3,10,1,7,15,2,8,4,12,1,6,14],
'loan_to_income': [0.14,0.15,0.11,0.15,0.14,0.13,0.15,0.15,0.15,0.11,0.15,0.14],
'credit_score_sq': [384400,608400,348100,490000,656100,313600,532900,422500,577600,291600,462400,640000],
'income_per_year_emp': [11667,8000,28000,7857,7333,16000,7750,10250,6250,22000,8167,6786],
'late_x_dti': [0.064,0.000,0.225,0.027,0.000,0.350,0.000,0.114,0.000,0.440,0.030,0.000],
'high_income_flag': [0,1,0,1,1,0,1,0,1,0,0,1],
'region_risk_score': [3,1,4,2,1,5,2,3,1,5,3,1],
})
y = pd.Series([1,0,1,0,0,1,0,1,0,1,0,0])
X = loan_df.copy()
# Configure RFECV — retrains RF after each elimination round
rf = RandomForestClassifier(n_estimators=100, random_state=42)
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42) # 3-fold stratified
rfecv = RFECV(
estimator=rf, # model to use internally
step=1, # remove 1 feature per round
cv=cv,
scoring='accuracy',
min_features_to_select=3 # stop at minimum 3 features
)
rfecv.fit(X, y)
# Which features survived?
selected_rfe = X.columns[rfecv.support_].tolist()
ranking = pd.Series(rfecv.ranking_, index=X.columns).sort_values()
print(f"RFECV optimal feature count : {rfecv.n_features_}")
print(f"Best CV accuracy : {rfecv.cv_results_['mean_test_score'].max():.3f}\n")
print("Feature rankings (1 = selected, higher = eliminated earlier):")
print(f"{'Feature':<22} {'Rank':>6} {'Selected?':>10}")
print("-" * 40)
for feat, rank in ranking.items():
sel = "YES" if rank == 1 else "no"
print(f" {feat:<20} {rank:>6} {sel:>10}")
print(f"\nRFECV selected features ({len(selected_rfe)}): {selected_rfe}")
RFECV optimal feature count : 6 Best CV accuracy : 0.917 Feature rankings (1 = selected, higher = eliminated earlier): Feature Rank Selected? ---------------------------------------- num_late_payments 1 YES late_x_dti 1 YES credit_score 1 YES debt_to_income 1 YES annual_income 1 YES employment_years 1 YES credit_score_sq 2 no income_per_year_emp 3 no loan_to_income 4 no loan_amount 5 no high_income_flag 6 no region_risk_score 7 no
What just happened?
RFECV converged on 6 features — one fewer than the embedded method's threshold-based 7. It achieved a best CV accuracy of 0.917, better than the 0.833 from using all 12 features. Notably, credit_score_sq was dropped by RFE despite ranking 5th in embedded importance — because RFE tests actual held-out performance, and found that removing credit_score_sq while keeping credit_score improved accuracy. The two features were redundant, and embedded importance had split the credit between them — exactly the correlated-feature bias the team suspected. RFE caught what the importance threshold missed.
Permutation Importance — The Model-Agnostic Truth Test
The scenario:
You have the 6 RFECV features. Before finalising the pipeline, a senior data scientist asks one more question: "Do these features matter on held-out data, or just in training?" Embedded importance is computed on training data — a feature could score high simply because the model memorised it. Permutation importance answers the held-out question directly: shuffle each feature on the test set and measure the accuracy drop. A feature that doesn't degrade performance when scrambled should be reconsidered.
# Import libraries
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
# Reuse loan_df and y
loan_df = pd.DataFrame({
'loan_amount': [5000,12000,3000,8000,15000,4000,9000,6000,11000,2500,7500,13000],
'annual_income': [35000,80000,28000,55000,110000,32000,62000,41000,75000,22000,49000,95000],
'debt_to_income': [0.32,0.18,0.45,0.27,0.12,0.50,0.22,0.38,0.20,0.55,0.30,0.15],
'credit_score': [620,780,590,700,810,560,730,650,760,540,680,800],
'num_late_payments': [2,0,5,1,0,7,0,3,0,8,1,0],
'employment_years': [3,10,1,7,15,2,8,4,12,1,6,14],
'loan_to_income': [0.14,0.15,0.11,0.15,0.14,0.13,0.15,0.15,0.15,0.11,0.15,0.14],
'credit_score_sq': [384400,608400,348100,490000,656100,313600,532900,422500,577600,291600,462400,640000],
'income_per_year_emp': [11667,8000,28000,7857,7333,16000,7750,10250,6250,22000,8167,6786],
'late_x_dti': [0.064,0.000,0.225,0.027,0.000,0.350,0.000,0.114,0.000,0.440,0.030,0.000],
'high_income_flag': [0,1,0,1,1,0,1,0,1,0,0,1],
'region_risk_score': [3,1,4,2,1,5,2,3,1,5,3,1],
})
y = pd.Series([1,0,1,0,0,1,0,1,0,1,0,0])
# Use only the 6 RFECV-selected features
selected = ['num_late_payments','late_x_dti','credit_score',
'debt_to_income','annual_income','employment_years']
X = loan_df[selected]
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
# Fit model on training data
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
# Compute permutation importance on the HELD-OUT test set
perm = permutation_importance(
rf, X_test, y_test,
n_repeats=30, # shuffle each feature 30 times and average the drop
random_state=42,
scoring='accuracy'
)
# Build result table
perm_df = (pd.DataFrame({
'feature': selected,
'perm_importance_mean': perm.importances_mean,
'perm_importance_std': perm.importances_std
})
.sort_values('perm_importance_mean', ascending=False)
.reset_index(drop=True))
print("Permutation importance on held-out test set:\n")
print(f"{'Feature':<22} {'Mean Drop':>12} {'Std Dev':>10}")
print("-" * 46)
for _, row in perm_df.iterrows():
flag = " <-- low signal on test set" if row['perm_importance_mean'] < 0.02 else ""
print(f" {row['feature']:<20} {row['perm_importance_mean']:>12.4f} {row['perm_importance_std']:>10.4f}{flag}")
test_acc = rf.score(X_test, y_test)
print(f"\nTest set accuracy: {test_acc:.3f}")
Permutation importance on held-out test set: Feature Mean Drop Std Dev ---------------------------------------------- num_late_payments 0.2333 0.0856 late_x_dti 0.1867 0.0741 credit_score 0.1600 0.0721 debt_to_income 0.1200 0.0632 employment_years 0.0533 0.0412 annual_income 0.0133 0.0298 <-- low signal on test set Test set accuracy: 0.750
What just happened?
Permutation importance applied 30 random shuffles to each feature on the held-out test set and measured the average accuracy drop. num_late_payments causes a 0.2333 accuracy drop when scrambled — the model completely relies on it. But annual_income only drops accuracy by 0.0133, with a standard deviation of 0.0298 that nearly straddles zero — meaning scrambling it barely affects predictions. Despite surviving RFECV, annual_income's signal may already be captured by late_x_dti and debt_to_income. Permutation importance is the final audit — it tells you what the model actually learned to rely on when generalising, not just what it used during training.
Comparing All Three Methods Side by Side
| Method | Cost | Signal Type | Handles Correlation? | Use When |
|---|---|---|---|---|
| Embedded (RF Importance) | 1 training run | Training impurity reduction | No — splits credit | Fast baseline, large feature sets |
| Wrapper (RFECV) | N × CV training runs | Held-out CV performance | Yes — removes redundant | Medium feature sets, audit required |
| Permutation Importance | 1 run + N×R shuffles | Test set accuracy degradation | Partial — on test data | Final audit before deployment |
The Recommended Production Workflow
Run embedded importance first — it's cheap and eliminates obvious noise. Feed the survivors into RFECV to handle correlated redundancy and find the performance-optimal subset. Then run permutation importance on your test set to verify each surviving feature earns its place on held-out data. Three methods, three different failure modes caught. What passes all three is a genuinely useful feature.
The Lasso Alternative for Linear Models
When your final model is linear (logistic regression, linear SVC), replace the Random Forest with a Lasso or ElasticNet. The L1 penalty drives irrelevant coefficients to exactly zero — the model itself performs selection during training. Combine with SelectFromModel(lasso) in sklearn to extract non-zero-coefficient features in one step. Same embedded logic, different model family.
Teacher's Note
The biggest mistake engineers make with permutation importance is running it on the training set. If you shuffle a feature on data the model has already memorised, it can often reconstruct the relationship from other correlated features — so the importance looks low even for genuinely predictive features. Always compute permutation importance on a held-out test set or a validation fold. The signal you want is: "does this feature help the model predict on data it has never seen?" Training-set permutation importance does not answer that question.
Practice Questions
1. The ML-based selection method that works by randomly shuffling a single feature's values on a held-out dataset and measuring accuracy degradation is called __________.
2. The sklearn wrapper class that removes one feature per round, retrains the model after each removal, and uses cross-validated performance to determine the optimal feature count is called __________.
3. In the RFECV run in this lesson, which feature was dropped despite ranking 5th in embedded importance — revealing a correlated-feature redundancy that importance scores alone could not detect?
Quiz
1. When two features are highly correlated, what happens to their embedded (tree) importance scores?
2. What is the critical rule for computing permutation importance correctly?
3. What is the recommended three-step production workflow for ML-based feature selection described in this lesson?
Up Next · Lesson 44
Deep Feature Synthesis
When your data lives across multiple related tables, DFS traverses the relationship graph and stacks aggregation primitives automatically — generating features a human engineer would take days to write by hand.