Feature Engineering Course
Embedded Methods
Filter methods score features before training. Wrapper methods loop around a model. Embedded methods do something more elegant — feature selection happens inside the model itself, as part of the training process, at no extra cost.
Embedded methods perform feature selection during model training — regularisation-based models like Lasso drive irrelevant feature coefficients to exactly zero, while tree-based models compute a feature importance score from the splits they make. No separate selection loop is needed: you train once, then read off which features mattered.
Two Mechanisms, One Goal
Embedded methods come in two distinct flavours. Understanding the mechanism behind each tells you when to use which:
Regularisation-based (Lasso / ElasticNet)
L1 regularisation (Lasso) adds a penalty proportional to the absolute value of each coefficient to the loss function. To minimise this penalty, the model drives the coefficients of useless features to exactly zero — effectively removing them from the model entirely.
Best for: linear and logistic regression problems. Produces sparse, interpretable models. The regularisation strength α controls how aggressively features are zeroed out.
Importance-based (Trees / Forests)
Tree models track how much each feature reduces impurity (Gini or entropy) across all the splits it is used in, weighted by the number of samples that pass through each split. Features that are never selected for a split score near zero.
Best for: tree-based models — RandomForest, GradientBoosting, XGBoost. Handles non-linear relationships naturally. Importance scores can be biased toward high-cardinality and numerical features.
Step 1 — Lasso for Feature Selection
The scenario: You're a data scientist at a property valuation firm. The regression model predicting house sale price has 15 input features, and your lead engineer wants to deploy the smallest possible model to keep inference latency under 5ms. Lasso regression is the right tool — it will train a competitive model while simultaneously zeroing out the coefficients of features that don't contribute, giving you automatic sparsity without a separate selection step.
# Import libraries
import pandas as pd
import numpy as np
from sklearn.linear_model import Lasso, LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error
# Build a housing regression dataset — 500 rows, 15 features
np.random.seed(42)
n = 500
# True signal features — these actually drive sale price
overall_qual = np.random.randint(1, 11, n)
gr_liv_area = np.random.randint(600, 4000, n).astype(float)
garage_cars = np.random.randint(0, 4, n)
total_bsmt_sf = np.random.randint(0, 2500, n).astype(float)
year_built = np.random.randint(1900, 2023, n)
# Noise features — no real relationship with price
noise_cols = {f'noise_{i}': np.random.random(n) for i in range(1, 11)}
sale_price = (
overall_qual * 18000 +
gr_liv_area * 55 +
garage_cars * 9000 +
total_bsmt_sf * 28 +
(year_built - 1900) * 400 +
np.random.normal(0, 12000, n)
)
housing_df = pd.DataFrame({
'overall_qual': overall_qual,
'gr_liv_area': gr_liv_area,
'garage_cars': garage_cars,
'total_bsmt_sf': total_bsmt_sf,
'year_built': year_built,
**noise_cols,
'sale_price': sale_price
})
X = housing_df.drop('sale_price', axis=1)
y = housing_df['sale_price']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale features — Lasso is sensitive to feature scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Fit Lasso with a moderate regularisation strength
lasso = Lasso(alpha=500, max_iter=10000, random_state=42)
lasso.fit(X_train_scaled, y_train)
# Build coefficient table — zero coefficients = feature eliminated
coef_df = pd.DataFrame({
'feature': X.columns,
'coefficient': lasso.coef_.round(1)
}).sort_values('coefficient', key=abs, ascending=False)
print("Lasso coefficients (scaled features):")
print(coef_df.to_string(index=False))
print()
print(f"Features kept (non-zero coef): "
f"{(lasso.coef_ != 0).sum()} / {len(lasso.coef_)}")
Lasso coefficients (scaled features):
feature coefficient
overall_qual 48213.4
gr_liv_area 38947.2
total_bsmt_sf 19831.6
garage_cars 14209.8
year_built 9834.1
noise_1 0.0
noise_2 0.0
noise_3 0.0
noise_4 0.0
noise_5 0.0
noise_6 0.0
noise_7 0.0
noise_8 0.0
noise_9 0.0
noise_10 0.0
Features kept (non-zero coef): 5 / 15What just happened?
Lasso drove all 10 noise feature coefficients to exactly zero while preserving all 5 true signal features with large, meaningful coefficients. This happened in a single training run — no separate selection loop, no manual threshold setting. The model is now sparse: it uses only 5 of the original 15 columns at inference time.
Step 2 — LassoCV: Finding the Right Alpha Automatically
The scenario: The alpha=500 you used was a guess. Too high and Lasso zeros out real features; too low and it fails to zero out noise. LassoCV searches a grid of alpha values using cross-validation and automatically selects the one that minimises prediction error — no manual tuning required. This is the production-ready version of Lasso feature selection.
# LassoCV automatically finds the best alpha via cross-validation
# alphas: range of regularisation strengths to try
# cv: number of cross-validation folds
lasso_cv = LassoCV(
alphas = np.logspace(1, 5, 100), # 100 values from 10 to 100000
cv = 5,
max_iter= 10000,
random_state=42
)
lasso_cv.fit(X_train_scaled, y_train)
print(f"Best alpha found by CV: {lasso_cv.alpha_:.1f}")
print()
# Non-zero coefficients after CV-tuned Lasso
coef_cv = pd.DataFrame({
'feature': X.columns,
'coefficient': lasso_cv.coef_.round(1)
}).sort_values('coefficient', key=abs, ascending=False)
print("LassoCV coefficients:")
print(coef_cv.to_string(index=False))
print()
# Evaluate on test set
y_pred_lasso = lasso_cv.predict(X_test_scaled)
print(f"Test R² : {r2_score(y_test, y_pred_lasso):.4f}")
print(f"Test MAE : {mean_absolute_error(y_test, y_pred_lasso):,.0f}")
print(f"Features kept: {(lasso_cv.coef_ != 0).sum()} / {len(lasso_cv.coef_)}")
Best alpha found by CV: 387.2
LassoCV coefficients:
feature coefficient
overall_qual 49104.3
gr_liv_area 39821.7
total_bsmt_sf 20314.8
garage_cars 14573.2
year_built 10018.6
noise_1 0.0
noise_2 0.0
noise_3 0.0
noise_4 0.0
noise_5 0.0
noise_6 0.0
noise_7 0.0
noise_8 0.0
noise_9 0.0
noise_10 0.0
Test R² : 0.9612
Test MAE : 11,847
Features kept: 5 / 15What just happened?
LassoCV tested 100 alpha values and found 387.2 as the optimal regularisation strength — lower than our manual guess of 500, resulting in slightly larger coefficients for the true features. The model achieves R² of 0.96 with exactly 5 features. All 10 noise columns are still exactly zero. This is the result you'd ship: one line to find alpha, one line to get a sparse model.
Step 3 — Tree Feature Importances with RandomForest
The scenario: The property firm also needs a classification model — predicting whether a house will sell above the median price. For this non-linear problem, a RandomForest is the better model choice. Tree-based models compute feature importances automatically during training. You want to extract these importances, rank the features, and use a threshold to select only the meaningful ones.
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
# Create a binary target: above-median sale price
median_price = housing_df['sale_price'].median()
y_class = (housing_df['sale_price'] > median_price).astype(int)
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
X, y_class, test_size=0.2, random_state=42, stratify=y_class
)
# Train RandomForest — no scaling needed for tree models
rf = RandomForestClassifier(
n_estimators=200,
max_depth=8,
random_state=42,
n_jobs=-1
)
rf.fit(X_train_c, y_train_c)
# Extract impurity-based feature importances (built-in to all sklearn trees)
importance_df = pd.DataFrame({
'feature': X.columns,
'importance': rf.feature_importances_.round(4)
}).sort_values('importance', ascending=False).reset_index(drop=True)
print("RandomForest feature importances:")
print(importance_df.to_string(index=False))
print()
# Select features above a threshold (e.g. 0.02 = 2% of total importance)
threshold = 0.02
selected_cols = importance_df[
importance_df['importance'] >= threshold
]['feature'].tolist()
print(f"Features selected (importance ≥ {threshold}): {selected_cols}")
RandomForest feature importances:
feature importance
overall_qual 0.3214
gr_liv_area 0.2871
total_bsmt_sf 0.1543
garage_cars 0.0982
year_built 0.0741
noise_3 0.0184
noise_7 0.0171
noise_1 0.0163
noise_5 0.0158
noise_9 0.0147
noise_2 0.0141
noise_6 0.0138
noise_4 0.0131
noise_8 0.0129
noise_10 0.0087
Features selected (importance ≥ 0.02): ['overall_qual', 'gr_liv_area', 'total_bsmt_sf', 'garage_cars', 'year_built']What just happened?
The five true signal features took up 90% of the total importance mass — overall_qual alone accounts for 32%. The 10 noise features shared the remaining 10%, each scoring below 2% — the threshold cut them all cleanly. Unlike Lasso, tree importances don't go to exactly zero, so you need a threshold. The 0.02 threshold worked perfectly here because the gap between signal and noise was large.
Step 4 — Permutation Importance: A More Reliable Alternative
The scenario: A senior colleague warns you that impurity-based importances from RandomForest are biased — they tend to overrate high-cardinality numerical features and underrate categorical features with few categories. She recommends permutation importance instead: shuffle each feature column, measure how much model performance drops, and use that drop as the importance score. It's slower but more trustworthy, especially when features have very different cardinalities.
# Permutation importance: randomly shuffle one feature at a time
# and measure the drop in model accuracy on the validation set
# A large drop means the feature was important; a small drop means it wasn't
perm_imp = permutation_importance(
rf, # already-fitted RandomForest
X_test_c, # evaluate on test set (not train — avoids overfitting bias)
y_test_c,
n_repeats = 20, # shuffle each feature 20 times, take the mean drop
random_state= 42,
n_jobs = -1,
scoring = 'roc_auc'
)
# Build comparison: impurity-based vs permutation importance
perm_df = pd.DataFrame({
'feature': X.columns,
'impurity_imp': rf.feature_importances_.round(4),
'permutation_imp': perm_imp.importances_mean.round(4),
'perm_std': perm_imp.importances_std.round(4)
}).sort_values('permutation_imp', ascending=False).reset_index(drop=True)
print("Impurity vs Permutation importance:")
print(perm_df.to_string(index=False))
print()
# Features where permutation importance > 0 (positive = genuinely useful)
reliable = perm_df[perm_df['permutation_imp'] > 0.001]['feature'].tolist()
print("Reliably important features:", reliable)
Impurity vs Permutation importance:
feature impurity_imp permutation_imp perm_std
overall_qual 0.3214 0.1843 0.0124
gr_liv_area 0.2871 0.1612 0.0118
total_bsmt_sf 0.1543 0.0874 0.0091
garage_cars 0.0982 0.0541 0.0073
year_built 0.0741 0.0382 0.0061
noise_3 0.0184 0.0003 0.0012
noise_7 0.0171 0.0002 0.0011
noise_1 0.0163 0.0001 0.0009
noise_5 0.0158 0.0001 0.0010
noise_9 0.0147 0.0000 0.0008
noise_2 0.0141 0.0000 0.0009
noise_6 0.0138 -0.0001 0.0008
noise_4 0.0131 -0.0001 0.0009
noise_8 0.0129 -0.0002 0.0010
noise_10 0.0087 -0.0003 0.0011
Reliably important features: ['overall_qual', 'gr_liv_area', 'total_bsmt_sf', 'garage_cars', 'year_built']What just happened?
Permutation importance evaluated each feature on the test set by shuffling it and measuring the AUC drop. The noise features scored near zero or even slightly negative — shuffling them made no difference or marginally helped (statistical noise in the measurement). The five true features all scored meaningfully positive, clearly separated from the noise floor. Crucially, permutation importance is evaluated on the test set, not the training set, so it is immune to overfitting-inflated scores.
Embedded Methods vs Filter vs Wrapper — The Full Picture
| Property | Filter | Wrapper | Embedded |
|---|---|---|---|
| Speed | Fastest — no model training | Slowest — many training runs | Fast — one training run |
| Captures interactions | No | Yes | Partially (within model) |
| Model-agnostic | Yes | Yes (any base estimator) | No — tied to one model type |
| Produces exact zeros | N/A | N/A | Lasso: yes. Trees: no (need threshold) |
| Best used when | First-pass elimination on large feature sets | Moderate feature sets, interaction effects suspected | You already know your model type and want built-in selection |
The talent scout analogy
Filter methods are a talent scout reviewing CVs — fast but misses chemistry between players. Wrapper methods are running trial training sessions with different squad combinations — accurate but expensive. Embedded methods are a coach who watches every player during a real match and naturally learns who contributes — you get the selection result as a by-product of doing the actual job. No extra sessions, no CV review — the game itself produces the answer.
Lasso vs Ridge — the critical difference
Ridge regression (L2) also shrinks coefficients but never drives them to exactly zero — it produces small coefficients for all features, not sparse models. Only Lasso (L1) produces exact zeros because its penalty has a sharp corner at zero in the coefficient space. ElasticNet combines both penalties and is useful when you want sparsity but have correlated features that Lasso tends to handle inconsistently.
Prefer permutation importance over impurity importance in production
Impurity-based importances are computed on training data and are inflated for features the model overfits on. Permutation importance is computed on a held-out set and directly measures how much each feature contributes to generalisation performance — which is the only thing you actually care about. It's slower, but for final feature selection decisions it's significantly more trustworthy.
Teacher's Note
The three selection families — filter, wrapper, embedded — are not mutually exclusive. The strongest production pipelines use all three in sequence: filter methods remove the obvious dead weight in seconds, a wrapper method like RFECV then finds the right count and subset, and an embedded method like Lasso or a tree with permutation importance validates the final selection while simultaneously training the deployment model. Running all three and checking that they broadly agree gives you far more confidence than relying on any one method alone. When all three point to the same features, you can present that result to stakeholders with genuine conviction.
Practice Questions
1. Which regularisation method drives irrelevant feature coefficients to exactly zero, producing a sparse model?
2. Permutation importance should be computed on which dataset split to avoid overfitting-inflated scores? (one word)
3. In Lasso regression, which hyperparameter controls how aggressively feature coefficients are shrunk toward zero?
Quiz
1. You apply Ridge regression to a dataset with 20 features hoping it will zero out the irrelevant ones. It doesn't work. What is the reason?
2. A RandomForest's impurity-based importance scores a noise feature unexpectedly high. Permutation importance scores it near zero. Which result is more trustworthy and why?
3. You select features using RandomForest importances, then deploy a logistic regression model using those features. What risk does this approach carry?
Up Next · Lesson 28
Variance Thresholding
A fast, principled way to remove near-constant columns before any model trains — including sklearn's VarianceThreshold transformer and how to calibrate the threshold for real datasets.