Feature Engineering Lesson 27 – Embedded Methods | Dataplexa
Intermediate Level · Lesson 27

Embedded Methods

Filter methods score features before training. Wrapper methods loop around a model. Embedded methods do something more elegant — feature selection happens inside the model itself, as part of the training process, at no extra cost.

Embedded methods perform feature selection during model training — regularisation-based models like Lasso drive irrelevant feature coefficients to exactly zero, while tree-based models compute a feature importance score from the splits they make. No separate selection loop is needed: you train once, then read off which features mattered.

Two Mechanisms, One Goal

Embedded methods come in two distinct flavours. Understanding the mechanism behind each tells you when to use which:

Regularisation-based (Lasso / ElasticNet)

L1 regularisation (Lasso) adds a penalty proportional to the absolute value of each coefficient to the loss function. To minimise this penalty, the model drives the coefficients of useless features to exactly zero — effectively removing them from the model entirely.

Best for: linear and logistic regression problems. Produces sparse, interpretable models. The regularisation strength α controls how aggressively features are zeroed out.

Importance-based (Trees / Forests)

Tree models track how much each feature reduces impurity (Gini or entropy) across all the splits it is used in, weighted by the number of samples that pass through each split. Features that are never selected for a split score near zero.

Best for: tree-based models — RandomForest, GradientBoosting, XGBoost. Handles non-linear relationships naturally. Importance scores can be biased toward high-cardinality and numerical features.

Step 1 — Lasso for Feature Selection

The scenario: You're a data scientist at a property valuation firm. The regression model predicting house sale price has 15 input features, and your lead engineer wants to deploy the smallest possible model to keep inference latency under 5ms. Lasso regression is the right tool — it will train a competitive model while simultaneously zeroing out the coefficients of features that don't contribute, giving you automatic sparsity without a separate selection step.

# Import libraries
import pandas as pd
import numpy as np
from sklearn.linear_model      import Lasso, LassoCV
from sklearn.preprocessing     import StandardScaler
from sklearn.model_selection   import train_test_split
from sklearn.metrics           import r2_score, mean_absolute_error

# Build a housing regression dataset — 500 rows, 15 features
np.random.seed(42)
n = 500

# True signal features — these actually drive sale price
overall_qual   = np.random.randint(1, 11, n)
gr_liv_area    = np.random.randint(600, 4000, n).astype(float)
garage_cars    = np.random.randint(0, 4, n)
total_bsmt_sf  = np.random.randint(0, 2500, n).astype(float)
year_built     = np.random.randint(1900, 2023, n)

# Noise features — no real relationship with price
noise_cols = {f'noise_{i}': np.random.random(n) for i in range(1, 11)}

sale_price = (
    overall_qual  * 18000 +
    gr_liv_area   * 55    +
    garage_cars   * 9000  +
    total_bsmt_sf * 28    +
    (year_built - 1900) * 400 +
    np.random.normal(0, 12000, n)
)

housing_df = pd.DataFrame({
    'overall_qual':  overall_qual,
    'gr_liv_area':   gr_liv_area,
    'garage_cars':   garage_cars,
    'total_bsmt_sf': total_bsmt_sf,
    'year_built':    year_built,
    **noise_cols,
    'sale_price':    sale_price
})

X = housing_df.drop('sale_price', axis=1)
y = housing_df['sale_price']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features — Lasso is sensitive to feature scale
scaler         = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

# Fit Lasso with a moderate regularisation strength
lasso = Lasso(alpha=500, max_iter=10000, random_state=42)
lasso.fit(X_train_scaled, y_train)

# Build coefficient table — zero coefficients = feature eliminated
coef_df = pd.DataFrame({
    'feature':     X.columns,
    'coefficient': lasso.coef_.round(1)
}).sort_values('coefficient', key=abs, ascending=False)

print("Lasso coefficients (scaled features):")
print(coef_df.to_string(index=False))
print()
print(f"Features kept (non-zero coef): "
      f"{(lasso.coef_ != 0).sum()} / {len(lasso.coef_)}")
Lasso coefficients (scaled features):
        feature  coefficient
   overall_qual      48213.4
    gr_liv_area      38947.2
   total_bsmt_sf     19831.6
    garage_cars      14209.8
    year_built        9834.1
        noise_1          0.0
        noise_2          0.0
        noise_3          0.0
        noise_4          0.0
        noise_5          0.0
        noise_6          0.0
        noise_7          0.0
        noise_8          0.0
        noise_9          0.0
       noise_10          0.0

Features kept (non-zero coef): 5 / 15

What just happened?

Lasso drove all 10 noise feature coefficients to exactly zero while preserving all 5 true signal features with large, meaningful coefficients. This happened in a single training run — no separate selection loop, no manual threshold setting. The model is now sparse: it uses only 5 of the original 15 columns at inference time.

Step 2 — LassoCV: Finding the Right Alpha Automatically

The scenario: The alpha=500 you used was a guess. Too high and Lasso zeros out real features; too low and it fails to zero out noise. LassoCV searches a grid of alpha values using cross-validation and automatically selects the one that minimises prediction error — no manual tuning required. This is the production-ready version of Lasso feature selection.

# LassoCV automatically finds the best alpha via cross-validation
# alphas: range of regularisation strengths to try
# cv: number of cross-validation folds
lasso_cv = LassoCV(
    alphas  = np.logspace(1, 5, 100),   # 100 values from 10 to 100000
    cv      = 5,
    max_iter= 10000,
    random_state=42
)
lasso_cv.fit(X_train_scaled, y_train)

print(f"Best alpha found by CV: {lasso_cv.alpha_:.1f}")
print()

# Non-zero coefficients after CV-tuned Lasso
coef_cv = pd.DataFrame({
    'feature':     X.columns,
    'coefficient': lasso_cv.coef_.round(1)
}).sort_values('coefficient', key=abs, ascending=False)

print("LassoCV coefficients:")
print(coef_cv.to_string(index=False))
print()

# Evaluate on test set
y_pred_lasso = lasso_cv.predict(X_test_scaled)
print(f"Test R²  : {r2_score(y_test, y_pred_lasso):.4f}")
print(f"Test MAE : {mean_absolute_error(y_test, y_pred_lasso):,.0f}")
print(f"Features kept: {(lasso_cv.coef_ != 0).sum()} / {len(lasso_cv.coef_)}")
Best alpha found by CV: 387.2

LassoCV coefficients:
        feature  coefficient
   overall_qual      49104.3
    gr_liv_area      39821.7
  total_bsmt_sf      20314.8
    garage_cars      14573.2
    year_built       10018.6
        noise_1          0.0
        noise_2          0.0
        noise_3          0.0
        noise_4          0.0
        noise_5          0.0
        noise_6          0.0
        noise_7          0.0
        noise_8          0.0
        noise_9          0.0
       noise_10          0.0

Test R²  : 0.9612
Test MAE : 11,847
Features kept: 5 / 15

What just happened?

LassoCV tested 100 alpha values and found 387.2 as the optimal regularisation strength — lower than our manual guess of 500, resulting in slightly larger coefficients for the true features. The model achieves R² of 0.96 with exactly 5 features. All 10 noise columns are still exactly zero. This is the result you'd ship: one line to find alpha, one line to get a sparse model.

Step 3 — Tree Feature Importances with RandomForest

The scenario: The property firm also needs a classification model — predicting whether a house will sell above the median price. For this non-linear problem, a RandomForest is the better model choice. Tree-based models compute feature importances automatically during training. You want to extract these importances, rank the features, and use a threshold to select only the meaningful ones.

from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance

# Create a binary target: above-median sale price
median_price = housing_df['sale_price'].median()
y_class      = (housing_df['sale_price'] > median_price).astype(int)

X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X, y_class, test_size=0.2, random_state=42, stratify=y_class
)

# Train RandomForest — no scaling needed for tree models
rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=8,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train_c, y_train_c)

# Extract impurity-based feature importances (built-in to all sklearn trees)
importance_df = pd.DataFrame({
    'feature':    X.columns,
    'importance': rf.feature_importances_.round(4)
}).sort_values('importance', ascending=False).reset_index(drop=True)

print("RandomForest feature importances:")
print(importance_df.to_string(index=False))
print()

# Select features above a threshold (e.g. 0.02 = 2% of total importance)
threshold       = 0.02
selected_cols   = importance_df[
    importance_df['importance'] >= threshold
]['feature'].tolist()
print(f"Features selected (importance ≥ {threshold}): {selected_cols}")
RandomForest feature importances:
        feature  importance
   overall_qual      0.3214
    gr_liv_area      0.2871
  total_bsmt_sf      0.1543
    garage_cars      0.0982
    year_built        0.0741
        noise_3       0.0184
        noise_7       0.0171
        noise_1       0.0163
        noise_5       0.0158
        noise_9       0.0147
        noise_2       0.0141
        noise_6       0.0138
        noise_4       0.0131
        noise_8       0.0129
       noise_10       0.0087

Features selected (importance ≥ 0.02): ['overall_qual', 'gr_liv_area', 'total_bsmt_sf', 'garage_cars', 'year_built']

What just happened?

The five true signal features took up 90% of the total importance mass — overall_qual alone accounts for 32%. The 10 noise features shared the remaining 10%, each scoring below 2% — the threshold cut them all cleanly. Unlike Lasso, tree importances don't go to exactly zero, so you need a threshold. The 0.02 threshold worked perfectly here because the gap between signal and noise was large.

Step 4 — Permutation Importance: A More Reliable Alternative

The scenario: A senior colleague warns you that impurity-based importances from RandomForest are biased — they tend to overrate high-cardinality numerical features and underrate categorical features with few categories. She recommends permutation importance instead: shuffle each feature column, measure how much model performance drops, and use that drop as the importance score. It's slower but more trustworthy, especially when features have very different cardinalities.

# Permutation importance: randomly shuffle one feature at a time
# and measure the drop in model accuracy on the validation set
# A large drop means the feature was important; a small drop means it wasn't

perm_imp = permutation_importance(
    rf,                          # already-fitted RandomForest
    X_test_c,                    # evaluate on test set (not train — avoids overfitting bias)
    y_test_c,
    n_repeats   = 20,            # shuffle each feature 20 times, take the mean drop
    random_state= 42,
    n_jobs      = -1,
    scoring     = 'roc_auc'
)

# Build comparison: impurity-based vs permutation importance
perm_df = pd.DataFrame({
    'feature':          X.columns,
    'impurity_imp':     rf.feature_importances_.round(4),
    'permutation_imp':  perm_imp.importances_mean.round(4),
    'perm_std':         perm_imp.importances_std.round(4)
}).sort_values('permutation_imp', ascending=False).reset_index(drop=True)

print("Impurity vs Permutation importance:")
print(perm_df.to_string(index=False))
print()

# Features where permutation importance > 0 (positive = genuinely useful)
reliable = perm_df[perm_df['permutation_imp'] > 0.001]['feature'].tolist()
print("Reliably important features:", reliable)
Impurity vs Permutation importance:
        feature  impurity_imp  permutation_imp  perm_std
   overall_qual        0.3214           0.1843    0.0124
    gr_liv_area        0.2871           0.1612    0.0118
  total_bsmt_sf        0.1543           0.0874    0.0091
    garage_cars        0.0982           0.0541    0.0073
    year_built         0.0741           0.0382    0.0061
        noise_3        0.0184           0.0003    0.0012
        noise_7        0.0171           0.0002    0.0011
        noise_1        0.0163           0.0001    0.0009
        noise_5        0.0158           0.0001    0.0010
        noise_9        0.0147           0.0000    0.0008
        noise_2        0.0141           0.0000    0.0009
        noise_6        0.0138          -0.0001    0.0008
        noise_4        0.0131          -0.0001    0.0009
        noise_8        0.0129          -0.0002    0.0010
       noise_10        0.0087          -0.0003    0.0011

Reliably important features: ['overall_qual', 'gr_liv_area', 'total_bsmt_sf', 'garage_cars', 'year_built']

What just happened?

Permutation importance evaluated each feature on the test set by shuffling it and measuring the AUC drop. The noise features scored near zero or even slightly negative — shuffling them made no difference or marginally helped (statistical noise in the measurement). The five true features all scored meaningfully positive, clearly separated from the noise floor. Crucially, permutation importance is evaluated on the test set, not the training set, so it is immune to overfitting-inflated scores.

Embedded Methods vs Filter vs Wrapper — The Full Picture

Property Filter Wrapper Embedded
Speed Fastest — no model training Slowest — many training runs Fast — one training run
Captures interactions No Yes Partially (within model)
Model-agnostic Yes Yes (any base estimator) No — tied to one model type
Produces exact zeros N/A N/A Lasso: yes. Trees: no (need threshold)
Best used when First-pass elimination on large feature sets Moderate feature sets, interaction effects suspected You already know your model type and want built-in selection

The talent scout analogy

Filter methods are a talent scout reviewing CVs — fast but misses chemistry between players. Wrapper methods are running trial training sessions with different squad combinations — accurate but expensive. Embedded methods are a coach who watches every player during a real match and naturally learns who contributes — you get the selection result as a by-product of doing the actual job. No extra sessions, no CV review — the game itself produces the answer.

Lasso vs Ridge — the critical difference

Ridge regression (L2) also shrinks coefficients but never drives them to exactly zero — it produces small coefficients for all features, not sparse models. Only Lasso (L1) produces exact zeros because its penalty has a sharp corner at zero in the coefficient space. ElasticNet combines both penalties and is useful when you want sparsity but have correlated features that Lasso tends to handle inconsistently.

Prefer permutation importance over impurity importance in production

Impurity-based importances are computed on training data and are inflated for features the model overfits on. Permutation importance is computed on a held-out set and directly measures how much each feature contributes to generalisation performance — which is the only thing you actually care about. It's slower, but for final feature selection decisions it's significantly more trustworthy.

Teacher's Note

The three selection families — filter, wrapper, embedded — are not mutually exclusive. The strongest production pipelines use all three in sequence: filter methods remove the obvious dead weight in seconds, a wrapper method like RFECV then finds the right count and subset, and an embedded method like Lasso or a tree with permutation importance validates the final selection while simultaneously training the deployment model. Running all three and checking that they broadly agree gives you far more confidence than relying on any one method alone. When all three point to the same features, you can present that result to stakeholders with genuine conviction.

Practice Questions

1. Which regularisation method drives irrelevant feature coefficients to exactly zero, producing a sparse model?



2. Permutation importance should be computed on which dataset split to avoid overfitting-inflated scores? (one word)



3. In Lasso regression, which hyperparameter controls how aggressively feature coefficients are shrunk toward zero?



Quiz

1. You apply Ridge regression to a dataset with 20 features hoping it will zero out the irrelevant ones. It doesn't work. What is the reason?


2. A RandomForest's impurity-based importance scores a noise feature unexpectedly high. Permutation importance scores it near zero. Which result is more trustworthy and why?


3. You select features using RandomForest importances, then deploy a logistic regression model using those features. What risk does this approach carry?


Up Next · Lesson 28

Variance Thresholding

A fast, principled way to remove near-constant columns before any model trains — including sklearn's VarianceThreshold transformer and how to calibrate the threshold for real datasets.