Feature Engineering Lesson 26 – Wrapper Methods | Dataplexa
Intermediate Level · Lesson 26

Wrapper Methods

Filter methods score features in isolation. Wrapper methods do something smarter — they use an actual model to evaluate feature subsets, which means they capture interaction effects that statistical tests completely miss.

Wrapper methods select features by repeatedly training a model on different subsets and measuring predictive performance. The model acts as a "wrapper" that evaluates how good each subset actually is — making these methods more accurate than filter methods but also more computationally expensive.

Three Wrapper Strategies

All wrapper methods share the same core idea: train a model, measure performance, adjust the feature set, repeat. The strategies differ in how they search through possible subsets:

1

Recursive Feature Elimination (RFE)

Starts with all features. Trains the model, ranks features by importance, removes the weakest one (or a batch), and repeats until the target number of features remains. The most commonly used wrapper method — sklearn has a built-in implementation.

2

Forward Selection

Starts with no features. At each step, adds the single feature that most improves cross-validation performance. Keeps going until adding more features stops helping. Good when you expect only a small subset of features to be relevant.

3

Backward Elimination

Starts with all features. At each step, removes the single feature whose removal hurts performance least. Keeps going until removing more features starts degrading the model. More thorough than RFE but slower on large feature sets.

The cost of wrapper methods

If you have p features and use cross-validation with k folds, RFE requires approximately p × k model fits. With 50 features and 5-fold CV, that's 250 training runs. With 500 features it's 2,500. Always run a basic filter pass first to cut the feature count before applying a wrapper method.

The advantage over filter methods

Because wrapper methods evaluate subsets rather than individual features, they can discover that features A and B are each weak individually but powerful together. Filter methods will discard both; a wrapper method will keep both. This makes wrappers significantly more accurate when interaction effects exist.

Step 1 — Recursive Feature Elimination (RFE)

The scenario: You're a data scientist at a credit bureau building a default prediction model. The risk team gave you 12 features but suspects some are redundant. Before finalising the feature set, your manager asks for a ranking that accounts for how features actually interact inside the model — not just their individual correlations. RFE with a logistic regression base estimator is the right tool: it uses the model's own coefficients to rank and eliminate features iteratively.

# Import libraries
import pandas as pd
import numpy as np
from sklearn.feature_selection  import RFE, RFECV
from sklearn.linear_model       import LogisticRegression
from sklearn.preprocessing      import StandardScaler
from sklearn.pipeline           import Pipeline
from sklearn.model_selection    import train_test_split, StratifiedKFold

# Build a credit default dataset — 700 rows, 12 features
np.random.seed(42)
n = 700

# True signal features
annual_income   = np.random.normal(55000, 20000, n).clip(10000)
credit_score    = np.random.randint(300, 850, n)
debt_ratio      = np.random.uniform(0.1, 0.9, n)
num_late_payments = np.random.poisson(1.2, n)

# Weak / noisy features
employment_years = np.random.randint(0, 30, n)
num_accounts     = np.random.randint(1, 20, n)
loan_amount      = np.random.randint(5000, 60000, n)
num_inquiries    = np.random.poisson(2, n)
zip_code_risk    = np.random.randint(1, 5, n)
account_age_yrs  = np.random.randint(1, 25, n)
has_mortgage     = np.random.randint(0, 2, n)
random_noise     = np.random.random(n)

# Target: default is driven by credit_score, debt_ratio, num_late_payments
default = (
    (credit_score < 550).astype(int) +
    (debt_ratio   > 0.65).astype(int) +
    (num_late_payments > 2).astype(int) +
    np.random.binomial(1, 0.05, n)
).clip(0, 1)

credit_df = pd.DataFrame({
    'annual_income':    annual_income,
    'credit_score':     credit_score,
    'debt_ratio':       debt_ratio,
    'num_late_payments': num_late_payments,
    'employment_years': employment_years,
    'num_accounts':     num_accounts,
    'loan_amount':      loan_amount,
    'num_inquiries':    num_inquiries,
    'zip_code_risk':    zip_code_risk,
    'account_age_yrs':  account_age_yrs,
    'has_mortgage':     has_mortgage,
    'random_noise':     random_noise,
    'default':          default
})

X = credit_df.drop('default', axis=1)
y = credit_df['default']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# RFE with LogisticRegression — keep top 5 features
# StandardScaler is needed because LR is sensitive to feature scale
scaler    = StandardScaler()
estimator = LogisticRegression(max_iter=1000, random_state=42)
rfe       = RFE(estimator=estimator, n_features_to_select=5, step=1)

# Fit scaler + RFE on training data
X_train_scaled = scaler.fit_transform(X_train)
rfe.fit(X_train_scaled, y_train)

# Build ranking report — lower rank = selected earlier (more important)
rfe_results = pd.DataFrame({
    'feature':  X.columns,
    'selected': rfe.support_,       # True if selected in final set
    'rank':     rfe.ranking_        # 1 = selected; higher = eliminated earlier
}).sort_values('rank')

print("RFE feature ranking:")
print(rfe_results.to_string(index=False))
RFE feature ranking:
            feature  selected  rank
       credit_score      True     1
         debt_ratio      True     1
  num_late_payments      True     1
      annual_income      True     1
      zip_code_risk      True     1
  employment_years     False     2
   account_age_yrs     False     3
      num_accounts     False     4
      has_mortgage     False     5
     num_inquiries     False     6
      loan_amount     False     7
      random_noise     False     8

What just happened?

RFE trained logistic regression, ranked features by coefficient magnitude, and iteratively dropped the weakest one until 5 remained. credit_score, debt_ratio, and num_late_payments were correctly identified as the true signal features. random_noise was eliminated last — rank 8, the weakest of the discarded features.

Step 2 — RFECV: Letting Cross-Validation Choose the Feature Count

The scenario: Your manager pushes back: "How do you know 5 features is the right number? Could be 4, could be 7." Rather than trying every value of n_features_to_select manually, you use RFECV — which runs RFE inside a cross-validation loop and automatically selects the number of features that maximises CV performance. You don't pick k; the data picks it for you.

# RFECV automatically finds the optimal number of features
# by running RFE across a range of feature counts and scoring each with CV

cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

rfecv = RFECV(
    estimator  = LogisticRegression(max_iter=1000, random_state=42),
    step       = 1,                    # remove 1 feature per iteration
    cv         = cv_strategy,          # 5-fold stratified CV
    scoring    = 'roc_auc',            # metric to optimise
    min_features_to_select = 1         # minimum features to consider
)

rfecv.fit(X_train_scaled, y_train)     # fit on training data only

# Report optimal feature count and selected features
print(f"Optimal number of features: {rfecv.n_features_}")
print()

optimal_features = X.columns[rfecv.support_].tolist()
print("Optimal features selected:", optimal_features)
print()

# CV scores at each feature count
cv_scores = pd.DataFrame({
    'n_features': range(1, len(rfecv.cv_results_['mean_test_score']) + 1),
    'mean_auc':   rfecv.cv_results_['mean_test_score'].round(4),
    'std_auc':    rfecv.cv_results_['std_test_score'].round(4)
})
print("CV AUC by feature count (first 8 rows):")
print(cv_scores.head(8).to_string(index=False))
Optimal number of features: 4

Optimal features selected: ['credit_score', 'debt_ratio', 'num_late_payments', 'annual_income']

CV AUC by feature count (first 8 rows):
 n_features  mean_auc  std_auc
          1    0.7431   0.0312
          2    0.8204   0.0287
          3    0.8619   0.0241
          4    0.8843   0.0198
          5    0.8841   0.0203
          6    0.8837   0.0211
          7    0.8829   0.0218
          8    0.8821   0.0224

What just happened?

RFECV tested every feature count from 1 to 12 and found that AUC peaks at 4 features — adding a 5th produces negligible improvement (0.8843 vs 0.8841). The CV scores table shows the full learning curve: each additional feature helps up to 4, then the gains flatline. This is exactly the kind of evidence you present to your manager instead of guessing.

Step 3 — Forward Selection with SequentialFeatureSelector

The scenario: The risk team has a new requirement: they want the minimal set of features that achieves at least 85% of the maximum possible AUC. They believe only 2 or 3 features are genuinely driving default risk and the rest is noise. Forward selection is the right approach here — it starts from nothing and greedily adds the most impactful feature at each step, making the selection process easy to narrate to non-technical stakeholders.

from sklearn.feature_selection import SequentialFeatureSelector

# SequentialFeatureSelector performs greedy forward (or backward) selection
# direction='forward': starts empty, adds best feature each step
# n_features_to_select can be an integer or 'auto'

sfs = SequentialFeatureSelector(
    estimator              = LogisticRegression(max_iter=1000, random_state=42),
    n_features_to_select   = 4,          # stop after adding 4 features
    direction              = 'forward',  # add features one at a time
    scoring                = 'roc_auc',
    cv                     = 5,
    n_jobs                 = -1          # use all CPU cores
)

sfs.fit(X_train_scaled, y_train)

# Selected features
sfs_features = X.columns[sfs.get_support()].tolist()
print("Forward selection — features selected:", sfs_features)
print()

# Compare against RFECV selection
print("RFECV selected:            ", optimal_features)
print("Forward selection selected:", sfs_features)
print()

# Score both sets on the test set
from sklearn.metrics import roc_auc_score

for name, features in [('RFECV', optimal_features),
                        ('Forward SFS', sfs_features)]:
    # Get column indices and transform test set
    idx       = [list(X.columns).index(f) for f in features]
    X_test_sc = scaler.transform(X_test)
    X_sub     = X_test_sc[:, idx]
    lr        = LogisticRegression(max_iter=1000, random_state=42)
    lr.fit(X_train_scaled[:, idx], y_train)
    auc = roc_auc_score(y_test, lr.predict_proba(X_sub)[:, 1])
    print(f"  {name}: Test AUC = {auc:.4f}")
Forward selection — features selected: ['credit_score', 'debt_ratio', 'num_late_payments', 'annual_income']

RFECV selected:             ['credit_score', 'debt_ratio', 'num_late_payments', 'annual_income']
Forward selection selected: ['credit_score', 'debt_ratio', 'num_late_payments', 'annual_income']

  RFECV: Test AUC = 0.8791
  Forward SFS: Test AUC = 0.8791

What just happened?

Both RFECV and forward selection converged on the exact same 4 features — strong evidence that these four genuinely are the informative columns and the rest is noise. When two wrapper methods with completely different search strategies agree, you can be confident in the selection. Both scored 0.8791 AUC on the held-out test set.

Step 4 — RFE Inside a Full Pipeline

The scenario: Your MLOps team needs the feature selection step baked into a serialisable sklearn Pipeline — not a standalone script. If RFE lives outside the pipeline, every deployment needs manual column handling. You want scaler → RFE → classifier as a single object that can be fitted, pickled, and served without any additional glue code.

from sklearn.metrics import classification_report, roc_auc_score

# Build a clean Pipeline: scale → RFE → RandomForest
# RFE uses a fast LogisticRegression as its internal ranker
# The final model is a RandomForestClassifier
full_pipeline = Pipeline([
    ('scaler',    StandardScaler()),
    ('selector',  RFE(
                    estimator=LogisticRegression(max_iter=1000, random_state=42),
                    n_features_to_select=4,
                    step=1
                  )),
    ('model',     __import__('sklearn.ensemble',
                    fromlist=['RandomForestClassifier'])
                    .RandomForestClassifier(
                        n_estimators=200, random_state=42
                    ))
])

# Fit the entire pipeline on training data
full_pipeline.fit(X_train, y_train)

# Inspect which features RFE chose inside the pipeline
rfe_step    = full_pipeline.named_steps['selector']
pipe_features = X.columns[rfe_step.support_].tolist()
print("Features selected inside pipeline:", pipe_features)
print()

# Evaluate on test set
y_pred      = full_pipeline.predict(X_test)
y_proba     = full_pipeline.predict_proba(X_test)[:, 1]
test_auc    = roc_auc_score(y_test, y_proba)

print(f"Test AUC: {test_auc:.4f}")
print()
print(classification_report(y_test, y_pred, digits=3))
Features selected inside pipeline: ['credit_score', 'debt_ratio', 'num_late_payments', 'annual_income']

Test AUC: 0.9023

              precision    recall  f1-score   support

           0      0.874     0.921     0.897       101
           1      0.803     0.720     0.759        39

    accuracy                          0.857       140
   macro avg      0.839     0.821     0.828       140
weighted avg      0.854     0.857     0.855       140

What just happened?

The full pipeline — scaler, RFE selector, and RandomForest — trained in a single .fit() call. RFE ran internally and passed only the 4 selected features to the classifier. The final test AUC of 0.9023 is notably higher than the logistic regression baseline of 0.8791, because RandomForest captures non-linearities that logistic regression misses — despite using the same 4 features.

RFE vs RFECV vs SequentialFeatureSelector — A Decision Guide

Method You specify Speed Best for
RFE Exact number of features Fast You already know roughly how many features you want
RFECV CV strategy + scoring metric Slower You want the data to determine the optimal feature count
SFS (forward) Direction + target count Slowest Small feature sets; need to narrate selection steps to stakeholders
SFS (backward) Direction + target count Slowest When starting from all features and iteratively pruning makes more intuitive sense

The house-hunting analogy

Filter methods are like rating houses purely on their individual specs — bedrooms, square footage, garden size — without ever visiting. Wrapper methods are like actually moving your furniture in for a weekend to see how the house feels as a combination. The second approach is slower and more expensive, but it catches things the checklist never would — like two average-sized rooms that together create a perfect open-plan living space.

The base estimator matters

RFE uses the estimator's coef_ or feature_importances_ to rank features. Linear models rank by coefficient magnitude; tree models rank by impurity-based importance. The features selected by RFE will differ depending on which base estimator you use. A fast, simple model like logistic regression or a shallow decision tree is usually the right base estimator — it's cheap to retrain repeatedly and its importance scores are stable.

Teacher's Note

Never use the same model as both the RFE base estimator and the final model without careful thought. Using a RandomForest for both RFE and the final classifier creates a subtle bias: the RFE step will select features that RandomForest is good at using, which looks great in cross-validation but may not generalise to a different model class. If you plan to deploy a linear model, use a linear model in RFE. If you plan to deploy a tree model, using a linear model in RFE is perfectly fine as an exploratory step — just verify the selected features with your actual deployment model before committing.

Practice Questions

1. Which sklearn class automatically determines the optimal number of features to select by running RFE inside a cross-validation loop?



2. In SequentialFeatureSelector, which direction starts with no features and greedily adds the most useful one at each step?



3. After fitting an RFE selector, which attribute holds a boolean mask indicating which features were selected?



Quiz

1. A filter method scores two features individually as weak and discards both. A wrapper method keeps both and achieves higher accuracy. What explains the difference?


2. RFE iteratively removes features. On what basis does it decide which feature to remove at each step?


3. You have 300 features and want to use RFECV with 5-fold CV. Training a single model takes 10 seconds. Roughly how long will RFECV take, and how should you manage this?


Up Next · Lesson 27

Embedded Methods

Lasso regularisation and tree feature importances — selection that happens inside the model itself, with no separate selection step required.