Feature Engineering Course
Feature Selection Basics
More features is not always better. Adding irrelevant or redundant columns to your model makes it slower to train, harder to interpret, and often less accurate. Feature selection is the discipline of deciding which columns actually earn their place.
Feature selection is the process of identifying and keeping only the subset of input columns that contribute useful, non-redundant signal to a model — improving generalisation, reducing training time, and making the model easier to explain to stakeholders.
The Problem with Too Many Features
It feels counterintuitive. Surely more information is better? But in machine learning, every irrelevant column you add to a dataset introduces a cost — sometimes a serious one. Here's what happens when you include features that don't belong:
The curse of dimensionality
In high-dimensional spaces, every data point becomes roughly equidistant from every other. Distance-based models like KNN and K-Means lose their ability to find meaningful clusters because noise dimensions swamp the signal dimensions.
Overfitting on noise
A model with 200 features and 500 rows has enough parameters to memorise the training set rather than learning the real pattern. Train accuracy looks great; test accuracy collapses. Irrelevant features give the model junk to fit on.
Slower training and inference
Every additional feature adds computation at both training and prediction time. In production systems serving thousands of requests per second, a leaner feature set directly translates to faster response times and lower infrastructure cost.
Harder to explain and maintain
A model that uses 8 well-chosen features is explainable to a business stakeholder. A model that uses 150 features is a black box — and every one of those features needs to be maintained, monitored, and kept live in production pipelines.
Three Families of Feature Selection
Feature selection methods fall into three broad families. Each operates differently and suits different situations. The next three lessons cover each family in depth — here we build the mental model for all three:
Lesson 25
Filter Methods
Score each feature independently using a statistical measure — correlation, chi-squared, mutual information. Fast, model-agnostic, and a good first pass. Run before training anything.
Lesson 26
Wrapper Methods
Use a model itself to evaluate subsets of features — forward selection, backward elimination, RFE. More accurate than filter methods but computationally expensive. Best for smaller feature sets.
Lesson 27
Embedded Methods
Selection happens during model training — Lasso regularisation, tree feature importances. No separate selection step needed; the model selects and trains simultaneously. Often the best balance of speed and accuracy.
Step 1 — Spotting Useless Features: Zero and Near-Zero Variance
The scenario: You've just joined a fintech team and inherited a churn prediction model. The feature matrix has 40 columns. Before doing anything sophisticated, you want to run a basic sanity check: are there any columns where almost every row has the same value? A column where 99% of rows are 0 is essentially a constant — it carries no information. These are the easiest features to drop, and finding them takes seconds.
# Import libraries
import pandas as pd
import numpy as np
# Build a churn dataset — 500 rows, 8 features, some deliberately useless
np.random.seed(42)
churn_df = pd.DataFrame({
'customer_id': range(1, 501),
'monthly_spend': np.random.normal(120, 40, 500).clip(0),
'tenure_months': np.random.randint(1, 72, 500),
'support_tickets': np.random.poisson(1.5, 500),
# Nearly constant — 498 out of 500 rows are the same value
'country_code': np.random.choice([1, 2], p=[0.996, 0.004], size=500),
# Completely constant — zero variance
'api_version': np.ones(500, dtype=int),
# Random noise — high variance but pure noise
'random_noise': np.random.random(500),
'churned': np.random.choice([0, 1], p=[0.75, 0.25], size=500)
})
# Step 1: Compute variance of each numerical column
variances = churn_df.drop('customer_id', axis=1).var()
print("Variance per column:")
print(variances.round(4).to_string())
print()
# Step 2: Flag columns with near-zero variance (threshold = 0.01)
low_var = variances[variances < 0.01]
print("Low / zero variance columns:")
print(low_var)
Variance per column: monthly_spend 1591.2043 tenure_months 415.8732 support_tickets 1.4897 country_code 0.0040 api_version 0.0000 random_noise 0.0833 churned 0.1882 Low / zero variance columns: country_code 0.004 api_version 0.000 dtype: float64
What just happened?
api_version has zero variance — every single row is 1. It contributes nothing. country_code is near-zero — 99.6% of rows share the same value, leaving the model almost nothing to split on. Both are immediate drop candidates before any modelling begins.
Step 2 — Removing Highly Correlated Features
The scenario: Two features in your dataset are nearly identical — they measure the same thing slightly differently. Keeping both doesn't give your model twice the information; it gives it the same information twice, plus multicollinearity problems in linear models. You need to detect these pairs and remove one from each duplicate pair. The tool is the correlation matrix.
# Build a dataset where some features are deliberately highly correlated
np.random.seed(7)
n = 400
base_income = np.random.normal(50000, 15000, n) # true underlying signal
loan_df = pd.DataFrame({
'annual_income': base_income,
# gross_income is almost identical — just annual_income + tiny noise
'gross_income': base_income + np.random.normal(0, 500, n),
# monthly_income is annual / 12 — perfectly derivable
'monthly_income': base_income / 12 + np.random.normal(0, 50, n),
'credit_score': np.random.randint(300, 850, n),
'loan_amount': np.random.randint(5000, 80000, n),
'default': np.random.choice([0, 1], p=[0.82, 0.18], size=n)
})
# Step 1: Compute the correlation matrix for numerical features
corr_matrix = loan_df.drop('default', axis=1).corr().round(3)
print("Correlation matrix:")
print(corr_matrix.to_string())
print()
# Step 2: Find pairs with absolute correlation > 0.90
threshold = 0.90
cols = corr_matrix.columns
to_drop = set()
for i in range(len(cols)):
for j in range(i + 1, len(cols)): # upper triangle only
if abs(corr_matrix.iloc[i, j]) > threshold:
print(f"High correlation: {cols[i]} vs {cols[j]} "
f"= {corr_matrix.iloc[i, j]:.3f} → drop: {cols[j]}")
to_drop.add(cols[j]) # drop the second of the pair
print()
print("Columns to drop:", list(to_drop))
Correlation matrix:
annual_income gross_income monthly_income credit_score loan_amount
annual_income 1.000 1.000 1.000 0.012 0.021
gross_income 1.000 1.000 1.000 0.012 0.021
monthly_income 1.000 1.000 1.000 0.012 0.021
credit_score 0.012 0.012 0.012 1.000 0.003
loan_amount 0.021 0.021 0.021 0.003 1.000
High correlation: annual_income vs gross_income = 1.000 → drop: gross_income
High correlation: annual_income vs monthly_income = 1.000 → drop: monthly_income
Columns to drop: ['gross_income', 'monthly_income']What just happened?
gross_income and monthly_income both correlate at 1.000 with annual_income — they are functionally the same column. Keeping all three gives a linear model severe multicollinearity: it can't separate the effect of each and will assign unstable, near-arbitrary coefficients. The loop through the upper triangle is the standard pattern for deduplicating correlated pairs without counting each pair twice.
Step 3 — Correlation with the Target
The scenario: You want a quick ranking of which features have the strongest linear relationship with the target variable. This isn't a final selection decision — it's a first-pass diagnostic that takes seconds and immediately shows you which features are working hard and which are likely noise. For regression targets you use Pearson correlation; for binary targets, point-biserial correlation (which pandas handles identically).
# Build a housing regression dataset — target is sale_price
np.random.seed(0)
n = 300
housing_df = pd.DataFrame({
'lot_area': np.random.lognormal(8.5, 0.4, n),
'overall_qual': np.random.randint(1, 11, n),
# grade_score is a near-duplicate of overall_qual (slightly noisy)
'grade_score': np.random.randint(1, 11, n) * 10 + np.random.randint(0, 5, n),
'year_built': np.random.randint(1920, 2020, n),
'total_rooms': np.random.randint(3, 12, n),
# random_col is pure noise — should rank near zero
'random_col': np.random.random(n),
})
# sale_price is driven mainly by overall_qual and total_rooms
housing_df['sale_price'] = (
housing_df['overall_qual'] * 30000 +
housing_df['total_rooms'] * 8000 +
housing_df['lot_area'] * 0.5 +
np.random.normal(0, 15000, n)
).round(0)
# Step 1: Pearson correlation of each feature with the target
target_corr = (
housing_df
.drop('sale_price', axis=1)
.corrwith(housing_df['sale_price'])
.abs() # absolute value — direction doesn't matter here
.sort_values(ascending=False)
.round(4)
)
print("Absolute correlation with sale_price:")
print(target_corr.to_string())
print()
# Step 2: Keep features above a minimum correlation threshold
min_corr = 0.05
keep_cols = target_corr[target_corr >= min_corr].index.tolist()
drop_cols = target_corr[target_corr < min_corr].index.tolist()
print(f"Keep ({len(keep_cols)}): {keep_cols}")
print(f"Drop ({len(drop_cols)}): {drop_cols}")
Absolute correlation with sale_price: overall_qual 0.8821 total_rooms 0.6134 lot_area 0.1873 year_built 0.0412 grade_score 0.0198 random_col 0.0091 Keep (3): ['overall_qual', 'total_rooms', 'lot_area'] Drop (3): ['year_built', 'grade_score', 'random_col']
What just happened?
corrwith() computed each feature's Pearson correlation with sale_price in one call. overall_qual and total_rooms are the clear leaders — exactly as designed. random_col sits near zero as expected. Note that this method only captures linear relationships — a feature with a non-linear relationship to the target would be underscored here, which is one reason filter methods alone aren't sufficient.
Step 4 — Putting It Together: A Basic Selection Pipeline
The scenario: Your team wants a reproducible, three-step selection function that can be run on any new dataset before training: first remove zero-variance columns, then remove highly correlated duplicates, then rank remaining columns by target correlation and optionally trim the bottom. This covers the basics in a single reusable function.
def basic_feature_selection(df, target_col, var_threshold=0.01,
corr_threshold=0.90, min_target_corr=0.03):
"""
Three-stage basic feature selection.
Returns: (selected_columns, dropped_report)
"""
dropped = {} # track what was dropped and why
features = df.drop(columns=[target_col])
# --- Stage 1: Remove zero / near-zero variance ---
variances = features.var()
low_var_cols = variances[variances < var_threshold].index.tolist()
features = features.drop(columns=low_var_cols)
dropped['low_variance'] = low_var_cols
# --- Stage 2: Remove highly correlated duplicates ---
corr_mat = features.corr().abs()
upper_tri = corr_mat.where(
np.triu(np.ones(corr_mat.shape), k=1).astype(bool)
)
high_corr_cols = [
col for col in upper_tri.columns
if any(upper_tri[col] > corr_threshold)
]
features = features.drop(columns=high_corr_cols)
dropped['high_correlation'] = high_corr_cols
# --- Stage 3: Filter by target correlation ---
target_corr = features.corrwith(df[target_col]).abs()
low_corr_cols = target_corr[target_corr < min_target_corr].index.tolist()
features = features.drop(columns=low_corr_cols)
dropped['low_target_corr'] = low_corr_cols
return features.columns.tolist(), dropped
# Run on the housing dataset
selected, report = basic_feature_selection(
housing_df,
target_col='sale_price',
var_threshold=0.01,
corr_threshold=0.90,
min_target_corr=0.05
)
print("Selected features:", selected)
print()
print("Dropped summary:")
for reason, cols in report.items():
print(f" {reason}: {cols}")
Selected features: ['overall_qual', 'total_rooms', 'lot_area'] Dropped summary: low_variance: [] high_correlation: ['grade_score'] low_target_corr: ['year_built', 'random_col']
What just happened?
The three-stage pipeline whittled six features down to three. Stage 1 found nothing with near-zero variance. Stage 2 caught grade_score as a near-duplicate of overall_qual. Stage 3 cut year_built and random_col for low target correlation. The three surviving features are exactly the three that were designed to drive sale_price.
Feature Selection vs Dimensionality Reduction
These two terms are often confused. They solve related but different problems:
Feature Selection
Keeps a subset of the original columns. The selected features remain interpretable — you can still say "this model uses annual income and credit score." Nothing is transformed or combined.
Best when: interpretability matters, you need to explain the model to stakeholders, or you want to reduce maintenance overhead.
Dimensionality Reduction (e.g. PCA)
Transforms the original columns into a smaller set of new components. No original feature survives intact — you get "Principal Component 1" which is a weighted combination of everything.
Best when: maximum compression is needed and interpretability is less important — image data, NLP embeddings, very high-dimensional datasets.
The job applicant analogy
Feature selection is like shortlisting candidates — you review all 200 applicants and invite 8 for interview. Each person on the shortlist is still themselves, fully identifiable. Dimensionality reduction is like creating a composite score from 200 applicants — you collapse them into a single "talent index." Useful, but you can no longer point to individual people.
The danger of selecting on the full dataset
If you compute target correlations on all your data — including the test set — and then select features, you have leaked the test set's relationship with the target into your feature selection step. Your model will look better than it really is. Always run feature selection only on training data, then apply the same column mask to validation and test sets.
Teacher's Note
The three methods covered in this lesson — variance filtering, correlation deduplication, and target correlation ranking — are the quick preliminary pass. They are not a substitute for the more rigorous filter, wrapper, and embedded methods in the next three lessons. Think of them as the pre-screening step you run in the first five minutes to eliminate the obvious dead weight. Then you bring in the heavier tools. Skipping this preliminary pass and going straight to a wrapper method on a 200-feature dataset is a common and expensive mistake — you'll spend hours evaluating feature subsets that include constant columns and near-identical duplicates that should have been removed in thirty seconds.
Practice Questions
1. A column where every single row has the same value will have a ___ of zero. (one word)
2. Which pandas method computes the correlation between every column in a DataFrame and a single target Series in one call?
3. To avoid data leakage, feature selection must be performed using only which data split? (one word)
Quiz
1. A colleague says "PCA and feature selection are the same thing — both reduce the number of columns." What is the key difference?
2. Pearson correlation is used to rank features by their relationship with the target. What is its main limitation as a selection criterion?
3. Which family of feature selection methods uses a model itself to evaluate subsets of features, making it more accurate but computationally expensive?
Up Next · Lesson 25
Filter Methods
Chi-squared, mutual information, ANOVA F-test — the statistical toolkit for scoring every feature independently before a single model trains.