Feature Engineering Lesson 24 – Feature Section Basics | Dataplexa
Intermediate Level · Lesson 24

Feature Selection Basics

More features is not always better. Adding irrelevant or redundant columns to your model makes it slower to train, harder to interpret, and often less accurate. Feature selection is the discipline of deciding which columns actually earn their place.

Feature selection is the process of identifying and keeping only the subset of input columns that contribute useful, non-redundant signal to a model — improving generalisation, reducing training time, and making the model easier to explain to stakeholders.

The Problem with Too Many Features

It feels counterintuitive. Surely more information is better? But in machine learning, every irrelevant column you add to a dataset introduces a cost — sometimes a serious one. Here's what happens when you include features that don't belong:

1

The curse of dimensionality

In high-dimensional spaces, every data point becomes roughly equidistant from every other. Distance-based models like KNN and K-Means lose their ability to find meaningful clusters because noise dimensions swamp the signal dimensions.

2

Overfitting on noise

A model with 200 features and 500 rows has enough parameters to memorise the training set rather than learning the real pattern. Train accuracy looks great; test accuracy collapses. Irrelevant features give the model junk to fit on.

3

Slower training and inference

Every additional feature adds computation at both training and prediction time. In production systems serving thousands of requests per second, a leaner feature set directly translates to faster response times and lower infrastructure cost.

4

Harder to explain and maintain

A model that uses 8 well-chosen features is explainable to a business stakeholder. A model that uses 150 features is a black box — and every one of those features needs to be maintained, monitored, and kept live in production pipelines.

Three Families of Feature Selection

Feature selection methods fall into three broad families. Each operates differently and suits different situations. The next three lessons cover each family in depth — here we build the mental model for all three:

Lesson 25

Filter Methods

Score each feature independently using a statistical measure — correlation, chi-squared, mutual information. Fast, model-agnostic, and a good first pass. Run before training anything.

Lesson 26

Wrapper Methods

Use a model itself to evaluate subsets of features — forward selection, backward elimination, RFE. More accurate than filter methods but computationally expensive. Best for smaller feature sets.

Lesson 27

Embedded Methods

Selection happens during model training — Lasso regularisation, tree feature importances. No separate selection step needed; the model selects and trains simultaneously. Often the best balance of speed and accuracy.

Step 1 — Spotting Useless Features: Zero and Near-Zero Variance

The scenario: You've just joined a fintech team and inherited a churn prediction model. The feature matrix has 40 columns. Before doing anything sophisticated, you want to run a basic sanity check: are there any columns where almost every row has the same value? A column where 99% of rows are 0 is essentially a constant — it carries no information. These are the easiest features to drop, and finding them takes seconds.

# Import libraries
import pandas as pd
import numpy as np

# Build a churn dataset — 500 rows, 8 features, some deliberately useless
np.random.seed(42)

churn_df = pd.DataFrame({
    'customer_id':       range(1, 501),
    'monthly_spend':     np.random.normal(120, 40, 500).clip(0),
    'tenure_months':     np.random.randint(1, 72, 500),
    'support_tickets':   np.random.poisson(1.5, 500),
    # Nearly constant — 498 out of 500 rows are the same value
    'country_code':      np.random.choice([1, 2], p=[0.996, 0.004], size=500),
    # Completely constant — zero variance
    'api_version':       np.ones(500, dtype=int),
    # Random noise — high variance but pure noise
    'random_noise':      np.random.random(500),
    'churned':           np.random.choice([0, 1], p=[0.75, 0.25], size=500)
})

# Step 1: Compute variance of each numerical column
variances = churn_df.drop('customer_id', axis=1).var()
print("Variance per column:")
print(variances.round(4).to_string())
print()

# Step 2: Flag columns with near-zero variance (threshold = 0.01)
low_var = variances[variances < 0.01]
print("Low / zero variance columns:")
print(low_var)
Variance per column:
monthly_spend      1591.2043
tenure_months       415.8732
support_tickets       1.4897
country_code          0.0040
api_version           0.0000
random_noise          0.0833
churned               0.1882

Low / zero variance columns:
country_code    0.004
api_version     0.000
dtype: float64

What just happened?

api_version has zero variance — every single row is 1. It contributes nothing. country_code is near-zero — 99.6% of rows share the same value, leaving the model almost nothing to split on. Both are immediate drop candidates before any modelling begins.

Step 2 — Removing Highly Correlated Features

The scenario: Two features in your dataset are nearly identical — they measure the same thing slightly differently. Keeping both doesn't give your model twice the information; it gives it the same information twice, plus multicollinearity problems in linear models. You need to detect these pairs and remove one from each duplicate pair. The tool is the correlation matrix.

# Build a dataset where some features are deliberately highly correlated
np.random.seed(7)

n = 400
base_income    = np.random.normal(50000, 15000, n)   # true underlying signal

loan_df = pd.DataFrame({
    'annual_income':    base_income,
    # gross_income is almost identical — just annual_income + tiny noise
    'gross_income':     base_income + np.random.normal(0, 500, n),
    # monthly_income is annual / 12 — perfectly derivable
    'monthly_income':   base_income / 12 + np.random.normal(0, 50, n),
    'credit_score':     np.random.randint(300, 850, n),
    'loan_amount':      np.random.randint(5000, 80000, n),
    'default':          np.random.choice([0, 1], p=[0.82, 0.18], size=n)
})

# Step 1: Compute the correlation matrix for numerical features
corr_matrix = loan_df.drop('default', axis=1).corr().round(3)
print("Correlation matrix:")
print(corr_matrix.to_string())
print()

# Step 2: Find pairs with absolute correlation > 0.90
threshold = 0.90
cols      = corr_matrix.columns
to_drop   = set()

for i in range(len(cols)):
    for j in range(i + 1, len(cols)):             # upper triangle only
        if abs(corr_matrix.iloc[i, j]) > threshold:
            print(f"High correlation: {cols[i]} vs {cols[j]} "
                  f"= {corr_matrix.iloc[i, j]:.3f}  →  drop: {cols[j]}")
            to_drop.add(cols[j])                   # drop the second of the pair

print()
print("Columns to drop:", list(to_drop))
Correlation matrix:
                annual_income  gross_income  monthly_income  credit_score  loan_amount
annual_income           1.000         1.000           1.000         0.012        0.021
gross_income            1.000         1.000           1.000         0.012        0.021
monthly_income          1.000         1.000           1.000         0.012        0.021
credit_score            0.012         0.012           0.012         1.000        0.003
loan_amount             0.021         0.021           0.021         0.003        1.000

High correlation: annual_income vs gross_income = 1.000  →  drop: gross_income
High correlation: annual_income vs monthly_income = 1.000  →  drop: monthly_income

Columns to drop: ['gross_income', 'monthly_income']

What just happened?

gross_income and monthly_income both correlate at 1.000 with annual_income — they are functionally the same column. Keeping all three gives a linear model severe multicollinearity: it can't separate the effect of each and will assign unstable, near-arbitrary coefficients. The loop through the upper triangle is the standard pattern for deduplicating correlated pairs without counting each pair twice.

Step 3 — Correlation with the Target

The scenario: You want a quick ranking of which features have the strongest linear relationship with the target variable. This isn't a final selection decision — it's a first-pass diagnostic that takes seconds and immediately shows you which features are working hard and which are likely noise. For regression targets you use Pearson correlation; for binary targets, point-biserial correlation (which pandas handles identically).

# Build a housing regression dataset — target is sale_price
np.random.seed(0)
n = 300

housing_df = pd.DataFrame({
    'lot_area':        np.random.lognormal(8.5, 0.4, n),
    'overall_qual':    np.random.randint(1, 11, n),
    # grade_score is a near-duplicate of overall_qual (slightly noisy)
    'grade_score':     np.random.randint(1, 11, n) * 10 + np.random.randint(0, 5, n),
    'year_built':      np.random.randint(1920, 2020, n),
    'total_rooms':     np.random.randint(3, 12, n),
    # random_col is pure noise — should rank near zero
    'random_col':      np.random.random(n),
})

# sale_price is driven mainly by overall_qual and total_rooms
housing_df['sale_price'] = (
    housing_df['overall_qual'] * 30000 +
    housing_df['total_rooms']  * 8000  +
    housing_df['lot_area']     * 0.5   +
    np.random.normal(0, 15000, n)
).round(0)

# Step 1: Pearson correlation of each feature with the target
target_corr = (
    housing_df
    .drop('sale_price', axis=1)
    .corrwith(housing_df['sale_price'])
    .abs()                          # absolute value — direction doesn't matter here
    .sort_values(ascending=False)
    .round(4)
)

print("Absolute correlation with sale_price:")
print(target_corr.to_string())
print()

# Step 2: Keep features above a minimum correlation threshold
min_corr   = 0.05
keep_cols  = target_corr[target_corr >= min_corr].index.tolist()
drop_cols  = target_corr[target_corr <  min_corr].index.tolist()
print(f"Keep ({len(keep_cols)}): {keep_cols}")
print(f"Drop ({len(drop_cols)}): {drop_cols}")
Absolute correlation with sale_price:
overall_qual    0.8821
total_rooms     0.6134
lot_area        0.1873
year_built      0.0412
grade_score     0.0198
random_col      0.0091

Keep (3): ['overall_qual', 'total_rooms', 'lot_area']
Drop (3): ['year_built', 'grade_score', 'random_col']

What just happened?

corrwith() computed each feature's Pearson correlation with sale_price in one call. overall_qual and total_rooms are the clear leaders — exactly as designed. random_col sits near zero as expected. Note that this method only captures linear relationships — a feature with a non-linear relationship to the target would be underscored here, which is one reason filter methods alone aren't sufficient.

Step 4 — Putting It Together: A Basic Selection Pipeline

The scenario: Your team wants a reproducible, three-step selection function that can be run on any new dataset before training: first remove zero-variance columns, then remove highly correlated duplicates, then rank remaining columns by target correlation and optionally trim the bottom. This covers the basics in a single reusable function.

def basic_feature_selection(df, target_col, var_threshold=0.01,
                             corr_threshold=0.90, min_target_corr=0.03):
    """
    Three-stage basic feature selection.
    Returns: (selected_columns, dropped_report)
    """
    dropped = {}   # track what was dropped and why
    features = df.drop(columns=[target_col])

    # --- Stage 1: Remove zero / near-zero variance ---
    variances    = features.var()
    low_var_cols = variances[variances < var_threshold].index.tolist()
    features     = features.drop(columns=low_var_cols)
    dropped['low_variance'] = low_var_cols

    # --- Stage 2: Remove highly correlated duplicates ---
    corr_mat  = features.corr().abs()
    upper_tri = corr_mat.where(
        np.triu(np.ones(corr_mat.shape), k=1).astype(bool)
    )
    high_corr_cols = [
        col for col in upper_tri.columns
        if any(upper_tri[col] > corr_threshold)
    ]
    features  = features.drop(columns=high_corr_cols)
    dropped['high_correlation'] = high_corr_cols

    # --- Stage 3: Filter by target correlation ---
    target_corr  = features.corrwith(df[target_col]).abs()
    low_corr_cols = target_corr[target_corr < min_target_corr].index.tolist()
    features      = features.drop(columns=low_corr_cols)
    dropped['low_target_corr'] = low_corr_cols

    return features.columns.tolist(), dropped


# Run on the housing dataset
selected, report = basic_feature_selection(
    housing_df,
    target_col='sale_price',
    var_threshold=0.01,
    corr_threshold=0.90,
    min_target_corr=0.05
)

print("Selected features:", selected)
print()
print("Dropped summary:")
for reason, cols in report.items():
    print(f"  {reason}: {cols}")
Selected features: ['overall_qual', 'total_rooms', 'lot_area']

Dropped summary:
  low_variance: []
  high_correlation: ['grade_score']
  low_target_corr: ['year_built', 'random_col']

What just happened?

The three-stage pipeline whittled six features down to three. Stage 1 found nothing with near-zero variance. Stage 2 caught grade_score as a near-duplicate of overall_qual. Stage 3 cut year_built and random_col for low target correlation. The three surviving features are exactly the three that were designed to drive sale_price.

Feature Selection vs Dimensionality Reduction

These two terms are often confused. They solve related but different problems:

Feature Selection

Keeps a subset of the original columns. The selected features remain interpretable — you can still say "this model uses annual income and credit score." Nothing is transformed or combined.

Best when: interpretability matters, you need to explain the model to stakeholders, or you want to reduce maintenance overhead.

Dimensionality Reduction (e.g. PCA)

Transforms the original columns into a smaller set of new components. No original feature survives intact — you get "Principal Component 1" which is a weighted combination of everything.

Best when: maximum compression is needed and interpretability is less important — image data, NLP embeddings, very high-dimensional datasets.

The job applicant analogy

Feature selection is like shortlisting candidates — you review all 200 applicants and invite 8 for interview. Each person on the shortlist is still themselves, fully identifiable. Dimensionality reduction is like creating a composite score from 200 applicants — you collapse them into a single "talent index." Useful, but you can no longer point to individual people.

The danger of selecting on the full dataset

If you compute target correlations on all your data — including the test set — and then select features, you have leaked the test set's relationship with the target into your feature selection step. Your model will look better than it really is. Always run feature selection only on training data, then apply the same column mask to validation and test sets.

Teacher's Note

The three methods covered in this lesson — variance filtering, correlation deduplication, and target correlation ranking — are the quick preliminary pass. They are not a substitute for the more rigorous filter, wrapper, and embedded methods in the next three lessons. Think of them as the pre-screening step you run in the first five minutes to eliminate the obvious dead weight. Then you bring in the heavier tools. Skipping this preliminary pass and going straight to a wrapper method on a 200-feature dataset is a common and expensive mistake — you'll spend hours evaluating feature subsets that include constant columns and near-identical duplicates that should have been removed in thirty seconds.

Practice Questions

1. A column where every single row has the same value will have a ___ of zero. (one word)



2. Which pandas method computes the correlation between every column in a DataFrame and a single target Series in one call?



3. To avoid data leakage, feature selection must be performed using only which data split? (one word)



Quiz

1. A colleague says "PCA and feature selection are the same thing — both reduce the number of columns." What is the key difference?


2. Pearson correlation is used to rank features by their relationship with the target. What is its main limitation as a selection criterion?


3. Which family of feature selection methods uses a model itself to evaluate subsets of features, making it more accurate but computationally expensive?


Up Next · Lesson 25

Filter Methods

Chi-squared, mutual information, ANOVA F-test — the statistical toolkit for scoring every feature independently before a single model trains.