ML Lesson 29 – Feature Selection | Dataplexa

Feature Selection

In the previous lesson, we studied Principal Component Analysis and learned how to reduce dimensions by transforming features into new combinations. PCA helped us simplify the dataset, but it also changed the meaning of features.

In many real-world machine learning projects, we want a different approach. Instead of creating new features, we want to keep the original ones and simply choose the most important among them.

This approach is called Feature Selection.


What Feature Selection Really Means

Feature selection is the process of identifying which input features actually contribute to model performance.

Some features provide strong signals. Some features add noise. Some features repeat information already present elsewhere.

Feature selection removes unnecessary features while preserving the original meaning of the data.

This makes models easier to interpret, faster to train, and often more accurate.


Feature Selection vs Dimensionality Reduction

Dimensionality reduction creates new features by combining existing ones.

Feature selection does not create anything new. It simply decides what to keep and what to remove.

In regulated industries like banking and healthcare, feature selection is often preferred because the model remains explainable.


Using Our Dataset

We continue working with the same dataset used throughout this module.

Dataplexa ML Housing & Customer Dataset

Our objective is still to predict loan approval, but now we want to understand which features matter most.


Preparing the Data

For feature selection, we keep the original feature values and the target variable.

import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("dataplexa_ml_housing_customer_dataset.csv")

X = df.drop("loan_approved", axis=1)
y = df["loan_approved"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Understanding Feature Importance Intuitively

Imagine a loan officer reviewing applications. Income and credit score may strongly influence decisions.

Other details, such as a rarely changing attribute, may not impact approval at all.

Feature selection tries to mathematically identify which attributes influence predictions the most.


Selecting Features Using a Model

One common way to select features is by training a model that naturally measures importance.

Decision Trees and Random Forests are often used for this purpose.

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

importances = model.feature_importances_

feature_importance_df = pd.DataFrame({
    "feature": X.columns,
    "importance": importances
}).sort_values(by="importance", ascending=False)

feature_importance_df

The output shows which features contribute most to predictions.


Reducing the Feature Set

Once importance is calculated, we can keep only the most influential features.

This reduces noise and improves model clarity.

top_features = feature_importance_df["feature"].head(5)

X_train_selected = X_train[top_features]
X_test_selected = X_test[top_features]

At this point, the dataset is simpler but still meaningful and interpretable.


Real-World Interpretation

Banks often discover that only a handful of variables drive most loan approval decisions.

Feature selection helps reduce operational complexity and makes models easier to audit and explain.


Mini Practice

Try selecting only the top three features and train a simple classifier.

Observe how accuracy compares to using all features.


Exercises

Exercise 1:
Why is feature selection important for explainability?

Because models rely on original features, making decisions easier to interpret.

Exercise 2:
Does feature selection always increase accuracy?

No. Removing useful features can reduce performance if done incorrectly.

Quick Quiz

Q1. Is feature selection a supervised technique?

Yes. It uses the target variable to evaluate feature relevance.

In the next lesson, we will study Feature Engineering and learn how to create better features from raw data.