Feature Selection
In the previous lesson, we studied Principal Component Analysis and learned how to reduce dimensions by transforming features into new combinations. PCA helped us simplify the dataset, but it also changed the meaning of features.
In many real-world machine learning projects, we want a different approach. Instead of creating new features, we want to keep the original ones and simply choose the most important among them.
This approach is called Feature Selection.
What Feature Selection Really Means
Feature selection is the process of identifying which input features actually contribute to model performance.
Some features provide strong signals. Some features add noise. Some features repeat information already present elsewhere.
Feature selection removes unnecessary features while preserving the original meaning of the data.
This makes models easier to interpret, faster to train, and often more accurate.
Feature Selection vs Dimensionality Reduction
Dimensionality reduction creates new features by combining existing ones.
Feature selection does not create anything new. It simply decides what to keep and what to remove.
In regulated industries like banking and healthcare, feature selection is often preferred because the model remains explainable.
Using Our Dataset
We continue working with the same dataset used throughout this module.
Dataplexa ML Housing & Customer Dataset
Our objective is still to predict loan approval, but now we want to understand which features matter most.
Preparing the Data
For feature selection, we keep the original feature values and the target variable.
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv("dataplexa_ml_housing_customer_dataset.csv")
X = df.drop("loan_approved", axis=1)
y = df["loan_approved"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Understanding Feature Importance Intuitively
Imagine a loan officer reviewing applications. Income and credit score may strongly influence decisions.
Other details, such as a rarely changing attribute, may not impact approval at all.
Feature selection tries to mathematically identify which attributes influence predictions the most.
Selecting Features Using a Model
One common way to select features is by training a model that naturally measures importance.
Decision Trees and Random Forests are often used for this purpose.
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
importances = model.feature_importances_
feature_importance_df = pd.DataFrame({
"feature": X.columns,
"importance": importances
}).sort_values(by="importance", ascending=False)
feature_importance_df
The output shows which features contribute most to predictions.
Reducing the Feature Set
Once importance is calculated, we can keep only the most influential features.
This reduces noise and improves model clarity.
top_features = feature_importance_df["feature"].head(5)
X_train_selected = X_train[top_features]
X_test_selected = X_test[top_features]
At this point, the dataset is simpler but still meaningful and interpretable.
Real-World Interpretation
Banks often discover that only a handful of variables drive most loan approval decisions.
Feature selection helps reduce operational complexity and makes models easier to audit and explain.
Mini Practice
Try selecting only the top three features and train a simple classifier.
Observe how accuracy compares to using all features.
Exercises
Exercise 1:
Why is feature selection important for explainability?
Exercise 2:
Does feature selection always increase accuracy?
Quick Quiz
Q1. Is feature selection a supervised technique?
In the next lesson, we will study Feature Engineering and learn how to create better features from raw data.