AI Lesson 42 – Neural Network Basics | Dataplexa

Feature Selection

Feature Selection is the process of choosing the most important input features for a machine learning model and removing unnecessary or irrelevant ones. While feature engineering creates better features, feature selection decides which features actually matter.

Using too many features can confuse a model, slow training, and reduce accuracy. A smaller, well-chosen feature set often performs better than a large noisy one.

Why Feature Selection Is Important

Not all features contribute equally to predictions. Some features add noise, some repeat the same information, and some have no relationship with the target at all.

Improves model accuracy
Reduces overfitting
Speeds up training
Makes models easier to understand

Real-World Connection

Imagine predicting employee salary. Useful features may include experience and skills, while irrelevant features like employee ID or email address add no value. Feature selection removes such useless inputs.

Types of Feature Selection Methods

Filter methods
Wrapper methods
Embedded methods

Filter Method: Correlation

Filter methods evaluate features using statistical measures. Correlation shows how strongly a feature is related to the target variable.


import pandas as pd

data = pd.DataFrame({
    'experience': [1, 2, 3, 4, 5],
    'education': [10, 12, 14, 16, 18],
    'salary': [30, 40, 50, 60, 70]
})

correlation = data.corr()
print(correlation)

experience education salary experience 1.0 1.0 1.0 education 1.0 1.0 1.0 salary 1.0 1.0 1.0

Higher correlation values indicate stronger relationships. Features with very low correlation can often be removed.

Wrapper Method: Recursive Feature Elimination (RFE)

Wrapper methods use a machine learning model to evaluate feature importance. Recursive Feature Elimination repeatedly removes the weakest features.


from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE

X, y = load_iris(return_X_y=True)
model = LogisticRegression(max_iter=200)

rfe = RFE(model, n_features_to_select=2)
rfe.fit(X, y)

print(rfe.support_)

[ True False True False]

The output shows which features were selected. Only the most useful ones remain.

Embedded Method: Feature Importance

Embedded methods perform feature selection during model training. Tree-based models automatically rank features by importance.


from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
model.fit(X, y)

print(model.feature_importances_)

[0.12 0.04 0.68 0.16]

Higher values indicate more important features. Low-importance features can be removed safely.