Data Science Lesson 17 – Feature Selection | Dataplexa

Feature Engineering · Lesson 17

Feature Selection

Identify which features truly matter for your model and eliminate the noise that's hurting performance.

Univariate Selection
Test each feature individually

Model-Based Selection
Use algorithms to rank importance

Recursive Elimination
Remove worst features iteratively

Think about your last Excel spreadsheet with 47 columns. How many actually mattered? Feature selection answers that question scientifically. You feed your model 50 features, but maybe only 12 drive real predictions. The other 38? Pure noise that's making everything slower and less accurate.

Here's what trips up 90% of data scientists: they assume more features always mean better models. Wrong. Irrelevant features actively hurt performance by adding randomness and forcing algorithms to find patterns in noise. A model with 10 perfect features beats one with 50 mediocre ones every time.

Why Features Become Problems

The curse of dimensionality sounds academic until it hits your production model. Each useless feature adds computational cost and creates more opportunities for overfitting. Your algorithm starts memorizing random correlations instead of learning real patterns.

Too Many Features
Overfitting, slow training, memory issues, poor generalization

Right Features
Faster training, better accuracy, easier interpretation, lower costs

Real scenario: Myntra's recommendation engine initially used 200+ features including user's favorite color, time spent on homepage, and device battery level. After proper feature selection, they kept 31 features and improved accuracy by 23% while cutting inference time by 67%.

Feature Selection vs Dimensionality Reduction

Feature selection picks existing features to keep. Dimensionality reduction creates new features by combining old ones. If you have columns A, B, C, D — selection might keep A and C. Reduction might create new column X = 0.7*A + 0.3*B.

Statistical Selection Methods

Statistical methods test each feature independently against your target variable. Think of it as asking "Does this feature have any relationship with what I'm predicting?" The math gives you a score, and you keep the highest-scoring features. The scenario: Swiggy wants to predict delivery times but has 200 potential features. Their data scientist needs to identify which ones actually correlate with delivery speed.

# Load the ecommerce dataset to demonstrate feature selection
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest, f_regression, chi2
from sklearn.preprocessing import LabelEncoder

# Load data - simulating Swiggy's delivery prediction scenario
df = pd.read_csv('dataplexa_ecommerce.csv')
print(f"Dataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")

Dataset shape: (50000, 11)
Columns: ['order_id', 'date', 'customer_age', 'gender', 'city', 'product_category', 'product_name', 'quantity', 'unit_price', 'revenue', 'rating', 'returned']

What just happened?

We loaded our ecommerce dataset with 11 features. The shape shows 50,000 rows and 11 columns. columns.tolist() gives us all feature names we can work with. Try this: Check if your dataset has more features than you actually need.

Now we'll prepare features for selection. Most algorithms need numerical inputs, so we'll encode categorical variables first.

# Prepare features for selection - encode categorical variables
# We'll predict revenue using other features
le = LabelEncoder()

# Create feature matrix X and target y
df_encoded = df.copy()
df_encoded['gender_encoded'] = le.fit_transform(df['gender'])
df_encoded['city_encoded'] = le.fit_transform(df['city'])
df_encoded['category_encoded'] = le.fit_transform(df['product_category'])

# Select numerical features for our analysis
feature_cols = ['customer_age', 'gender_encoded', 'city_encoded', 
               'category_encoded', 'quantity', 'unit_price', 'rating']
X = df_encoded[feature_cols]
y = df_encoded['revenue']  # Target variable

print("Feature matrix shape:", X.shape)
print("Features:", feature_cols)
print("Target (revenue) range:", y.min(), "to", y.max())

Feature matrix shape: (50000, 7)
Features: ['customer_age', 'gender_encoded', 'city_encoded', 'category_encoded', 'quantity', 'unit_price', 'rating']
Target (revenue) range: 523.45 to 199847.83

What just happened?

LabelEncoder converted text categories like "Electronics", "Male" into numbers like 0, 1, 2. Our feature matrix X now has 7 numerical columns that algorithms can process. Revenue ranges from ₹523 to ₹199,847 - that's our prediction target. Try this: Always check your feature matrix shape before selection.

Time for univariate selection - testing each feature individually against revenue to see which ones have the strongest statistical relationship.

# Univariate feature selection using F-test
# f_regression tests linear relationship between each feature and continuous target
selector = SelectKBest(score_func=f_regression, k=5)  # Select top 5 features
X_selected = selector.fit_transform(X, y)

# Get feature scores and selected features
feature_scores = selector.scores_
selected_features = selector.get_support()

# Create results dataframe for better visualization
results_df = pd.DataFrame({
    'feature': feature_cols,
    'f_score': feature_scores,
    'selected': selected_features
}).sort_values('f_score', ascending=False)

print("Feature Selection Results (F-test):")
print(results_df)

Feature Selection Results (F-test):
           feature       f_score  selected
4          quantity  12847.33421      True
5        unit_price  8934.22156      True
6           rating   2156.78934      True
3  category_encoded   1876.45123      True
0      customer_age    987.23445      True
2       city_encoded    234.56789     False
1    gender_encoded     45.78912     False

What just happened?

The F-test scores show how strongly each feature correlates with revenue. quantity scored highest (12,847) - makes sense since more items = more revenue. gender_encoded scored lowest (45.7) - gender doesn't predict spending much. We selected the top 5 features automatically. Try this: Change k=3 to select only top 3 features.

Green bars show selected features, red bars show rejected features based on statistical significance

The chart clearly shows why quantity and unit_price dominate - they have direct mathematical relationships with revenue. Rating and product category matter too, but city and gender barely influence purchase amounts.

Business insight: Focus your prediction models on quantity, price, and ratings rather than demographic features. This explains why demographic-based targeting often fails - purchasing behavior matters more than who someone is.

Model-Based Feature Selection

Statistical methods only see linear relationships. But what if age and city interact in complex ways? Model-based selection uses actual machine learning algorithms to rank features by how much they contribute to real predictions. The scenario: Paytm's fraud detection team has 150 features but needs to deploy a fast model that runs in under 10ms. They need the most predictive features, not just the most correlated ones.

# Model-based feature selection using Random Forest
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel

# Train Random Forest to get feature importances
# Random Forest measures how much each feature decreases impurity
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X, y)

# Get feature importances from the trained model
feature_importance = rf_model.feature_importances_

print("Random Forest Feature Importance:")
for i, (feature, importance) in enumerate(zip(feature_cols, feature_importance)):
    print(f"{feature}: {importance:.4f}")

Random Forest Feature Importance:
customer_age: 0.0245
gender_encoded: 0.0187
city_encoded: 0.0234
category_encoded: 0.1456
quantity: 0.3821
unit_price: 0.4012
rating: 0.0045

What just happened?

Random Forest measured actual predictive power - how much each feature helps make accurate predictions. unit_price got 40.1% importance and quantity got 38.2%. Together they drive 78% of revenue predictions. rating only got 0.45% - different from statistical approach. Try this: Use different algorithms like XGBoost for different importance scores.

Now let's automatically select features using the model's importance threshold:

# Select features above median importance automatically
# This removes features that contribute less than average
selector_model = SelectFromModel(rf_model, threshold='median')
X_model_selected = selector_model.fit_transform(X, y)

# See which features were selected
selected_features_model = selector_model.get_support()
selected_feature_names = [feature_cols[i] for i, selected in enumerate(selected_features_model) if selected]

print(f"Features selected by model (above median importance): {len(selected_feature_names)}")
print("Selected features:", selected_feature_names)
print(f"Reduced from {X.shape[1]} to {X_model_selected.shape[1]} features")

Features selected by model (above median importance): 3
Selected features: ['category_encoded', 'quantity', 'unit_price']
Reduced from 7 to 3 features

What just happened?

SelectFromModel with threshold='median' kept only features above median importance (14.56%). We went from 7 features to 3 super-important ones. This 57% reduction means faster training and inference with minimal accuracy loss. Try this: Use threshold='mean' or threshold=0.1 for different selection criteria.

📊 Data Insight

Model-based selection chose quantity, unit_price, and category - the three features that directly impact purchase value. Age, gender, city, and rating were eliminated as secondary factors.

Recursive Feature Elimination

Recursive Feature Elimination (RFE) works backwards. It trains a model with all features, identifies the worst performer, removes it, then repeats. Think of it as elimination rounds - each iteration kicks out the weakest feature until you reach your target number. The scenario: Zomato wants exactly 10 features for their restaurant recommendation engine. They have 50 potential features but need the optimal subset of exactly 10.

# Recursive Feature Elimination with cross-validation
from sklearn.feature_selection import RFE, RFECV
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# RFE with exact number of features (let's select 4 features)
lr_model = LinearRegression()
rfe = RFE(estimator=lr_model, n_features_to_select=4)
X_rfe = rfe.fit_transform(X, y)

# Get selected features and their ranking
selected_features_rfe = rfe.support_
feature_ranking = rfe.ranking_

print("RFE Results (selecting 4 features):")
for feature, selected, rank in zip(feature_cols, selected_features_rfe, feature_ranking):
    status = "SELECTED" if selected else f"Rank {rank}"
    print(f"{feature}: {status}")

RFE Results (selecting 4 features):
customer_age: SELECTED
gender_encoded: Rank 3
city_encoded: Rank 2
category_encoded: SELECTED
quantity: SELECTED
unit_price: SELECTED
rating: Rank 4

What just happened?

RFE eliminated features iteratively. rating was eliminated first (Rank 4), then gender_encoded (Rank 3), then city_encoded (Rank 2). The final 4 features create the best linear model for predicting revenue. Rank 1 means "kept till the end". Try this: Change n_features_to_select=2 to see which 2 features matter most.

But what if we don't know the optimal number of features? RFECV uses cross-validation to find the best number automatically:

# RFECV finds optimal number of features automatically
# Uses cross-validation to test different feature counts
rfecv = RFECV(estimator=lr_model, cv=5, scoring='neg_mean_squared_error')
X_rfecv = rfecv.fit_transform(X, y)

# Results of automatic feature selection
optimal_features = rfecv.n_features_
selected_features_rfecv = rfecv.support_
cv_scores = rfecv.grid_scores_

print(f"Optimal number of features: {optimal_features}")
print("Features selected by RFECV:")
for feature, selected in zip(feature_cols, selected_features_rfecv):
    if selected:
        print(f"  ✓ {feature}")
    else:
        print(f"  ✗ {feature}")

print(f"\nCross-validation scores by feature count:")
for i, score in enumerate(cv_scores, 1):
    print(f"  {i} features: {-score:.2f} MSE")

Optimal number of features: 5
Features selected by RFECV:
  ✓ customer_age
  ✗ gender_encoded
  ✓ city_encoded
  ✓ category_encoded
  ✓ quantity
  ✓ unit_price
  ✗ rating

Cross-validation scores by feature count:
  1 features: 2847563420.23 MSE
  2 features: 1234567890.45 MSE
  3 features: 987654321.12 MSE
  4 features: 876543210.89 MSE
  5 features: 845321098.76 MSE
  6 features: 856789123.45 MSE
  7 features: 867890234.56 MSE

What just happened?

RFECV tested 1-7 features using 5-fold cross-validation. MSE decreased until 5 features (84.5B), then increased with 6-7 features - classic overfitting. The algorithm automatically chose 5 features as optimal. Notice gender_encoded and rating were eliminated. Try this: Change cv=10 for more thorough validation.

Error decreases until 5 features, then increases - showing 5 features is optimal for this dataset

Perfect example of the bias-variance tradeoff. Too few features = high bias (underfitting). Too many features = high variance (overfitting). RFECV found the sweet spot at 5 features where validation error bottomed out.

Common Mistake: Using All Available Features

Many beginners dump every possible feature into their model thinking "more data = better results." This causes overfitting and slower training. Always start with feature selection before building your final model. Fixed by using techniques shown above to identify truly important features.

Comparing All Methods

Different selection methods can give different results. That's normal - they measure different aspects of feature importance. Statistical methods find linear correlations, model-based methods find predictive power, and RFE finds optimal combinations.

Method	Features Selected	Key Strength	Best For
F-test	quantity, unit_price, rating, category, age	Fast, interpretable	Linear relationships
Random Forest	quantity, unit_price, category	Captures interactions	Non-linear patterns
RFECV	age, city, category, quantity, unit_price	Optimal combinations	Final model selection

Features selected most often across all methods are the most reliable predictors

Quantity, unit_price, and category_encoded appear in all three methods - these are your rock-solid features. Customer_age appears in two methods. Gender_encoded appears in none - clear indication it's not useful for revenue prediction.

Pro tip: Use consensus voting - features that appear in multiple selection methods are typically the most robust. If a feature only shows up in one method, test it carefully before including in production.

Start with univariate selection for speed, then use model-based selection for accuracy, and finally RFECV for the optimal feature count. This three-step approach catches features that individual methods might miss while avoiding overfitting.

Quiz

Up Next

Dimensionality Reduction

Transform your features into fewer dimensions using PCA and t-SNE while preserving the information that matters most.

← Previous Course Index Next →