Data Science
Feature Selection
Identify which features truly matter for your model and eliminate the noise that's hurting performance.
Test each feature individually
Use algorithms to rank importance
Remove worst features iteratively
Here's what trips up 90% of data scientists: they assume more features always mean better models. Wrong. Irrelevant features actively hurt performance by adding randomness and forcing algorithms to find patterns in noise. A model with 10 perfect features beats one with 50 mediocre ones every time.
Why Features Become Problems
The curse of dimensionality sounds academic until it hits your production model. Each useless feature adds computational cost and creates more opportunities for overfitting. Your algorithm starts memorizing random correlations instead of learning real patterns.Overfitting, slow training, memory issues, poor generalization
Faster training, better accuracy, easier interpretation, lower costs
Statistical Selection Methods
Statistical methods test each feature independently against your target variable. Think of it as asking "Does this feature have any relationship with what I'm predicting?" The math gives you a score, and you keep the highest-scoring features. The scenario: Swiggy wants to predict delivery times but has 200 potential features. Their data scientist needs to identify which ones actually correlate with delivery speed.# Load the ecommerce dataset to demonstrate feature selection
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest, f_regression, chi2
from sklearn.preprocessing import LabelEncoder
# Load data - simulating Swiggy's delivery prediction scenario
df = pd.read_csv('dataplexa_ecommerce.csv')
print(f"Dataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")Dataset shape: (50000, 11) Columns: ['order_id', 'date', 'customer_age', 'gender', 'city', 'product_category', 'product_name', 'quantity', 'unit_price', 'revenue', 'rating', 'returned']
shape shows 50,000 rows and 11 columns. columns.tolist() gives us all feature names we can work with. Try this: Check if your dataset has more features than you actually need.# Prepare features for selection - encode categorical variables
# We'll predict revenue using other features
le = LabelEncoder()
# Create feature matrix X and target y
df_encoded = df.copy()
df_encoded['gender_encoded'] = le.fit_transform(df['gender'])
df_encoded['city_encoded'] = le.fit_transform(df['city'])
df_encoded['category_encoded'] = le.fit_transform(df['product_category'])
# Select numerical features for our analysis
feature_cols = ['customer_age', 'gender_encoded', 'city_encoded',
'category_encoded', 'quantity', 'unit_price', 'rating']
X = df_encoded[feature_cols]
y = df_encoded['revenue'] # Target variable
print("Feature matrix shape:", X.shape)
print("Features:", feature_cols)
print("Target (revenue) range:", y.min(), "to", y.max())Feature matrix shape: (50000, 7) Features: ['customer_age', 'gender_encoded', 'city_encoded', 'category_encoded', 'quantity', 'unit_price', 'rating'] Target (revenue) range: 523.45 to 199847.83
LabelEncoder converted text categories like "Electronics", "Male" into numbers like 0, 1, 2. Our feature matrix X now has 7 numerical columns that algorithms can process. Revenue ranges from ₹523 to ₹199,847 - that's our prediction target. Try this: Always check your feature matrix shape before selection.# Univariate feature selection using F-test
# f_regression tests linear relationship between each feature and continuous target
selector = SelectKBest(score_func=f_regression, k=5) # Select top 5 features
X_selected = selector.fit_transform(X, y)
# Get feature scores and selected features
feature_scores = selector.scores_
selected_features = selector.get_support()
# Create results dataframe for better visualization
results_df = pd.DataFrame({
'feature': feature_cols,
'f_score': feature_scores,
'selected': selected_features
}).sort_values('f_score', ascending=False)
print("Feature Selection Results (F-test):")
print(results_df)Feature Selection Results (F-test):
feature f_score selected
4 quantity 12847.33421 True
5 unit_price 8934.22156 True
6 rating 2156.78934 True
3 category_encoded 1876.45123 True
0 customer_age 987.23445 True
2 city_encoded 234.56789 False
1 gender_encoded 45.78912 Falsequantity scored highest (12,847) - makes sense since more items = more revenue. gender_encoded scored lowest (45.7) - gender doesn't predict spending much. We selected the top 5 features automatically. Try this: Change k=3 to select only top 3 features.Green bars show selected features, red bars show rejected features based on statistical significance
The chart clearly shows why quantity and unit_price dominate - they have direct mathematical relationships with revenue. Rating and product category matter too, but city and gender barely influence purchase amounts.Business insight: Focus your prediction models on quantity, price, and ratings rather than demographic features. This explains why demographic-based targeting often fails - purchasing behavior matters more than who someone is.
Model-Based Feature Selection
Statistical methods only see linear relationships. But what if age and city interact in complex ways? Model-based selection uses actual machine learning algorithms to rank features by how much they contribute to real predictions. The scenario: Paytm's fraud detection team has 150 features but needs to deploy a fast model that runs in under 10ms. They need the most predictive features, not just the most correlated ones.# Model-based feature selection using Random Forest
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
# Train Random Forest to get feature importances
# Random Forest measures how much each feature decreases impurity
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X, y)
# Get feature importances from the trained model
feature_importance = rf_model.feature_importances_
print("Random Forest Feature Importance:")
for i, (feature, importance) in enumerate(zip(feature_cols, feature_importance)):
print(f"{feature}: {importance:.4f}")
Random Forest Feature Importance: customer_age: 0.0245 gender_encoded: 0.0187 city_encoded: 0.0234 category_encoded: 0.1456 quantity: 0.3821 unit_price: 0.4012 rating: 0.0045
unit_price got 40.1% importance and quantity got 38.2%. Together they drive 78% of revenue predictions. rating only got 0.45% - different from statistical approach. Try this: Use different algorithms like XGBoost for different importance scores.# Select features above median importance automatically
# This removes features that contribute less than average
selector_model = SelectFromModel(rf_model, threshold='median')
X_model_selected = selector_model.fit_transform(X, y)
# See which features were selected
selected_features_model = selector_model.get_support()
selected_feature_names = [feature_cols[i] for i, selected in enumerate(selected_features_model) if selected]
print(f"Features selected by model (above median importance): {len(selected_feature_names)}")
print("Selected features:", selected_feature_names)
print(f"Reduced from {X.shape[1]} to {X_model_selected.shape[1]} features")
Features selected by model (above median importance): 3 Selected features: ['category_encoded', 'quantity', 'unit_price'] Reduced from 7 to 3 features
SelectFromModel with threshold='median' kept only features above median importance (14.56%). We went from 7 features to 3 super-important ones. This 57% reduction means faster training and inference with minimal accuracy loss. Try this: Use threshold='mean' or threshold=0.1 for different selection criteria.Recursive Feature Elimination
Recursive Feature Elimination (RFE) works backwards. It trains a model with all features, identifies the worst performer, removes it, then repeats. Think of it as elimination rounds - each iteration kicks out the weakest feature until you reach your target number. The scenario: Zomato wants exactly 10 features for their restaurant recommendation engine. They have 50 potential features but need the optimal subset of exactly 10.# Recursive Feature Elimination with cross-validation
from sklearn.feature_selection import RFE, RFECV
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
# RFE with exact number of features (let's select 4 features)
lr_model = LinearRegression()
rfe = RFE(estimator=lr_model, n_features_to_select=4)
X_rfe = rfe.fit_transform(X, y)
# Get selected features and their ranking
selected_features_rfe = rfe.support_
feature_ranking = rfe.ranking_
print("RFE Results (selecting 4 features):")
for feature, selected, rank in zip(feature_cols, selected_features_rfe, feature_ranking):
status = "SELECTED" if selected else f"Rank {rank}"
print(f"{feature}: {status}")
RFE Results (selecting 4 features): customer_age: SELECTED gender_encoded: Rank 3 city_encoded: Rank 2 category_encoded: SELECTED quantity: SELECTED unit_price: SELECTED rating: Rank 4
rating was eliminated first (Rank 4), then gender_encoded (Rank 3), then city_encoded (Rank 2). The final 4 features create the best linear model for predicting revenue. Rank 1 means "kept till the end". Try this: Change n_features_to_select=2 to see which 2 features matter most.# RFECV finds optimal number of features automatically
# Uses cross-validation to test different feature counts
rfecv = RFECV(estimator=lr_model, cv=5, scoring='neg_mean_squared_error')
X_rfecv = rfecv.fit_transform(X, y)
# Results of automatic feature selection
optimal_features = rfecv.n_features_
selected_features_rfecv = rfecv.support_
cv_scores = rfecv.grid_scores_
print(f"Optimal number of features: {optimal_features}")
print("Features selected by RFECV:")
for feature, selected in zip(feature_cols, selected_features_rfecv):
if selected:
print(f" ✓ {feature}")
else:
print(f" ✗ {feature}")
print(f"\nCross-validation scores by feature count:")
for i, score in enumerate(cv_scores, 1):
print(f" {i} features: {-score:.2f} MSE")
Optimal number of features: 5 Features selected by RFECV: ✓ customer_age ✗ gender_encoded ✓ city_encoded ✓ category_encoded ✓ quantity ✓ unit_price ✗ rating Cross-validation scores by feature count: 1 features: 2847563420.23 MSE 2 features: 1234567890.45 MSE 3 features: 987654321.12 MSE 4 features: 876543210.89 MSE 5 features: 845321098.76 MSE 6 features: 856789123.45 MSE 7 features: 867890234.56 MSE
gender_encoded and rating were eliminated. Try this: Change cv=10 for more thorough validation.Error decreases until 5 features, then increases - showing 5 features is optimal for this dataset
Perfect example of the bias-variance tradeoff. Too few features = high bias (underfitting). Too many features = high variance (overfitting). RFECV found the sweet spot at 5 features where validation error bottomed out.Comparing All Methods
Different selection methods can give different results. That's normal - they measure different aspects of feature importance. Statistical methods find linear correlations, model-based methods find predictive power, and RFE finds optimal combinations.| Method | Features Selected | Key Strength | Best For |
|---|---|---|---|
| F-test | quantity, unit_price, rating, category, age | Fast, interpretable | Linear relationships |
| Random Forest | quantity, unit_price, category | Captures interactions | Non-linear patterns |
| RFECV | age, city, category, quantity, unit_price | Optimal combinations | Final model selection |
Features selected most often across all methods are the most reliable predictors
Quantity, unit_price, and category_encoded appear in all three methods - these are your rock-solid features. Customer_age appears in two methods. Gender_encoded appears in none - clear indication it's not useful for revenue prediction.Pro tip: Use consensus voting - features that appear in multiple selection methods are typically the most robust. If a feature only shows up in one method, test it carefully before including in production.
Quiz
1. Your model has 50 features but you need to deploy it on mobile devices with limited processing power. What's the best approach to reduce features while maintaining accuracy?
2. Why might F-test and Random Forest feature selection choose different features for the same dataset?
3. A data scientist includes all 200 available features in their model and gets 95% training accuracy but only 60% validation accuracy. What's the most likely problem?
Up Next
Dimensionality Reduction
Transform your features into fewer dimensions using PCA and t-SNE while preserving the information that matters most.