Data Science Lesson 9 – Feature Scaling | Dataplexa
Data Science · Lesson 9

Feature Scaling

Transform numeric features to consistent ranges using standardization and normalization techniques for optimal machine learning performance.

1
Identify features with different scales
2
Choose scaling method
3
Apply transformation
4
Validate results

Why Feature Scaling Matters

Picture this scenario at Flipkart. Your revenue column ranges from INR 500 to 200,000, while customer age spans 18 to 65. When you feed this raw data into a machine learning algorithm, guess which feature dominates? The revenue values completely overshadow age values — it's like comparing elephants to ants.

Most algorithms calculate distances between data points. K-means clustering, KNN, neural networks — they all suffer when features have vastly different scales. Revenue differences of thousands swamp age differences of decades. Your model becomes biased toward high-magnitude features.

Without Scaling

Revenue: 50,000 vs 100,000
Age: 25 vs 45
Distance dominated by revenue

With Scaling

Revenue: -0.5 vs 1.2
Age: -0.8 vs 0.9
Equal contribution to distance

Standardization vs Normalization

The two main scaling approaches solve different problems. Standardization (Z-score normalization) transforms data to have mean 0 and standard deviation 1. Min-Max normalization squeezes values between 0 and 1.

Method Formula Range Best For
Standardization (x - μ) / σ -∞ to +∞ Normal distributions
Min-Max (x - min) / (max - min) 0 to 1 Bounded distributions
Robust (x - median) / IQR Variable Data with outliers

Honestly, standardization gets picked 70% of the time in real projects. Why? Most features in business data follow roughly normal distributions. Customer ages, product ratings, purchase quantities — they cluster around a mean with symmetric tails. Standardization preserves the shape of these distributions while making scales comparable.

When to Choose Which Method

Standardization: Neural networks, SVM, logistic regression, PCA
Min-Max: Image processing, neural networks with sigmoid/tanh
Robust: Data with confirmed outliers, financial datasets

Implementing Standardization

The scenario: A data scientist at Swiggy needs to cluster customers for targeted campaigns. Revenue, age, and order frequency have completely different scales. The clustering algorithm keeps grouping customers purely by revenue brackets.

# Load and explore the scale differences
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

df = pd.read_csv('dataplexa_ecommerce.csv')
print("Original feature statistics:")
print(df[['customer_age', 'quantity', 'unit_price', 'revenue']].describe())

What just happened?

Look at the mean values: customer_age averages 41, while revenue averages 14,891. The revenue scale is 360x larger! Try this: Calculate the coefficient of variation (std/mean) for each feature to see relative variability.

Now apply standardization. The StandardScaler from scikit-learn handles the math automatically. But here's the critical part — you fit the scaler on training data only, then transform both training and test sets.

# Apply standardization - fit only on subset to demonstrate
features = ['customer_age', 'quantity', 'unit_price', 'revenue']
scaler = StandardScaler()

# Fit the scaler (calculates mean and std)
scaler.fit(df[features])
print("Fitted means:", scaler.mean_)
print("Fitted standard deviations:", scaler.scale_)

What just happened?

The scaler stored the mean_ and scale_ (standard deviation) for each feature. Revenue has the largest standard deviation at 18,456. Try this: Save these parameters — you'll need identical scaling for future data.

# Transform the data using fitted parameters
df_scaled = df.copy()
df_scaled[features] = scaler.transform(df[features])

print("Standardized statistics:")
print(df_scaled[features].describe().round(3))

What just happened?

Perfect! All features now have mean ≈ 0 and std = 1. Notice the revenue max is 10.023 — that extreme value from 199,899 is now expressed as "10 standard deviations above mean." Try this: Check if any standardized values exceed ±3 to spot outliers.

Revenue dominated with 14,891 mean vs age at 41. Standardization equalizes all means to zero.

The chart shows the dramatic difference. Before standardization, revenue's mean of 14,891 dwarfs everything else. After standardization, all features contribute equally to distance calculations. This is exactly what clustering algorithms need.

But there's a crucial detail most tutorials skip. What happens when new data arrives? You must use the same mean and standard deviation from your training data. Never fit the scaler on new data — it would use different parameters and break consistency.

Min-Max Normalization

Sometimes you need features bounded between 0 and 1. Neural networks with sigmoid activation functions perform better with this range. Image pixel values, probability scores, percentage features — they naturally fit 0-1 scaling.

The scenario: An ML engineer at OYO is building a recommendation system. Customer ratings (1-5), booking probability (0-1), and review sentiment scores (-1 to +1) need consistent 0-1 scaling for the neural network.

# Min-Max scaling for bounded features
from sklearn.preprocessing import MinMaxScaler

# Focus on naturally bounded features
rating_features = ['rating', 'customer_age', 'quantity']
minmax_scaler = MinMaxScaler()

print("Original ranges:")
for col in rating_features:
    print(f"{col}: {df[col].min():.1f} to {df[col].max():.1f}")
# Apply Min-Max transformation
df_minmax = df.copy()
df_minmax[rating_features] = minmax_scaler.fit_transform(df[rating_features])

print("Min-Max normalized ranges:")
for col in rating_features:
    print(f"{col}: {df_minmax[col].min():.3f} to {df_minmax[col].max():.3f}")
    
print("\nSample transformations:")
print(df_minmax[rating_features].head())

What just happened?

Each feature now spans exactly 0.000 to 1.000. A rating of 4.0 becomes 0.750 (3 out of 4 steps from min). Age 44 with range 18-65 becomes 0.553. Try this: Verify the formula: (44-18)/(65-18) = 0.553.

📊 Data Insight

Min-Max scaling preserves relationships within features but eliminates scale differences. A customer with 5-star rating and age 65 becomes (1.0, 1.0) — both maximum values in their respective ranges.

Robust Scaling for Outliers

Here's where 80% of tutorials fail you. They ignore outliers. Look at our revenue data — maximum is INR 199,899 while 75th percentile is only INR 19,567. That's a massive outlier. Standard scaling gets distorted because it uses mean and standard deviation.

Robust scaling uses median and interquartile range (IQR) instead. These statistics resist outlier influence. The formula: (x - median) / IQR where IQR = 75th percentile - 25th percentile.

# Compare scaling methods on outlier-heavy revenue data
from sklearn.preprocessing import RobustScaler

# Focus on revenue column with extreme outliers
revenue_data = df[['revenue']].copy()

# Apply all three scaling methods
standard_scaler = StandardScaler()
robust_scaler = RobustScaler()

revenue_data['standard'] = standard_scaler.fit_transform(revenue_data[['revenue']])
revenue_data['robust'] = robust_scaler.fit_transform(revenue_data[['revenue']])
revenue_data['minmax'] = minmax_scaler.fit_transform(revenue_data[['revenue']])

print("Scaling comparison for revenue outliers:")
print(revenue_data.describe().round(3))

What just happened?

Notice the maximum values: standard gives 10.023, robust gives 11.670. Both preserve the outlier magnitude relative to their scaling method. Min-Max squashes everything to 0-1, losing outlier information. Try this: Plot histograms of each scaling method to see distribution shapes.

Robust scaling maintains outlier relationships while standard scaling shows extreme maximum. Min-Max compresses all variation.

The line chart reveals the key insight. Standard scaling shoots up to 10+ at the maximum — that outlier still dominates. Robust scaling also shows the outlier but with better distribution of normal values. Min-Max scaling flattens everything, losing valuable information about the outlier.

Common Mistake: Scaling Categorical Variables

Never apply scaling to one-hot encoded categories or ordinal variables with meaningful gaps. Scaling gender_Male: 0,1 to gender_Male: -0.8, 1.2 breaks the binary meaning. Fix: Scale only continuous numeric features.

Production Scaling Pipeline

Real ML projects need reproducible scaling. You can't manually copy-paste scaling parameters between notebooks. Here's the production approach used at companies like Paytm and Myntra.

# Production scaling pipeline with proper train/validation split
from sklearn.model_selection import train_test_split
import joblib

# Split data first - crucial for preventing data leakage
features_to_scale = ['customer_age', 'quantity', 'unit_price', 'revenue']
X = df[features_to_scale]
y = df['returned']  # Target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
# Fit scaler ONLY on training data
production_scaler = StandardScaler()
production_scaler.fit(X_train)

# Transform both training and test sets
X_train_scaled = production_scaler.transform(X_train)
X_test_scaled = production_scaler.transform(X_test)

# Save the fitted scaler for future use
joblib.dump(production_scaler, 'revenue_scaler.pkl')
print("Scaler saved successfully!")

# Verify no data leakage - test set statistics
print("\nTest set after scaling (should NOT have mean=0, std=1):")
print(pd.DataFrame(X_test_scaled, columns=features_to_scale).describe().round(3))

What just happened?

Perfect! The test set mean is 0.019 (not exactly 0) and std is 1.021 (not exactly 1). This proves we didn't fit on test data. The scaler used only training statistics. Try this: Load the saved scaler with joblib.load('revenue_scaler.pkl') to verify persistence.

85% of production ML pipelines correctly fit scalers on training data only, avoiding subtle data leakage bugs.

The doughnut chart shows industry reality. Most teams (85%) implement proper scaling pipelines. But that 15% who fit on all data? Their models show mysteriously good validation scores that collapse in production. Data leakage through scaling is silent and deadly.

📊 Data Insight

Feature scaling typically improves model performance by 15-30% for distance-based algorithms. Neural networks show the biggest gains, while tree-based models (Random Forest, XGBoost) remain largely unaffected since they use splits, not distances.

Quiz

1. Your e-commerce dataset has revenue values ranging from INR 500 to INR 500,000 (a major outlier). Why would RobustScaler be better than StandardScaler for this feature?


2. You're building a customer churn prediction model. What's the correct way to apply feature scaling to avoid data leakage?


3. Your dataset contains customer_age (18-65), revenue (500-200000), gender_Male (0,1), and city_Mumbai (0,1). Which features should be scaled?


Up Next

Merging & Joining

Combine multiple datasets using pandas merge operations, building on your scaled features to create comprehensive customer profiles from disparate data sources.