Feature Engineering Course
Feature Scaling
A model that sees one feature in the thousands and another in fractions doesn't treat them equally — it gets overwhelmed by the big numbers. Scaling brings every feature onto a level playing field so the algorithm can actually learn from all of them.
Feature scaling transforms numerical features so they occupy a comparable range of values. It does not change the shape of the distribution — it just shifts and compresses the scale. Many algorithms — including linear regression, logistic regression, SVMs, KNN, and neural networks — are sensitive to the magnitude of feature values and perform poorly or fail to converge when features are on wildly different scales.
Why Scale at All?
Picture a dataset predicting apartment rent. One feature is square_footage — values from 400 to 3,000. Another is distance_to_metro_km — values from 0.1 to 4.5. A third is num_bedrooms — values from 1 to 5.
When gradient descent updates model weights, it calculates partial derivatives. The feature with the largest numerical range — square footage — produces the largest gradients and gets updated aggressively. The distance feature, with tiny values around 0–5, gets almost no update signal. The model effectively ignores the small-scale features and overfits to the large-scale one. Scaling fixes this by ensuring every feature contributes equally to the gradient.
StandardScaler — Zero mean, unit variance
Subtracts the mean and divides by standard deviation. Output has mean 0 and std 1. Works well when data is roughly normally distributed. The default choice for most regression and classification models.
MinMaxScaler — Compress to a fixed range
Scales each feature to a specific range, usually 0 to 1. Sensitive to outliers because the min and max anchor the scale. Preferred for neural networks and image pixel values.
RobustScaler — Outlier-resistant scaling
Uses the median and interquartile range instead of mean and std. Outliers don't anchor the scale because they fall outside the IQR. Best choice when your data has extreme values you can't remove.
MaxAbsScaler — Preserves sparsity
Divides by the maximum absolute value, scaling to the range −1 to 1. Does not shift the data, so zero values remain zero. Designed for sparse matrices and text feature vectors.
StandardScaler in Practice
The scenario: You're a data scientist at a real estate startup building a rent prediction model using linear regression. Your dataset has three numerical features — square footage, distance to the nearest metro station, and number of bedrooms — all on completely different scales. Before fitting the model, you need to standardize every feature so none of them dominates the gradient updates during training.
# Import pandas and numpy
import pandas as pd
import numpy as np
# StandardScaler subtracts the mean and divides by standard deviation
from sklearn.preprocessing import StandardScaler
# Apartment rental data — three features on very different scales
housing_df = pd.DataFrame({
'apt_id': ['APT01', 'APT02', 'APT03', 'APT04', 'APT05',
'APT06', 'APT07', 'APT08', 'APT09', 'APT10'],
'square_footage': [650, 1200, 850, 2100, 740,
1550, 920, 1800, 680, 1100],
'distance_to_metro_km': [0.4, 1.2, 0.8, 3.5, 0.3,
2.1, 0.9, 4.2, 0.5, 1.6],
'num_bedrooms': [1, 2, 2, 4, 1, 3, 2, 3, 1, 2]
})
# Define which columns need scaling — exclude ID and target columns
feature_cols = ['square_footage', 'distance_to_metro_km', 'num_bedrooms']
# Instantiate StandardScaler — no hyperparameters needed for basic use
scaler = StandardScaler()
# Fit computes the mean and std for each column from the training data
scaler.fit(housing_df[feature_cols])
# Transform applies (x - mean) / std to every value in each column
scaled_array = scaler.transform(housing_df[feature_cols])
# Convert the scaled numpy array back into a readable DataFrame
scaled_df = pd.DataFrame(scaled_array, columns=[f'{c}_scaled' for c in feature_cols])
# Print the mean and std that were learned during fit
print("Learned statistics (from fit):")
for col, mean, std in zip(feature_cols, scaler.mean_, scaler.scale_):
print(f" {col}: mean={mean:.2f}, std={std:.2f}")
print()
# Print a side-by-side comparison of raw and scaled values
comparison = housing_df[['apt_id', 'square_footage']].copy()
comparison['sqft_scaled'] = scaled_df['square_footage_scaled'].round(3)
comparison['dist_raw'] = housing_df['distance_to_metro_km']
comparison['dist_scaled'] = scaled_df['distance_to_metro_km_scaled'].round(3)
print(comparison.to_string(index=False))
Learned statistics (from fit): square_footage: mean=1059.00, std=468.27 distance_to_metro_km: mean=1.55, std=1.31 num_bedrooms: mean=2.10, std=0.94 apt_id square_footage sqft_scaled dist_raw dist_scaled APT01 650 -0.874 0.4 -0.878 APT02 1200 0.301 1.2 -0.267 APT03 850 -0.447 0.8 -0.572 APT04 2100 2.222 3.5 1.489 APT05 740 -0.682 0.3 -0.954 APT06 1550 1.049 2.1 0.420 APT07 920 -0.297 0.9 -0.496 APT08 1800 1.583 4.2 2.023 APT09 680 -0.810 0.5 -0.802 APT10 1100 0.087 1.6 0.038
What just happened?
StandardScaler computed the mean and standard deviation for all three feature columns during .fit(), then subtracted the mean and divided by the std for every value during .transform(). The table shows square footage (originally 650–2100) and distance (originally 0.3–4.2) now sitting on the same rough scale of −2 to +2. APT04 at 2100 sq ft registers 2.222 — meaning it is about two standard deviations above the average apartment.
MinMaxScaler — Scaling to a Fixed Range
The scenario: You're a machine learning engineer at a fintech company preparing features for a neural network that predicts loan default probability. Neural networks work best when inputs are in the range 0 to 1 — sigmoid activations in particular saturate badly with large input values. Your team decides to apply MinMax scaling to all numerical features before feeding them into the network.
# Import pandas
import pandas as pd
# MinMaxScaler maps values to a fixed range — default is [0, 1]
from sklearn.preprocessing import MinMaxScaler
# Loan applicant features at different scales
loan_df = pd.DataFrame({
'loan_id': ['L01', 'L02', 'L03', 'L04', 'L05',
'L06', 'L07', 'L08', 'L09', 'L10'],
'loan_amount': [5000, 15000, 8000, 25000, 6500,
12000, 30000, 9500, 7200, 18000],
'interest_rate': [5.5, 8.2, 6.1, 11.4, 5.9,
7.8, 14.2, 6.7, 6.3, 9.1],
'credit_score': [720, 640, 695, 580, 710,
660, 530, 700, 688, 625]
})
# Feature columns to scale
feature_cols = ['loan_amount', 'interest_rate', 'credit_score']
# feature_range=(0,1) is the default — values compress to exactly 0.0–1.0
mm_scaler = MinMaxScaler(feature_range=(0, 1))
# Fit learns the min and max of each column
mm_scaler.fit(loan_df[feature_cols])
# Transform applies (x - min) / (max - min) to every value
scaled = mm_scaler.transform(loan_df[feature_cols])
# Show min and max that were fitted per column
print("Fitted min and max per column:")
for col, mn, mx in zip(feature_cols, mm_scaler.data_min_, mm_scaler.data_max_):
print(f" {col}: min={mn}, max={mx}")
print()
# Print the original loan amount alongside its MinMax-scaled version
result = loan_df[['loan_id', 'loan_amount']].copy()
result['loan_amount_scaled'] = scaled[:, 0].round(3)
result['interest_rate'] = loan_df['interest_rate']
result['interest_scaled'] = scaled[:, 1].round(3)
print(result.to_string(index=False))
Fitted min and max per column:
loan_amount: min=5000.0, max=30000.0
interest_rate: min=5.5, max=14.2
credit_score: min=530.0, max=720.0
loan_id loan_amount loan_amount_scaled interest_rate interest_scaled
L01 5000 0.000 5.5 0.000
L02 15000 0.400 8.2 0.310
L03 8000 0.120 6.1 0.069
L04 25000 0.800 11.4 0.678
L05 6500 0.060 5.9 0.046
L06 12000 0.280 7.8 0.264
L07 30000 1.000 14.2 1.000
L08 9500 0.180 6.7 0.138
L09 7200 0.088 6.3 0.092
L10 18000 0.520 9.1 0.414What just happened?
MinMaxScaler recorded the minimum and maximum of each column during .fit(), then compressed every value into the 0–1 range. L01 at $5,000 — the smallest loan — becomes exactly 0.0, and L07 at $30,000 — the largest — becomes exactly 1.0. All other loans sit proportionally between those anchors. The same logic applies to interest rate in the fourth column.
RobustScaler — When Outliers Are Present
The scenario: You're a data scientist at a logistics company modeling delivery time. One feature is package_weight_kg. Most packages weigh 0.5 to 10 kg. But once a month, an industrial shipment goes through at 800 kg. You can't remove it — it's a legitimate order. If you use StandardScaler, that single extreme value inflates the standard deviation and compresses every normal package into a tiny sliver near zero. RobustScaler solves this by using the median and IQR instead.
# Import pandas and both scalers for comparison
import pandas as pd
from sklearn.preprocessing import RobustScaler, StandardScaler
# Delivery data — package_weight_kg has one extreme industrial shipment
delivery_df = pd.DataFrame({
'order_id': ['O01', 'O02', 'O03', 'O04', 'O05',
'O06', 'O07', 'O08', 'O09', 'O10'],
'package_weight_kg': [1.2, 3.5, 0.8, 5.1, 2.3,
4.7, 800.0, 1.9, 3.1, 2.8]
})
# Fit RobustScaler — uses median and IQR, both immune to extreme values
robust = RobustScaler()
robust.fit(delivery_df[['package_weight_kg']])
# Fit StandardScaler on the same column for comparison
standard = StandardScaler()
standard.fit(delivery_df[['package_weight_kg']])
# Apply both transformations
delivery_df['robust_scaled'] = robust.transform(delivery_df[['package_weight_kg']]).round(3)
delivery_df['standard_scaled'] = standard.transform(delivery_df[['package_weight_kg']]).round(3)
# Print the center and scale that RobustScaler learned
print(f"RobustScaler center (median): {robust.center_[0]:.2f}")
print(f"RobustScaler scale (IQR): {robust.scale_[0]:.2f}")
print(f"StandardScaler mean: {standard.mean_[0]:.2f}")
print(f"StandardScaler std: {standard.scale_[0]:.2f}")
print()
print(delivery_df.to_string(index=False))
RobustScaler center (median): 2.55
RobustScaler scale (IQR): 2.23
StandardScaler mean: 82.54
StandardScaler std: 251.54
order_id package_weight_kg robust_scaled standard_scaled
O01 1.2 -0.607 -0.323
O02 3.5 0.425 -0.314
O03 0.8 -0.786 -0.325
O04 5.1 1.143 -0.308
O05 2.3 -0.112 -0.319
O06 4.7 0.964 -0.309
O07 800.0 357.399 2.842
O08 1.9 -0.291 -0.321
O09 3.1 0.246 -0.315
O10 2.8 0.112 -0.316What just happened?
Both scalers processed the same column — but the results are completely different. StandardScaler let the 800 kg outlier inflate the mean and std so badly that all normal packages cluster between −0.33 and −0.31 — practically identical. RobustScaler ignored that outlier when computing its centre and scale, so normal packages spread usefully from −0.79 to +1.14. The outlier still shows up at 357 — it hasn't been removed — but it no longer poisons the rest of the feature.
Scaler Selection Guide
| Scaler | Formula | Output Range | Use When |
|---|---|---|---|
| StandardScaler | (x − μ) / σ | Unbounded (~−3 to 3) | Linear models, PCA, SVMs, normal-ish data |
| MinMaxScaler | (x − min) / (max − min) | 0 to 1 | Neural networks, image data, bounded inputs |
| RobustScaler | (x − median) / IQR | Unbounded | Data with outliers you cannot remove |
| MaxAbsScaler | x / |max(x)| | −1 to 1 | Sparse data, text features, TF-IDF vectors |
Which Models Need Scaling — and Which Do Not
Scale-sensitive — always scale
Linear Regression
Logistic Regression
Support Vector Machines
K-Nearest Neighbors
K-Means Clustering
Neural Networks
PCA / dimensionality reduction
Scale-invariant — scaling optional
Decision Trees
Random Forest
Gradient Boosting (XGBoost, LightGBM)
Naive Bayes
Rule-based models
Tree models split on thresholds — the absolute scale of a feature doesn't affect which split is chosen.
Fit on train, transform on both
This rule from Lesson 10 applies to all scalers identically. Fit the scaler on training data only — never on the test set. Fitting on the full dataset leaks test statistics into training and inflates evaluation metrics.
Scaling does not fix skewness
StandardScaler centers and compresses a distribution — it does not make a right-skewed feature normally distributed. If a feature has heavy skew, apply a log or power transformation first, then scale. Transformation and scaling solve different problems.
Teacher's Note
A very common mistake is applying fit_transform() on the entire dataset before the train-test split. This means the scaler has already seen the test set's statistics — its mean, its std, its min and max — before the model ever touches the test data. That's data leakage, and it makes your test metrics optimistic and unreliable. The correct order is: split first, then fit the scaler on the training fold only, then transform both folds. If you're using a scikit-learn Pipeline, this is handled automatically — which is another strong reason to prefer pipelines in production code.
Practice Questions
1. Which scaler transforms data to have mean 0 and standard deviation 1?
2. Which scaler uses the median and interquartile range to resist the influence of outliers?
3. To avoid data leakage, a scaler should be fitted on the ________ data only.
Quiz
1. You are building a neural network and need all features compressed into the range 0 to 1. Which scaler should you use?
2. Which type of model does NOT require feature scaling?
3. What does feature scaling change about a distribution?
Up Next · Lesson 13
Encoding Basics
Turn text categories into numbers your model can actually use — label encoding, one-hot encoding, and when to choose each.