Feature Engineering Lesson 12 – Feature Scaling | Dataplexa

Beginner Level · Lesson 12

Feature Scaling

A model that sees one feature in the thousands and another in fractions doesn't treat them equally — it gets overwhelmed by the big numbers. Scaling brings every feature onto a level playing field so the algorithm can actually learn from all of them.

Feature scaling transforms numerical features so they occupy a comparable range of values. It does not change the shape of the distribution — it just shifts and compresses the scale. Many algorithms — including linear regression, logistic regression, SVMs, KNN, and neural networks — are sensitive to the magnitude of feature values and perform poorly or fail to converge when features are on wildly different scales.

Why Scale at All?

Picture a dataset predicting apartment rent. One feature is square_footage — values from 400 to 3,000. Another is distance_to_metro_km — values from 0.1 to 4.5. A third is num_bedrooms — values from 1 to 5.

When gradient descent updates model weights, it calculates partial derivatives. The feature with the largest numerical range — square footage — produces the largest gradients and gets updated aggressively. The distance feature, with tiny values around 0–5, gets almost no update signal. The model effectively ignores the small-scale features and overfits to the large-scale one. Scaling fixes this by ensuring every feature contributes equally to the gradient.

StandardScaler — Zero mean, unit variance

Subtracts the mean and divides by standard deviation. Output has mean 0 and std 1. Works well when data is roughly normally distributed. The default choice for most regression and classification models.

MinMaxScaler — Compress to a fixed range

Scales each feature to a specific range, usually 0 to 1. Sensitive to outliers because the min and max anchor the scale. Preferred for neural networks and image pixel values.

RobustScaler — Outlier-resistant scaling

Uses the median and interquartile range instead of mean and std. Outliers don't anchor the scale because they fall outside the IQR. Best choice when your data has extreme values you can't remove.

MaxAbsScaler — Preserves sparsity

Divides by the maximum absolute value, scaling to the range −1 to 1. Does not shift the data, so zero values remain zero. Designed for sparse matrices and text feature vectors.

StandardScaler in Practice

The scenario: You're a data scientist at a real estate startup building a rent prediction model using linear regression. Your dataset has three numerical features — square footage, distance to the nearest metro station, and number of bedrooms — all on completely different scales. Before fitting the model, you need to standardize every feature so none of them dominates the gradient updates during training.

# Import pandas and numpy
import pandas as pd
import numpy as np

# StandardScaler subtracts the mean and divides by standard deviation
from sklearn.preprocessing import StandardScaler

# Apartment rental data — three features on very different scales
housing_df = pd.DataFrame({
    'apt_id': ['APT01', 'APT02', 'APT03', 'APT04', 'APT05',
               'APT06', 'APT07', 'APT08', 'APT09', 'APT10'],
    'square_footage': [650, 1200, 850, 2100, 740,
                      1550, 920, 1800, 680, 1100],
    'distance_to_metro_km': [0.4, 1.2, 0.8, 3.5, 0.3,
                             2.1, 0.9, 4.2, 0.5, 1.6],
    'num_bedrooms': [1, 2, 2, 4, 1, 3, 2, 3, 1, 2]
})

# Define which columns need scaling — exclude ID and target columns
feature_cols = ['square_footage', 'distance_to_metro_km', 'num_bedrooms']

# Instantiate StandardScaler — no hyperparameters needed for basic use
scaler = StandardScaler()

# Fit computes the mean and std for each column from the training data
scaler.fit(housing_df[feature_cols])

# Transform applies (x - mean) / std to every value in each column
scaled_array = scaler.transform(housing_df[feature_cols])

# Convert the scaled numpy array back into a readable DataFrame
scaled_df = pd.DataFrame(scaled_array, columns=[f'{c}_scaled' for c in feature_cols])

# Print the mean and std that were learned during fit
print("Learned statistics (from fit):")
for col, mean, std in zip(feature_cols, scaler.mean_, scaler.scale_):
    print(f"  {col}: mean={mean:.2f}, std={std:.2f}")

print()
# Print a side-by-side comparison of raw and scaled values
comparison = housing_df[['apt_id', 'square_footage']].copy()
comparison['sqft_scaled'] = scaled_df['square_footage_scaled'].round(3)
comparison['dist_raw'] = housing_df['distance_to_metro_km']
comparison['dist_scaled'] = scaled_df['distance_to_metro_km_scaled'].round(3)
print(comparison.to_string(index=False))

Learned statistics (from fit):
  square_footage: mean=1059.00, std=468.27
  distance_to_metro_km: mean=1.55, std=1.31
  num_bedrooms: mean=2.10, std=0.94

 apt_id  square_footage  sqft_scaled  dist_raw  dist_scaled
  APT01             650       -0.874       0.4       -0.878
  APT02            1200        0.301       1.2       -0.267
  APT03             850       -0.447       0.8       -0.572
  APT04            2100        2.222       3.5        1.489
  APT05             740       -0.682       0.3       -0.954
  APT06            1550        1.049       2.1        0.420
  APT07             920       -0.297       0.9       -0.496
  APT08            1800        1.583       4.2        2.023
  APT09             680       -0.810       0.5       -0.802
  APT10            1100        0.087       1.6        0.038

What just happened?

StandardScaler computed the mean and standard deviation for all three feature columns during .fit(), then subtracted the mean and divided by the std for every value during .transform(). The table shows square footage (originally 650–2100) and distance (originally 0.3–4.2) now sitting on the same rough scale of −2 to +2. APT04 at 2100 sq ft registers 2.222 — meaning it is about two standard deviations above the average apartment.

MinMaxScaler — Scaling to a Fixed Range

The scenario: You're a machine learning engineer at a fintech company preparing features for a neural network that predicts loan default probability. Neural networks work best when inputs are in the range 0 to 1 — sigmoid activations in particular saturate badly with large input values. Your team decides to apply MinMax scaling to all numerical features before feeding them into the network.

# Import pandas
import pandas as pd

# MinMaxScaler maps values to a fixed range — default is [0, 1]
from sklearn.preprocessing import MinMaxScaler

# Loan applicant features at different scales
loan_df = pd.DataFrame({
    'loan_id': ['L01', 'L02', 'L03', 'L04', 'L05',
               'L06', 'L07', 'L08', 'L09', 'L10'],
    'loan_amount': [5000, 15000, 8000, 25000, 6500,
                   12000, 30000, 9500, 7200, 18000],
    'interest_rate': [5.5, 8.2, 6.1, 11.4, 5.9,
                     7.8, 14.2, 6.7, 6.3, 9.1],
    'credit_score': [720, 640, 695, 580, 710,
                    660, 530, 700, 688, 625]
})

# Feature columns to scale
feature_cols = ['loan_amount', 'interest_rate', 'credit_score']

# feature_range=(0,1) is the default — values compress to exactly 0.0–1.0
mm_scaler = MinMaxScaler(feature_range=(0, 1))

# Fit learns the min and max of each column
mm_scaler.fit(loan_df[feature_cols])

# Transform applies (x - min) / (max - min) to every value
scaled = mm_scaler.transform(loan_df[feature_cols])

# Show min and max that were fitted per column
print("Fitted min and max per column:")
for col, mn, mx in zip(feature_cols, mm_scaler.data_min_, mm_scaler.data_max_):
    print(f"  {col}: min={mn}, max={mx}")

print()
# Print the original loan amount alongside its MinMax-scaled version
result = loan_df[['loan_id', 'loan_amount']].copy()
result['loan_amount_scaled'] = scaled[:, 0].round(3)
result['interest_rate'] = loan_df['interest_rate']
result['interest_scaled'] = scaled[:, 1].round(3)
print(result.to_string(index=False))

Fitted min and max per column:
  loan_amount: min=5000.0, max=30000.0
  interest_rate: min=5.5, max=14.2
  credit_score: min=530.0, max=720.0

 loan_id  loan_amount  loan_amount_scaled  interest_rate  interest_scaled
     L01         5000               0.000            5.5            0.000
     L02        15000               0.400            8.2            0.310
     L03         8000               0.120            6.1            0.069
     L04        25000               0.800           11.4            0.678
     L05         6500               0.060            5.9            0.046
     L06        12000               0.280            7.8            0.264
     L07        30000               1.000           14.2            1.000
     L08         9500               0.180            6.7            0.138
     L09         7200               0.088            6.3            0.092
     L10        18000               0.520            9.1            0.414

What just happened?

MinMaxScaler recorded the minimum and maximum of each column during .fit(), then compressed every value into the 0–1 range. L01 at $5,000 — the smallest loan — becomes exactly 0.0, and L07 at $30,000 — the largest — becomes exactly 1.0. All other loans sit proportionally between those anchors. The same logic applies to interest rate in the fourth column.

RobustScaler — When Outliers Are Present

The scenario: You're a data scientist at a logistics company modeling delivery time. One feature is package_weight_kg. Most packages weigh 0.5 to 10 kg. But once a month, an industrial shipment goes through at 800 kg. You can't remove it — it's a legitimate order. If you use StandardScaler, that single extreme value inflates the standard deviation and compresses every normal package into a tiny sliver near zero. RobustScaler solves this by using the median and IQR instead.

# Import pandas and both scalers for comparison
import pandas as pd
from sklearn.preprocessing import RobustScaler, StandardScaler

# Delivery data — package_weight_kg has one extreme industrial shipment
delivery_df = pd.DataFrame({
    'order_id': ['O01', 'O02', 'O03', 'O04', 'O05',
               'O06', 'O07', 'O08', 'O09', 'O10'],
    'package_weight_kg': [1.2, 3.5, 0.8, 5.1, 2.3,
                         4.7, 800.0, 1.9, 3.1, 2.8]
})

# Fit RobustScaler — uses median and IQR, both immune to extreme values
robust = RobustScaler()
robust.fit(delivery_df[['package_weight_kg']])

# Fit StandardScaler on the same column for comparison
standard = StandardScaler()
standard.fit(delivery_df[['package_weight_kg']])

# Apply both transformations
delivery_df['robust_scaled'] = robust.transform(delivery_df[['package_weight_kg']]).round(3)
delivery_df['standard_scaled'] = standard.transform(delivery_df[['package_weight_kg']]).round(3)

# Print the center and scale that RobustScaler learned
print(f"RobustScaler center (median): {robust.center_[0]:.2f}")
print(f"RobustScaler scale (IQR):     {robust.scale_[0]:.2f}")
print(f"StandardScaler mean:          {standard.mean_[0]:.2f}")
print(f"StandardScaler std:           {standard.scale_[0]:.2f}")
print()
print(delivery_df.to_string(index=False))

RobustScaler center (median): 2.55
RobustScaler scale (IQR):     2.23
StandardScaler mean:          82.54
StandardScaler std:           251.54

 order_id  package_weight_kg  robust_scaled  standard_scaled
      O01                1.2         -0.607           -0.323
      O02                3.5          0.425           -0.314
      O03                0.8         -0.786           -0.325
      O04                5.1          1.143           -0.308
      O05                2.3         -0.112           -0.319
      O06                4.7          0.964           -0.309
      O07              800.0        357.399            2.842
      O08                1.9         -0.291           -0.321
      O09                3.1          0.246           -0.315
      O10                2.8          0.112           -0.316

What just happened?

Both scalers processed the same column — but the results are completely different. StandardScaler let the 800 kg outlier inflate the mean and std so badly that all normal packages cluster between −0.33 and −0.31 — practically identical. RobustScaler ignored that outlier when computing its centre and scale, so normal packages spread usefully from −0.79 to +1.14. The outlier still shows up at 357 — it hasn't been removed — but it no longer poisons the rest of the feature.

Scaler Selection Guide

Scaler	Formula	Output Range	Use When
StandardScaler	(x − μ) / σ	Unbounded (~−3 to 3)	Linear models, PCA, SVMs, normal-ish data
MinMaxScaler	(x − min) / (max − min)	0 to 1	Neural networks, image data, bounded inputs
RobustScaler	(x − median) / IQR	Unbounded	Data with outliers you cannot remove
MaxAbsScaler	x / \|max(x)\|	−1 to 1	Sparse data, text features, TF-IDF vectors

Which Models Need Scaling — and Which Do Not

Scale-sensitive — always scale

Linear Regression
Logistic Regression
Support Vector Machines
K-Nearest Neighbors
K-Means Clustering
Neural Networks
PCA / dimensionality reduction

Scale-invariant — scaling optional

Decision Trees
Random Forest
Gradient Boosting (XGBoost, LightGBM)
Naive Bayes
Rule-based models

Tree models split on thresholds — the absolute scale of a feature doesn't affect which split is chosen.

Fit on train, transform on both

This rule from Lesson 10 applies to all scalers identically. Fit the scaler on training data only — never on the test set. Fitting on the full dataset leaks test statistics into training and inflates evaluation metrics.

Scaling does not fix skewness

StandardScaler centers and compresses a distribution — it does not make a right-skewed feature normally distributed. If a feature has heavy skew, apply a log or power transformation first, then scale. Transformation and scaling solve different problems.

Teacher's Note

A very common mistake is applying fit_transform() on the entire dataset before the train-test split. This means the scaler has already seen the test set's statistics — its mean, its std, its min and max — before the model ever touches the test data. That's data leakage, and it makes your test metrics optimistic and unreliable. The correct order is: split first, then fit the scaler on the training fold only, then transform both folds. If you're using a scikit-learn Pipeline, this is handled automatically — which is another strong reason to prefer pipelines in production code.

Practice Questions

1. Which scaler transforms data to have mean 0 and standard deviation 1?

2. Which scaler uses the median and interquartile range to resist the influence of outliers?

3. To avoid data leakage, a scaler should be fitted on the ________ data only.

Quiz

Up Next · Lesson 13

Encoding Basics

Turn text categories into numbers your model can actually use — label encoding, one-hot encoding, and when to choose each.

← Previous Course Index Next →

Feature Engineering Course

Feature Scaling

Why Scale at All?

StandardScaler in Practice

MinMaxScaler — Scaling to a Fixed Range

RobustScaler — When Outliers Are Present

Scaler Selection Guide

Which Models Need Scaling — and Which Do Not

Practice Questions

Quiz