ML Lesson 5 – Feature Scaling | Dataplexa

Feature Scaling

In the previous lesson, we cleaned and prepared our dataset by handling missing values, removing duplicates, and fixing data issues.

Now we move to the next critical step in Machine Learning — Feature Scaling. This step becomes extremely important once we start training models.


Why Feature Scaling is Needed

Machine Learning models learn patterns based on numerical values. If features have very different ranges, the model may become biased toward features with larger values.

In our dataset, consider the following columns:

  • house_size → values in hundreds or thousands
  • location_score → values between 1 and 10
  • customer_income → values in tens of thousands

Without scaling, features like customer income may dominate the learning process, even if they are not the most important predictors.


Real-World Example (Easy to Understand)

Imagine evaluating students using two criteria:

  • Marks scored (out of 100)
  • Attendance percentage (out of 10)

If you simply add these numbers, marks will dominate attendance. Scaling both to the same range ensures fair comparison.

Machine Learning works the same way.


Types of Feature Scaling

There are two commonly used feature scaling techniques in Machine Learning:

1. Standardization (Z-Score Scaling)

Standardization converts values so that they have:

  • Mean = 0
  • Standard Deviation = 1

This method works very well for algorithms like Linear Regression, Logistic Regression, and SVM.

2. Normalization (Min-Max Scaling)

Normalization scales values between 0 and 1.

This method is commonly used when features must stay within a fixed range.


Loading the Dataset (Same Dataset)

We continue working with the Dataplexa ML Housing & Customer Dataset.

import pandas as pd

df = pd.read_csv("dataplexa_ml_housing_customer_dataset.csv")
df.head()

Applying Feature Scaling (Standardization)

We will scale numerical features so that no single feature dominates the model.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

numerical_features = [
    "house_size",
    "bedrooms",
    "location_score",
    "age_of_house",
    "customer_income"
]

df[numerical_features] = scaler.fit_transform(df[numerical_features])
df.head()

After scaling, all numerical features are centered around zero and have similar ranges.


What Happens If We Skip Feature Scaling?

If scaling is skipped:

  • Distance-based algorithms like KNN perform poorly
  • Gradient-based algorithms converge slowly
  • Model accuracy may drop significantly

This is why feature scaling is considered a mandatory step in most ML workflows.


Real-World Industry Insight

In recommendation systems, user behavior features may vary widely in scale. Without scaling, recommendations become biased.

In financial systems, income and transaction counts must be scaled to prevent unstable predictions.


Mini Practice

Think about our dataset and answer the following:

  • Which feature had the largest original range?
  • Why should bedrooms also be scaled even though values are small?

Exercises

Exercise 1:
Why does feature scaling improve model performance?

It ensures all features contribute equally to the learning process.

Exercise 2:
Which algorithms are most affected if scaling is skipped?

Distance-based and gradient-based algorithms such as KNN and Linear Regression.

Exercise 3:
Should categorical features be scaled?

No, only numerical features should be scaled.

Quick Quiz

Q1. What is the main goal of feature scaling?

To bring all numerical features to a comparable scale.

Q2. Which scaling method centers data around zero?

Standardization (Z-score scaling).

In the next lesson, we will continue using the same dataset and learn how to perform data cleaning in deeper detail, including handling outliers and inconsistent data.