ML Lesson 6 – Data Cleaning | Dataplexa

Data Cleaning (Advanced)

In the previous lesson, we applied feature scaling to ensure all numerical features contribute equally to the learning process.

Now we move deeper into one of the most important and time-consuming steps in real-world Machine Learning projects — advanced data cleaning.

Industry truth: 80% of ML work happens before model training. If the data is poor, even the best algorithm will fail.


What Does Data Cleaning Really Mean?

Data cleaning is not just about removing missing values. It is about making sure the data truly represents reality and does not mislead the model.

Advanced data cleaning includes:

  • Detecting and treating outliers
  • Handling inconsistent values
  • Fixing data entry errors
  • Ensuring logical data relationships

Using Our Dataset (No Change)

We continue using the same dataset you downloaded earlier:

Dataplexa ML Housing & Customer Dataset

import pandas as pd

df = pd.read_csv("dataplexa_ml_housing_customer_dataset.csv")
df.head()

Outliers – The Hidden Problem

Outliers are extreme values that are very different from the rest of the data. They may occur due to:

  • Data entry mistakes
  • Measurement errors
  • Rare but valid real-world events

If not handled carefully, outliers can distort model learning.


Real-World Example

Suppose most houses in our dataset cost between $100,000 and $500,000.

If one record shows a house price of $50,000,000, the model may shift its predictions unfairly.

This is why outlier handling is critical.


Detecting Outliers Using IQR

One common statistical method to detect outliers is Interquartile Range (IQR).

Q1 = df["house_price"].quantile(0.25)
Q3 = df["house_price"].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = df[(df["house_price"] < lower_bound) | 
              (df["house_price"] > upper_bound)]

outliers.head()

This code helps us identify values that fall far outside the normal range.


Handling Outliers (Safe Approach)

In real projects, we usually do NOT delete outliers blindly.

Instead, we may:

  • Cap extreme values
  • Replace with median values
  • Analyze if the value is valid
df["house_price"] = df["house_price"].clip(
    lower=lower_bound,
    upper=upper_bound
)

This technique is called winsorization and is widely used in industry.


Fixing Logical Inconsistencies

Data may look clean numerically but still be logically wrong.

Example:

  • House age = 120 years
  • Bedrooms = 0

Such values may confuse the model.

df = df[df["bedrooms"] > 0]
df = df[df["age_of_house"] <= 100]

Why We Do This Before Model Training

Machine Learning models learn patterns, not common sense.

If incorrect values exist, the model assumes they are valid. That is why cleaning must happen before training begins.


Mini Practice

Think carefully:

  • Should we remove a billionaire house buyer as an outlier?
  • When should outliers be kept?

Exercises

Exercise 1:
What is the main risk of keeping extreme outliers?

Outliers can distort model learning and reduce prediction accuracy.

Exercise 2:
Why is median preferred over mean when replacing outliers?

Median is less affected by extreme values.

Exercise 3:
Should all outliers always be removed?

No. Some outliers may represent valid rare real-world cases.

Quick Quiz

Q1. What does IQR stand for?

Interquartile Range.

Q2. Why should data cleaning happen before modeling?

Models blindly learn patterns from the data they are given.

In the next lesson, we will continue using the same dataset and learn how data visualization helps us understand patterns before selecting algorithms.