Data Cleaning (Advanced)
In the previous lesson, we applied feature scaling to ensure all numerical features contribute equally to the learning process.
Now we move deeper into one of the most important and time-consuming steps in real-world Machine Learning projects — advanced data cleaning.
Industry truth: 80% of ML work happens before model training. If the data is poor, even the best algorithm will fail.
What Does Data Cleaning Really Mean?
Data cleaning is not just about removing missing values. It is about making sure the data truly represents reality and does not mislead the model.
Advanced data cleaning includes:
- Detecting and treating outliers
- Handling inconsistent values
- Fixing data entry errors
- Ensuring logical data relationships
Using Our Dataset (No Change)
We continue using the same dataset you downloaded earlier:
Dataplexa ML Housing & Customer Dataset
import pandas as pd
df = pd.read_csv("dataplexa_ml_housing_customer_dataset.csv")
df.head()
Outliers – The Hidden Problem
Outliers are extreme values that are very different from the rest of the data. They may occur due to:
- Data entry mistakes
- Measurement errors
- Rare but valid real-world events
If not handled carefully, outliers can distort model learning.
Real-World Example
Suppose most houses in our dataset cost between $100,000 and $500,000.
If one record shows a house price of $50,000,000, the model may shift its predictions unfairly.
This is why outlier handling is critical.
Detecting Outliers Using IQR
One common statistical method to detect outliers is Interquartile Range (IQR).
Q1 = df["house_price"].quantile(0.25)
Q3 = df["house_price"].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df["house_price"] < lower_bound) |
(df["house_price"] > upper_bound)]
outliers.head()
This code helps us identify values that fall far outside the normal range.
Handling Outliers (Safe Approach)
In real projects, we usually do NOT delete outliers blindly.
Instead, we may:
- Cap extreme values
- Replace with median values
- Analyze if the value is valid
df["house_price"] = df["house_price"].clip(
lower=lower_bound,
upper=upper_bound
)
This technique is called winsorization and is widely used in industry.
Fixing Logical Inconsistencies
Data may look clean numerically but still be logically wrong.
Example:
- House age = 120 years
- Bedrooms = 0
Such values may confuse the model.
df = df[df["bedrooms"] > 0]
df = df[df["age_of_house"] <= 100]
Why We Do This Before Model Training
Machine Learning models learn patterns, not common sense.
If incorrect values exist, the model assumes they are valid. That is why cleaning must happen before training begins.
Mini Practice
Think carefully:
- Should we remove a billionaire house buyer as an outlier?
- When should outliers be kept?
Exercises
Exercise 1:
What is the main risk of keeping extreme outliers?
Exercise 2:
Why is median preferred over mean when replacing outliers?
Exercise 3:
Should all outliers always be removed?
Quick Quiz
Q1. What does IQR stand for?
Q2. Why should data cleaning happen before modeling?
In the next lesson, we will continue using the same dataset and learn how data visualization helps us understand patterns before selecting algorithms.