Data Preprocessing
In the previous lesson, you learned the complete Machine Learning workflow. Now we move to the first hands-on technical step of that workflow — Data Preprocessing.
From this lesson onward, we will work with one single dataset throughout the entire Machine Learning module. Each lesson will improve, transform, and use the same data step by step.
The Dataset Used in This Course
For this Machine Learning course, we will use a real-world inspired dataset called the Dataplexa ML Housing & Customer Dataset.
This dataset is designed specifically for learning purposes and will be reused in data preprocessing, feature scaling, regression, classification, clustering, model evaluation, and the final ML project.
Download the Dataset
Before continuing, download the dataset using the button below.
Download Dataplexa ML Housing & Customer Dataset (CSV)
After downloading, extract the ZIP file and place the CSV file inside your project folder.
What is Data Preprocessing?
Data preprocessing is the process of converting raw data into a clean and structured format suitable for Machine Learning models.
In real-world applications, data is never perfect. It may contain missing values, incorrect entries, duplicate records, and inconsistent formats.
If we train a model on such data, the model will learn incorrect patterns and produce unreliable results.
Understanding the Dataset Structure
Let us first load the dataset and understand what kind of data we are working with.
import pandas as pd
df = pd.read_csv("dataplexa_ml_housing_customer_dataset.csv")
print(df.head())
Each row represents information about a house and a customer. Some columns are inputs (features), and some will later be used as targets.
Why Data Preprocessing is Critical
Consider a real-world housing price prediction system.
If house size is missing or incorrectly recorded, the model may predict unrealistic prices. Similarly, if customer income is wrong, classification decisions may fail.
Data preprocessing ensures:
- The model learns from accurate information
- Noise and bias are reduced
- Predictions become trustworthy
Handling Missing Values
Missing values occur when information is not recorded or lost. This is extremely common in real-world datasets.
Instead of deleting large amounts of data, we usually replace missing values with meaningful estimates.
Real-World Explanation
If the size of a house is missing, we can reasonably estimate it using the average size of other houses.
# Fill missing values using mean and median
df["house_size"].fillna(df["house_size"].mean(), inplace=True)
df["bedrooms"].fillna(df["bedrooms"].median(), inplace=True)
df["location_score"].fillna(df["location_score"].median(), inplace=True)
df["price"].fillna(df["price"].mean(), inplace=True)
This approach is widely used in industry when missing data is limited.
Removing Duplicate Records
Duplicate records occur when the same data is stored multiple times. This may happen due to system errors or multiple data sources.
If duplicates are not removed, the model may give extra importance to repeated records.
# Remove duplicate rows
df = df.drop_duplicates()
Removing duplicates ensures that each data point contributes equally.
Checking and Correcting Data Types
Machine Learning models work with numeric values. If numeric data is stored as text, the model cannot process it correctly.
For example, house size must be numeric, not a string. Ensuring correct data types is a fundamental preprocessing step.
Real-World Impact of Poor Preprocessing
In banking systems, incorrect preprocessing can lead to loan approvals for risky customers or rejection of valid ones.
In healthcare, missing or wrong values may result in incorrect diagnoses.
That is why preprocessing is never skipped in professional ML projects.
Mini Practice
Think about the dataset you loaded:
- Which columns might have missing values in real life?
- Which columns are more important for predicting house price?
- Why is customer income important for classification?
Exercises
Exercise 1: Why should we not train a model directly on raw data?
Exercise 2: Name one real-world problem caused by duplicate data.
Exercise 3: Why is replacing missing values sometimes better than deleting rows?
Quick Quiz
Q1. What is the main goal of data preprocessing?
Q2. Which value is commonly used to replace missing numerical data?
In the next lesson, we will continue working with the same dataset and learn how to apply feature scaling so that Machine Learning models treat all features fairly.