ML Lesson X – TITLE HERE | Dataplexa

Data Preprocessing

In the previous lesson, you learned the complete Machine Learning workflow. Now we move to the first hands-on technical step of that workflow — Data Preprocessing.

From this lesson onward, we will work with one single dataset throughout the entire Machine Learning module. Each lesson will improve, transform, and use the same data step by step.


The Dataset Used in This Course

For this Machine Learning course, we will use a real-world inspired dataset called the Dataplexa ML Housing & Customer Dataset.

This dataset is designed specifically for learning purposes and will be reused in data preprocessing, feature scaling, regression, classification, clustering, model evaluation, and the final ML project.

Download the Dataset

Before continuing, download the dataset using the button below.

Download Dataplexa ML Housing & Customer Dataset (CSV)

After downloading, extract the ZIP file and place the CSV file inside your project folder.


What is Data Preprocessing?

Data preprocessing is the process of converting raw data into a clean and structured format suitable for Machine Learning models.

In real-world applications, data is never perfect. It may contain missing values, incorrect entries, duplicate records, and inconsistent formats.

If we train a model on such data, the model will learn incorrect patterns and produce unreliable results.


Understanding the Dataset Structure

Let us first load the dataset and understand what kind of data we are working with.

import pandas as pd

df = pd.read_csv("dataplexa_ml_housing_customer_dataset.csv")
print(df.head())

Each row represents information about a house and a customer. Some columns are inputs (features), and some will later be used as targets.


Why Data Preprocessing is Critical

Consider a real-world housing price prediction system.

If house size is missing or incorrectly recorded, the model may predict unrealistic prices. Similarly, if customer income is wrong, classification decisions may fail.

Data preprocessing ensures:

  • The model learns from accurate information
  • Noise and bias are reduced
  • Predictions become trustworthy

Handling Missing Values

Missing values occur when information is not recorded or lost. This is extremely common in real-world datasets.

Instead of deleting large amounts of data, we usually replace missing values with meaningful estimates.

Real-World Explanation

If the size of a house is missing, we can reasonably estimate it using the average size of other houses.

# Fill missing values using mean and median
df["house_size"].fillna(df["house_size"].mean(), inplace=True)
df["bedrooms"].fillna(df["bedrooms"].median(), inplace=True)
df["location_score"].fillna(df["location_score"].median(), inplace=True)
df["price"].fillna(df["price"].mean(), inplace=True)

This approach is widely used in industry when missing data is limited.


Removing Duplicate Records

Duplicate records occur when the same data is stored multiple times. This may happen due to system errors or multiple data sources.

If duplicates are not removed, the model may give extra importance to repeated records.

# Remove duplicate rows
df = df.drop_duplicates()

Removing duplicates ensures that each data point contributes equally.


Checking and Correcting Data Types

Machine Learning models work with numeric values. If numeric data is stored as text, the model cannot process it correctly.

For example, house size must be numeric, not a string. Ensuring correct data types is a fundamental preprocessing step.


Real-World Impact of Poor Preprocessing

In banking systems, incorrect preprocessing can lead to loan approvals for risky customers or rejection of valid ones.

In healthcare, missing or wrong values may result in incorrect diagnoses.

That is why preprocessing is never skipped in professional ML projects.


Mini Practice

Think about the dataset you loaded:

  • Which columns might have missing values in real life?
  • Which columns are more important for predicting house price?
  • Why is customer income important for classification?

Exercises

Exercise 1: Why should we not train a model directly on raw data?

Because raw data contains errors, missing values, and inconsistencies that negatively affect model performance.

Exercise 2: Name one real-world problem caused by duplicate data.

Duplicate data can bias the model and cause incorrect predictions.

Exercise 3: Why is replacing missing values sometimes better than deleting rows?

Deleting rows may remove valuable data, while replacement preserves information.

Quick Quiz

Q1. What is the main goal of data preprocessing?

To convert raw data into a clean, structured format suitable for ML models.

Q2. Which value is commonly used to replace missing numerical data?

Mean or median values.

In the next lesson, we will continue working with the same dataset and learn how to apply feature scaling so that Machine Learning models treat all features fairly.