Pandas Lesson 8 – Data Cleaning | Dataplexa

Data Cleaning in Pandas

Data cleaning is the process of preparing raw data for analysis by fixing errors, inconsistencies, and unwanted values.

Even after handling missing values, datasets often contain duplicated rows, incorrect formats, inconsistent text, or unnecessary columns.


Why Data Cleaning Is Important

Clean data ensures:

  • Accurate calculations
  • Reliable analysis results
  • Better visualizations
  • Fewer errors in machine learning models

In real projects, data cleaning can take more time than actual analysis.


Loading the Dataset

We continue using the same dataset. Make sure it is loaded correctly.

import pandas as pd

df = pd.read_csv("dataplexa_pandas_sales.csv")

Removing Duplicate Rows

Duplicate rows can distort totals and averages.

To check for duplicates:

df.duplicated()

To remove duplicate rows:

df = df.drop_duplicates()

Standardizing Column Names

Inconsistent column names make code harder to read.

A common practice is to convert all column names to lowercase and remove spaces.

df.columns = df.columns.str.lower().str.replace(" ", "_")

This makes column access predictable and clean.


Cleaning Text Data

Text columns often contain extra spaces, mixed letter cases, or inconsistent values.

Example: Cleaning a region column.

df["region"] = df["region"].str.strip().str.title()

This removes extra spaces and standardizes text formatting.


Fixing Incorrect Data Types

Sometimes numeric data is stored as text. This prevents calculations.

Check data types:

df.dtypes

Convert a column to numeric:

df["sales"] = pd.to_numeric(df["sales"], errors="coerce")

Invalid values automatically become NaN.


Removing Unnecessary Columns

Datasets often include columns that are not useful.

Remove unwanted columns:

df = df.drop(columns=["unnecessary_column"])

Keeping only relevant columns improves performance and clarity.


Handling Out-of-Range Values

Some values may be logically incorrect, such as negative sales.

Filter out invalid values:

df = df[df["sales"] >= 0]

This ensures data consistency.


Final Data Validation

After cleaning, always recheck the dataset.

df.info()
df.describe()

These summaries confirm the dataset is ready for analysis.


Practice Exercise

Using the dataset:

  • Remove duplicate rows
  • Standardize all column names
  • Clean text values in categorical columns
  • Ensure numeric columns contain valid numbers

What’s Next?

Now that the dataset is clean, the next lesson focuses on sorting and ranking data to organize information meaningfully.