Data Cleaning in Pandas
Data cleaning is the process of preparing raw data for analysis by fixing errors, inconsistencies, and unwanted values.
Even after handling missing values, datasets often contain duplicated rows, incorrect formats, inconsistent text, or unnecessary columns.
Why Data Cleaning Is Important
Clean data ensures:
- Accurate calculations
- Reliable analysis results
- Better visualizations
- Fewer errors in machine learning models
In real projects, data cleaning can take more time than actual analysis.
Loading the Dataset
We continue using the same dataset. Make sure it is loaded correctly.
import pandas as pd
df = pd.read_csv("dataplexa_pandas_sales.csv")
Removing Duplicate Rows
Duplicate rows can distort totals and averages.
To check for duplicates:
df.duplicated()
To remove duplicate rows:
df = df.drop_duplicates()
Standardizing Column Names
Inconsistent column names make code harder to read.
A common practice is to convert all column names to lowercase and remove spaces.
df.columns = df.columns.str.lower().str.replace(" ", "_")
This makes column access predictable and clean.
Cleaning Text Data
Text columns often contain extra spaces, mixed letter cases, or inconsistent values.
Example: Cleaning a region column.
df["region"] = df["region"].str.strip().str.title()
This removes extra spaces and standardizes text formatting.
Fixing Incorrect Data Types
Sometimes numeric data is stored as text. This prevents calculations.
Check data types:
df.dtypes
Convert a column to numeric:
df["sales"] = pd.to_numeric(df["sales"], errors="coerce")
Invalid values automatically become NaN.
Removing Unnecessary Columns
Datasets often include columns that are not useful.
Remove unwanted columns:
df = df.drop(columns=["unnecessary_column"])
Keeping only relevant columns improves performance and clarity.
Handling Out-of-Range Values
Some values may be logically incorrect, such as negative sales.
Filter out invalid values:
df = df[df["sales"] >= 0]
This ensures data consistency.
Final Data Validation
After cleaning, always recheck the dataset.
df.info()
df.describe()
These summaries confirm the dataset is ready for analysis.
Practice Exercise
Using the dataset:
- Remove duplicate rows
- Standardize all column names
- Clean text values in categorical columns
- Ensure numeric columns contain valid numbers
What’s Next?
Now that the dataset is clean, the next lesson focuses on sorting and ranking data to organize information meaningfully.