Handling Duplicate Data in Pandas
Duplicate data is a common issue in real-world datasets. It can occur due to repeated data entry, system errors, or merging multiple data sources.
In this lesson, you will learn how to detect, analyze, and remove duplicates using Pandas.
Loading the Dataset
We continue working with the same dataset used throughout this course.
import pandas as pd
df = pd.read_csv("dataplexa_pandas_sales.csv")
What Are Duplicates?
Duplicates occur when two or more rows contain the same values for one or more columns.
For example:
- Same order ID appears multiple times
- Customer records repeated
- Same transaction imported twice
Checking for Duplicate Rows
You can check whether rows are duplicated using
the duplicated() method.
df.duplicated()
This returns a boolean Series:
True→ duplicate rowFalse→ unique row
Viewing Duplicate Rows
To display only the duplicated rows:
df[df.duplicated()]
This helps you visually inspect what data is repeated.
Counting Duplicate Records
To find how many duplicates exist:
df.duplicated().sum()
This returns the total number of duplicated rows.
Removing Duplicate Rows
To remove duplicate rows and keep only the first occurrence:
df.drop_duplicates(inplace=True)
After this operation, all duplicate rows are removed.
Removing Duplicates Based on Specific Columns
Sometimes duplicates are defined by a specific column,
such as order_id.
df.drop_duplicates(subset=["order_id"], inplace=True)
This removes rows where the same order ID appears more than once.
Keeping Last Occurrence Instead of First
By default, Pandas keeps the first occurrence. You can keep the last one instead.
df.drop_duplicates(keep="last", inplace=True)
Duplicate Detection After Cleaning
Always verify your data after removing duplicates.
df.duplicated().sum()
If the result is 0, your dataset is duplicate-free.
Why Handling Duplicates Is Important
- Prevents incorrect aggregations
- Improves data accuracy
- Reduces bias in analysis
- Ensures reliable reporting
Practice Exercise
Using the dataset:
- Identify duplicated rows
- Count total duplicates
- Remove duplicates based on a key column
- Verify that duplicates are removed
What’s Next?
In the next lesson, you will learn how to apply functions to data using powerful Pandas techniques.