Pandas Lesson 13 – Duplicates | Dataplexa

Handling Duplicate Data in Pandas

Duplicate data is a common issue in real-world datasets. It can occur due to repeated data entry, system errors, or merging multiple data sources.

In this lesson, you will learn how to detect, analyze, and remove duplicates using Pandas.

Loading the Dataset

We continue working with the same dataset used throughout this course.

import pandas as pd

df = pd.read_csv("dataplexa_pandas_sales.csv")

Duplicates occur when two or more rows contain the same values for one or more columns.

For example:

You can check whether rows are duplicated using the duplicated() method.

df.duplicated()

This returns a boolean Series:

To display only the duplicated rows:

df[df.duplicated()]

This helps you visually inspect what data is repeated.

To find how many duplicates exist:

df.duplicated().sum()

This returns the total number of duplicated rows.

To remove duplicate rows and keep only the first occurrence:

df.drop_duplicates(inplace=True)

After this operation, all duplicate rows are removed.

Sometimes duplicates are defined by a specific column, such as order_id.

df.drop_duplicates(subset=["order_id"], inplace=True)

This removes rows where the same order ID appears more than once.

By default, Pandas keeps the first occurrence. You can keep the last one instead.

df.drop_duplicates(keep="last", inplace=True)

Always verify your data after removing duplicates.

df.duplicated().sum()

If the result is 0, your dataset is duplicate-free.

Using the dataset:

In the next lesson, you will learn how to apply functions to data using powerful Pandas techniques.