Pandas Lesson 13 – Duplicates | Dataplexa

Handling Duplicate Data in Pandas

Duplicate data is a common issue in real-world datasets. It can occur due to repeated data entry, system errors, or merging multiple data sources.

In this lesson, you will learn how to detect, analyze, and remove duplicates using Pandas.


Loading the Dataset

We continue working with the same dataset used throughout this course.

import pandas as pd

df = pd.read_csv("dataplexa_pandas_sales.csv")

What Are Duplicates?

Duplicates occur when two or more rows contain the same values for one or more columns.

For example:

  • Same order ID appears multiple times
  • Customer records repeated
  • Same transaction imported twice

Checking for Duplicate Rows

You can check whether rows are duplicated using the duplicated() method.

df.duplicated()

This returns a boolean Series:

  • True → duplicate row
  • False → unique row

Viewing Duplicate Rows

To display only the duplicated rows:

df[df.duplicated()]

This helps you visually inspect what data is repeated.


Counting Duplicate Records

To find how many duplicates exist:

df.duplicated().sum()

This returns the total number of duplicated rows.


Removing Duplicate Rows

To remove duplicate rows and keep only the first occurrence:

df.drop_duplicates(inplace=True)

After this operation, all duplicate rows are removed.


Removing Duplicates Based on Specific Columns

Sometimes duplicates are defined by a specific column, such as order_id.

df.drop_duplicates(subset=["order_id"], inplace=True)

This removes rows where the same order ID appears more than once.


Keeping Last Occurrence Instead of First

By default, Pandas keeps the first occurrence. You can keep the last one instead.

df.drop_duplicates(keep="last", inplace=True)

Duplicate Detection After Cleaning

Always verify your data after removing duplicates.

df.duplicated().sum()

If the result is 0, your dataset is duplicate-free.


Why Handling Duplicates Is Important

  • Prevents incorrect aggregations
  • Improves data accuracy
  • Reduces bias in analysis
  • Ensures reliable reporting

Practice Exercise

Using the dataset:

  • Identify duplicated rows
  • Count total duplicates
  • Remove duplicates based on a key column
  • Verify that duplicates are removed

What’s Next?

In the next lesson, you will learn how to apply functions to data using powerful Pandas techniques.