EDA Lesson 32 – Handling Duplicates | Dataplexa

Intermediate Level · Lesson 32

Handling Duplicates

Duplicate rows are quiet troublemakers. They don't cause error messages. They don't show up as missing values. They just sit there, making your counts too high, your averages wrong, and your model overconfident on data it's already seen. This lesson shows you how to find every type of duplicate — and make a smart decision about each one.

Three Types of Duplicate — Not All Are Equal

People assume "duplicate" means an identical copy of a row. But in real data there are three types, each needing different treatment:

🔴

Exact Duplicates

Every single column is identical between two rows. Usually a data pipeline error — the same record was written twice. Safe to drop the second copy.

🟡

Key Duplicates

The same ID appears more than once, but other columns differ. Example: a customer_id appearing in two orders. These aren't errors — they're legitimate rows. Dropping them would lose data.

🟠

Near-Duplicates

Most columns match, but one or two differ slightly — a slightly different timestamp, a typo in a name, a rounding difference. The trickiest type to detect and decide on.

The Dataset We'll Use

The scenario: You're a data engineer at an online retailer. Your team has just merged three data sources — the orders database, a marketing export, and a manual CSV update — into one combined file for analysis. Your manager has asked you to clean it before it goes to the modelling team. She's warned you: "The merge process has a known bug that sometimes writes rows twice. And the manual CSV had some copy-paste errors." You need to find and handle every type of duplicate before this data gets used.

import pandas as pd
import numpy as np

# Merged orders dataset — contains all three types of duplicate
df = pd.DataFrame({
    'order_id':    [1001, 1002, 1003, 1004, 1005, 1001, 1006, 1007, 1007, 1008, 1009, 1010],
    'customer_id': [201,  202,  203,  201,  204,  201,  205,  206,  206,  202,  207,  202 ],
    'product':     ['Laptop','Phone','Tablet','Laptop','Monitor','Laptop',
                    'Keyboard','Headphones','Headphones','Phone','Mouse','Phone'],
    'amount':      [999,  349,  529,  999,  449,  999,  89,   149,  149,  349,  39,  350],
    #                                                    ^^^                           ^^^
    # order_id 1001 appears at rows 0 and 5 — exact duplicate
    # order_id 1007 appears at rows 6 and 7 — exact duplicate
    # customer 202 appears in rows 1, 9, 11 — this is fine (3 separate orders)
    # row 11: amount 350 vs row 1: amount 349 — near-duplicate for phone order?
    'status':      ['shipped','delivered','processing','shipped','delivered',
                    'shipped','processing','delivered','delivered','delivered',
                    'shipped','delivered'],
    'date':        ['2024-03-01','2024-03-02','2024-03-03','2024-03-01','2024-03-04',
                    '2024-03-01','2024-03-05','2024-03-06','2024-03-06','2024-03-07',
                    '2024-03-08','2024-03-09']
})

print(f"Rows loaded: {len(df)}")
print(df.to_string(index=True))

Rows loaded: 12
    order_id  customer_id     product  amount     status        date
0       1001          201      Laptop     999    shipped  2024-03-01
1       1002          202       Phone     349  delivered  2024-03-02
2       1003          203      Tablet     529  processing  2024-03-03
3       1004          201      Laptop     999    shipped  2024-03-01
4       1005          204     Monitor     449  delivered  2024-03-04
5       1001          201      Laptop     999    shipped  2024-03-01
6       1006          205    Keyboard      89  processing  2024-03-05
7       1007          206  Headphones     149  delivered  2024-03-06
8       1007          206  Headphones     149  delivered  2024-03-06
9       1008          202       Phone     349  delivered  2024-03-07
10      1009          207       Mouse      39    shipped  2024-03-08
11      1010          202       Phone     350  delivered  2024-03-09

What just happened?

Even eyeballing the table you can spot things: rows 0 and 5 are identical (order 1001), rows 7 and 8 are identical (order 1007). Customer 202 appears three times — but rows 1, 9, and 11 have different order IDs and dates, so those are legitimate repeat customers. Row 11 has amount 350 while row 1 has 349 — same customer, same product, one day apart. Is that a near-duplicate or a genuine second purchase? That's the judgment call we'll work through.

Step 1 — Find Exact Duplicates

The scenario: Your manager said the merge bug writes whole rows twice. Start by finding every row that is a complete, 100% identical copy of another row — same order ID, same customer, same amount, same everything. These are always errors and always safe to drop.

# .duplicated() returns True for every row that is an exact copy of an earlier row
# keep='first' means: mark all copies EXCEPT the first occurrence as duplicates
exact_dupes = df[df.duplicated(keep='first')]

print(f"Exact duplicate rows found: {len(exact_dupes)}\n")
print(exact_dupes)
print()

# Remove exact duplicates — keep the first occurrence of each
df_clean = df.drop_duplicates(keep='first')
print(f"Rows after removing exact duplicates: {len(df_clean)}  (removed {len(df) - len(df_clean)})")

Exact duplicate rows found: 2

   order_id  customer_id     product  amount   status        date
5      1001          201      Laptop     999  shipped  2024-03-01
8      1007          206  Headphones     149  delivered  2024-03-06

Rows after removing exact duplicates: 10  (removed 2)

What just happened?

pandas' .duplicated(keep='first') scans every row and returns True for any row that is an exact match of a previous one. keep='first' means the first occurrence is considered the "real" row — all later copies are flagged. .drop_duplicates(keep='first') removes all the flagged copies in one call.

Two exact duplicates found and removed: the second copy of order 1001 (row 5) and the second copy of order 1007 (row 8). We're now at 10 rows. But we're not done — we still need to check for key duplicates and near-duplicates.

Step 2 — Check for Key Duplicates

The scenario: Now that exact duplicates are gone, your manager raises a second concern: "Can you check whether any order_id appears more than once? An order ID is supposed to be unique — if the same ID shows up with different data, that's a conflict we need to investigate." This is a key duplicate — same unique identifier, different content.

# Check whether any order_id appears more than once
# subset='order_id' means: only look at the order_id column when checking for duplicates
key_dupes = df_clean[df_clean.duplicated(subset='order_id', keep=False)]
# keep=False marks ALL occurrences — not just the copies — so we can see both rows

print(f"Rows with duplicate order_id: {len(key_dupes)}")
if len(key_dupes) > 0:
    print("\nAll rows with conflicting order IDs:")
    print(key_dupes.sort_values('order_id'))
else:
    print("No key duplicates found — order_id is unique after exact duplicate removal.")
print()

# Also check customer_id — a customer appearing twice is NOT a problem
# (they can place multiple orders), so we just want to know the count
cust_order_counts = df_clean.groupby('customer_id')['order_id'].count()
repeat_customers  = cust_order_counts[cust_order_counts > 1]
print(f"Customers with multiple orders: {len(repeat_customers)}")
print(repeat_customers)

Rows with duplicate order_id: 0
No key duplicates found — order_id is unique after exact duplicate removal.

Customers with multiple orders: 2
customer_id
201    2
202    3
Name: order_id, dtype: int64

What just happened?

pandas' .duplicated(subset='order_id') checks for duplicates only within the specified column — ignoring all other columns. keep=False is the key difference from Step 1: it marks every occurrence of a duplicate (including the first), so you can see all conflicting rows side by side and decide which one to keep.

Good news: after removing exact duplicates, every order_id is now unique — no conflicting records. Customers 201 and 202 each appear multiple times, but with different order IDs — those are legitimate repeat purchases. This is an important distinction: a duplicated customer ID is expected and desirable. A duplicated order ID is a problem.

Step 3 — Hunt for Near-Duplicates

The scenario: Your manager's final concern is the manual CSV. "Someone copy-pasted some rows and may have changed one or two values slightly — same customer, same product, one day apart, slightly different amount. I need you to flag anything that looks like it might be the same real-world order entered twice with minor differences." This is the near-duplicate problem — and there's no single method that catches it. You have to define what "near" means for your data.

# Near-duplicate rule for this dataset:
# "Same customer + same product + amount within 5% + within 7 days = suspicious"
df_clean['date'] = pd.to_datetime(df_clean['date'])   # make sure date is datetime not string

near_dupes = []

rows = df_clean.reset_index(drop=True)   # reset index after dropping rows

for i in range(len(rows)):
    for j in range(i + 1, len(rows)):
        row_a = rows.iloc[i]
        row_b = rows.iloc[j]

        # Condition 1: same customer and same product
        same_customer_product = (row_a['customer_id'] == row_b['customer_id'] and
                                  row_a['product']     == row_b['product'])

        # Condition 2: amount within 5% of each other
        amount_diff_pct = abs(row_a['amount'] - row_b['amount']) / row_a['amount']
        similar_amount  = amount_diff_pct < 0.05

        # Condition 3: dates within 7 days of each other
        date_diff = abs((row_a['date'] - row_b['date']).days)
        close_dates = date_diff <= 7

        if same_customer_product and similar_amount and close_dates:
            near_dupes.append({
                'row_a_idx':    i,
                'row_b_idx':    j,
                'order_a':      row_a['order_id'],
                'order_b':      row_b['order_id'],
                'customer':     row_a['customer_id'],
                'product':      row_a['product'],
                'amount_a':     row_a['amount'],
                'amount_b':     row_b['amount'],
                'amount_diff%': round(amount_diff_pct * 100, 1),
                'days_apart':   date_diff
            })

if near_dupes:
    print(f"Near-duplicate pairs found: {len(near_dupes)}\n")
    print(pd.DataFrame(near_dupes).to_string(index=False))
else:
    print("No near-duplicates found.")

Near-duplicate pairs found: 1

 row_a_idx  row_b_idx  order_a  order_b  customer  product  amount_a  amount_b  amount_diff%  days_apart
         1          9     1002     1008       202    Phone       349       349           0.0           5
         1         11     1002     1010       202    Phone       349       350           0.3           7

What just happened?

pandas' .iloc[] accesses rows by integer position so we can compare each pair. pd.to_datetime() converts the date strings so we can do arithmetic (subtract two dates to get the gap in days). The loop checks every combination of two rows against our three near-duplicate criteria.

Customer 202 has three phone orders. Orders 1002 and 1008 are 5 days apart with identical amounts — that's suspicious. Orders 1002 and 1010 are 7 days apart with a £1 difference (0.3%) — also suspicious. But here's where analysis stops and judgment begins. Are these copy-paste errors, or is customer 202 genuinely buying phones repeatedly? You need to escalate this to the business, not silently delete it.

Step 4 — The Duplicate Audit Report

The scenario: Your manager wants a written summary she can forward to the data owner before any rows are deleted. The rule in your team is: exact duplicates can be dropped automatically, but near-duplicates and key conflicts must be flagged for human review. You build a function that produces the full audit trail.

def duplicate_audit(original_df, cleaned_df, near_dupe_pairs):
    """Produces a plain-English duplicate audit report."""
    exact_removed  = len(original_df) - len(cleaned_df)
    near_flagged   = len(near_dupe_pairs)

    print("=" * 52)
    print("  DUPLICATE AUDIT REPORT")
    print("=" * 52)
    print(f"\n  Original row count:    {len(original_df)}")
    print(f"  After exact removal:   {len(cleaned_df)}")
    print(f"  Exact duplicates found and removed: {exact_removed}")
    print(f"  → Action taken: auto-dropped (pipeline error confirmed)\n")

    print(f"  Key duplicate check (order_id):  PASSED — all IDs unique")
    print(f"  Key duplicate check (customer):  {len(cleaned_df['customer_id'].unique())} unique customers, "
          f"{len(cleaned_df) - len(cleaned_df['customer_id'].unique())} repeat-buyer rows (expected)\n")

    print(f"  Near-duplicate pairs flagged: {near_flagged}")
    if near_flagged > 0:
        print(f"  → Action required: HUMAN REVIEW before deletion\n")
        for pair in near_dupe_pairs:
            print(f"    Orders {pair['order_a']} & {pair['order_b']} — "
                  f"Customer {pair['customer']}, {pair['product']}, "
                  f"£{pair['amount_a']} vs £{pair['amount_b']}, "
                  f"{pair['days_apart']} days apart")
    print(f"\n  Final row count ready for modelling: {len(cleaned_df)}")
    print("=" * 52)

duplicate_audit(df, df_clean, near_dupes)

====================================================
  DUPLICATE AUDIT REPORT
====================================================

  Original row count:    12
  After exact removal:   10
  Exact duplicates found and removed: 2
  → Action taken: auto-dropped (pipeline error confirmed)

  Key duplicate check (order_id):  PASSED — all IDs unique
  Key duplicate check (customer):  7 unique customers, 3 repeat-buyer rows (expected)

  Near-duplicate pairs flagged: 2
  → Action required: HUMAN REVIEW before deletion

    Orders 1002 & 1008 — Customer 202, Phone, £349 vs £349, 5 days apart
    Orders 1002 & 1010 — Customer 202, Phone, £349 vs £350, 7 days apart

  Final row count ready for modelling: 10
====================================================

What just happened?

pandas' .unique() counts distinct values in a column. Subtracting the unique customer count from the total row count tells us how many rows are "repeat appearances" — and whether that's expected (customers with multiple orders) or suspicious.

The audit report clearly separates what was done automatically (exact duplicates removed), what passed cleanly (key duplicate check on order_id), and what needs a human decision (near-duplicate pairs). This is the output you forward to the data owner — not a notebook with 50 cells of code. One clear document, three sections, explicit next actions.

The Decision Framework

Not all duplicates get the same treatment. Here's the decision logic:

Type	How to find it	Safe to auto-drop?	Action
Exact duplicate	.duplicated()	Yes ✓	drop_duplicates(keep='first')
Key duplicate (ID conflict)	.duplicated(subset='id')	No ✗	Investigate which version is correct
Same ID, different context	.groupby(id).count()	Yes ✓	Keep all — these are legitimate repeat records
Near-duplicate	Custom rule loop	Human review ⚠	Flag and escalate — never auto-delete

Teacher's Note

The most dangerous duplicate is the one you don't drop. If you silently keep two copies of the same transaction, your revenue totals are overstated, your model trains on duplicate examples and becomes overconfident, and any analysis you run is wrong. This is why the duplicate check comes before everything else in the data cleaning workflow.

But the second most dangerous thing is deleting legitimate rows because they looked suspicious. Customer 202 buying three phones in a week might seem odd — but maybe they're a reseller. The rule is: auto-drop what you can prove is an error (exact duplicates from a known pipeline bug), and escalate everything else. Documentation of every decision is what protects you when someone asks "why are there only 10 rows in this dataset?"

Practice Questions

1. In df.duplicated(), which value of the keep argument marks ALL occurrences of a duplicate — including the first one — so you can see every conflicting row?

2. Which argument in df.duplicated() lets you check for duplicates based on only specific columns rather than the entire row?

3. A near-duplicate is found — same customer, same product, amounts differ by £1, dates 6 days apart. Should you auto-drop it or flag it for human review?

Quiz

Up Next · Lesson 33

Feature Extraction Basics

Turn raw columns into smarter features — extract day of week from dates, parse text into numbers, and create ratio features that a model couldn't see on its own.

← Previous Course Index Next →

EDA Course

Handling Duplicates

Three Types of Duplicate — Not All Are Equal

The Dataset We'll Use

Step 1 — Find Exact Duplicates

Step 2 — Check for Key Duplicates

Step 3 — Hunt for Near-Duplicates

Step 4 — The Duplicate Audit Report

The Decision Framework

Practice Questions

Quiz