Pandas Lesson 28 – Outliers | Dataplexa

Outlier Detection in Pandas

In real-world datasets, not all values follow normal patterns. Some values are unusually high or unusually low. These values are called outliers.

Detecting outliers is a critical step in data cleaning and analysis. If ignored, outliers can distort averages, trends, and models.

What Is an Outlier?

An outlier is a data point that differs significantly from most other values.

Examples:

A sales value far higher than normal daily sales
An unusually low price caused by data entry error
A sudden spike due to a one-time event

Why Outlier Detection Matters

Outliers can:

Skew averages and totals
Mislead business decisions
Reduce accuracy of machine learning models

Outliers are not always wrong — sometimes they carry important insights. The goal is to identify them, not blindly remove them.

Understanding the Dataset Column

In our dataset, we will focus on the numeric column:

sales_amount

This column is ideal for demonstrating outlier detection techniques.

Method 1: Visual Inspection Using describe()

The first step is to understand the data distribution.

sales["sales_amount"].describe()

This output helps you see:

Minimum value
Maximum value
Mean and quartiles

Extremely high or low values may indicate outliers.

Method 2: Interquartile Range (IQR)

The IQR method is one of the most common techniques for detecting outliers.

Steps:

Calculate Q1 (25th percentile)
Calculate Q3 (75th percentile)
Compute IQR = Q3 − Q1

Q1 = sales["sales_amount"].quantile(0.25)
Q3 = sales["sales_amount"].quantile(0.75)
IQR = Q3 - Q1

Q1, Q3, IQR

Identifying Outliers Using IQR

Any value outside this range is considered an outlier:

Lower bound = Q1 − 1.5 × IQR
Upper bound = Q3 + 1.5 × IQR

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = sales[
    (sales["sales_amount"] < lower_bound) |
    (sales["sales_amount"] > upper_bound)
]

outliers

Counting Outliers

To see how many outliers exist:

outliers.shape[0]

This helps decide whether outliers are rare or widespread.

Method 3: Z-Score Technique

Z-score measures how far a value is from the mean.

Values with Z-score greater than 3 or less than -3 are often considered outliers.

mean = sales["sales_amount"].mean()
std = sales["sales_amount"].std()

z_scores = (sales["sales_amount"] - mean) / std

outliers = sales[abs(z_scores) > 3]
outliers

Handling Outliers

Once detected, you have multiple options:

Remove them if they are errors
Cap them at upper or lower limits
Analyze them separately

Example: Removing outliers using IQR.

clean_data = sales[
    (sales["sales_amount"] >= lower_bound) &
    (sales["sales_amount"] <= upper_bound)
]

clean_data

Real-World Consideration

Before removing outliers, always ask:

Is this value possible in reality?
Is it caused by an error or a special event?
Will removing it change business conclusions?

Practice Exercise

Using the dataset:

Detect outliers in sales_amount
Count how many outliers exist
Create a cleaned dataset without them

What’s Next?

In the next lesson, you will learn about Performance Optimization and how to make Pandas code faster and more efficient.

← Previous Lesson Pandas Index Next ➜