Pandas Lesson 28 – Outliers | Dataplexa

Outlier Detection in Pandas

In real-world datasets, not all values follow normal patterns. Some values are unusually high or unusually low. These values are called outliers.

Detecting outliers is a critical step in data cleaning and analysis. If ignored, outliers can distort averages, trends, and models.


What Is an Outlier?

An outlier is a data point that differs significantly from most other values.

Examples:

  • A sales value far higher than normal daily sales
  • An unusually low price caused by data entry error
  • A sudden spike due to a one-time event

Why Outlier Detection Matters

Outliers can:

  • Skew averages and totals
  • Mislead business decisions
  • Reduce accuracy of machine learning models

Outliers are not always wrong — sometimes they carry important insights. The goal is to identify them, not blindly remove them.


Understanding the Dataset Column

In our dataset, we will focus on the numeric column:

  • sales_amount

This column is ideal for demonstrating outlier detection techniques.


Method 1: Visual Inspection Using describe()

The first step is to understand the data distribution.

sales["sales_amount"].describe()

This output helps you see:

  • Minimum value
  • Maximum value
  • Mean and quartiles

Extremely high or low values may indicate outliers.


Method 2: Interquartile Range (IQR)

The IQR method is one of the most common techniques for detecting outliers.

Steps:

  • Calculate Q1 (25th percentile)
  • Calculate Q3 (75th percentile)
  • Compute IQR = Q3 − Q1
Q1 = sales["sales_amount"].quantile(0.25)
Q3 = sales["sales_amount"].quantile(0.75)
IQR = Q3 - Q1

Q1, Q3, IQR

Identifying Outliers Using IQR

Any value outside this range is considered an outlier:

  • Lower bound = Q1 − 1.5 × IQR
  • Upper bound = Q3 + 1.5 × IQR
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = sales[
    (sales["sales_amount"] < lower_bound) |
    (sales["sales_amount"] > upper_bound)
]

outliers

Counting Outliers

To see how many outliers exist:

outliers.shape[0]

This helps decide whether outliers are rare or widespread.


Method 3: Z-Score Technique

Z-score measures how far a value is from the mean.

Values with Z-score greater than 3 or less than -3 are often considered outliers.

mean = sales["sales_amount"].mean()
std = sales["sales_amount"].std()

z_scores = (sales["sales_amount"] - mean) / std

outliers = sales[abs(z_scores) > 3]
outliers

Handling Outliers

Once detected, you have multiple options:

  • Remove them if they are errors
  • Cap them at upper or lower limits
  • Analyze them separately

Example: Removing outliers using IQR.

clean_data = sales[
    (sales["sales_amount"] >= lower_bound) &
    (sales["sales_amount"] <= upper_bound)
]

clean_data

Real-World Consideration

Before removing outliers, always ask:

  • Is this value possible in reality?
  • Is it caused by an error or a special event?
  • Will removing it change business conclusions?

Practice Exercise

Using the dataset:

  • Detect outliers in sales_amount
  • Count how many outliers exist
  • Create a cleaned dataset without them

What’s Next?

In the next lesson, you will learn about Performance Optimization and how to make Pandas code faster and more efficient.