Outlier Detection in Pandas
In real-world datasets, not all values follow normal patterns. Some values are unusually high or unusually low. These values are called outliers.
Detecting outliers is a critical step in data cleaning and analysis. If ignored, outliers can distort averages, trends, and models.
What Is an Outlier?
An outlier is a data point that differs significantly from most other values.
Examples:
- A sales value far higher than normal daily sales
- An unusually low price caused by data entry error
- A sudden spike due to a one-time event
Why Outlier Detection Matters
Outliers can:
- Skew averages and totals
- Mislead business decisions
- Reduce accuracy of machine learning models
Outliers are not always wrong — sometimes they carry important insights. The goal is to identify them, not blindly remove them.
Understanding the Dataset Column
In our dataset, we will focus on the numeric column:
sales_amount
This column is ideal for demonstrating outlier detection techniques.
Method 1: Visual Inspection Using describe()
The first step is to understand the data distribution.
sales["sales_amount"].describe()
This output helps you see:
- Minimum value
- Maximum value
- Mean and quartiles
Extremely high or low values may indicate outliers.
Method 2: Interquartile Range (IQR)
The IQR method is one of the most common techniques for detecting outliers.
Steps:
- Calculate Q1 (25th percentile)
- Calculate Q3 (75th percentile)
- Compute IQR = Q3 − Q1
Q1 = sales["sales_amount"].quantile(0.25)
Q3 = sales["sales_amount"].quantile(0.75)
IQR = Q3 - Q1
Q1, Q3, IQR
Identifying Outliers Using IQR
Any value outside this range is considered an outlier:
- Lower bound = Q1 − 1.5 × IQR
- Upper bound = Q3 + 1.5 × IQR
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = sales[
(sales["sales_amount"] < lower_bound) |
(sales["sales_amount"] > upper_bound)
]
outliers
Counting Outliers
To see how many outliers exist:
outliers.shape[0]
This helps decide whether outliers are rare or widespread.
Method 3: Z-Score Technique
Z-score measures how far a value is from the mean.
Values with Z-score greater than 3 or less than -3 are often considered outliers.
mean = sales["sales_amount"].mean()
std = sales["sales_amount"].std()
z_scores = (sales["sales_amount"] - mean) / std
outliers = sales[abs(z_scores) > 3]
outliers
Handling Outliers
Once detected, you have multiple options:
- Remove them if they are errors
- Cap them at upper or lower limits
- Analyze them separately
Example: Removing outliers using IQR.
clean_data = sales[
(sales["sales_amount"] >= lower_bound) &
(sales["sales_amount"] <= upper_bound)
]
clean_data
Real-World Consideration
Before removing outliers, always ask:
- Is this value possible in reality?
- Is it caused by an error or a special event?
- Will removing it change business conclusions?
Practice Exercise
Using the dataset:
- Detect outliers in
sales_amount - Count how many outliers exist
- Create a cleaned dataset without them
What’s Next?
In the next lesson, you will learn about Performance Optimization and how to make Pandas code faster and more efficient.