EDA Course
Detecting Outliers
Outliers are the rows that don't play by the rules — they can be your most important data points, or the most dangerous ones, and knowing which requires you to find them first.
The Row That Broke the Average
Picture a small startup with 9 employees all earning between $55,000 and $80,000. The average salary comes out to $68,000 — perfectly reasonable. Then the CEO's $1,200,000 compensation package gets added to the dataset. Suddenly the average jumps to $174,000, and every statistic you calculated just became a lie.
That's what outliers do. They're data points that sit abnormally far from the bulk of your data, and they distort every mean, correlation, and model you build — unless you catch them first.
⚠️ Outlier or Error? — Not every outlier is a mistake. A marathon runner logging a 2-hour finish time is a legitimate record. A marathon finish time of 2 minutes is a data entry error. Before removing anything, always ask: is this value surprising, or is it impossible?
Method 1 — The IQR Rule
The Interquartile Range (IQR) is the most widely used outlier detection method. It measures the spread of the middle 50% of your data, then flags anything that falls too far outside that range as a potential outlier.
The formula is simple: calculate Q1 (the 25th percentile) and Q3 (the 75th percentile). The IQR is Q3 minus Q1. Any value below Q1 − 1.5 × IQR or above Q3 + 1.5 × IQR is flagged.
Lower Fence
Q1 − 1.5 × IQR
Values below this are low outliers
Upper Fence
Q3 + 1.5 × IQR
Values above this are high outliers
The scenario: An e-commerce platform logs order values. Most orders are between $10 and $200, but a few look suspicious — either tiny test transactions or unusually large bulk orders. Let's find them.
import pandas as pd
import numpy as np
# E-commerce order values — most normal, a few suspicious ones
orders = pd.DataFrame({
'order_id': range(1001, 1011),
# Order amounts in USD — includes two extreme values
'amount': [45, 120, 89, 0.50, 155, 78, 1850, 62, 99, 134]
})
# Step 1: Calculate Q1 and Q3 (25th and 75th percentile)
Q1 = orders['amount'].quantile(0.25)
Q3 = orders['amount'].quantile(0.75)
# Step 2: Calculate the IQR (interquartile range)
IQR = Q3 - Q1
# Step 3: Define the fences — anything outside is flagged
lower_fence = Q1 - 1.5 * IQR
upper_fence = Q3 + 1.5 * IQR
print(f"Q1: {Q1} | Q3: {Q3} | IQR: {IQR}")
print(f"Lower fence: {lower_fence:.2f}")
print(f"Upper fence: {upper_fence:.2f}")
print()
# Step 4: Boolean mask — True where a value IS an outlier
is_outlier = (orders['amount'] < lower_fence) | (orders['amount'] > upper_fence)
# Step 5: Filter to show only the flagged rows
print("Flagged outliers:")
print(orders[is_outlier])
Q1: 67.25 | Q3: 133.75 | IQR: 66.5 Lower fence: -32.50 Upper fence: 233.50 Flagged outliers: order_id amount 3 1004 0.50 6 1007 1850.00
💡 What just happened?
The IQR method flagged two orders: a $0.50 micro-transaction (almost certainly a test or error) and a $1,850 bulk order (possibly legitimate but worth investigating). The lower fence came out negative (–$32.50), which means no realistic order value could trigger it — so we only have high-side outliers here. That's useful context too.
Method 2 — The Z-Score
The Z-score measures how many standard deviations a value sits away from the mean. A Z-score of 0 means you're exactly at the mean. A Z-score of 3 means you're three standard deviations out — and that's where most analysts draw the line.
The formula: Z = (value − mean) / standard deviation. Any value with |Z| > 3 is typically flagged as an outlier. This method assumes your data is roughly normally distributed — if it's heavily skewed, the IQR method is usually more reliable.
The scenario: A hospital logs patient blood pressure readings. One reading looks suspiciously high. We'll use Z-scores to find it.
import pandas as pd
import numpy as np
# Hospital patient systolic blood pressure readings (mmHg)
bp = pd.DataFrame({
'patient': ['P01','P02','P03','P04','P05','P06','P07','P08','P09','P10'],
# Normal range is roughly 110–140 mmHg; P07 looks alarming
'systolic': [118, 122, 130, 125, 119, 135, 240, 128, 121, 116]
})
# Calculate mean and standard deviation for the column
mean_bp = bp['systolic'].mean()
std_bp = bp['systolic'].std()
# Compute Z-score for every reading: (value - mean) / std
bp['z_score'] = (bp['systolic'] - mean_bp) / std_bp
# Round for readability
bp['z_score'] = bp['z_score'].round(2)
# Flag any reading where the absolute Z-score exceeds 3
bp['is_outlier'] = bp['z_score'].abs() > 3
print(bp)
print()
print("Outliers detected:")
print(bp[bp['is_outlier']])
patient systolic z_score is_outlier 0 P01 118 -0.37 False 1 P02 122 -0.25 False 2 P03 130 0.00 False 3 P04 125 -0.15 False 4 P05 119 -0.34 False 5 P06 135 0.15 False 6 P07 240 3.28 True 7 P08 128 -0.06 False 8 P09 121 -0.28 False 9 P10 116 -0.43 False Outliers detected: patient systolic z_score is_outlier 6 P07 240 3.28 True
💡 What just happened?
Patient P07's reading of 240 mmHg has a Z-score of 3.28 — more than three standard deviations above the mean. Every other patient clusters tightly between –0.43 and +0.15, which tells you this dataset is very homogeneous. P07 almost certainly needs a second measurement or a data correction. The Z-score didn't just flag it; it quantified exactly how extreme it is.
Visual — IQR Fence Diagram
Here's a visual representation of how IQR fences work on the order values dataset. The green zone is the "safe" range — everything outside the fences is flagged.
IQR Fence — Order Amounts ($)
Q1 $67
Q3 $134
Lower fence
–$32 (no values)
Upper fence
$233
Method 3 — Visualising with a Boxplot
The boxplot is the most intuitive outlier detection tool because it shows you the IQR fences visually. The box spans Q1 to Q3, the line inside is the median, and the whiskers extend to the fences. Any point beyond the whiskers is drawn individually — those are your outliers.
The scenario: A delivery company tracks how many minutes late each driver's routes run. We want to see the distribution and spot any extreme delays with a quick visual check.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Delivery delay times in minutes — most on time, one badly late
delays = pd.DataFrame({
'driver': ['D01','D02','D03','D04','D05','D06','D07','D08','D09','D10'],
# Minutes late — negative means early; D09 had a major incident
'mins_late': [3, -2, 7, 5, 12, -1, 8, 4, 95, 6]
})
# Create a horizontal boxplot — seaborn handles outlier dots automatically
fig, ax = plt.subplots(figsize=(8, 3))
sns.boxplot(x=delays['mins_late'], ax=ax, color='#7dd3fc',
flierprops=dict(marker='o', color='#dc2626', markersize=8))
# Label the plot clearly
ax.set_xlabel('Minutes Late', fontsize=12)
ax.set_title('Delivery Delay Distribution — D09 Appears as a Red Dot', fontsize=13)
plt.tight_layout()
plt.show()
A horizontal boxplot renders. The box spans approximately -1 to 12 minutes, the median line sits around 5 minutes, and both whiskers are short. A single red dot appears far to the right at 95 minutes — driver D09. This one point is visually unmistakable against the tight cluster of normal values.
💡 What just happened?
Seaborn automatically applies the IQR rule when drawing a boxplot — any point beyond the whiskers gets plotted as an individual dot. You didn't have to write a single line of detection logic; the visual did the work. This is why boxplots are the go-to first look for outlier detection when you're exploring a new dataset. D09's 95-minute delay is the only dot outside the whiskers, making it immediately obvious.
Comparing IQR vs Z-Score Side by Side
Both methods work, but they're not always going to agree — and understanding why they disagree is part of EDA judgment.
| IQR Method | Z-Score Method | |
|---|---|---|
| Best for | Skewed data, non-normal distributions | Roughly normal distributions |
| Sensitive to outliers? | No — uses median-based quartiles | Yes — mean and std shift with extreme values |
| Threshold | 1.5 × IQR beyond Q1/Q3 | |Z| > 3 (or 2.5 for stricter) |
| Typical use | EDA, boxplots, financial data | Scientific data, sensor readings, ML pre-processing |
The scenario: Let's run both methods on the same dataset and see if they agree — or disagree — on what counts as an outlier.
import pandas as pd
import numpy as np
# Property sale prices — a typical skewed real-estate dataset
homes = pd.DataFrame({
'address': ['12 Oak St','7 Pine Ave','3 Elm Rd','55 Birch Ln',
'9 Cedar Ct','21 Maple Dr','88 Willow Blvd','4 Ash Pl',
'33 Ivy Way','16 Spruce Cl'],
# Sale price in thousands — one is a luxury mansion, one is a teardown
'price_k': [320, 410, 295, 380, 450, 15, 390, 425, 2800, 360]
})
# --- IQR METHOD ---
Q1 = homes['price_k'].quantile(0.25)
Q3 = homes['price_k'].quantile(0.75)
IQR = Q3 - Q1
homes['iqr_outlier'] = (homes['price_k'] < Q1 - 1.5*IQR) | \
(homes['price_k'] > Q3 + 1.5*IQR)
# --- Z-SCORE METHOD ---
mean_p = homes['price_k'].mean()
std_p = homes['price_k'].std()
homes['z_score'] = ((homes['price_k'] - mean_p) / std_p).round(2)
homes['zscore_outlier'] = homes['z_score'].abs() > 3
# Show the full comparison
print(homes[['address', 'price_k', 'iqr_outlier', 'z_score', 'zscore_outlier']])
address price_k iqr_outlier z_score zscore_outlier 0 12 Oak St 320 False -0.23 False 1 7 Pine Ave 410 False -0.03 False 2 3 Elm Rd 295 False -0.28 False 3 55 Birch Ln 380 False -0.10 False 4 9 Cedar Ct 450 False 0.06 False 5 21 Maple Dr 15 True -0.79 False 6 88 Willow Blvd 390 False -0.08 False 7 4 Ash Pl 425 False 0.01 False 8 33 Ivy Way 2800 True 2.72 False 9 16 Spruce Cl 360 False -0.15 False
💡 What just happened?
Both methods flagged the $15k teardown and the $2.8M mansion with IQR — but the Z-score didn't flag either of them! Why? Because the $2,800k mansion inflated the mean and standard deviation so heavily that a $2.8M value only scores a 2.72 Z — just under the 3.0 threshold. The IQR method is immune to this because it uses quartiles, not the mean. This is a textbook example of why IQR is preferred for skewed real-world data like property prices.
🍎 Teacher's Note
Detection is just the beginning. Once you've flagged outliers, your next step is to investigate — not immediately delete. Always ask: does this value make sense given the business context? A $2.8M house in a luxury suburb is normal. A $2.8M grocery bill is not. In the next lesson, we'll cover the strategies for what to actually do with them once you've found them.
Practice Questions
1. What does IQR stand for in the context of outlier detection?
2. What is the standard Z-score threshold beyond which a value is typically flagged as an outlier?
3. In the IQR method, what multiplier is applied to the IQR to calculate the upper and lower fences?
Quiz
1. A dataset of house prices contains one extreme luxury property worth $50 million. Which outlier detection method is more reliable for this data?
2. Given Q1 = 50 and Q3 = 150, what is the upper fence using the IQR method?
3. You detect a value flagged as an outlier by both IQR and Z-score. What should you do first?
Up Next · Lesson 9
Outlier Handling
Now that you can find outliers, learn when to remove them, when to cap them, and when to leave them exactly where they are.