EDA Lesson 8 – Detecting Outliers | Dataplexa

Beginner Level · Lesson 8

Detecting Outliers

Outliers are the rows that don't play by the rules — they can be your most important data points, or the most dangerous ones, and knowing which requires you to find them first.

The Row That Broke the Average

Picture a small startup with 9 employees all earning between $55,000 and $80,000. The average salary comes out to $68,000 — perfectly reasonable. Then the CEO's $1,200,000 compensation package gets added to the dataset. Suddenly the average jumps to $174,000, and every statistic you calculated just became a lie.

That's what outliers do. They're data points that sit abnormally far from the bulk of your data, and they distort every mean, correlation, and model you build — unless you catch them first.

⚠️ Outlier or Error? — Not every outlier is a mistake. A marathon runner logging a 2-hour finish time is a legitimate record. A marathon finish time of 2 minutes is a data entry error. Before removing anything, always ask: is this value surprising, or is it impossible?

Method 1 — The IQR Rule

The Interquartile Range (IQR) is the most widely used outlier detection method. It measures the spread of the middle 50% of your data, then flags anything that falls too far outside that range as a potential outlier.

The formula is simple: calculate Q1 (the 25th percentile) and Q3 (the 75th percentile). The IQR is Q3 minus Q1. Any value below Q1 − 1.5 × IQR or above Q3 + 1.5 × IQR is flagged.

Lower Fence

Q1 − 1.5 × IQR

Values below this are low outliers

Upper Fence

Q3 + 1.5 × IQR

Values above this are high outliers

The scenario: An e-commerce platform logs order values. Most orders are between $10 and $200, but a few look suspicious — either tiny test transactions or unusually large bulk orders. Let's find them.

import pandas as pd
import numpy as np

# E-commerce order values — most normal, a few suspicious ones
orders = pd.DataFrame({
    'order_id': range(1001, 1011),
    # Order amounts in USD — includes two extreme values
    'amount':   [45, 120, 89, 0.50, 155, 78, 1850, 62, 99, 134]
})

# Step 1: Calculate Q1 and Q3 (25th and 75th percentile)
Q1 = orders['amount'].quantile(0.25)
Q3 = orders['amount'].quantile(0.75)

# Step 2: Calculate the IQR (interquartile range)
IQR = Q3 - Q1

# Step 3: Define the fences — anything outside is flagged
lower_fence = Q1 - 1.5 * IQR
upper_fence = Q3 + 1.5 * IQR

print(f"Q1: {Q1}  |  Q3: {Q3}  |  IQR: {IQR}")
print(f"Lower fence: {lower_fence:.2f}")
print(f"Upper fence: {upper_fence:.2f}")
print()

# Step 4: Boolean mask — True where a value IS an outlier
is_outlier = (orders['amount'] < lower_fence) | (orders['amount'] > upper_fence)

# Step 5: Filter to show only the flagged rows
print("Flagged outliers:")
print(orders[is_outlier])

Q1: 67.25  |  Q3: 133.75  |  IQR: 66.5
Lower fence: -32.50
Upper fence: 233.50

Flagged outliers:
   order_id  amount
3      1004    0.50
6      1007  1850.00

💡 What just happened?

The IQR method flagged two orders: a $0.50 micro-transaction (almost certainly a test or error) and a $1,850 bulk order (possibly legitimate but worth investigating). The lower fence came out negative (–$32.50), which means no realistic order value could trigger it — so we only have high-side outliers here. That's useful context too.

Method 2 — The Z-Score

The Z-score measures how many standard deviations a value sits away from the mean. A Z-score of 0 means you're exactly at the mean. A Z-score of 3 means you're three standard deviations out — and that's where most analysts draw the line.

The formula: Z = (value − mean) / standard deviation. Any value with |Z| > 3 is typically flagged as an outlier. This method assumes your data is roughly normally distributed — if it's heavily skewed, the IQR method is usually more reliable.

The scenario: A hospital logs patient blood pressure readings. One reading looks suspiciously high. We'll use Z-scores to find it.

import pandas as pd
import numpy as np

# Hospital patient systolic blood pressure readings (mmHg)
bp = pd.DataFrame({
    'patient': ['P01','P02','P03','P04','P05','P06','P07','P08','P09','P10'],
    # Normal range is roughly 110–140 mmHg; P07 looks alarming
    'systolic': [118, 122, 130, 125, 119, 135, 240, 128, 121, 116]
})

# Calculate mean and standard deviation for the column
mean_bp = bp['systolic'].mean()
std_bp  = bp['systolic'].std()

# Compute Z-score for every reading: (value - mean) / std
bp['z_score'] = (bp['systolic'] - mean_bp) / std_bp

# Round for readability
bp['z_score'] = bp['z_score'].round(2)

# Flag any reading where the absolute Z-score exceeds 3
bp['is_outlier'] = bp['z_score'].abs() > 3

print(bp)
print()
print("Outliers detected:")
print(bp[bp['is_outlier']])

  patient  systolic  z_score  is_outlier
0     P01       118    -0.37       False
1     P02       122    -0.25       False
2     P03       130     0.00       False
3     P04       125    -0.15       False
4     P05       119    -0.34       False
5     P06       135     0.15       False
6     P07       240     3.28        True
7     P08       128    -0.06       False
8     P09       121    -0.28       False
9     P10       116    -0.43       False

Outliers detected:
  patient  systolic  z_score  is_outlier
6     P07       240     3.28        True

💡 What just happened?

Patient P07's reading of 240 mmHg has a Z-score of 3.28 — more than three standard deviations above the mean. Every other patient clusters tightly between –0.43 and +0.15, which tells you this dataset is very homogeneous. P07 almost certainly needs a second measurement or a data correction. The Z-score didn't just flag it; it quantified exactly how extreme it is.

Visual — IQR Fence Diagram

Here's a visual representation of how IQR fences work on the order values dataset. The green zone is the "safe" range — everything outside the fences is flagged.

IQR Fence — Order Amounts ($)

Q1 $67

Q3 $134

Lower fence

–$32 (no values)

Upper fence

$233

Safe zone (IQR range)

Normal values

Flagged outliers

Method 3 — Visualising with a Boxplot

The boxplot is the most intuitive outlier detection tool because it shows you the IQR fences visually. The box spans Q1 to Q3, the line inside is the median, and the whiskers extend to the fences. Any point beyond the whiskers is drawn individually — those are your outliers.

The scenario: A delivery company tracks how many minutes late each driver's routes run. We want to see the distribution and spot any extreme delays with a quick visual check.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Delivery delay times in minutes — most on time, one badly late
delays = pd.DataFrame({
    'driver': ['D01','D02','D03','D04','D05','D06','D07','D08','D09','D10'],
    # Minutes late — negative means early; D09 had a major incident
    'mins_late': [3, -2, 7, 5, 12, -1, 8, 4, 95, 6]
})

# Create a horizontal boxplot — seaborn handles outlier dots automatically
fig, ax = plt.subplots(figsize=(8, 3))
sns.boxplot(x=delays['mins_late'], ax=ax, color='#7dd3fc',
            flierprops=dict(marker='o', color='#dc2626', markersize=8))

# Label the plot clearly
ax.set_xlabel('Minutes Late', fontsize=12)
ax.set_title('Delivery Delay Distribution — D09 Appears as a Red Dot', fontsize=13)
plt.tight_layout()
plt.show()

A horizontal boxplot renders. The box spans approximately -1 to 12 minutes,
the median line sits around 5 minutes, and both whiskers are short.
A single red dot appears far to the right at 95 minutes — driver D09.
This one point is visually unmistakable against the tight cluster of normal values.

💡 What just happened?

Seaborn automatically applies the IQR rule when drawing a boxplot — any point beyond the whiskers gets plotted as an individual dot. You didn't have to write a single line of detection logic; the visual did the work. This is why boxplots are the go-to first look for outlier detection when you're exploring a new dataset. D09's 95-minute delay is the only dot outside the whiskers, making it immediately obvious.

Comparing IQR vs Z-Score Side by Side

Both methods work, but they're not always going to agree — and understanding why they disagree is part of EDA judgment.

	IQR Method	Z-Score Method
Best for	Skewed data, non-normal distributions	Roughly normal distributions
Sensitive to outliers?	No — uses median-based quartiles	Yes — mean and std shift with extreme values
Threshold	1.5 × IQR beyond Q1/Q3	\|Z\| > 3 (or 2.5 for stricter)
Typical use	EDA, boxplots, financial data	Scientific data, sensor readings, ML pre-processing

The scenario: Let's run both methods on the same dataset and see if they agree — or disagree — on what counts as an outlier.

import pandas as pd
import numpy as np

# Property sale prices — a typical skewed real-estate dataset
homes = pd.DataFrame({
    'address': ['12 Oak St','7 Pine Ave','3 Elm Rd','55 Birch Ln',
                '9 Cedar Ct','21 Maple Dr','88 Willow Blvd','4 Ash Pl',
                '33 Ivy Way','16 Spruce Cl'],
    # Sale price in thousands — one is a luxury mansion, one is a teardown
    'price_k':  [320, 410, 295, 380, 450, 15, 390, 425, 2800, 360]
})

# --- IQR METHOD ---
Q1  = homes['price_k'].quantile(0.25)
Q3  = homes['price_k'].quantile(0.75)
IQR = Q3 - Q1
homes['iqr_outlier'] = (homes['price_k'] < Q1 - 1.5*IQR) | \
                     (homes['price_k'] > Q3 + 1.5*IQR)

# --- Z-SCORE METHOD ---
mean_p = homes['price_k'].mean()
std_p  = homes['price_k'].std()
homes['z_score']     = ((homes['price_k'] - mean_p) / std_p).round(2)
homes['zscore_outlier'] = homes['z_score'].abs() > 3

# Show the full comparison
print(homes[['address', 'price_k', 'iqr_outlier', 'z_score', 'zscore_outlier']])

          address  price_k  iqr_outlier  z_score  zscore_outlier
0        12 Oak St      320        False    -0.23           False
1       7 Pine Ave      410        False    -0.03           False
2         3 Elm Rd      295        False    -0.28           False
3      55 Birch Ln      380        False    -0.10           False
4       9 Cedar Ct      450        False     0.06           False
5      21 Maple Dr       15         True    -0.79           False
6  88 Willow Blvd      390        False    -0.08           False
7         4 Ash Pl      425        False     0.01           False
8        33 Ivy Way     2800         True     2.72           False
9     16 Spruce Cl      360        False    -0.15           False

💡 What just happened?

Both methods flagged the $15k teardown and the $2.8M mansion with IQR — but the Z-score didn't flag either of them! Why? Because the $2,800k mansion inflated the mean and standard deviation so heavily that a $2.8M value only scores a 2.72 Z — just under the 3.0 threshold. The IQR method is immune to this because it uses quartiles, not the mean. This is a textbook example of why IQR is preferred for skewed real-world data like property prices.

🍎 Teacher's Note

Detection is just the beginning. Once you've flagged outliers, your next step is to investigate — not immediately delete. Always ask: does this value make sense given the business context? A $2.8M house in a luxury suburb is normal. A $2.8M grocery bill is not. In the next lesson, we'll cover the strategies for what to actually do with them once you've found them.

Practice Questions

1. What does IQR stand for in the context of outlier detection?

2. What is the standard Z-score threshold beyond which a value is typically flagged as an outlier?

3. In the IQR method, what multiplier is applied to the IQR to calculate the upper and lower fences?

Quiz

Up Next · Lesson 9

Outlier Handling

Now that you can find outliers, learn when to remove them, when to cap them, and when to leave them exactly where they are.

← Previous Course Index Next →

EDA Course

Detecting Outliers

The Row That Broke the Average

Method 1 — The IQR Rule

Method 2 — The Z-Score

Visual — IQR Fence Diagram

Method 3 — Visualising with a Boxplot

Comparing IQR vs Z-Score Side by Side

Practice Questions

Quiz