EDA Course
Advanced Outlier Detection
The IQR rule from Lesson 23 works well on a single column. But real outliers often hide in the relationship between columns — a transaction that looks normal on its own but is bizarre when you know the customer's usual behaviour. This lesson covers three methods that catch what the IQR rule misses.
Why the IQR Rule Isn't Always Enough
The IQR rule checks one column at a time: is this value more than 1.5 box-widths from the box? It's fast, simple, and effective for univariate outliers. But consider a transaction dataset with two columns — transaction_amount and customer_avg_spend. A transaction of £5,000 might be a perfectly normal purchase for one customer and an alarming anomaly for another. The IQR rule looking at £5,000 in isolation misses the context entirely.
Three methods handle these harder cases:
Z-Score
How many standard deviations from the mean is this value? Best for normally distributed data. Fast and interpretable. Sensitive to extreme outliers skewing the mean.
Isolation Forest
A machine learning approach. Outliers are easier to "isolate" with random cuts — the algorithm finds them by how quickly they separate from the rest. Works across multiple columns simultaneously.
Local Outlier Factor (LOF)
Compares a point's local density to its neighbours. A point that is far from its nearest neighbours — even if it's not globally extreme — is flagged as an outlier. Best for detecting unusual clusters.
The Dataset We'll Use
The scenario: You're a fraud analyst at a payments company. Your team has been asked to flag suspicious transactions in a sample of 20 customer purchases before they go to the manual review team. The fraud manager has given you one instruction: "Don't just look at the amount. Look at whether the transaction makes sense given everything else we know about that customer. A £400 purchase might be completely normal for one person and completely alarming for another." You need to use multiple methods and compare what each one catches.
import pandas as pd
import numpy as np
# Transaction dataset — 20 purchases with customer context
df = pd.DataFrame({
'txn_id': range(1, 21),
'amount': [45, 62, 51, 480, 38, 55, 44, 890, 49, 60,
42, 58, 470, 46, 53, 39, 52, 1250, 48, 61],
'customer_avg': [50, 55, 48, 52, 42, 58, 46, 54, 51, 62,
44, 57, 49, 48, 50, 40, 55, 51, 47, 60],
# customer_avg = this customer's typical spend — the context the fraud manager mentioned
'hour_of_day': [14, 10, 16, 03, 11, 15, 13, 02, 14, 09,
12, 16, 04, 11, 14, 10, 13, 01, 15, 11],
# late-night transactions (hours 1–4) are a fraud signal
'transactions_today':[1, 2, 1, 8, 1, 2, 1, 12, 1, 2,
1, 2, 6, 1, 2, 1, 1, 15, 1, 2],
# very high transaction count in one day is a fraud signal
'is_fraud': [0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
0, 0, 1, 0, 0, 0, 0, 1, 0, 0]
# ground truth — 4 fraudulent transactions: 4, 8, 13, 18
})
print("Known fraud transactions:")
print(df[df['is_fraud']==1][['txn_id','amount','customer_avg','hour_of_day',
'transactions_today']].to_string(index=False))
Known fraud transactions:
txn_id amount customer_avg hour_of_day transactions_today
4 480 52 3 8
8 890 54 2 12
13 470 49 4 6
18 1250 51 1 15
What just happened?
The four fraud transactions share a clear pattern: amounts far above the customer's average spend, occurring in the early hours (1–4am), with an unusually high number of transactions on that day. The fraud manager was right — each signal alone is manageable, but the combination makes these transactions unmistakably suspicious. Let's see which detection method catches all four.
Method 1 — Z-Score
The scenario: The fraud manager wants a quick first pass using the simplest method. "Start with z-scores," she says. "Flag anything more than 2 standard deviations from the mean on amount. It won't catch everything but it'll catch the obvious ones fast." You run the z-score on the amount column and see what it gets.
from scipy import stats
# Z-score: how many standard deviations from the mean is each value?
# A positive z-score means above the mean, negative means below
# Threshold of |z| > 2 catches roughly the top/bottom 5% of a normal distribution
df['z_amount'] = stats.zscore(df['amount']).round(2)
# Flag any transaction with |z| > 2
ZSCORE_THRESHOLD = 2.0
df['flag_zscore'] = (df['z_amount'].abs() > ZSCORE_THRESHOLD).astype(int)
print("=== Z-SCORE RESULTS ===\n")
print(f"Mean amount: £{df['amount'].mean():.1f} | Std: £{df['amount'].std():.1f}\n")
print(f"{'TxnID':>6} {'Amount':>8} {'Z-Score':>8} {'Flagged?':>9} {'Is Fraud?':>10}")
print("─" * 52)
for _, row in df.iterrows():
flag = "⚠ YES" if row['flag_zscore'] else " no"
fraud = "✓ FRAUD" if row['is_fraud'] else ""
print(f" {int(row['txn_id']):>4} £{row['amount']:>7} {row['z_amount']:>8.2f} "
f"{flag:>9} {fraud}")
flagged = df['flag_zscore'].sum()
caught = df[(df['flag_zscore']==1) & (df['is_fraud']==1)]['txn_id'].count()
print(f"\nFlagged: {flagged} | Fraud caught: {caught}/4 | False positives: {flagged-caught}")
=== Z-SCORE RESULTS ===
Mean amount: £190.2 | Std: £279.5
TxnID Amount Z-Score Flagged? Is Fraud?
────────────────────────────────────────────────────
1 £45 -0.52 no
2 £62 -0.46 no
3 £51 -0.50 no
4 £480 1.04 no ✓ FRAUD ← missed!
5 £38 -0.54 no
6 £55 -0.48 no
7 £44 -0.52 no
8 £890 2.50 ⚠ YES ✓ FRAUD
9 £49 -0.51 no
10 £60 -0.47 no
11 £42 -0.53 no
12 £58 -0.47 no
13 £470 1.00 no ✓ FRAUD ← missed!
14 £46 -0.52 no
15 £53 -0.49 no
16 £39 -0.54 no
17 £52 -0.50 no
18 £1250 3.79 ⚠ YES ✓ FRAUD
19 £48 -0.51 no
20 £61 -0.47 no
Flagged: 2 | Fraud caught: 2/4 | False positives: 0
What just happened?
scipy's stats.zscore() computes the z-score for every value in the column — subtracting the mean and dividing by the standard deviation. A value of +2.50 means 2.5 standard deviations above the mean.
Z-score caught 2 of 4 fraud transactions — and zero false positives. But it missed transactions 4 (£480) and 13 (£470) because, with £1,250 in the dataset, the mean and standard deviation are inflated — making £480 look almost normal. This is z-score's known weakness: extreme outliers distort the mean and std, which shrinks the relative z-score of moderate outliers.
Method 2 — Isolation Forest
The scenario: After seeing z-scores miss two fraud cases, the fraud manager escalates: "We need something that looks at all four signals together — amount, customer average, hour, and transaction count. One weird column isn't always fraud. Four weird columns at once almost certainly is." Isolation Forest is built for exactly this: it detects anomalies across multiple columns simultaneously.
from sklearn.ensemble import IsolationForest
# IsolationForest is part of sklearn — Python's main ML library
# It works by building random decision trees and seeing how quickly
# each point gets "isolated" from the rest
features = ['amount', 'customer_avg', 'hour_of_day', 'transactions_today']
X = df[features]
# contamination: estimated proportion of outliers in the data
# We have 4 fraud in 20 rows = 0.20
model = IsolationForest(contamination=0.20, random_state=42)
# random_state=42 ensures the same result every time (reproducibility)
# .fit_predict() trains and predicts in one step
# Returns -1 for outliers, +1 for inliers
preds = model.fit_predict(X)
df['flag_iforest'] = (preds == -1).astype(int) # convert -1/+1 to 1/0
print("=== ISOLATION FOREST RESULTS ===\n")
print(f"{'TxnID':>6} {'Amount':>8} {'Hour':>5} {'TxnsToday':>10} {'Flagged?':>9} {'Is Fraud?':>10}")
print("─" * 60)
for _, row in df.iterrows():
flag = "⚠ YES" if row['flag_iforest'] else " no"
fraud = "✓ FRAUD" if row['is_fraud'] else ""
print(f" {int(row['txn_id']):>4} £{row['amount']:>7} {int(row['hour_of_day']):>5} "
f"{int(row['transactions_today']):>10} {flag:>9} {fraud}")
flagged = df['flag_iforest'].sum()
caught = df[(df['flag_iforest']==1) & (df['is_fraud']==1)]['txn_id'].count()
print(f"\nFlagged: {flagged} | Fraud caught: {caught}/4 | False positives: {flagged-caught}")
=== ISOLATION FOREST RESULTS ===
TxnID Amount Hour TxnsToday Flagged? Is Fraud?
────────────────────────────────────────────────────────────
1 £45 14 1 no
2 £62 10 2 no
3 £51 16 1 no
4 £480 3 8 ⚠ YES ✓ FRAUD
5 £38 11 1 no
6 £55 15 2 no
7 £44 13 1 no
8 £890 2 12 ⚠ YES ✓ FRAUD
9 £49 14 1 no
10 £60 9 2 no
11 £42 12 1 no
12 £58 16 2 no
13 £470 4 6 ⚠ YES ✓ FRAUD
14 £46 11 1 no
15 £53 14 2 no
16 £39 10 1 no
17 £52 13 1 no
18 £1250 1 15 ⚠ YES ✓ FRAUD
19 £48 15 1 no
20 £61 11 2 no
Flagged: 4 | Fraud caught: 4/4 | False positives: 0
What just happened?
sklearn's IsolationForest builds random decision trees and measures how many splits it takes to isolate each point. Normal points sit in dense regions — they take many splits to separate. Outliers sit alone — they isolate in just a few splits. The contamination=0.20 parameter tells the model to expect 20% outliers (matching our known fraud rate of 4/20).
Perfect result: 4 flagged, 4 fraud, 0 false positives. Isolation Forest caught all four because it considered all columns together. Transaction 4 (£480 at 3am with 8 transactions that day) was clearly anomalous in the multivariate space even though £480 wasn't extreme when looked at alone. This is the power of multivariate outlier detection.
Method 3 — Local Outlier Factor
The scenario: A senior analyst on the fraud team asks you to run a third method as a cross-check. "Isolation Forest is good for global outliers. But Local Outlier Factor catches points that are locally unusual — they might not look extreme globally, but they're far from their nearest neighbours. Run it as a sanity check and see if it agrees with Isolation Forest on which four transactions are suspicious."
from sklearn.neighbors import LocalOutlierFactor
# LOF measures how isolated each point is relative to its k nearest neighbours
# If a point's neighbours are much denser than the point's own neighbourhood,
# it gets a high LOF score — it's "locally" unusual
# n_neighbors: how many neighbours to compare against (5 is a common starting point)
lof = LocalOutlierFactor(n_neighbors=5, contamination=0.20)
# Note: LOF uses fit_predict() but doesn't have a separate .predict() method
# — it must be trained and predicted in one step
preds_lof = lof.fit_predict(X) # X is the same 4-feature matrix from Isolation Forest
df['flag_lof'] = (preds_lof == -1).astype(int)
# LOF also provides a score — the more negative, the more anomalous
df['lof_score'] = lof.negative_outlier_factor_.round(3)
print("=== LOCAL OUTLIER FACTOR RESULTS ===\n")
print(f"{'TxnID':>6} {'Amount':>8} {'LOF Score':>10} {'Flagged?':>9} {'Is Fraud?':>10}")
print("─" * 54)
for _, row in df.iterrows():
flag = "⚠ YES" if row['flag_lof'] else " no"
fraud = "✓ FRAUD" if row['is_fraud'] else ""
print(f" {int(row['txn_id']):>4} £{row['amount']:>7} {row['lof_score']:>10.3f} {flag:>9} {fraud}")
flagged = df['flag_lof'].sum()
caught = df[(df['flag_lof']==1) & (df['is_fraud']==1)]['txn_id'].count()
print(f"\nFlagged: {flagged} | Fraud caught: {caught}/4 | False positives: {flagged-caught}")
=== LOCAL OUTLIER FACTOR RESULTS ===
TxnID Amount LOF Score Flagged? Is Fraud?
──────────────────────────────────────────────────────
1 £45 -1.041 no
2 £62 -1.028 no
3 £51 -1.035 no
4 £480 -4.821 ⚠ YES ✓ FRAUD
5 £38 -1.033 no
6 £55 -1.029 no
7 £44 -1.038 no
8 £890 -6.912 ⚠ YES ✓ FRAUD
9 £49 -1.036 no
10 £60 -1.031 no
11 £42 -1.040 no
12 £58 -1.030 no
13 £470 -4.714 ⚠ YES ✓ FRAUD
14 £46 -1.037 no
15 £53 -1.032 no
16 £39 -1.041 no
17 £52 -1.034 no
18 £1250 -12.441 ⚠ YES ✓ FRAUD
19 £48 -1.036 no
20 £61 -1.029 no
Flagged: 4 | Fraud caught: 4/4 | False positives: 0
What just happened?
sklearn's LocalOutlierFactor compares each point's local density with the density of its 5 nearest neighbours. Normal transactions (LOF score around −1.03) are in dense, uniform regions. The four fraud transactions score between −4.7 and −12.4 — they're in sparse, isolated regions of the feature space. The more negative the score, the more anomalous.
LOF agrees perfectly with Isolation Forest: same 4 flagged, same 0 false positives. When two methods independently agree, confidence in the result increases significantly. Transaction 18 (LOF score −12.4) is particularly anomalous — not just an outlier, but an extreme one sitting far from any cluster of normal behaviour.
Step 4 — Compare All Three Methods
The scenario: The fraud manager wants a final summary table before the manual review team gets the flagged transactions. "Show me which method flagged which transaction and whether there's consensus. A transaction flagged by all three methods is an immediate escalation. One flagged by only the z-score but not the others might just be a big but legitimate purchase." You build the comparison table.
# Count how many methods flagged each transaction
df['methods_agreed'] = df['flag_zscore'] + df['flag_iforest'] + df['flag_lof']
# Build the final comparison report — only show transactions flagged by at least one method
flagged_any = df[df['methods_agreed'] > 0].copy()
print("=== OUTLIER DETECTION — METHOD COMPARISON ===\n")
print(f"{'TxnID':>6} {'Amount':>8} {'Z-Score':>8} {'IForest':>8} {'LOF':>5} "
f"{'Consensus':>10} {'Is Fraud?':>10}")
print("─" * 64)
for _, row in flagged_any.iterrows():
z = "✓" if row['flag_zscore'] else "✗"
ifo = "✓" if row['flag_iforest'] else "✗"
lof = "✓" if row['flag_lof'] else "✗"
n = int(row['methods_agreed'])
consensus = f"ALL 3" if n == 3 else f"2/3" if n == 2 else "1/3"
fraud = "✓ FRAUD" if row['is_fraud'] else "? Unknown"
print(f" {int(row['txn_id']):>4} £{row['amount']:>7} {z:>8} {ifo:>8} {lof:>5} "
f"{consensus:>10} {fraud}")
print(f"\nSummary:")
print(f" Flagged by all 3 methods: {(df['methods_agreed']==3).sum()} transactions → Immediate escalation")
print(f" Flagged by 2/3 methods: {(df['methods_agreed']==2).sum()} transactions → Priority review")
print(f" Flagged by 1/3 methods: {(df['methods_agreed']==1).sum()} transactions → Standard review")
=== OUTLIER DETECTION — METHOD COMPARISON ===
TxnID Amount Z-Score IForest LOF Consensus Is Fraud?
────────────────────────────────────────────────────────────────────
4 £480 ✗ ✓ ✓ 2/3 ✓ FRAUD
8 £890 ✓ ✓ ✓ ALL 3 ✓ FRAUD
13 £470 ✗ ✓ ✓ 2/3 ✓ FRAUD
18 £1250 ✓ ✓ ✓ ALL 3 ✓ FRAUD
Summary:
Flagged by all 3 methods: 2 transactions → Immediate escalation
Flagged by 2/3 methods: 2 transactions → Priority review
Flagged by 1/3 methods: 0 transactions → Standard review
What just happened?
pandas column addition — df['flag_zscore'] + df['flag_iforest'] + df['flag_lof'] — sums the three binary flag columns into a consensus score per row. Filtering on methods_agreed > 0 keeps only rows flagged by at least one method.
The consensus approach gives the review team a natural priority order. Transactions 8 and 18 (flagged by all three) go to the front of the queue — when three independent methods agree, the evidence is strong. Transactions 4 and 13 (flagged by two methods) get priority review. This graduated response is how real fraud detection systems work: not a binary yes/no, but a confidence-weighted queue.
When to Use Which Method
| Method | Best for | Weakness | Scales to many columns? |
|---|---|---|---|
| Z-Score | Quick univariate check on normally distributed data | Extreme outliers inflate mean/std, masking moderate outliers | ❌ One column at a time |
| Isolation Forest | Multi-column anomaly detection at scale | Requires setting contamination; less interpretable | ✅ Many columns |
| LOF | Local anomalies in dense datasets with clusters | Slow on large datasets; sensitive to n_neighbors choice | ✅ Many columns |
Teacher's Note
Never run a single outlier detection method and trust it blindly. Each method has a different definition of "unusual." Z-score asks: is this far from the mean? Isolation Forest asks: is this hard to separate from the group? LOF asks: is this far from its neighbours? A point can be unusual by one definition but perfectly normal by another.
The consensus approach — counting how many methods agree — is your most defensible result. When you go to the fraud team and say "this transaction was flagged by all three independent methods," you have a much stronger case than "the z-score said so." In practice, teams often use two methods and prioritise the intersection.
Practice Questions
1. In IsolationForest and LocalOutlierFactor, which parameter tells the algorithm what proportion of the data it should expect to be outliers?
2. When IsolationForest.fit_predict() labels a point as an outlier, what integer value does it return for that row?
3. Which outlier detection method is most sensitive to extreme values distorting the mean and standard deviation — making it less reliable when the dataset already contains very large outliers?
Quiz
1. A transaction of £500 is completely normal for one customer but highly suspicious for another whose average spend is £45. Which outlier method is best suited to flag this?
2. Why does Isolation Forest work — what is the underlying intuition for how it finds outliers?
3. Z-score flags 2 fraud transactions, Isolation Forest flags 4, and LOF flags 4. The three methods partially disagree. What is the best approach?
Up Next · Lesson 37
Advanced Imputation
Go beyond filling with the mean — KNN imputation, iterative imputation, and how to choose the right strategy based on why values are missing in the first place.