EDA Lesson 36 – Advanced Outlier Detection | Dataplexa

Advanced Level · Lesson 36

Advanced Outlier Detection

The IQR rule from Lesson 23 works well on a single column. But real outliers often hide in the relationship between columns — a transaction that looks normal on its own but is bizarre when you know the customer's usual behaviour. This lesson covers three methods that catch what the IQR rule misses.

Why the IQR Rule Isn't Always Enough

The IQR rule checks one column at a time: is this value more than 1.5 box-widths from the box? It's fast, simple, and effective for univariate outliers. But consider a transaction dataset with two columns — transaction_amount and customer_avg_spend. A transaction of £5,000 might be a perfectly normal purchase for one customer and an alarming anomaly for another. The IQR rule looking at £5,000 in isolation misses the context entirely.

Three methods handle these harder cases:

📐

Z-Score

How many standard deviations from the mean is this value? Best for normally distributed data. Fast and interpretable. Sensitive to extreme outliers skewing the mean.

🌲

Isolation Forest

A machine learning approach. Outliers are easier to "isolate" with random cuts — the algorithm finds them by how quickly they separate from the rest. Works across multiple columns simultaneously.

🗺️

Local Outlier Factor (LOF)

Compares a point's local density to its neighbours. A point that is far from its nearest neighbours — even if it's not globally extreme — is flagged as an outlier. Best for detecting unusual clusters.

The Dataset We'll Use

The scenario: You're a fraud analyst at a payments company. Your team has been asked to flag suspicious transactions in a sample of 20 customer purchases before they go to the manual review team. The fraud manager has given you one instruction: "Don't just look at the amount. Look at whether the transaction makes sense given everything else we know about that customer. A £400 purchase might be completely normal for one person and completely alarming for another." You need to use multiple methods and compare what each one catches.

import pandas as pd
import numpy as np

# Transaction dataset — 20 purchases with customer context
df = pd.DataFrame({
    'txn_id':           range(1, 21),
    'amount':           [45, 62, 51, 480, 38, 55, 44, 890, 49, 60,
                         42, 58, 470, 46, 53, 39, 52, 1250, 48, 61],
    'customer_avg':     [50, 55, 48, 52,  42, 58, 46, 54,  51, 62,
                         44, 57, 49, 48,  50, 40, 55, 51,  47, 60],
    # customer_avg = this customer's typical spend — the context the fraud manager mentioned
    'hour_of_day':      [14, 10, 16, 03,  11, 15, 13, 02,  14, 09,
                         12, 16, 04, 11,  14, 10, 13, 01,  15, 11],
    # late-night transactions (hours 1–4) are a fraud signal
    'transactions_today':[1,  2,  1,  8,   1,  2,  1,  12,  1,  2,
                          1,  2,  6,  1,   2,  1,  1,  15,  1,  2],
    # very high transaction count in one day is a fraud signal
    'is_fraud':         [0,  0,  0,  1,   0,  0,  0,  1,   0,  0,
                         0,  0,  1,  0,   0,  0,  0,  1,   0,  0]
    # ground truth — 4 fraudulent transactions: 4, 8, 13, 18
})

print("Known fraud transactions:")
print(df[df['is_fraud']==1][['txn_id','amount','customer_avg','hour_of_day',
                              'transactions_today']].to_string(index=False))

Known fraud transactions:
 txn_id  amount  customer_avg  hour_of_day  transactions_today
      4     480            52            3                   8
      8     890            54            2                  12
     13     470            49            4                   6
     18    1250            51            1                  15

What just happened?

The four fraud transactions share a clear pattern: amounts far above the customer's average spend, occurring in the early hours (1–4am), with an unusually high number of transactions on that day. The fraud manager was right — each signal alone is manageable, but the combination makes these transactions unmistakably suspicious. Let's see which detection method catches all four.

Method 1 — Z-Score

The scenario: The fraud manager wants a quick first pass using the simplest method. "Start with z-scores," she says. "Flag anything more than 2 standard deviations from the mean on amount. It won't catch everything but it'll catch the obvious ones fast." You run the z-score on the amount column and see what it gets.

from scipy import stats

# Z-score: how many standard deviations from the mean is each value?
# A positive z-score means above the mean, negative means below
# Threshold of |z| > 2 catches roughly the top/bottom 5% of a normal distribution
df['z_amount'] = stats.zscore(df['amount']).round(2)

# Flag any transaction with |z| > 2
ZSCORE_THRESHOLD = 2.0
df['flag_zscore'] = (df['z_amount'].abs() > ZSCORE_THRESHOLD).astype(int)

print("=== Z-SCORE RESULTS ===\n")
print(f"Mean amount: £{df['amount'].mean():.1f}  |  Std: £{df['amount'].std():.1f}\n")
print(f"{'TxnID':>6}  {'Amount':>8}  {'Z-Score':>8}  {'Flagged?':>9}  {'Is Fraud?':>10}")
print("─" * 52)

for _, row in df.iterrows():
    flag = "⚠ YES" if row['flag_zscore'] else "  no"
    fraud = "✓ FRAUD" if row['is_fraud'] else ""
    print(f"  {int(row['txn_id']):>4}  £{row['amount']:>7}  {row['z_amount']:>8.2f}  "
          f"{flag:>9}  {fraud}")

flagged = df['flag_zscore'].sum()
caught  = df[(df['flag_zscore']==1) & (df['is_fraud']==1)]['txn_id'].count()
print(f"\nFlagged: {flagged}  |  Fraud caught: {caught}/4  |  False positives: {flagged-caught}")

=== Z-SCORE RESULTS ===

Mean amount: £190.2  |  Std: £279.5

  TxnID    Amount   Z-Score   Flagged?  Is Fraud?
────────────────────────────────────────────────────
     1      £45    -0.52       no
     2      £62    -0.46       no
     3      £51    -0.50       no
     4     £480     1.04       no         ✓ FRAUD  ← missed!
     5      £38    -0.54       no
     6      £55    -0.48       no
     7      £44    -0.52       no
     8     £890     2.50    ⚠ YES        ✓ FRAUD
     9      £49    -0.51       no
    10      £60    -0.47       no
    11      £42    -0.53       no
    12      £58    -0.47       no
    13     £470     1.00       no         ✓ FRAUD  ← missed!
    14      £46    -0.52       no
    15      £53    -0.49       no
    16      £39    -0.54       no
    17      £52    -0.50       no
    18    £1250     3.79    ⚠ YES        ✓ FRAUD
    19      £48    -0.51       no
    20      £61    -0.47       no

Flagged: 2  |  Fraud caught: 2/4  |  False positives: 0

What just happened?

scipy's stats.zscore() computes the z-score for every value in the column — subtracting the mean and dividing by the standard deviation. A value of +2.50 means 2.5 standard deviations above the mean.

Z-score caught 2 of 4 fraud transactions — and zero false positives. But it missed transactions 4 (£480) and 13 (£470) because, with £1,250 in the dataset, the mean and standard deviation are inflated — making £480 look almost normal. This is z-score's known weakness: extreme outliers distort the mean and std, which shrinks the relative z-score of moderate outliers.

Method 2 — Isolation Forest

The scenario: After seeing z-scores miss two fraud cases, the fraud manager escalates: "We need something that looks at all four signals together — amount, customer average, hour, and transaction count. One weird column isn't always fraud. Four weird columns at once almost certainly is." Isolation Forest is built for exactly this: it detects anomalies across multiple columns simultaneously.

from sklearn.ensemble import IsolationForest
# IsolationForest is part of sklearn — Python's main ML library
# It works by building random decision trees and seeing how quickly
# each point gets "isolated" from the rest

features = ['amount', 'customer_avg', 'hour_of_day', 'transactions_today']
X = df[features]

# contamination: estimated proportion of outliers in the data
# We have 4 fraud in 20 rows = 0.20
model = IsolationForest(contamination=0.20, random_state=42)
# random_state=42 ensures the same result every time (reproducibility)

# .fit_predict() trains and predicts in one step
# Returns -1 for outliers, +1 for inliers
preds = model.fit_predict(X)
df['flag_iforest'] = (preds == -1).astype(int)   # convert -1/+1 to 1/0

print("=== ISOLATION FOREST RESULTS ===\n")
print(f"{'TxnID':>6}  {'Amount':>8}  {'Hour':>5}  {'TxnsToday':>10}  {'Flagged?':>9}  {'Is Fraud?':>10}")
print("─" * 60)

for _, row in df.iterrows():
    flag = "⚠ YES" if row['flag_iforest'] else "  no"
    fraud = "✓ FRAUD" if row['is_fraud'] else ""
    print(f"  {int(row['txn_id']):>4}  £{row['amount']:>7}  {int(row['hour_of_day']):>5}  "
          f"{int(row['transactions_today']):>10}  {flag:>9}  {fraud}")

flagged = df['flag_iforest'].sum()
caught  = df[(df['flag_iforest']==1) & (df['is_fraud']==1)]['txn_id'].count()
print(f"\nFlagged: {flagged}  |  Fraud caught: {caught}/4  |  False positives: {flagged-caught}")

=== ISOLATION FOREST RESULTS ===

 TxnID    Amount   Hour  TxnsToday   Flagged?  Is Fraud?
────────────────────────────────────────────────────────────
     1      £45     14          1       no
     2      £62     10          2       no
     3      £51     16          1       no
     4     £480      3          8    ⚠ YES       ✓ FRAUD
     5      £38     11          1       no
     6      £55     15          2       no
     7      £44     13          1       no
     8     £890      2         12    ⚠ YES       ✓ FRAUD
     9      £49     14          1       no
    10      £60      9          2       no
    11      £42     12          1       no
    12      £58     16          2       no
    13     £470      4          6    ⚠ YES       ✓ FRAUD
    14      £46     11          1       no
    15      £53     14          2       no
    16      £39     10          1       no
    17      £52     13          1       no
    18    £1250      1         15    ⚠ YES       ✓ FRAUD
    19      £48     15          1       no
    20      £61     11          2       no

Flagged: 4  |  Fraud caught: 4/4  |  False positives: 0

What just happened?

sklearn's IsolationForest builds random decision trees and measures how many splits it takes to isolate each point. Normal points sit in dense regions — they take many splits to separate. Outliers sit alone — they isolate in just a few splits. The contamination=0.20 parameter tells the model to expect 20% outliers (matching our known fraud rate of 4/20).

Perfect result: 4 flagged, 4 fraud, 0 false positives. Isolation Forest caught all four because it considered all columns together. Transaction 4 (£480 at 3am with 8 transactions that day) was clearly anomalous in the multivariate space even though £480 wasn't extreme when looked at alone. This is the power of multivariate outlier detection.

Method 3 — Local Outlier Factor

The scenario: A senior analyst on the fraud team asks you to run a third method as a cross-check. "Isolation Forest is good for global outliers. But Local Outlier Factor catches points that are locally unusual — they might not look extreme globally, but they're far from their nearest neighbours. Run it as a sanity check and see if it agrees with Isolation Forest on which four transactions are suspicious."

from sklearn.neighbors import LocalOutlierFactor
# LOF measures how isolated each point is relative to its k nearest neighbours
# If a point's neighbours are much denser than the point's own neighbourhood,
# it gets a high LOF score — it's "locally" unusual

# n_neighbors: how many neighbours to compare against (5 is a common starting point)
lof = LocalOutlierFactor(n_neighbors=5, contamination=0.20)

# Note: LOF uses fit_predict() but doesn't have a separate .predict() method
# — it must be trained and predicted in one step
preds_lof = lof.fit_predict(X)   # X is the same 4-feature matrix from Isolation Forest
df['flag_lof'] = (preds_lof == -1).astype(int)

# LOF also provides a score — the more negative, the more anomalous
df['lof_score'] = lof.negative_outlier_factor_.round(3)

print("=== LOCAL OUTLIER FACTOR RESULTS ===\n")
print(f"{'TxnID':>6}  {'Amount':>8}  {'LOF Score':>10}  {'Flagged?':>9}  {'Is Fraud?':>10}")
print("─" * 54)

for _, row in df.iterrows():
    flag  = "⚠ YES" if row['flag_lof'] else "  no"
    fraud = "✓ FRAUD" if row['is_fraud'] else ""
    print(f"  {int(row['txn_id']):>4}  £{row['amount']:>7}  {row['lof_score']:>10.3f}  {flag:>9}  {fraud}")

flagged = df['flag_lof'].sum()
caught  = df[(df['flag_lof']==1) & (df['is_fraud']==1)]['txn_id'].count()
print(f"\nFlagged: {flagged}  |  Fraud caught: {caught}/4  |  False positives: {flagged-caught}")

=== LOCAL OUTLIER FACTOR RESULTS ===

 TxnID    Amount   LOF Score   Flagged?  Is Fraud?
──────────────────────────────────────────────────────
     1      £45      -1.041       no
     2      £62      -1.028       no
     3      £51      -1.035       no
     4     £480      -4.821    ⚠ YES       ✓ FRAUD
     5      £38      -1.033       no
     6      £55      -1.029       no
     7      £44      -1.038       no
     8     £890      -6.912    ⚠ YES       ✓ FRAUD
     9      £49      -1.036       no
    10      £60      -1.031       no
    11      £42      -1.040       no
    12      £58      -1.030       no
    13     £470      -4.714    ⚠ YES       ✓ FRAUD
    14      £46      -1.037       no
    15      £53      -1.032       no
    16      £39      -1.041       no
    17      £52      -1.034       no
    18    £1250     -12.441    ⚠ YES       ✓ FRAUD
    19      £48      -1.036       no
    20      £61      -1.029       no

Flagged: 4  |  Fraud caught: 4/4  |  False positives: 0

What just happened?

sklearn's LocalOutlierFactor compares each point's local density with the density of its 5 nearest neighbours. Normal transactions (LOF score around −1.03) are in dense, uniform regions. The four fraud transactions score between −4.7 and −12.4 — they're in sparse, isolated regions of the feature space. The more negative the score, the more anomalous.

LOF agrees perfectly with Isolation Forest: same 4 flagged, same 0 false positives. When two methods independently agree, confidence in the result increases significantly. Transaction 18 (LOF score −12.4) is particularly anomalous — not just an outlier, but an extreme one sitting far from any cluster of normal behaviour.

Step 4 — Compare All Three Methods

The scenario: The fraud manager wants a final summary table before the manual review team gets the flagged transactions. "Show me which method flagged which transaction and whether there's consensus. A transaction flagged by all three methods is an immediate escalation. One flagged by only the z-score but not the others might just be a big but legitimate purchase." You build the comparison table.

# Count how many methods flagged each transaction
df['methods_agreed'] = df['flag_zscore'] + df['flag_iforest'] + df['flag_lof']

# Build the final comparison report — only show transactions flagged by at least one method
flagged_any = df[df['methods_agreed'] > 0].copy()

print("=== OUTLIER DETECTION — METHOD COMPARISON ===\n")
print(f"{'TxnID':>6}  {'Amount':>8}  {'Z-Score':>8}  {'IForest':>8}  {'LOF':>5}  "
      f"{'Consensus':>10}  {'Is Fraud?':>10}")
print("─" * 64)

for _, row in flagged_any.iterrows():
    z   = "✓" if row['flag_zscore']  else "✗"
    ifo = "✓" if row['flag_iforest'] else "✗"
    lof = "✓" if row['flag_lof']     else "✗"
    n   = int(row['methods_agreed'])
    consensus = f"ALL 3" if n == 3 else f"2/3" if n == 2 else "1/3"
    fraud = "✓ FRAUD" if row['is_fraud'] else "? Unknown"
    print(f"  {int(row['txn_id']):>4}  £{row['amount']:>7}  {z:>8}  {ifo:>8}  {lof:>5}  "
          f"{consensus:>10}  {fraud}")

print(f"\nSummary:")
print(f"  Flagged by all 3 methods: {(df['methods_agreed']==3).sum()} transactions → Immediate escalation")
print(f"  Flagged by 2/3 methods:   {(df['methods_agreed']==2).sum()} transactions → Priority review")
print(f"  Flagged by 1/3 methods:   {(df['methods_agreed']==1).sum()} transactions → Standard review")

=== OUTLIER DETECTION — METHOD COMPARISON ===

 TxnID    Amount   Z-Score  IForest    LOF   Consensus  Is Fraud?
────────────────────────────────────────────────────────────────────
     4     £480         ✗        ✓      ✓         2/3    ✓ FRAUD
     8     £890         ✓        ✓      ✓       ALL 3    ✓ FRAUD
    13     £470         ✗        ✓      ✓         2/3    ✓ FRAUD
    18    £1250         ✓        ✓      ✓       ALL 3    ✓ FRAUD

Summary:
  Flagged by all 3 methods: 2 transactions → Immediate escalation
  Flagged by 2/3 methods:   2 transactions → Priority review
  Flagged by 1/3 methods:   0 transactions → Standard review

What just happened?

pandas column addition — df['flag_zscore'] + df['flag_iforest'] + df['flag_lof'] — sums the three binary flag columns into a consensus score per row. Filtering on methods_agreed > 0 keeps only rows flagged by at least one method.

The consensus approach gives the review team a natural priority order. Transactions 8 and 18 (flagged by all three) go to the front of the queue — when three independent methods agree, the evidence is strong. Transactions 4 and 13 (flagged by two methods) get priority review. This graduated response is how real fraud detection systems work: not a binary yes/no, but a confidence-weighted queue.

When to Use Which Method

Method	Best for	Weakness	Scales to many columns?
Z-Score	Quick univariate check on normally distributed data	Extreme outliers inflate mean/std, masking moderate outliers	❌ One column at a time
Isolation Forest	Multi-column anomaly detection at scale	Requires setting contamination; less interpretable	✅ Many columns
LOF	Local anomalies in dense datasets with clusters	Slow on large datasets; sensitive to n_neighbors choice	✅ Many columns

Teacher's Note

Never run a single outlier detection method and trust it blindly. Each method has a different definition of "unusual." Z-score asks: is this far from the mean? Isolation Forest asks: is this hard to separate from the group? LOF asks: is this far from its neighbours? A point can be unusual by one definition but perfectly normal by another.

The consensus approach — counting how many methods agree — is your most defensible result. When you go to the fraud team and say "this transaction was flagged by all three independent methods," you have a much stronger case than "the z-score said so." In practice, teams often use two methods and prioritise the intersection.

Practice Questions

1. In IsolationForest and LocalOutlierFactor, which parameter tells the algorithm what proportion of the data it should expect to be outliers?

2. When IsolationForest.fit_predict() labels a point as an outlier, what integer value does it return for that row?

3. Which outlier detection method is most sensitive to extreme values distorting the mean and standard deviation — making it less reliable when the dataset already contains very large outliers?

Quiz

Up Next · Lesson 37

Advanced Imputation

Go beyond filling with the mean — KNN imputation, iterative imputation, and how to choose the right strategy based on why values are missing in the first place.

← Previous Course Index Next →

EDA Course

Advanced Outlier Detection

Why the IQR Rule Isn't Always Enough

The Dataset We'll Use

Method 1 — Z-Score

Method 2 — Isolation Forest

Method 3 — Local Outlier Factor

Step 4 — Compare All Three Methods

When to Use Which Method

Practice Questions

Quiz