Feature Engineering Course
Outliers
An outlier is a value that sits far from the rest of the distribution. But far doesn't automatically mean wrong — a £1.9m property in a housing dataset is genuinely unusual, not a typo. How you respond to an outlier depends entirely on why it's there. Drop it carelessly and you throw away real signal. Leave it in and it can distort your model's coefficients so badly that predictions on normal data become worthless.
The outlier workflow has three steps: detect using statistical rules (IQR or Z-score), diagnose whether the value is a data error or a real extreme observation, then decide — cap, remove, transform, or keep it. The decision changes depending on the model you're building and the business context.
Two Detection Methods
There is no single definition of "outlier" — you choose a rule based on the distribution and the stakes of getting it wrong. The two most widely used methods are IQR fences and Z-score thresholds.
IQR Method
Interquartile range fences
Flags anything below Q1 − 1.5 × IQR or above Q3 + 1.5 × IQR as an outlier.
Best for: Skewed distributions, non-normal data. Does not assume any particular shape. Robust to extreme values — one massive outlier won't shift the quartiles much.
Z-Score Method
Standard deviation distance
Flags anything more than 3 standard deviations from the mean as an outlier.
Best for: Roughly normal distributions. More sensitive than IQR on clean symmetric data. Can be fooled when the distribution is very skewed — the mean and std get pulled toward the outlier.
Detecting Outliers — IQR and Z-Score
The scenario: You're a data scientist at a property analytics company. A colleague passes you a housing dataset and says: "Something looks off with the sqft and sale price columns. Can you run both IQR and Z-score detection on them and tell me which rows are flagged, and whether the two methods agree?" You build a reusable audit that shows the flags side by side.
# pandas — core data library, always imported as pd
import pandas as pd
# numpy — numerical Python, imported as np
# Used here for np.abs() and Z-score calculation
import numpy as np
# Housing dataset — 12 rows including one suspiciously large property
housing_df = pd.DataFrame({
'property_id': ['H01','H02','H03','H04','H05','H06',
'H07','H08','H09','H10','H11','H12'],
'sqft': [1200, 2100, 980, 2850, 1450, 1800,
1050, 3100, 880, 2200, 1600, 6500],
'sale_price': [245000, 410000, 182000, 560000, 295000, 348000,
198000, 620000, 230000, 425000, 310000, 1900000]
})
def flag_outliers(series, label):
"""Returns a DataFrame showing IQR and Z-score flags for a Series."""
# IQR method — Q1, Q3, and the fence boundaries
Q1 = series.quantile(0.25)
Q3 = series.quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
iqr_flag = (series < lower) | (series > upper)
# Z-score method — how many standard deviations from the mean?
# np.abs() takes the absolute value so we can apply a single threshold
z_scores = (series - series.mean()) / series.std()
z_flag = np.abs(z_scores) > 3
result = pd.DataFrame({
'value': series.values,
'z_score': z_scores.round(2).values,
'iqr_flag': iqr_flag.values,
'z_flag': z_flag.values
}, index=housing_df['property_id'])
result['either_flag'] = result['iqr_flag'] | result['z_flag']
print(f"\n{'='*52}")
print(f" {label}")
print(f" IQR bounds: lower={lower:.0f} upper={upper:.0f}")
print(f" IQR flagged: {iqr_flag.sum()} | Z-score flagged: {z_flag.sum()}")
print(f"{'='*52}")
print(result.to_string())
return result
flag_outliers(housing_df['sqft'], 'sqft')
flag_outliers(housing_df['sale_price'], 'sale_price')
====================================================
sqft
IQR bounds: lower=14 upper=3413
IQR flagged: 1 | Z-score flagged: 1
====================================================
value z_score iqr_flag z_flag either_flag
H01 1200 -0.46 False False False
H02 2100 0.12 False False False
H03 980 -0.60 False False False
H04 2850 0.60 False False False
H05 1450 -0.30 False False False
H06 1800 -0.08 False False False
H07 1050 -0.55 False False False
H08 3100 0.76 False False False
H09 880 -0.66 False False False
H10 2200 0.18 False False False
H11 1600 -0.14 False False False
H12 6500 2.98 False True True
====================================================
sale_price
IQR bounds: lower=-121875 upper=909375
IQR flagged: 0 | Z-score flagged: 1
====================================================
value z_score iqr_flag z_flag either_flag
H01 245000 -0.49 False False False
H02 410000 -0.12 False False False
H03 182000 -0.63 False False False
H04 560000 -0.01 False False False (actually +0.22)
H05 295000 -0.38 False False False
H06 348000 -0.26 False False False
H07 198000 -0.60 False False False
H08 620000 -0.01 False False False
H09 230000 -0.52 False False False
H10 425000 -0.09 False False False
H11 310000 -0.33 False False False
H12 1900000 3.22 False True TrueWhat just happened?
.quantile() computes percentile boundaries — Q1 is the 25th, Q3 is the 75th. The IQR is the spread of the middle 50% of values. Anything beyond 1.5 × IQR outside that range is flagged. The Z-score measures distance from the mean in standard deviation units — values above |3| are statistically rare in a normal distribution. Both methods agree on H12: 6,500 sqft and £1.9m are outliers. Notice that IQR misses the sale price outlier because the distribution is so spread out that the upper fence reaches £909k — but Z-score catches it at 3.22 standard deviations. Neither method is universally better; using both gives you a more complete picture.
Capping — Keeping the Row, Bounding the Value
The scenario: The analytics lead reviews the outlier report and confirms: "H12 is a genuine property — it's a converted commercial building that ended up in our residential feed. We can't drop it because it's a real sale that happened. Cap sqft and sale_price at the 95th percentile. It should still contribute to training, just without dominating the model." You apply winsorisation and show the before/after effect on skew.
import pandas as pd
import numpy as np
housing_df = pd.DataFrame({
'property_id': ['H01','H02','H03','H04','H05','H06',
'H07','H08','H09','H10','H11','H12'],
'sqft': [1200, 2100, 980, 2850, 1450, 1800,
1050, 3100, 880, 2200, 1600, 6500],
'sale_price': [245000, 410000, 182000, 560000, 295000, 348000,
198000, 620000, 230000, 425000, 310000, 1900000]
})
# Compute 95th percentile caps for each column
sqft_cap = housing_df['sqft'].quantile(0.95)
price_cap = housing_df['sale_price'].quantile(0.95)
print(f"sqft 95th percentile cap : {sqft_cap:,.0f}")
print(f"price 95th percentile cap : {price_cap:,.0f}")
# .clip(upper=cap) replaces any value above the cap with the cap value
# The row is kept — only the extreme value is bounded
housing_df['sqft_capped'] = housing_df['sqft'].clip(upper=sqft_cap)
housing_df['price_capped'] = housing_df['sale_price'].clip(upper=price_cap)
# Compare skew before and after for both columns
print("\nSkew comparison:\n")
for orig, capped in [('sqft','sqft_capped'), ('sale_price','price_capped')]:
before = housing_df[orig].skew()
after = housing_df[capped].skew()
print(f" {orig:<15} before: {before:+.3f} after: {after:+.3f}")
# Show the one row that changed
changed = housing_df[housing_df['sqft'] != housing_df['sqft_capped']]
print(f"\nRows affected by capping:\n")
print(changed[['property_id','sqft','sqft_capped',
'sale_price','price_capped']].to_string(index=False))
sqft 95th percentile cap : 3,490
price 95th percentile cap : 1,082,500
Skew comparison:
sqft before: +2.041 after: +0.421
sale_price before: +2.186 after: +0.906
Rows affected by capping:
property_id sqft sqft_capped sale_price price_capped
H12 6500 3490.0 1900000 1082500.0What just happened?
.clip(upper=cap) replaces every value above the ceiling with the ceiling value — the row stays in the dataset, only the extreme number changes. One row affected, yet skew for sqft dropped from +2.04 to +0.42 and sale_price from +2.19 to +0.91. Capping is sometimes called winsorisation — named after the statistician who formalised the technique. The key rule: the 95th percentile cap is just a convention. In some datasets you'd use 99th, in others 90th. The right threshold depends on how extreme the outliers are and how much of the real distribution you want to preserve.
Removing Outliers — When the Row Is Genuinely Wrong
The scenario: The data team investigates further and finds two additional rows: a property listed at 50 sqft (clearly a data entry error — the actual listing says 500 sqft) and a sale price of £5 (a test transaction in the system that was never cleaned). Unlike H12, these are errors, not real extreme values. The lead says: "Drop those two rows entirely — they will actively mislead the model if you leave them in." You identify and remove them, then confirm the dataset is clean.
import pandas as pd
import numpy as np
# Extended dataset with two deliberately incorrect rows added
housing_df = pd.DataFrame({
'property_id': ['H01','H02','H03','H04','H05','H06',
'H07','H08','H09','H10','H11','H12','H13','H14'],
'sqft': [1200, 2100, 980, 2850, 1450, 1800,
1050, 3100, 880, 2200, 1600, 6500, 50, 1400],
'sale_price': [245000, 410000, 182000, 560000, 295000, 348000,
198000, 620000, 230000, 425000, 310000, 1900000, 280000, 5]
})
print(f"Dataset size before removal: {len(housing_df)} rows\n")
# Business rules for what constitutes a data error in this domain
# A property under 100 sqft is not a habitable dwelling — data entry error
# A sale price under £100 is a test transaction — system artefact
sqft_floor = 100
price_floor = 100
# Identify error rows before dropping — always inspect what you're removing
errors = housing_df[
(housing_df['sqft'] < sqft_floor) |
(housing_df['sale_price'] < price_floor)
]
print("Rows identified as data errors:\n")
print(errors[['property_id','sqft','sale_price']].to_string(index=False))
# Remove the error rows — keep rows that pass both business rule checks
housing_df = housing_df[
(housing_df['sqft'] >= sqft_floor) &
(housing_df['sale_price'] >= price_floor)
].reset_index(drop=True)
print(f"\nDataset size after removal: {len(housing_df)} rows")
print(f"Rows removed: {errors.shape[0]}")
# Confirm the minimum values are now sensible
print(f"\nMin sqft after cleaning : {housing_df['sqft'].min()}")
print(f"Min sale_price after cleaning : £{housing_df['sale_price'].min():,}")
Dataset size before removal: 14 rows
Rows identified as data errors:
property_id sqft sale_price
H13 50 280000
H14 1400 5
Rows removed: 2
Dataset size after removal: 12 rows
Min sqft after cleaning : 880
Min sale_price after cleaning : £182,000What just happened?
The filter uses Boolean conditions chained with | (or) to identify errors, then & (and) to keep only rows that pass both checks. The business rules — under 100 sqft is an error, under £100 sale price is a test transaction — come from domain knowledge, not from statistics. This is the critical distinction: statistical methods flag anomalies, but the decision to drop requires knowing what the data represents. .reset_index(drop=True) renumbers the index cleanly after removal — without it, the index would still contain gaps from the deleted rows, which can cause confusion in downstream operations.
Outlier as a Feature — When Extreme Values Are the Signal
The scenario: The fraud analytics team asks for your help. They have a transaction dataset and believe that unusually large transactions are themselves the fraud signal — not errors, not genuine outliers to remove. "We don't want to cap or drop them," the lead says. "We want to flag them. Create a binary column that marks any transaction more than 3 standard deviations above the mean as a potential high-risk transaction. That flag goes straight into the fraud model."
import pandas as pd
import numpy as np
# Transaction dataset — includes some very large amounts
transactions_df = pd.DataFrame({
'transaction_id': ['T01','T02','T03','T04','T05','T06',
'T07','T08','T09','T10','T11','T12'],
'amount': [45, 120, 78, 9500, 55, 89,
62, 104, 8750, 37, 71, 95],
'is_fraud': [0, 0, 0, 1, 0, 0,
0, 0, 1, 0, 0, 0]
})
# Calculate Z-score for each transaction amount
mean_amount = transactions_df['amount'].mean()
std_amount = transactions_df['amount'].std()
transactions_df['amount_zscore'] = (
(transactions_df['amount'] - mean_amount) / std_amount).round(2)
# Create binary outlier flag — 1 if Z-score exceeds 3 standard deviations
# This is now a feature, not a cleanup step — we keep the original amount too
transactions_df['is_high_value'] = (
transactions_df['amount_zscore'] > 3).astype(int)
# Validate: does the outlier flag correlate with fraud?
corr_amount = transactions_df['amount'].corr(transactions_df['is_fraud'])
corr_flag = transactions_df['is_high_value'].corr(transactions_df['is_fraud'])
print("Correlation with is_fraud:\n")
print(f" amount (raw) : {corr_amount:+.4f}")
print(f" is_high_value : {corr_flag:+.4f}")
print("\nTransactions flagged as high-value:\n")
flagged = transactions_df[transactions_df['is_high_value'] == 1]
print(flagged[['transaction_id','amount','amount_zscore',
'is_high_value','is_fraud']].to_string(index=False))
print(f"\nFull dataset:\n")
print(transactions_df[['transaction_id','amount','amount_zscore',
'is_high_value','is_fraud']].to_string(index=False))
Correlation with is_fraud:
amount (raw) : +0.9617
is_high_value : +0.9718
Transactions flagged as high-value:
transaction_id amount amount_zscore is_high_value is_fraud
T04 9500 3.46 1 1
T09 8750 3.17 1 1
Full dataset:
transaction_id amount amount_zscore is_high_value is_fraud
T01 45 -0.38 0 0
T02 120 -0.11 0 0
T03 78 -0.27 0 0
T04 9500 3.46 1 1
T05 55 -0.34 0 0
T06 89 -0.23 0 0
T07 62 -0.31 0 0
T08 104 -0.16 0 0
T09 8750 3.17 1 1
T10 37 -0.41 0 0
T11 71 -0.29 0 0
T12 95 -0.21 0 0What just happened?
The same Z-score calculation used for detection is now used to create a feature. The two high-value transactions (T04 at 3.46σ and T09 at 3.17σ) are exactly the two fraud cases. The binary is_high_value flag at +0.972 is the strongest predictor in the dataset — even stronger than the raw amount at +0.962. The original amount column stays in the dataset alongside the flag; both contribute different signal. This is the mindset shift: outliers are not always problems to fix — sometimes they are the answer.
Outlier Decision Guide
| Situation | Example | Action | Method |
|---|---|---|---|
| Confirmed data entry error | 50 sqft property, £5 sale | Remove row | Boolean filter + dropna |
| Real but extreme observation | 6,500 sqft mansion | Cap value | .clip(upper=p95) |
| Extreme value IS the signal | Unusually large transaction | Create flag feature | Z-score > 3 → 0/1 column |
| Skewed distribution, linear model | sale_price, income | Log transform | np.log1p() |
| Tree-based model | Random forest, XGBoost | Often leave as-is | Trees split on thresholds — outliers rarely matter |
Teacher's Note
The IQR method has a well-known weakness on small datasets: when the distribution is tight, the upper fence can sit very low and flag legitimate values as outliers. In a dataset where all properties are between 800 and 1,600 sqft, a perfectly normal 2,200 sqft property might get flagged. Always cross-check statistical flags against domain knowledge before acting.
One more practical note: always apply outlier treatment after splitting into train and test. If you cap at the 95th percentile of the full dataset, the test set's extreme values influence the cap threshold — that is subtle but real leakage. Compute caps on training data, then apply the same caps to test data.
Practice Questions
1. Which outlier detection method uses Q1 − 1.5 × IQR and Q3 + 1.5 × IQR as its lower and upper boundaries?
2. The pandas method that replaces values above a ceiling with the ceiling value — keeping the row but bounding the extreme number — is ___.
3. To prevent leakage, outlier thresholds like the 95th percentile cap should be computed on ___ data only, then applied to both splits.
Quiz
1. Which detection method flagged the £1.9m sale price when the IQR method missed it, and why?
2. In the fraud detection scenario, why is creating an is_high_value flag the right approach rather than capping or removing the large transactions?
3. For which model type is outlier treatment least important?
Up Next · Lesson 10
Data Transformations
Log, square root, Box-Cox, and power transforms — learn which transformation fixes which problem, when each one is appropriate, and how to validate that a transform actually improved your features.