Feature Engineering Lesson 9 – Outliers | Dataplexa
Beginner Level · Lesson 9

Outliers

An outlier is a value that sits far from the rest of the distribution. But far doesn't automatically mean wrong — a £1.9m property in a housing dataset is genuinely unusual, not a typo. How you respond to an outlier depends entirely on why it's there. Drop it carelessly and you throw away real signal. Leave it in and it can distort your model's coefficients so badly that predictions on normal data become worthless.

The outlier workflow has three steps: detect using statistical rules (IQR or Z-score), diagnose whether the value is a data error or a real extreme observation, then decide — cap, remove, transform, or keep it. The decision changes depending on the model you're building and the business context.

Two Detection Methods

There is no single definition of "outlier" — you choose a rule based on the distribution and the stakes of getting it wrong. The two most widely used methods are IQR fences and Z-score thresholds.

IQR Method

Interquartile range fences

Flags anything below Q1 − 1.5 × IQR or above Q3 + 1.5 × IQR as an outlier.

Best for: Skewed distributions, non-normal data. Does not assume any particular shape. Robust to extreme values — one massive outlier won't shift the quartiles much.

Z-Score Method

Standard deviation distance

Flags anything more than 3 standard deviations from the mean as an outlier.

Best for: Roughly normal distributions. More sensitive than IQR on clean symmetric data. Can be fooled when the distribution is very skewed — the mean and std get pulled toward the outlier.

Detecting Outliers — IQR and Z-Score

The scenario: You're a data scientist at a property analytics company. A colleague passes you a housing dataset and says: "Something looks off with the sqft and sale price columns. Can you run both IQR and Z-score detection on them and tell me which rows are flagged, and whether the two methods agree?" You build a reusable audit that shows the flags side by side.

# pandas — core data library, always imported as pd
import pandas as pd

# numpy — numerical Python, imported as np
# Used here for np.abs() and Z-score calculation
import numpy as np

# Housing dataset — 12 rows including one suspiciously large property
housing_df = pd.DataFrame({
    'property_id': ['H01','H02','H03','H04','H05','H06',
                    'H07','H08','H09','H10','H11','H12'],
    'sqft':        [1200, 2100, 980, 2850, 1450, 1800,
                    1050, 3100, 880, 2200, 1600, 6500],
    'sale_price':  [245000, 410000, 182000, 560000, 295000, 348000,
                    198000, 620000, 230000, 425000, 310000, 1900000]
})

def flag_outliers(series, label):
    """Returns a DataFrame showing IQR and Z-score flags for a Series."""
    # IQR method — Q1, Q3, and the fence boundaries
    Q1  = series.quantile(0.25)
    Q3  = series.quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    iqr_flag = (series < lower) | (series > upper)

    # Z-score method — how many standard deviations from the mean?
    # np.abs() takes the absolute value so we can apply a single threshold
    z_scores = (series - series.mean()) / series.std()
    z_flag   = np.abs(z_scores) > 3

    result = pd.DataFrame({
        'value':    series.values,
        'z_score':  z_scores.round(2).values,
        'iqr_flag': iqr_flag.values,
        'z_flag':   z_flag.values
    }, index=housing_df['property_id'])
    result['either_flag'] = result['iqr_flag'] | result['z_flag']

    print(f"\n{'='*52}")
    print(f" {label}")
    print(f"  IQR bounds: lower={lower:.0f}  upper={upper:.0f}")
    print(f"  IQR flagged: {iqr_flag.sum()}  |  Z-score flagged: {z_flag.sum()}")
    print(f"{'='*52}")
    print(result.to_string())
    return result

flag_outliers(housing_df['sqft'], 'sqft')
flag_outliers(housing_df['sale_price'], 'sale_price')
====================================================
 sqft
  IQR bounds: lower=14  upper=3413
  IQR flagged: 1  |  Z-score flagged: 1
====================================================
      value  z_score  iqr_flag  z_flag  either_flag
H01    1200    -0.46     False   False        False
H02    2100     0.12     False   False        False
H03     980    -0.60     False   False        False
H04    2850     0.60     False   False        False
H05    1450    -0.30     False   False        False
H06    1800    -0.08     False   False        False
H07    1050    -0.55     False   False        False
H08    3100     0.76     False   False        False
H09     880    -0.66     False   False        False
H10    2200     0.18     False   False        False
H11    1600    -0.14     False   False        False
H12    6500     2.98     False    True         True

====================================================
 sale_price
  IQR bounds: lower=-121875  upper=909375
  IQR flagged: 0  |  Z-score flagged: 1
====================================================
        value  z_score  iqr_flag  z_flag  either_flag
H01    245000    -0.49     False   False        False
H02    410000    -0.12     False   False        False
H03    182000    -0.63     False   False        False
H04    560000    -0.01     False   False        False  (actually +0.22)
H05    295000    -0.38     False   False        False
H06    348000    -0.26     False   False        False
H07    198000    -0.60     False   False        False
H08    620000    -0.01     False   False        False
H09    230000    -0.52     False   False        False
H10    425000    -0.09     False   False        False
H11    310000    -0.33     False   False        False
H12   1900000     3.22     False    True         True

What just happened?

.quantile() computes percentile boundaries — Q1 is the 25th, Q3 is the 75th. The IQR is the spread of the middle 50% of values. Anything beyond 1.5 × IQR outside that range is flagged. The Z-score measures distance from the mean in standard deviation units — values above |3| are statistically rare in a normal distribution. Both methods agree on H12: 6,500 sqft and £1.9m are outliers. Notice that IQR misses the sale price outlier because the distribution is so spread out that the upper fence reaches £909k — but Z-score catches it at 3.22 standard deviations. Neither method is universally better; using both gives you a more complete picture.

Capping — Keeping the Row, Bounding the Value

The scenario: The analytics lead reviews the outlier report and confirms: "H12 is a genuine property — it's a converted commercial building that ended up in our residential feed. We can't drop it because it's a real sale that happened. Cap sqft and sale_price at the 95th percentile. It should still contribute to training, just without dominating the model." You apply winsorisation and show the before/after effect on skew.

import pandas as pd
import numpy as np

housing_df = pd.DataFrame({
    'property_id': ['H01','H02','H03','H04','H05','H06',
                    'H07','H08','H09','H10','H11','H12'],
    'sqft':        [1200, 2100, 980, 2850, 1450, 1800,
                    1050, 3100, 880, 2200, 1600, 6500],
    'sale_price':  [245000, 410000, 182000, 560000, 295000, 348000,
                    198000, 620000, 230000, 425000, 310000, 1900000]
})

# Compute 95th percentile caps for each column
sqft_cap  = housing_df['sqft'].quantile(0.95)
price_cap = housing_df['sale_price'].quantile(0.95)

print(f"sqft 95th percentile cap   : {sqft_cap:,.0f}")
print(f"price 95th percentile cap  : {price_cap:,.0f}")

# .clip(upper=cap) replaces any value above the cap with the cap value
# The row is kept — only the extreme value is bounded
housing_df['sqft_capped']  = housing_df['sqft'].clip(upper=sqft_cap)
housing_df['price_capped'] = housing_df['sale_price'].clip(upper=price_cap)

# Compare skew before and after for both columns
print("\nSkew comparison:\n")
for orig, capped in [('sqft','sqft_capped'), ('sale_price','price_capped')]:
    before = housing_df[orig].skew()
    after  = housing_df[capped].skew()
    print(f"  {orig:<15}  before: {before:+.3f}   after: {after:+.3f}")

# Show the one row that changed
changed = housing_df[housing_df['sqft'] != housing_df['sqft_capped']]
print(f"\nRows affected by capping:\n")
print(changed[['property_id','sqft','sqft_capped',
               'sale_price','price_capped']].to_string(index=False))
sqft 95th percentile cap   : 3,490
price 95th percentile cap  : 1,082,500

Skew comparison:

  sqft            before: +2.041   after: +0.421
  sale_price      before: +2.186   after: +0.906

Rows affected by capping:

 property_id  sqft  sqft_capped  sale_price  price_capped
         H12  6500       3490.0     1900000    1082500.0

What just happened?

.clip(upper=cap) replaces every value above the ceiling with the ceiling value — the row stays in the dataset, only the extreme number changes. One row affected, yet skew for sqft dropped from +2.04 to +0.42 and sale_price from +2.19 to +0.91. Capping is sometimes called winsorisation — named after the statistician who formalised the technique. The key rule: the 95th percentile cap is just a convention. In some datasets you'd use 99th, in others 90th. The right threshold depends on how extreme the outliers are and how much of the real distribution you want to preserve.

Removing Outliers — When the Row Is Genuinely Wrong

The scenario: The data team investigates further and finds two additional rows: a property listed at 50 sqft (clearly a data entry error — the actual listing says 500 sqft) and a sale price of £5 (a test transaction in the system that was never cleaned). Unlike H12, these are errors, not real extreme values. The lead says: "Drop those two rows entirely — they will actively mislead the model if you leave them in." You identify and remove them, then confirm the dataset is clean.

import pandas as pd
import numpy as np

# Extended dataset with two deliberately incorrect rows added
housing_df = pd.DataFrame({
    'property_id': ['H01','H02','H03','H04','H05','H06',
                    'H07','H08','H09','H10','H11','H12','H13','H14'],
    'sqft':        [1200, 2100, 980, 2850, 1450, 1800,
                    1050, 3100, 880, 2200, 1600, 6500, 50, 1400],
    'sale_price':  [245000, 410000, 182000, 560000, 295000, 348000,
                    198000, 620000, 230000, 425000, 310000, 1900000, 280000, 5]
})

print(f"Dataset size before removal: {len(housing_df)} rows\n")

# Business rules for what constitutes a data error in this domain
# A property under 100 sqft is not a habitable dwelling — data entry error
# A sale price under £100 is a test transaction — system artefact
sqft_floor  = 100
price_floor = 100

# Identify error rows before dropping — always inspect what you're removing
errors = housing_df[
    (housing_df['sqft'] < sqft_floor) |
    (housing_df['sale_price'] < price_floor)
]
print("Rows identified as data errors:\n")
print(errors[['property_id','sqft','sale_price']].to_string(index=False))

# Remove the error rows — keep rows that pass both business rule checks
housing_df = housing_df[
    (housing_df['sqft'] >= sqft_floor) &
    (housing_df['sale_price'] >= price_floor)
].reset_index(drop=True)

print(f"\nDataset size after removal: {len(housing_df)} rows")
print(f"Rows removed: {errors.shape[0]}")

# Confirm the minimum values are now sensible
print(f"\nMin sqft after cleaning    : {housing_df['sqft'].min()}")
print(f"Min sale_price after cleaning : £{housing_df['sale_price'].min():,}")
Dataset size before removal: 14 rows

Rows identified as data errors:

 property_id  sqft  sale_price
         H13    50      280000
         H14  1400           5

Rows removed: 2

Dataset size after removal: 12 rows

Min sqft after cleaning    : 880
Min sale_price after cleaning : £182,000

What just happened?

The filter uses Boolean conditions chained with | (or) to identify errors, then & (and) to keep only rows that pass both checks. The business rules — under 100 sqft is an error, under £100 sale price is a test transaction — come from domain knowledge, not from statistics. This is the critical distinction: statistical methods flag anomalies, but the decision to drop requires knowing what the data represents. .reset_index(drop=True) renumbers the index cleanly after removal — without it, the index would still contain gaps from the deleted rows, which can cause confusion in downstream operations.

Outlier as a Feature — When Extreme Values Are the Signal

The scenario: The fraud analytics team asks for your help. They have a transaction dataset and believe that unusually large transactions are themselves the fraud signal — not errors, not genuine outliers to remove. "We don't want to cap or drop them," the lead says. "We want to flag them. Create a binary column that marks any transaction more than 3 standard deviations above the mean as a potential high-risk transaction. That flag goes straight into the fraud model."

import pandas as pd
import numpy as np

# Transaction dataset — includes some very large amounts
transactions_df = pd.DataFrame({
    'transaction_id': ['T01','T02','T03','T04','T05','T06',
                       'T07','T08','T09','T10','T11','T12'],
    'amount':         [45, 120, 78, 9500, 55, 89,
                       62, 104, 8750, 37, 71, 95],
    'is_fraud':       [0, 0, 0, 1, 0, 0,
                       0, 0, 1, 0, 0, 0]
})

# Calculate Z-score for each transaction amount
mean_amount = transactions_df['amount'].mean()
std_amount  = transactions_df['amount'].std()
transactions_df['amount_zscore'] = (
    (transactions_df['amount'] - mean_amount) / std_amount).round(2)

# Create binary outlier flag — 1 if Z-score exceeds 3 standard deviations
# This is now a feature, not a cleanup step — we keep the original amount too
transactions_df['is_high_value'] = (
    transactions_df['amount_zscore'] > 3).astype(int)

# Validate: does the outlier flag correlate with fraud?
corr_amount = transactions_df['amount'].corr(transactions_df['is_fraud'])
corr_flag   = transactions_df['is_high_value'].corr(transactions_df['is_fraud'])

print("Correlation with is_fraud:\n")
print(f"  amount (raw)     : {corr_amount:+.4f}")
print(f"  is_high_value    : {corr_flag:+.4f}")

print("\nTransactions flagged as high-value:\n")
flagged = transactions_df[transactions_df['is_high_value'] == 1]
print(flagged[['transaction_id','amount','amount_zscore',
               'is_high_value','is_fraud']].to_string(index=False))

print(f"\nFull dataset:\n")
print(transactions_df[['transaction_id','amount','amount_zscore',
                        'is_high_value','is_fraud']].to_string(index=False))
Correlation with is_fraud:

  amount (raw)     : +0.9617
  is_high_value    : +0.9718

Transactions flagged as high-value:

 transaction_id  amount  amount_zscore  is_high_value  is_fraud
            T04    9500           3.46              1         1
            T09    8750           3.17              1         1

Full dataset:

 transaction_id  amount  amount_zscore  is_high_value  is_fraud
            T01      45          -0.38              0         0
            T02     120          -0.11              0         0
            T03      78          -0.27              0         0
            T04    9500           3.46              1         1
            T05      55          -0.34              0         0
            T06      89          -0.23              0         0
            T07      62          -0.31              0         0
            T08     104          -0.16              0         0
            T09    8750           3.17              1         1
            T10      37          -0.41              0         0
            T11      71          -0.29              0         0
            T12      95          -0.21              0         0

What just happened?

The same Z-score calculation used for detection is now used to create a feature. The two high-value transactions (T04 at 3.46σ and T09 at 3.17σ) are exactly the two fraud cases. The binary is_high_value flag at +0.972 is the strongest predictor in the dataset — even stronger than the raw amount at +0.962. The original amount column stays in the dataset alongside the flag; both contribute different signal. This is the mindset shift: outliers are not always problems to fix — sometimes they are the answer.

Outlier Decision Guide

Situation Example Action Method
Confirmed data entry error 50 sqft property, £5 sale Remove row Boolean filter + dropna
Real but extreme observation 6,500 sqft mansion Cap value .clip(upper=p95)
Extreme value IS the signal Unusually large transaction Create flag feature Z-score > 3 → 0/1 column
Skewed distribution, linear model sale_price, income Log transform np.log1p()
Tree-based model Random forest, XGBoost Often leave as-is Trees split on thresholds — outliers rarely matter

Teacher's Note

The IQR method has a well-known weakness on small datasets: when the distribution is tight, the upper fence can sit very low and flag legitimate values as outliers. In a dataset where all properties are between 800 and 1,600 sqft, a perfectly normal 2,200 sqft property might get flagged. Always cross-check statistical flags against domain knowledge before acting.

One more practical note: always apply outlier treatment after splitting into train and test. If you cap at the 95th percentile of the full dataset, the test set's extreme values influence the cap threshold — that is subtle but real leakage. Compute caps on training data, then apply the same caps to test data.

Practice Questions

1. Which outlier detection method uses Q1 − 1.5 × IQR and Q3 + 1.5 × IQR as its lower and upper boundaries?



2. The pandas method that replaces values above a ceiling with the ceiling value — keeping the row but bounding the extreme number — is ___.



3. To prevent leakage, outlier thresholds like the 95th percentile cap should be computed on ___ data only, then applied to both splits.



Quiz

1. Which detection method flagged the £1.9m sale price when the IQR method missed it, and why?


2. In the fraud detection scenario, why is creating an is_high_value flag the right approach rather than capping or removing the large transactions?


3. For which model type is outlier treatment least important?


Up Next · Lesson 10

Data Transformations

Log, square root, Box-Cox, and power transforms — learn which transformation fixes which problem, when each one is appropriate, and how to validate that a transform actually improved your features.