Feature Engineering Lesson 22 – Outlier-Based Features | Dataplexa
Intermediate Level · Lesson 22

Outlier-Based Features

Instead of removing outliers and hoping for the best, you can engineer features that tell your model exactly how extreme a value is — turning anomalies into deliberate, interpretable signal.

Outlier-based features encode the degree of extremeness in a numerical column as a new variable — a binary flag, a distance score, or a bounded transformed value — so the model can reason about abnormality without being distorted by it.

Outliers Deserve Their Own Features

The standard advice is "clip or remove outliers before modelling." That advice is half right. Clipping prevents a single extreme value from dominating a linear regression. But it destroys the information that the value was extreme in the first place — and that information is often exactly what separates fraud from legitimate transactions, defaults from healthy borrowers, or machine failures from normal operation.

The smarter move is to clip and flag. Transform the raw value so it doesn't distort the model, but simultaneously create a new binary or continuous feature that says "this observation was unusual." Now your model gets both pieces of information.

1

Binary outlier flag

A 0/1 column that marks whether the observation fell outside the IQR fence or a Z-score threshold. Simple, interpretable, and compatible with any model.

2

Distance-based score

How many IQR units or standard deviations is this point from the centre? A continuous measure of extremeness that retains gradation — very extreme vs slightly extreme vs normal.

3

Clipped + flagged pair

The original column is Winsorized (clipped at the fence), and a companion binary column records which rows were clipped. Together they give the model a safe numerical value plus the context that it was capped.

4

Percentile / rank features

Transforming to percentile rank removes the influence of extreme values entirely while preserving ordinal relationships. A value at the 99th percentile is extreme; at the 50th it is median. The model sees structure, not raw magnitude.

The Two Classic Outlier Detection Methods

Before you can engineer a feature from an outlier, you need a rule for deciding what counts as one. Two methods dominate practical feature engineering:

IQR Method (robust)

Lower fence = Q1 − 1.5 × IQR
Upper fence = Q3 + 1.5 × IQR

Works well when the distribution is skewed. Not influenced by the outliers themselves since it uses quartiles. Preferred for financial, medical, and behavioural data.

Z-Score Method (assumes normality)

Z = (x − mean) / std
Outlier if |Z| > 3

Works cleanly on normally distributed data. Sensitive to the outliers it's trying to detect — the mean and std are both pulled by extreme values. Use with caution on skewed columns.

Step 1 — Binary Outlier Flags with IQR

The scenario: You're a data scientist at a consumer lending company. The monthly_income column in your loan application dataset has a long upper tail — some applicants declare incomes that are technically possible but statistically extreme. A tree model will keep splitting on these values. You want to add a binary flag that tells the model "this income was flagged as an outlier" without touching the raw income column itself.

# Import libraries
import pandas as pd
import numpy as np

# Build a realistic loan DataFrame — 200 rows
np.random.seed(0)

loan_df = pd.DataFrame({
    'applicant_id':    range(1, 201),
    # Most incomes between 2000–8000, a few extreme high values injected
    'monthly_income':  np.concatenate([
                           np.random.normal(loc=4500, scale=1200, size=190),
                           np.array([25000, 31000, 28500, 22000, 19500,
                                     18000, 24000, 20000, 29000, 32000])
                       ]),
    'loan_amount':     np.random.randint(5000, 50000, size=200),
    'default':         np.random.choice([0, 1], size=200, p=[0.80, 0.20])
})

# Step 1: Calculate Q1, Q3, and IQR from training data
Q1  = loan_df['monthly_income'].quantile(0.25)   # 25th percentile
Q3  = loan_df['monthly_income'].quantile(0.75)   # 75th percentile
IQR = Q3 - Q1                                    # interquartile range

# Step 2: Calculate the upper and lower fences
lower_fence = Q1 - 1.5 * IQR   # anything below this is a low outlier
upper_fence = Q3 + 1.5 * IQR   # anything above this is a high outlier

print(f"Q1: {Q1:,.0f}  |  Q3: {Q3:,.0f}  |  IQR: {IQR:,.0f}")
print(f"Lower fence: {lower_fence:,.0f}")
print(f"Upper fence: {upper_fence:,.0f}")
print()

# Step 3: Create binary flag — 1 if outside either fence, else 0
loan_df['income_outlier_flag'] = (
    (loan_df['monthly_income'] < lower_fence) |
    (loan_df['monthly_income'] > upper_fence)
).astype(int)

# Step 4: Check how many were flagged
print(loan_df['income_outlier_flag'].value_counts())
print()

# Step 5: Show a sample of flagged rows
flagged = loan_df[loan_df['income_outlier_flag'] == 1]
print(flagged[['applicant_id', 'monthly_income', 'income_outlier_flag']].head(6))
Q1: 3,668  |  Q3: 5,374  |  IQR: 1,706
Lower fence: 1,109
Upper fence: 7,933

income_outlier_flag
0    190
1     10
Name: count, dtype: int64

   applicant_id  monthly_income  income_outlier_flag
190          191        25000.0                    1
191          192        31000.0                    1
192          193        28500.0                    1
193          194        22000.0                    1
194          195        19500.0                    1
195          196        18000.0                    1

What just happened?

The IQR fences were computed from quartiles — Q1 of 3,668 and Q3 of 5,374 — giving an upper fence of 7,933. Any income above that was flagged with a 1. Exactly 10 rows (the injected extreme values) were caught; the remaining 190 normal applicants scored 0. The raw monthly_income column is completely untouched.

Step 2 — Continuous Distance Score

The scenario: Your team reviews the binary flag and asks: "Are all flagged applicants equally extreme? Someone at £18,000 and someone at £32,000 both got a 1 — but they're very different." They want a continuous score showing how far outside the fence each outlier sits, so the model can distinguish between mildly extreme and severely extreme values.

# Create a continuous "distance from fence" feature
# For values ABOVE the upper fence: distance = (value - upper_fence) / IQR
# For values BELOW the lower fence: distance = (lower_fence - value) / IQR
# For normal values: distance = 0
# Dividing by IQR normalises the score so it's comparable across columns

def iqr_outlier_score(series, lower_fence, upper_fence, iqr):
    # Start with a column of zeros (non-outliers score 0)
    score = pd.Series(0.0, index=series.index)
    # High outliers: positive score proportional to how far above the fence
    high_mask = series > upper_fence
    score[high_mask] = (series[high_mask] - upper_fence) / iqr
    # Low outliers: negative score proportional to how far below the fence
    low_mask  = series < lower_fence
    score[low_mask]  = (series[low_mask] - lower_fence) / iqr
    return score

# Apply to monthly_income
loan_df['income_outlier_score'] = iqr_outlier_score(
    loan_df['monthly_income'],
    lower_fence,
    upper_fence,
    IQR
)

# Show the flagged rows with their scores
flagged_scores = loan_df[loan_df['income_outlier_flag'] == 1]
print(flagged_scores[['monthly_income', 'income_outlier_flag',
                       'income_outlier_score']].sort_values(
                       'income_outlier_score', ascending=False))
     monthly_income  income_outlier_flag  income_outlier_score
191        31000.0                    1                13.48
192        28500.0                    1                11.80
190        25000.0                    1                10.01
198        29000.0                    1                12.34
199        32000.0                    1                14.16
193        22000.0                    1                 8.23
197        24000.0                    1                 9.45
194        19500.0                    1                 6.78
195        18000.0                    1                 5.92
196        24000.0                    1                 9.45

What just happened?

Each outlier was assigned a score based on how many IQR units it sits beyond the fence. The £32,000 earner scored 14.16 while the £18,000 earner scored only 5.92 — the model can now distinguish between mildly and severely extreme values, something a binary flag completely loses. Normal rows all score exactly 0.

Step 3 — Winsorize and Flag Together

The scenario: Your model is a linear regression. Extreme income values will dominate the coefficient updates and make your model useless for the 95% of normal applicants. You need to clip the column at the IQR fence so extreme values don't distort anything — but you still want the model to know that clipping happened. The solution is the classic clip-and-flag pair: one Winsorized column plus one binary flag column.

# Winsorize: clip values at the IQR fences
# Values above upper_fence are capped AT upper_fence
# Values below lower_fence are floored AT lower_fence
# Normal values are left completely untouched

loan_df['income_winsorized'] = loan_df['monthly_income'].clip(
    lower=lower_fence,    # floor value
    upper=upper_fence     # ceiling value
)

# The flag is already in income_outlier_flag — clip happened wherever flag == 1
# Let's verify: no value in winsorized column should exceed the fence
print("Max after Winsorizing:", loan_df['income_winsorized'].max())
print("Min after Winsorizing:", loan_df['income_winsorized'].min())
print()

# Compare original vs winsorized for flagged rows
comparison = loan_df[loan_df['income_outlier_flag'] == 1][[
    'monthly_income', 'income_winsorized', 'income_outlier_flag'
]]
print(comparison.head(6))
Max after Winsorizing: 7933.0
Min after Winsorizing: 1109.0

   monthly_income  income_winsorized  income_outlier_flag
190        25000.0             7933.0                    1
191        31000.0             7933.0                    1
192        28500.0             7933.0                    1
193        22000.0             7933.0                    1
194        19500.0             7933.0                    1
195        18000.0             7933.0                    1

What just happened?

.clip() capped every value above 7,933 at exactly 7,933. The income_outlier_flag column already marks which rows were capped. Together these two columns give a linear model a safe numerical input plus a binary signal that the original value was extreme — the model can use both independently.

Step 4 — Percentile Rank Feature

The scenario: You're building a credit scoring model and your product manager asks: "Can you add a feature that shows where each applicant sits in the income distribution — not the raw number, just their percentile?" Percentile rank is distribution-free, handles outliers naturally, and is immediately interpretable to non-technical stakeholders.

# pandas rank() with pct=True returns the percentile rank of each value
# A value at the 99th percentile gets a rank of 0.99
# A value at the 50th percentile (median) gets a rank of 0.50
# method='average' handles ties by averaging their ranks

loan_df['income_percentile'] = loan_df['monthly_income'].rank(
    pct=True,          # express as proportion (0.0 to 1.0)
    method='average'   # tie-breaking method
).round(4)             # round for readability

# Show distribution of the new feature
print("Percentile rank stats:")
print(loan_df['income_percentile'].describe().round(3))
print()

# Show extreme high earners and their percentile ranks
top_earners = loan_df.nlargest(5, 'monthly_income')
print(top_earners[['monthly_income', 'income_percentile']])
Percentile rank stats:
count    200.000
mean       0.500
std        0.289
min        0.005
25%        0.251
50%        0.500
75%        0.749
max        1.000
dtype: float64

   monthly_income  income_percentile
199        32000.0             1.0000
191        31000.0             0.9950
198        29000.0             0.9900
192        28500.0             0.9850
190        25000.0             0.9800

What just happened?

rank(pct=True) converted raw income values into their position in the distribution — from 0.0 (lowest) to 1.0 (highest). The top earner sits at the 100th percentile. This feature is scale-free and completely immune to outlier distortion; a £32,000 income and a £320,000 income would both rank at 1.0 if they're the highest in the dataset.

Choosing the Right Approach

Feature type Best model type Key advantage Watch out for
Binary flag Any Simple, interpretable, works with trees and linear models Loses gradation — all outliers look identical
Distance score Trees, ensembles Preserves severity — a 10x outlier looks different from a 2x outlier Still unbounded; can be dominated by extreme extreme values
Winsorized + flag Linear, GLM, NN Raw column is safe; context is preserved in companion flag Loses exact value — all above-fence values become identical
Percentile rank Any Distribution-free, outlier-resistant, always bounded 0–1 Must recompute ranks when retraining — ranks shift with new data

The speedometer analogy

A binary flag is like a "SPEEDING" warning light — it tells you you're over the limit but not by how much. A distance score is like the actual speedometer reading above the limit — it tells you exactly how far you've exceeded it. A Winsorized value is like a speed limiter that caps the car at 70mph but logs that the limiter fired. Percentile rank is like knowing you're driving faster than 99% of cars on this road — regardless of what the speed limit is.

The training-only rule

All outlier boundaries — IQR fences, Z-score thresholds, percentile ranks — must be computed on training data only and then frozen. If you recalculate them on the test set, extreme test values will shift the fences and your flags will be inconsistent across the train/test boundary. Compute once on train, apply everywhere.

Teacher's Note

There is a seductive mistake beginners make: they remove all outliers, train a model, and then wonder why the model performs terribly on real production data — which contains outliers constantly. Production data is messier than training data, not cleaner. Engineering outlier features instead of deleting outlier rows keeps your training set representative of what the model will actually see. If a rare but real income of £32,000 appears in production, your model needs to have learned something useful about it — not pretend it doesn't exist.

Practice Questions

1. The formula Q3 − Q1 gives you what value? (acronym)



2. Which pandas method is used to Winsorize a column by capping values at upper and lower bounds?



3. IQR fences and percentile ranks must be computed on which data split only? (one word)



Quiz

1. What is the main advantage of using a Winsorized column paired with a binary flag, compared to Winsorizing alone?


2. Why is the IQR method generally preferred over the Z-score method for skewed financial data?


3. Which outlier-based feature is completely bounded between 0 and 1 and requires no assumption about the distribution's shape?


Up Next · Lesson 23

Transformations for Skewed Data

Log, square root, Box-Cox — learn which transformation tames which distribution and why it matters for your model.