EDA Lesson 9 – Outliers Handling | Dataplexa
Beginner Level · Lesson 9

Outlier Handling

Detecting outliers is step one — now comes the harder question: what do you actually do with them? Delete, cap, transform, or keep? The answer changes every time, and getting it wrong can quietly destroy your analysis.

You Found the Outlier. Now What?

Imagine you're analysing employee salaries at a mid-sized company. You run your IQR check and flag a $4,200,000 annual salary. Your first instinct might be to delete it. But wait — that's the CEO. Removing that row would be factually wrong. The data is correct; it just reflects real inequality in the company structure.

On the other hand, if you're analysing customer ages and you find someone listed as 847 years old, that's clearly a typo. Keeping it would corrupt every model you build. Same technique to detect both — completely different response required.

🧭 The Four Handling Strategies

Every outlier situation maps to one of these: Remove it (when it's a data error), Cap it (when it's real but extreme), Transform it (when skew is causing the problem), or Keep it (when it carries genuine signal). We'll cover all four with real code.

Strategy 1 — Removing Outliers

Removal is appropriate when you have strong reason to believe an outlier is a recording error, a corrupted entry, or a value physically impossible in context. You identify the offending rows using your IQR or Z-score mask from Lesson 8, then simply filter them out.

The scenario: You're a data analyst at a fitness app company. Your team has collected daily step counts from 10,000 users, but you've noticed during quality checks that some entries are clearly broken — values of 0 steps logged for the full day (device not worn), and absurdly high counts like 180,000 steps (sensor malfunction, since the world record for steps in a day is around 100,000). Your manager has asked you to clean the dataset before it goes into the monthly health trend report. Keeping these bad rows would inflate the "average daily steps" metric used in marketing materials. You need to identify and remove only the impossible values, leaving genuine high performers untouched.

import pandas as pd   # data manipulation — DataFrame, filtering, stats
import numpy as np    # NaN values and numerical operations

# Daily step counts from fitness app users — some clearly broken
steps = pd.DataFrame({
    'user_id': ['U001','U002','U003','U004','U005',
                'U006','U007','U008','U009','U010'],
    # Step counts — U004 is zero (device off), U008 is a sensor fault
    'steps':   [8200, 11500, 7300, 0, 9400,
               12100, 6800, 180000, 8900, 10200]
})

# Print summary before cleaning so we can compare
print("Before removal:")
print(f"  Rows: {len(steps)}")
print(f"  Mean steps: {steps['steps'].mean():.0f}")
print(f"  Max steps:  {steps['steps'].max()}")
print()

# Define business rules: valid steps are between 500 and 100,000
valid_mask = (steps['steps'] >= 500) & (steps['steps'] <= 100000)

# Apply the mask — keep only rows where both conditions are True
steps_clean = steps[valid_mask].copy()

# Print summary after cleaning
print("After removal:")
print(f"  Rows: {len(steps_clean)}")
print(f"  Mean steps: {steps_clean['steps'].mean():.0f}")
print(f"  Max steps:  {steps_clean['steps'].max()}")
print()

# Show which rows were removed
removed = steps[~valid_mask]
print("Removed rows:")
print(removed)

💡 What just happened?

We used pandas — specifically boolean masking with & to filter rows matching two conditions simultaneously. The mean daily steps dropped from a misleading 27,440 all the way down to 9,300 just by removing two bad rows. That's the silent damage outliers do — they inflate a stat used in a marketing deck, and nobody questions the number because it came from "the data." Notice we used business rules (500–100,000) rather than the IQR formula here. Domain knowledge often beats pure statistics when the bad values are physically impossible, not just statistically extreme.

Strategy 2 — Capping (Winsorization)

Capping, also called Winsorization, is the technique of replacing extreme values with the boundary value of an acceptable range — rather than deleting the row entirely. It's the right call when the outlier is real and legitimate, but its extremity would distort a model or chart.

Think of it like this: you're not saying the CEO doesn't exist or that the data is wrong. You're saying "for the purpose of this statistical model, I'll treat anyone earning above the 95th percentile as if they earn exactly the 95th percentile value." The row stays. The extreme influence is tamed.

The scenario: You work at a loan processing company. Your team is preparing a dataset of applicant annual incomes to feed into a credit scoring model. The incomes are real — nobody's data is wrong — but three ultra-high earners (a doctor, an exec, and a professional athlete) have incomes so much higher than everyone else that they'll dominate the model's variance and throw off predictions for average applicants. Your task is to cap values at the 5th and 95th percentiles so the model treats extreme earners proportionally, without losing those rows from the dataset.

import pandas as pd   # DataFrame and quantile calculations
import numpy as np    # numerical support

# Loan applicant annual income in thousands — three extreme earners
applicants = pd.DataFrame({
    'applicant': ['A01','A02','A03','A04','A05',
                  'A06','A07','A08','A09','A10'],
    # Income in $K — A08 is a doctor, A09 an exec, A10 a sports star
    'income_k': [42, 55, 61, 48, 53,
                67, 58, 320, 510, 1200]
})

# Calculate the 5th and 95th percentile caps
lower_cap = applicants['income_k'].quantile(0.05)
upper_cap = applicants['income_k'].quantile(0.95)
print(f"Lower cap (5th pct):  ${lower_cap:.1f}K")
print(f"Upper cap (95th pct): ${upper_cap:.1f}K")
print()

# clip() replaces anything below lower_cap or above upper_cap
# with the cap value itself — the row is kept, just the value is bounded
applicants['income_capped'] = applicants['income_k'].clip(
    lower=lower_cap, upper=upper_cap
)

# Compare original vs capped values side by side
print(applicants[['applicant', 'income_k', 'income_capped']])
print()

# Show how the mean and std changed after capping
print(f"Original  — Mean: {applicants['income_k'].mean():.1f}K  Std: {applicants['income_k'].std():.1f}K")
print(f"Capped    — Mean: {applicants['income_capped'].mean():.1f}K  Std: {applicants['income_capped'].std():.1f}K")

💡 What just happened?

We used pandas — specifically the .clip() method, which is pandas' built-in Winsorization tool. It accepts a lower and upper bound and replaces anything outside those bounds with the boundary value. A01's $42K income was nudged up to $43.7K (the 5th percentile), and A10's $1,200K was pulled down to $737K (the 95th). Nobody was deleted. The standard deviation dropped from 362K to 265K — that's 27% less variance, which means our credit model won't be dominated by three exceptional earners. All 10 rows are still in the dataset.

Visual — Before and After Capping

A side-by-side dot plot shows how capping tames the extreme values without removing them from the dataset entirely.

Income Distribution — Before vs After Capping ($K)

BEFORE CAPPING

$0K$1,200K

AFTER CAPPING (95th pct = $737K)

$1200K → $737K
$0K$1,200K
Normal values
Outliers (original)
Capped value

Strategy 3 — Log Transformation

Sometimes the problem isn't a single bad row — it's that your entire column has a heavily right-skewed distribution. Incomes, house prices, social media follower counts, and transaction values all tend to cluster near zero with a long tail of very large numbers. In these cases, applying a log transformation compresses the tail and brings the distribution closer to normal, making every statistical method you apply more reliable.

The scenario: You're preparing a dataset of online ad campaign spend for a machine learning model at a digital marketing agency. The spend values range from $5 to $180,000 — the agency has everything from tiny local businesses to global brands. You can't remove any rows because every client matters, and capping would distort budget planning reports. Your data scientist lead has asked you to apply a log transformation before model training so the skew doesn't throw off the regression coefficients. You need to add a log-transformed column alongside the original so both versions are available in the output file.

import pandas as pd   # DataFrame operations and descriptive stats
import numpy as np    # np.log1p() — log transformation that handles zeros safely

# Digital ad campaign daily spend in USD — huge range across client sizes
campaigns = pd.DataFrame({
    'campaign': ['C01','C02','C03','C04','C05',
                 'C06','C07','C08','C09','C10'],
    # Spend ranges from a $5 micro-budget to a $180K enterprise campaign
    'spend_usd': [5, 120, 850, 200, 3400,
                 75, 12000, 450, 68000, 180000]
})

# np.log1p() computes log(1 + x) — safer than np.log() because
# log(0) is undefined; log1p(0) safely returns 0 instead
campaigns['log_spend'] = np.log1p(campaigns['spend_usd'])
campaigns['log_spend'] = campaigns['log_spend'].round(3)

# Compare skew and std before vs after transformation
print("Original spend_usd:")
print(f"  Skewness: {campaigns['spend_usd'].skew():.2f}")
print(f"  Std Dev:  {campaigns['spend_usd'].std():.0f}")
print()
print("Log-transformed log_spend:")
print(f"  Skewness: {campaigns['log_spend'].skew():.2f}")
print(f"  Std Dev:  {campaigns['log_spend'].std():.2f}")
print()
print(campaigns[['campaign', 'spend_usd', 'log_spend']])

💡 What just happened?

We used numpy for the np.log1p() function — numpy is the go-to library for mathematical operations on arrays and Series. We used log1p specifically (not plain log) because it safely handles zero values by computing log(1 + x) instead of log(x), preventing the "log of zero is undefined" error. The result is dramatic: skewness dropped from 2.79 (heavily right-skewed) to just 0.34 (nearly symmetric). The standard deviation collapsed from 57,025 to 3.64 — the log scale compresses the enormous differences between a $5 and $180,000 budget into a manageable range of 1.79 to 12.10.

Strategy 4 — Keeping Outliers Intentionally

Sometimes the right answer is to do nothing at all. If you're building a fraud detection model, the outliers are the signal — they're the fraudulent transactions you're trying to find. Removing them would make your model blind to exactly what it needs to detect.

The scenario: You're on the data team at a bank. Your fraud analytics system has flagged several transactions in a customer dataset as statistical outliers using IQR. Before the data goes to the fraud model team, your job is to produce a report showing which rows are flagged, their outlier status, and their transaction amounts — so the modelling team can confirm they're keeping those rows as intentional signal, not removing them as noise. You need to add a clean boolean flag column and print a full summary.

import pandas as pd   # DataFrame operations and value_counts
import numpy as np    # NaN and numerical support

# Bank transaction amounts — mostly routine, two flagged as suspicious
transactions = pd.DataFrame({
    'txn_id':  ['T001','T002','T003','T004','T005',
               'T006','T007','T008','T009','T010'],
    # Transaction value in USD — T003 and T009 are suspiciously large
    'amount':  [45, 120, 9800, 88, 200,
               67, 155, 91, 14200, 73],
    # Whether the transaction has already been manually reviewed
    'reviewed':['No','No','Yes','No','No',
               'No','No','No','No','No']
})

# Apply IQR method to flag outliers — but we will NOT remove them
Q1  = transactions['amount'].quantile(0.25)
Q3  = transactions['amount'].quantile(0.75)
IQR = Q3 - Q1
transactions['flagged_outlier'] = (
    (transactions['amount'] < Q1 - 1.5 * IQR) |
    (transactions['amount'] > Q3 + 1.5 * IQR)
)

# Print the full dataset — all rows intact, outliers clearly labelled
print(transactions)
print()

# Summary count of flagged vs normal transactions
print("Outlier flag summary:")
print(transactions['flagged_outlier'].value_counts())

💡 What just happened?

We used pandas for boolean masking and .value_counts() — a fast way to tally up how many True vs False values are in a column. Both suspicious transactions (T003 at $9,800 and T009 at $14,200) are flagged, but the dataset still has 10 rows. The flagged_outlier column is the deliverable here — it lets the fraud model team use those rows as positive class examples without the data team silently destroying them. This is a pattern you'll see constantly in production data pipelines: flag and document, never silently drop.

Decision Guide — Which Strategy When

Strategy Use When Watch Out For
Remove Data error, impossible value, typo Losing too many rows (>5%)
Cap Value is real but will dominate model Distorting reports that need actual values
Transform Whole column is skewed, not just one row Interpreting results back in original units
Keep Outlier is the signal (fraud, anomaly detection) Forgetting to document why you kept it

🍎 Teacher's Note

Here's a rule of thumb that will save you from bad decisions: never handle outliers before you understand them. Before you remove, cap, or transform anything, print the rows, check the source, and ask a domain expert if needed. The one time you silently delete a real data point is the one time it was the most important row in the dataset. Document every decision you make — even "I chose to keep this outlier because it's legitimate high-value behaviour" belongs in your analysis notes.

Practice Questions

1. Which pandas method is used to cap outliers by replacing values beyond a boundary with the boundary value itself?



2. Which numpy function applies a log transformation that safely handles zero values by computing log(1 + x)?



3. If you're building a fraud detection model, what is the correct outlier handling strategy for suspiciously large transactions?



Quiz

1. A dataset of house prices has one legitimate $10M mansion entry. It's real, but it's skewing your regression model. What is the best handling strategy?


2. A transaction value column has skewness of 3.1. No single row is a data error — the column is just heavily right-skewed. Which approach is most appropriate?


3. You're handing a transaction dataset to a fraud modelling team. The flagged outlier rows are the fraud signals they need to train on. What should you do?


Up Next · Lesson 10

Distributions

Learn to recognise the shape of your data — normal, skewed, bimodal, and uniform — and understand why the distribution type changes everything about how you analyse it.