EDA Lesson 9 – Outliers Handling | Dataplexa

Beginner Level · Lesson 9

Outlier Handling

Detecting outliers is step one — now comes the harder question: what do you actually do with them? Delete, cap, transform, or keep? The answer changes every time, and getting it wrong can quietly destroy your analysis.

You Found the Outlier. Now What?

Imagine you're analysing employee salaries at a mid-sized company. You run your IQR check and flag a $4,200,000 annual salary. Your first instinct might be to delete it. But wait — that's the CEO. Removing that row would be factually wrong. The data is correct; it just reflects real inequality in the company structure.

On the other hand, if you're analysing customer ages and you find someone listed as 847 years old, that's clearly a typo. Keeping it would corrupt every model you build. Same technique to detect both — completely different response required.

🧭 The Four Handling Strategies

Every outlier situation maps to one of these: Remove it (when it's a data error), Cap it (when it's real but extreme), Transform it (when skew is causing the problem), or Keep it (when it carries genuine signal). We'll cover all four with real code.

Strategy 1 — Removing Outliers

Removal is appropriate when you have strong reason to believe an outlier is a recording error, a corrupted entry, or a value physically impossible in context. You identify the offending rows using your IQR or Z-score mask from Lesson 8, then simply filter them out.

The scenario: You're a data analyst at a fitness app company. Your team has collected daily step counts from 10,000 users, but you've noticed during quality checks that some entries are clearly broken — values of 0 steps logged for the full day (device not worn), and absurdly high counts like 180,000 steps (sensor malfunction, since the world record for steps in a day is around 100,000). Your manager has asked you to clean the dataset before it goes into the monthly health trend report. Keeping these bad rows would inflate the "average daily steps" metric used in marketing materials. You need to identify and remove only the impossible values, leaving genuine high performers untouched.

import pandas as pd   # data manipulation — DataFrame, filtering, stats
import numpy as np    # NaN values and numerical operations

# Daily step counts from fitness app users — some clearly broken
steps = pd.DataFrame({
    'user_id': ['U001','U002','U003','U004','U005',
                'U006','U007','U008','U009','U010'],
    # Step counts — U004 is zero (device off), U008 is a sensor fault
    'steps':   [8200, 11500, 7300, 0, 9400,
               12100, 6800, 180000, 8900, 10200]
})

# Print summary before cleaning so we can compare
print("Before removal:")
print(f"  Rows: {len(steps)}")
print(f"  Mean steps: {steps['steps'].mean():.0f}")
print(f"  Max steps:  {steps['steps'].max()}")
print()

# Define business rules: valid steps are between 500 and 100,000
valid_mask = (steps['steps'] >= 500) & (steps['steps'] <= 100000)

# Apply the mask — keep only rows where both conditions are True
steps_clean = steps[valid_mask].copy()

# Print summary after cleaning
print("After removal:")
print(f"  Rows: {len(steps_clean)}")
print(f"  Mean steps: {steps_clean['steps'].mean():.0f}")
print(f"  Max steps:  {steps_clean['steps'].max()}")
print()

# Show which rows were removed
removed = steps[~valid_mask]
print("Removed rows:")
print(removed)

Before removal:
  Rows: 10
  Mean steps: 27440
  Max steps:  180000

After removal:
  Rows: 8
  Mean steps: 9300
  Max steps:  12100

Removed rows:
  user_id   steps
3    U004       0
7    U008  180000

💡 What just happened?

We used pandas — specifically boolean masking with & to filter rows matching two conditions simultaneously. The mean daily steps dropped from a misleading 27,440 all the way down to 9,300 just by removing two bad rows. That's the silent damage outliers do — they inflate a stat used in a marketing deck, and nobody questions the number because it came from "the data." Notice we used business rules (500–100,000) rather than the IQR formula here. Domain knowledge often beats pure statistics when the bad values are physically impossible, not just statistically extreme.

Strategy 2 — Capping (Winsorization)

Capping, also called Winsorization, is the technique of replacing extreme values with the boundary value of an acceptable range — rather than deleting the row entirely. It's the right call when the outlier is real and legitimate, but its extremity would distort a model or chart.

Think of it like this: you're not saying the CEO doesn't exist or that the data is wrong. You're saying "for the purpose of this statistical model, I'll treat anyone earning above the 95th percentile as if they earn exactly the 95th percentile value." The row stays. The extreme influence is tamed.

The scenario: You work at a loan processing company. Your team is preparing a dataset of applicant annual incomes to feed into a credit scoring model. The incomes are real — nobody's data is wrong — but three ultra-high earners (a doctor, an exec, and a professional athlete) have incomes so much higher than everyone else that they'll dominate the model's variance and throw off predictions for average applicants. Your task is to cap values at the 5th and 95th percentiles so the model treats extreme earners proportionally, without losing those rows from the dataset.

import pandas as pd   # DataFrame and quantile calculations
import numpy as np    # numerical support

# Loan applicant annual income in thousands — three extreme earners
applicants = pd.DataFrame({
    'applicant': ['A01','A02','A03','A04','A05',
                  'A06','A07','A08','A09','A10'],
    # Income in $K — A08 is a doctor, A09 an exec, A10 a sports star
    'income_k': [42, 55, 61, 48, 53,
                67, 58, 320, 510, 1200]
})

# Calculate the 5th and 95th percentile caps
lower_cap = applicants['income_k'].quantile(0.05)
upper_cap = applicants['income_k'].quantile(0.95)
print(f"Lower cap (5th pct):  ${lower_cap:.1f}K")
print(f"Upper cap (95th pct): ${upper_cap:.1f}K")
print()

# clip() replaces anything below lower_cap or above upper_cap
# with the cap value itself — the row is kept, just the value is bounded
applicants['income_capped'] = applicants['income_k'].clip(
    lower=lower_cap, upper=upper_cap
)

# Compare original vs capped values side by side
print(applicants[['applicant', 'income_k', 'income_capped']])
print()

# Show how the mean and std changed after capping
print(f"Original  — Mean: {applicants['income_k'].mean():.1f}K  Std: {applicants['income_k'].std():.1f}K")
print(f"Capped    — Mean: {applicants['income_capped'].mean():.1f}K  Std: {applicants['income_capped'].std():.1f}K")

Lower cap (5th pct):  $43.7K
Upper cap (95th pct): $737.0K

  applicant  income_k  income_capped
0       A01        42           43.7
1       A02        55           55.0
2       A03        61           61.0
3       A04        48           48.0
4       A05        53           53.0
5       A06        67           67.0
6       A07        58           58.0
7       A08       320          320.0
8       A09       510          510.0
9       A10      1200          737.0

Original  — Mean: 221.4K  Std: 362.2K
Capped    — Mean: 195.3K  Std: 264.9K

💡 What just happened?

We used pandas — specifically the .clip() method, which is pandas' built-in Winsorization tool. It accepts a lower and upper bound and replaces anything outside those bounds with the boundary value. A01's $42K income was nudged up to $43.7K (the 5th percentile), and A10's $1,200K was pulled down to $737K (the 95th). Nobody was deleted. The standard deviation dropped from 362K to 265K — that's 27% less variance, which means our credit model won't be dominated by three exceptional earners. All 10 rows are still in the dataset.

Visual — Before and After Capping

A side-by-side dot plot shows how capping tames the extreme values without removing them from the dataset entirely.

Income Distribution — Before vs After Capping ($K)

BEFORE CAPPING

$0K$1,200K

AFTER CAPPING (95th pct = $737K)

$1200K → $737K

$0K$1,200K

Normal values

Outliers (original)

Capped value

Strategy 3 — Log Transformation

Sometimes the problem isn't a single bad row — it's that your entire column has a heavily right-skewed distribution. Incomes, house prices, social media follower counts, and transaction values all tend to cluster near zero with a long tail of very large numbers. In these cases, applying a log transformation compresses the tail and brings the distribution closer to normal, making every statistical method you apply more reliable.

The scenario: You're preparing a dataset of online ad campaign spend for a machine learning model at a digital marketing agency. The spend values range from $5 to $180,000 — the agency has everything from tiny local businesses to global brands. You can't remove any rows because every client matters, and capping would distort budget planning reports. Your data scientist lead has asked you to apply a log transformation before model training so the skew doesn't throw off the regression coefficients. You need to add a log-transformed column alongside the original so both versions are available in the output file.

import pandas as pd   # DataFrame operations and descriptive stats
import numpy as np    # np.log1p() — log transformation that handles zeros safely

# Digital ad campaign daily spend in USD — huge range across client sizes
campaigns = pd.DataFrame({
    'campaign': ['C01','C02','C03','C04','C05',
                 'C06','C07','C08','C09','C10'],
    # Spend ranges from a $5 micro-budget to a $180K enterprise campaign
    'spend_usd': [5, 120, 850, 200, 3400,
                 75, 12000, 450, 68000, 180000]
})

# np.log1p() computes log(1 + x) — safer than np.log() because
# log(0) is undefined; log1p(0) safely returns 0 instead
campaigns['log_spend'] = np.log1p(campaigns['spend_usd'])
campaigns['log_spend'] = campaigns['log_spend'].round(3)

# Compare skew and std before vs after transformation
print("Original spend_usd:")
print(f"  Skewness: {campaigns['spend_usd'].skew():.2f}")
print(f"  Std Dev:  {campaigns['spend_usd'].std():.0f}")
print()
print("Log-transformed log_spend:")
print(f"  Skewness: {campaigns['log_spend'].skew():.2f}")
print(f"  Std Dev:  {campaigns['log_spend'].std():.2f}")
print()
print(campaigns[['campaign', 'spend_usd', 'log_spend']])

Original spend_usd:
  Skewness: 2.79
  Std Dev:  57025

Log-transformed log_spend:
  Skewness: 0.34
  Std Dev:  3.64

  campaign  spend_usd  log_spend
0      C01          5      1.792
1      C02        120      4.796
2      C03        850      6.746
3      C04        200      5.303
4      C05       3400      8.134
5      C06         75      4.331
6      C07      12000      9.394
7      C08        450      6.112
8      C09      68000     11.127
9      C10     180000     12.101

💡 What just happened?

We used numpy for the np.log1p() function — numpy is the go-to library for mathematical operations on arrays and Series. We used log1p specifically (not plain log) because it safely handles zero values by computing log(1 + x) instead of log(x), preventing the "log of zero is undefined" error. The result is dramatic: skewness dropped from 2.79 (heavily right-skewed) to just 0.34 (nearly symmetric). The standard deviation collapsed from 57,025 to 3.64 — the log scale compresses the enormous differences between a $5 and $180,000 budget into a manageable range of 1.79 to 12.10.

Strategy 4 — Keeping Outliers Intentionally

Sometimes the right answer is to do nothing at all. If you're building a fraud detection model, the outliers are the signal — they're the fraudulent transactions you're trying to find. Removing them would make your model blind to exactly what it needs to detect.

The scenario: You're on the data team at a bank. Your fraud analytics system has flagged several transactions in a customer dataset as statistical outliers using IQR. Before the data goes to the fraud model team, your job is to produce a report showing which rows are flagged, their outlier status, and their transaction amounts — so the modelling team can confirm they're keeping those rows as intentional signal, not removing them as noise. You need to add a clean boolean flag column and print a full summary.

import pandas as pd   # DataFrame operations and value_counts
import numpy as np    # NaN and numerical support

# Bank transaction amounts — mostly routine, two flagged as suspicious
transactions = pd.DataFrame({
    'txn_id':  ['T001','T002','T003','T004','T005',
               'T006','T007','T008','T009','T010'],
    # Transaction value in USD — T003 and T009 are suspiciously large
    'amount':  [45, 120, 9800, 88, 200,
               67, 155, 91, 14200, 73],
    # Whether the transaction has already been manually reviewed
    'reviewed':['No','No','Yes','No','No',
               'No','No','No','No','No']
})

# Apply IQR method to flag outliers — but we will NOT remove them
Q1  = transactions['amount'].quantile(0.25)
Q3  = transactions['amount'].quantile(0.75)
IQR = Q3 - Q1
transactions['flagged_outlier'] = (
    (transactions['amount'] < Q1 - 1.5 * IQR) |
    (transactions['amount'] > Q3 + 1.5 * IQR)
)

# Print the full dataset — all rows intact, outliers clearly labelled
print(transactions)
print()

# Summary count of flagged vs normal transactions
print("Outlier flag summary:")
print(transactions['flagged_outlier'].value_counts())

  txn_id  amount reviewed  flagged_outlier
0   T001      45       No            False
1   T002     120       No            False
2   T003    9800      Yes             True
3   T004      88       No            False
4   T005     200       No            False
5   T006      67       No            False
6   T007     155       No            False
7   T008      91       No            False
8   T009   14200       No             True
9   T010      73       No            False

Outlier flag summary:
flagged_outlier
False    8
True     2
dtype: int64

💡 What just happened?

We used pandas for boolean masking and .value_counts() — a fast way to tally up how many True vs False values are in a column. Both suspicious transactions (T003 at $9,800 and T009 at $14,200) are flagged, but the dataset still has 10 rows. The flagged_outlier column is the deliverable here — it lets the fraud model team use those rows as positive class examples without the data team silently destroying them. This is a pattern you'll see constantly in production data pipelines: flag and document, never silently drop.

Decision Guide — Which Strategy When

Strategy	Use When	Watch Out For
Remove	Data error, impossible value, typo	Losing too many rows (>5%)
Cap	Value is real but will dominate model	Distorting reports that need actual values
Transform	Whole column is skewed, not just one row	Interpreting results back in original units
Keep	Outlier is the signal (fraud, anomaly detection)	Forgetting to document why you kept it

🍎 Teacher's Note

Here's a rule of thumb that will save you from bad decisions: never handle outliers before you understand them. Before you remove, cap, or transform anything, print the rows, check the source, and ask a domain expert if needed. The one time you silently delete a real data point is the one time it was the most important row in the dataset. Document every decision you make — even "I chose to keep this outlier because it's legitimate high-value behaviour" belongs in your analysis notes.

Practice Questions

1. Which pandas method is used to cap outliers by replacing values beyond a boundary with the boundary value itself?

2. Which numpy function applies a log transformation that safely handles zero values by computing log(1 + x)?

3. If you're building a fraud detection model, what is the correct outlier handling strategy for suspiciously large transactions?

Quiz

Up Next · Lesson 10

Distributions

Learn to recognise the shape of your data — normal, skewed, bimodal, and uniform — and understand why the distribution type changes everything about how you analyse it.

← Previous Course Index Next →

EDA Course

Outlier Handling

You Found the Outlier. Now What?

Strategy 1 — Removing Outliers

Strategy 2 — Capping (Winsorization)

Visual — Before and After Capping

Strategy 3 — Log Transformation

Strategy 4 — Keeping Outliers Intentionally

Decision Guide — Which Strategy When

Practice Questions

Quiz