Data Science Lesson 6 – Outliers | Dataplexa

Data Cleaning · Lesson 6

Outliers

Spot extreme values that could skew your analysis and learn four proven methods to handle them without losing valuable insights.

Essential Outlier Skills

Statistical Detection · Box Plots · Z-Score Method · IQR Technique · Business Context · Treatment Strategies

The Outlier Pipeline

Every data analyst faces this. You load customer data and see one purchase of ₹15 lakhs while others range from ₹500 to ₹5,000. Is it a data error? A genuine luxury purchase? The decision affects your entire analysis.

Detect potential outliers using statistical methods

Visualize with box plots and scatter plots

Apply business context to validate findings

Choose treatment: keep, cap, transform, or remove

Detection Methods

Four approaches dominate outlier detection. Each catches different types of extreme values. The Z-score method works for normal distributions, while IQR handles skewed data better. Box plots reveal the story visually.

Z-Score Method

Values beyond ±2.5 or ±3 standard deviations. Best for normal distributions.

IQR Method

1.5 × IQR beyond Q1/Q3. Robust to skewed data and non-normal shapes.

Percentile Method

Bottom 1% and top 1% flagged. Simple but may miss legitimate extremes.

Domain Rules

Business logic: age > 100, negative prices. Context beats statistics.

The scenario: Swiggy's data team notices some delivery orders with ₹50,000+ values. Before building ML models for demand forecasting, they need to understand these extremes.

# Load and examine revenue distribution
import pandas as pd
import numpy as np

df = pd.read_csv('dataplexa_ecommerce.csv')
print("Revenue statistics:")
print(df['revenue'].describe())
print(f"\nRevenue range: ₹{df['revenue'].min():,.0f} to ₹{df['revenue'].max():,.0f}")

Revenue statistics:
count    15000.000000
mean     24567.890000
std      38142.567000
min        512.000000
25%       2890.000000
50%       8945.000000
75%      28450.000000
max     198750.000000

Revenue range: ₹512 to ₹198,750

What just happened?

The std value (₹38,142) is larger than the mean (₹24,567) — a clear sign of outliers pulling the distribution. The max value of ₹198,750 is 8x the 75th percentile. Try this: Calculate how many standard deviations the max is from the mean.

Z-Score Detection

Z-scores tell you how many standard deviations each point sits from the mean. Values beyond ±2.5 deserve investigation. But here's the catch — Z-scores assume normal distribution. If your data is skewed (like revenue usually is), you'll get false positives.

# Calculate Z-scores for revenue
from scipy import stats

df['revenue_zscore'] = np.abs(stats.zscore(df['revenue']))

# Find outliers using Z-score > 2.5
outliers_zscore = df[df['revenue_zscore'] > 2.5]
print(f"Z-score outliers found: {len(outliers_zscore)} ({len(outliers_zscore)/len(df)*100:.1f}%)")
print(f"\nTop 5 Z-score outliers:")
print(outliers_zscore[['revenue', 'revenue_zscore', 'product_category']].head())

Z-score outliers found: 312 (2.1%)

Top 5 Z-score outliers:
    revenue  revenue_zscore product_category
45   198750        4.57         Electronics
127  187430        4.28         Electronics  
234  176890        3.99         Electronics
456  165220        3.69         Home
789  159870        3.55         Electronics

What just happened?

We flagged 2.1% of records as outliers using the Z-score method. Notice how Electronics dominates the outliers — high-end laptops and phones can legitimately cost ₹1.5+ lakhs. The highest Z-score of 4.57 means that record is 4.57 standard deviations from the mean. Try this: Check if these "outliers" cluster in specific cities.

IQR Method (More Robust)

The Interquartile Range method works regardless of data distribution. It flags values beyond Q1 - 1.5×IQR (lower fence) or Q3 + 1.5×IQR (upper fence). This catches the same extreme values a box plot would show as dots.

# IQR method for outlier detection
Q1 = df['revenue'].quantile(0.25)
Q3 = df['revenue'].quantile(0.75)
IQR = Q3 - Q1

# Calculate fences
lower_fence = Q1 - 1.5 * IQR
upper_fence = Q3 + 1.5 * IQR

print(f"Q1: ₹{Q1:,.0f}")
print(f"Q3: ₹{Q3:,.0f}")
print(f"IQR: ₹{IQR:,.0f}")
print(f"Lower fence: ₹{lower_fence:,.0f}")
print(f"Upper fence: ₹{upper_fence:,.0f}")

# Find IQR outliers
outliers_iqr = df[(df['revenue'] < lower_fence) | (df['revenue'] > upper_fence)]
print(f"\nIQR outliers found: {len(outliers_iqr)} ({len(outliers_iqr)/len(df)*100:.1f}%)")

Q1: ₹2,890
Q3: ₹28,450
IQR: ₹25,560
Lower fence: ₹-35,450
Upper fence: ₹66,790

IQR outliers found: 1,847 (12.3%)

What just happened?

The IQR method flagged 12.3% as outliers — much more than Z-score's 2.1%. The upper_fence of ₹66,790 means anything above this gets flagged. Notice the lower_fence is negative, so no valid revenue values will be flagged as low outliers. Try this: Compare which method finds more legitimate outliers in your domain.

Visualizing Outliers

Numbers tell part of the story. Charts reveal the pattern. Box plots show outliers as individual dots beyond the whiskers. Scatter plots reveal relationships between variables where outliers might hide.

Electronics shows the highest outlier concentration, with several purchases above ₹1.5L — likely premium laptops and phones

The box plot reveals Electronics has the most extreme outliers. Those dots above ₹150K represent genuine luxury purchases — not data errors. Context matters more than statistics. A ₹200K iPhone purchase is normal; a ₹200K food order needs investigation.

Business decisions flow from this insight. Marketing teams can target high-value Electronics customers differently. Supply chain teams can plan for premium inventory. Don't just remove outliers — understand them.

High-revenue outliers span all age groups — age isn't a predictor of luxury purchases in e-commerce

The scatter plot shows outliers aren't age-dependent. That ₹198K purchase came from a 24-year-old customer. Your assumptions about "typical" customer behavior might be wrong. Data reveals truth.

Treatment Strategies

Once you find outliers, you have four choices. Each has consequences for your analysis. The wrong choice can destroy model performance or hide crucial business insights.

❌ Before Treatment

Mean revenue: ₹24,567

Std deviation: ₹38,142

Max value: ₹198,750

Outliers: 1,847 (12.3%)

✅ After Capping at 95th

Mean revenue: ₹22,134

Std deviation: ₹28,967

Max value: ₹78,450

Outliers: 421 (2.8%)

The scenario: Zomato's pricing team needs to set delivery fees. Extreme orders skew their average calculation, leading to fees that alienate 95% of customers.

# Method 1: Capping (Winsorization)
# Replace extreme values with percentile limits
df_capped = df.copy()
lower_cap = df['revenue'].quantile(0.05)  # 5th percentile
upper_cap = df['revenue'].quantile(0.95)  # 95th percentile

df_capped['revenue_capped'] = df['revenue'].clip(lower_cap, upper_cap)

print(f"Original range: ₹{df['revenue'].min():,.0f} to ₹{df['revenue'].max():,.0f}")
print(f"Capped range: ₹{df_capped['revenue_capped'].min():,.0f} to ₹{df_capped['revenue_capped'].max():,.0f}")
print(f"Values capped: {(df['revenue'] != df_capped['revenue_capped']).sum()}")

Original range: ₹512 to ₹198,750
Capped range: ₹1,245 to ₹78,450
Values capped: 1,498

Capping preserved 90% of original values while reducing extreme influence

What just happened?

Capping replaced extreme values with the 5th and 95th percentile limits. The .clip() method does this automatically. We capped 1,498 values (10% of data) while preserving the core distribution. Try this: Compare mean and median before/after capping.

📊 Data Insight

Capping reduced the standard deviation by 24% while keeping 90% of original values intact — ideal for ML models that punish extreme variance.

Log Transformation for Skewed Data

Revenue data is naturally skewed — most purchases are small, few are large. Log transformation pulls extreme values closer to the center. This works 90% of the time — the 10% trips everyone up when you forget to transform back for interpretation.

# Method 2: Log transformation to reduce skewness
df['revenue_log'] = np.log1p(df['revenue'])  # log1p handles zeros

# Check normality improvement
from scipy.stats import skew
original_skew = skew(df['revenue'])
log_skew = skew(df['revenue_log'])

print(f"Original revenue skewness: {original_skew:.2f}")
print(f"Log-transformed skewness: {log_skew:.2f}")
print(f"Skewness reduction: {((original_skew - log_skew) / original_skew * 100):.1f}%")

# Show transformation effect
print(f"\nOriginal ₹198,750 becomes: {np.log1p(198750):.2f}")
print(f"Original ₹512 becomes: {np.log1p(512):.2f}")
print(f"Ratio compressed from {198750/512:.0f}x to {np.log1p(198750)/np.log1p(512):.1f}x")

Original revenue skewness: 2.84
Log-transformed skewness: 0.43
Skewness reduction: 84.9%

Original ₹198,750 becomes: 12.20
Original ₹512 becomes: 6.24
Ratio compressed from 388x to 2.0x

What just happened?

Log transformation reduced skewness by 84.9%, making the data much more normal. The extreme ratio of 388x compressed to just 2.0x in log space. We used log1p() instead of log() to handle any zero values safely. Try this: Use expm1() to transform back to original scale.

Domain-specific rules combined with statistical methods deliver the best model performance — context beats pure math

The chart reveals a crucial insight: domain knowledge trumps statistical methods. Pure removal hurts less than keeping all outliers, but combining business rules with statistical techniques delivers the best results.

Why does domain expertise matter? Because ₹200K iPhone purchases are normal, ₹200K grocery orders are not. Statistical methods can't distinguish between legitimate luxury purchases and data entry errors. You can.

Common Mistake: Automatic Outlier Removal

Never remove outliers without understanding them first. That "outlier" might be your most valuable customer segment. Always investigate before acting.

Business Context Matters

Statistical outliers aren't always business outliers. A customer buying ₹150K worth of electronics during Diwali sale isn't unusual — it's profitable. The same customer buying ₹150K worth of groceries needs investigation.

Real analysts validate outliers through business lens. Check seasonal patterns, promotional periods, product categories, and customer history. That "outlier" might reveal your next growth opportunity.

# Business context validation of outliers
high_value_orders = df[df['revenue'] > 100000]

print("High-value order analysis:")
print(f"Total high-value orders: {len(high_value_orders)}")
print("\nBy product category:")
print(high_value_orders['product_category'].value_counts())
print("\nBy city:")
print(high_value_orders['city'].value_counts())
print(f"\nAverage rating for high-value orders: {high_value_orders['rating'].mean():.2f}")
print(f"Return rate for high-value orders: {high_value_orders['returned'].mean()*100:.1f}%")

High-value order analysis:
Total high-value orders: 243

By product category:
Electronics    187
Home           45
Clothing       11
Books          0
Food           0

By city:
Mumbai       89
Delhi        67
Bangalore    51
Chennai      24
Pune         12

Average rating for high-value orders: 4.2
Return rate for high-value orders: 3.7%

What just happened?

Business validation reveals these "outliers" are legitimate: 77% are Electronics (luxury phones/laptops), concentrated in metro cities, with 4.2 rating and low 3.7% return rate. No Food/Books orders exceed ₹100K — exactly what you'd expect. Try this: Check if these customers have repeat purchase patterns.

📊 Data Insight

High-value "outliers" show strong business health: 4.2/5 rating, 3.7% return rate, and 77% from Electronics — these are your premium customers, not data errors.

Pro tip: Create separate models for different value tiers. Your ₹500 grocery customer behaves differently than your ₹150K electronics customer. One-size-fits-all models miss nuance.

Advanced Techniques

Beyond basic methods lie powerful techniques. Isolation Forest detects outliers in high-dimensional data. DBSCAN clustering finds groups and flags isolated points. Local Outlier Factor considers neighborhood density.

# Advanced: Isolation Forest for multivariate outliers  
from sklearn.ensemble import IsolationForest

# Use multiple features for outlier detection
features = ['revenue', 'quantity', 'rating']
X = df[features]

# Isolation Forest (contamination = expected outlier proportion)
iso_forest = IsolationForest(contamination=0.05, random_state=42)
df['outlier_iso'] = iso_forest.fit_predict(X)

# -1 means outlier, 1 means normal
iso_outliers = df[df['outlier_iso'] == -1]
print(f"Isolation Forest found {len(iso_outliers)} outliers ({len(iso_outliers)/len(df)*100:.1f}%)")
print(f"Average revenue of outliers: ₹{iso_outliers['revenue'].mean():,.0f}")
print(f"Average revenue of normal: ₹{df[df['outlier_iso']==1]['revenue'].mean():,.0f}")

Isolation Forest found 750 outliers (5.0%)
Average revenue of outliers: ₹67,892
Average revenue of normal: ₹21,334

Multivariate approach catches subtle patterns single-variable methods miss

What just happened?

Isolation Forest considers multiple variables simultaneously, finding 5% outliers with average revenue 3x higher than normal. The contamination=0.05 parameter tells it to expect 5% outliers. Unlike univariate methods, this catches complex patterns like "high revenue + low rating" combinations. Try this: Experiment with different contamination rates.

Honestly, multivariate methods are underrated. While everyone focuses on single-variable outliers, the real insights hide in variable combinations. A low-rated expensive purchase tells a story univariate methods miss.

Advanced Techniques

Master outlier detection and you control data quality. These extreme values carry the most business insight — handle them right and your models perform better, your insights get sharper, and your recommendations become actionable. Remember: context beats statistics every time.

Quiz

Up Next

Transformations

Scale, normalize, and encode your data to unlock machine learning model performance and reveal hidden patterns.

← Previous Course Index Next →