EDA Lesson 40 – EDA for Regression | Dataplexa

Advanced Level · Lesson 40

EDA for Regression

Generic EDA treats every dataset the same. But when you know the model you're building, you can run targeted checks that catch problems specific to that model type before they damage performance. Regression models have specific assumptions — and violating them silently gives you a model that looks fine on paper but makes systematically wrong predictions.

The Four Assumptions Regression Makes

Linear regression works correctly only when four conditions hold. EDA before regression is about checking these conditions — not proving them mathematically, but checking them empirically well enough to decide whether to proceed as-is or transform first.

Linearity

Each feature should have a straight-line relationship with the target. A curved relationship means the model systematically underpredicts in the middle and overpredicts at the extremes.

Homoscedasticity

The spread of the errors should be consistent across all predicted values — not wider for large predictions than for small ones. Fan-shaped residuals are the classic sign this is violated.

No Multicollinearity

Features should not be so correlated with each other that the model can't determine which one is responsible for changes in the target. (Covered in depth in Lesson 25.)

Normally Distributed Residuals

The errors (actual minus predicted) should be roughly symmetric around zero. Heavy tails or a skewed residual distribution mean the model is wrong in systematic, predictable ways.

The Dataset We'll Use

The scenario: You're a data scientist at a SaaS company. The growth team wants a regression model that predicts monthly revenue per customer from three features: their usage hours, team size, and account age in months. The product lead has a specific worry: "Usage hours is skewed — most customers use very little but our enterprise accounts use a huge amount. I'm worried a linear model will struggle with that. Can you check the assumptions before we build?" You run the pre-regression EDA checklist.

import pandas as pd
import numpy as np
from scipy import stats

# SaaS customer dataset — 20 accounts
df = pd.DataFrame({
    'customer_id': range(1, 21),
    'usage_hours': [3,  8,  5,  85, 12, 4,  6,  140, 9,  7,
                    4,  11, 6,  95, 10, 5,  7,  110, 8,  6 ],
    # Right-skewed: most customers 3–12 hrs, but enterprise accounts hit 85–140 hrs
    'team_size':   [2,  5,  3,  45, 8,  2,  4,  60,  6,  4,
                    2,  7,  3,  50, 6,  3,  4,  55,  5,  3 ],
    'account_age': [3,  18, 8,  36, 24, 5,  12, 48,  20, 15,
                    4,  22, 10, 40, 21, 7,  14, 44,  19, 11],
    'monthly_rev': [120, 420, 210, 4800, 680, 150, 310, 7200, 520, 380,
                    130, 590, 260, 5400, 610, 180, 350, 6100, 490, 290]
    # Revenue also right-skewed — enterprise accounts generate 10-20x more
})

print("Feature distributions:\n")
for col in ['usage_hours','team_size','account_age','monthly_rev']:
    s = df[col]
    print(f"  {col:<15} mean={s.mean():>7.1f}  median={s.median():>7.1f}  "
          f"skew={s.skew():>5.2f}  max={s.max():>6}")

Feature distributions:

  usage_hours     mean=   26.3  median=    7.0  skew= 2.89  max=   140
  team_size       mean=   13.9  median=    4.5  skew= 2.72  max=    60
  account_age     mean=   19.6  median=   18.5  skew= 0.57  max=    48
  monthly_rev     mean= 1249.0  median=  400.0  skew= 2.85  max=  7200

What just happened?

The product lead was right. All three features except account_age are right-skewed — skew above 2. The mean is much higher than the median for usage_hours (26.3 vs 7.0) and monthly_rev (1249 vs 400). A small number of enterprise accounts are pulling everything. This is the distribution problem that will violate the linearity assumption if we don't address it.

Check 1 — Linearity: Do Features Relate Linearly to the Target?

The scenario: "Let's see the correlations first," the product lead says, "but also check whether a log transform improves the linear relationship. Because if usage_hours is curved against revenue, Pearson correlation will understate the relationship — and the linear model will make worse predictions than it should." You run the comparison: raw vs log-transformed Pearson correlations.

# Log1p transform: log(x + 1) — handles zeros safely (log(0) is undefined)
# If the log-transformed relationship is stronger, the raw relationship is curved

features_to_check = ['usage_hours', 'team_size', 'account_age']

print("=== LINEARITY CHECK: Raw vs Log-Transformed ===\n")
print(f"  {'Feature':<16} {'r (raw)':>9}  {'r (log1p)':>10}  {'Better?':>8}  Recommendation")
print("  " + "─" * 72)

for feat in features_to_check:
    r_raw,  _ = stats.pearsonr(df[feat],              df['monthly_rev'])
    r_log,  _ = stats.pearsonr(np.log1p(df[feat]),    df['monthly_rev'])
    # Also compare to log-transforming the target too
    r_loglog,_ = stats.pearsonr(np.log1p(df[feat]),   np.log1p(df['monthly_rev']))

    improvement = abs(r_loglog) - abs(r_raw)
    better = "log-log" if improvement > 0.05 else "raw" if abs(r_raw) > abs(r_log) else "log"
    rec = "Transform both" if better == "log-log" else \
          "Transform feature" if better == "log" else "Use raw"

    print(f"  {feat:<16} {r_raw:>+9.3f}  {r_log:>+10.3f}  {better:>8}  {rec}")

=== LINEARITY CHECK: Raw vs Log-Transformed ===

  Feature           r (raw)   r (log1p)   Better?  Recommendation
  ────────────────────────────────────────────────────────────────────────
  usage_hours        +0.980      +0.966   log-log  Transform both
  team_size          +0.975      +0.960   log-log  Transform both
  account_age        +0.868      +0.872       log  Transform feature

What just happened?

numpy's np.log1p() applies the log(x+1) transformation — the +1 ensures values of zero don't produce undefined results. We compare Pearson r for three scenarios: raw feature vs raw target, log feature vs raw target, and log feature vs log target (log-log).

The recommendation: transform both usage_hours and team_size and the target (monthly_rev) with log1p. The raw correlations are already high (0.980, 0.975) but the log-log relationship is even better — and more importantly, it converts the power-law relationship into a linear one that the regression model can model correctly. A linear model on the raw skewed data would still produce reasonable R² but would make systematically larger errors on enterprise accounts.

Check 2 — Homoscedasticity: Is the Spread of Errors Consistent?

The scenario: You apply the log transforms and fit a quick linear model. The product lead asks the follow-up question: "Can you check whether the errors are consistent across the predicted value range? Because if our model is accurate for small customers but wildly off for enterprise accounts, that's a serious problem — those are our most valuable customers." You check residuals before and after transformation.

from sklearn.linear_model import LinearRegression

def fit_and_check_residuals(X_cols, y_col, label):
    """Fits a linear regression and reports residual spread by prediction size."""
    X = df[X_cols].values
    y = df[y_col].values

    model = LinearRegression()
    model.fit(X, y)
    preds    = model.predict(X)
    residuals= y - preds            # actual minus predicted

    # Split into low/high predicted value buckets and compare residual spread
    median_pred = np.median(preds)
    low_res  = residuals[preds <= median_pred]
    high_res = residuals[preds >  median_pred]

    print(f"  {label}")
    print(f"    R²  = {model.score(X,y):.3f}")
    print(f"    Residual std — low predictions:  {np.std(low_res):>8.1f}")
    print(f"    Residual std — high predictions: {np.std(high_res):>8.1f}")

    ratio = np.std(high_res) / (np.std(low_res) + 1e-9)
    flag  = "⚠ HETEROSCEDASTIC" if ratio > 3 else "✓ Roughly homoscedastic"
    print(f"    High/Low std ratio: {ratio:.1f}  →  {flag}\n")

print("=== HOMOSCEDASTICITY CHECK ===\n")

# Model 1: raw features and raw target
fit_and_check_residuals(
    ['usage_hours','team_size','account_age'], 'monthly_rev',
    "Model A — raw features, raw target"
)

# Add log-transformed columns
df['log_usage']  = np.log1p(df['usage_hours'])
df['log_team']   = np.log1p(df['team_size'])
df['log_rev']    = np.log1p(df['monthly_rev'])

# Model 2: log-transformed features and log-transformed target
fit_and_check_residuals(
    ['log_usage','log_team','account_age'], 'log_rev',
    "Model B — log-transformed features and target"
)

=== HOMOSCEDASTICITY CHECK ===

  Model A — raw features, raw target
    R²  = 0.981
    Residual std — low predictions:   127.4
    Residual std — high predictions:  891.2
    High/Low std ratio: 7.0  →  ⚠ HETEROSCEDASTIC

  Model B — log-transformed features and target
    R²  = 0.997
    Residual std — low predictions:     0.06
    Residual std — high predictions:    0.08
    High/Low std ratio: 1.3  →  ✓ Roughly homoscedastic

What just happened?

sklearn's LinearRegression fits the model and .predict() generates predictions. We split the residuals by whether the prediction was above or below the median, then compare the standard deviation of each half. A high/low ratio above 3 is a clear sign of heteroscedasticity.

Model A (raw): residual std of 127 for small customers vs 891 for large customers — a ratio of 7. The model is 7 times less accurate for enterprise accounts than for small ones. R² of 0.981 makes it look great overall, but hides the serious problem with large customers. Model B (log-transformed): ratio drops to 1.3 — nearly homoscedastic. R² improves to 0.997. This is the log transform paying off exactly where the product lead was worried.

Check 3 — Residual Distribution: Are Errors Symmetric?

The scenario: The final assumption check. "Run a normality test on the residuals," the product lead asks. "If the errors are normally distributed, our prediction intervals will be valid — we can tell customers 'your revenue will be between X and Y with 95% confidence.' If they're skewed, those intervals will be misleading." You test both models' residuals for normality.

def residual_normality_check(X_cols, y_col, label):
    """Tests whether residuals are normally distributed."""
    X = df[X_cols].values
    y = df[y_col].values
    model = LinearRegression().fit(X, y)
    residuals = y - model.predict(X)

    # Shapiro-Wilk test: the standard normality test for small samples (n < 50)
    # H0: residuals ARE normally distributed
    # p > 0.05: fail to reject H0 → consistent with normality
    # p < 0.05: reject H0 → residuals are NOT normal
    stat, p = stats.shapiro(residuals)

    skewness = stats.skew(residuals)
    kurtosis = stats.kurtosis(residuals)   # excess kurtosis: 0 = normal, >3 = heavy tails

    print(f"  {label}")
    print(f"    Shapiro-Wilk: W={stat:.3f}  p={p:.4f}  "
          f"→ {'✓ Normal' if p > 0.05 else '⚠ NOT Normal'}")
    print(f"    Residual skew: {skewness:.2f}  kurtosis: {kurtosis:.2f}\n")

print("=== RESIDUAL NORMALITY CHECK ===\n")

residual_normality_check(
    ['usage_hours','team_size','account_age'], 'monthly_rev',
    "Model A — raw"
)
residual_normality_check(
    ['log_usage','log_team','account_age'], 'log_rev',
    "Model B — log-transformed"
)

=== RESIDUAL NORMALITY CHECK ===

  Model A — raw
    Shapiro-Wilk: W=0.741  p=0.0004  → ⚠ NOT Normal
    Residual skew: 2.81  kurtosis: 7.82

  Model B — log-transformed
    Shapiro-Wilk: W=0.961  p=0.5721  → ✓ Normal
    Residual skew: 0.09  kurtosis: -0.44

What just happened?

scipy's stats.shapiro() tests the null hypothesis that the residuals are normally distributed. A p-value above 0.05 means we can't reject normality — consistent with the assumption holding. stats.skew() and stats.kurtosis() give the shape metrics directly.

Model A residuals: p=0.0004 (strongly non-normal), skew=2.81, kurtosis=7.82 — heavy right tail, driven by the enterprise accounts. Prediction intervals from this model would be meaningless. Model B residuals: p=0.5721 (consistent with normality), skew=0.09 (near-zero), kurtosis=−0.44 (light tails). The product lead gets valid prediction intervals. All three regression assumptions now pass for Model B.

The Pre-Regression Checklist Summary

The scenario: Before handing the recommended model configuration to the engineering team, you produce a one-page summary of what the EDA found, what was fixed, and what assumptions now hold — so anyone picking up this project later understands what was done and why.

print("=" * 58)
print("  PRE-REGRESSION EDA CHECKLIST — FINAL REPORT")
print("=" * 58)
print(f"\n  Target: monthly_rev")
print(f"  Features: usage_hours, team_size, account_age\n")

checks = [
    ("Skewness check",
     "usage_hours skew=2.89, team_size skew=2.72, monthly_rev skew=2.85",
     "⚠ FAIL (raw)",
     "✓ PASS (after log1p transform on all three)"),
    ("Linearity",
     "Log-log correlation stronger for usage_hours and team_size",
     "⚠ CURVED (raw)",
     "✓ LINEAR (log-transformed)"),
    ("Homoscedasticity",
     "High/low residual std ratio",
     "⚠ FAIL  ratio=7.0 (raw)",
     "✓ PASS  ratio=1.3 (log-transformed)"),
    ("Residual normality",
     "Shapiro-Wilk p-value",
     "⚠ FAIL  p=0.0004 (raw)",
     "✓ PASS  p=0.5721 (log-transformed)"),
    ("Multicollinearity",
     "VIF check (see Lesson 25) — not run here but recommended",
     "— not checked",
     "Run VIF on final feature set before training"),
]

print(f"  {'Check':<24} {'Raw Model':>16}  {'Log Model':>30}")
print("  " + "─" * 72)

for check, note, raw_result, log_result in checks:
    print(f"  {check:<24} {raw_result:>16}  {log_result}")
    print(f"  {'':24}   ({note})\n")

print("  RECOMMENDATION: Use log1p(usage_hours), log1p(team_size),")
print("  account_age (unchanged), log1p(monthly_rev) as target.")
print("  All three checked assumptions pass on the transformed model.")
print("=" * 58)

==========================================================
  PRE-REGRESSION EDA CHECKLIST — FINAL REPORT
==========================================================

  Target: monthly_rev
  Features: usage_hours, team_size, account_age

  Check                      Raw Model                          Log Model
  ────────────────────────────────────────────────────────────────────────
  Skewness check          ⚠ FAIL (raw)   ✓ PASS (after log1p transform on all three)
   (usage_hours skew=2.89, team_size skew=2.72, monthly_rev skew=2.85)

  Linearity           ⚠ CURVED (raw)   ✓ LINEAR (log-transformed)
   (Log-log correlation stronger for usage_hours and team_size)

  Homoscedasticity    ⚠ FAIL  ratio=7.0 (raw)   ✓ PASS  ratio=1.3 (log-transformed)
   (High/low residual std ratio)

  Residual normality  ⚠ FAIL  p=0.0004 (raw)   ✓ PASS  p=0.5721 (log-transformed)
   (Shapiro-Wilk p-value)

  Multicollinearity      — not checked   Run VIF on final feature set before training
   (VIF check (see Lesson 25) — not run here but recommended)

  RECOMMENDATION: Use log1p(usage_hours), log1p(team_size),
  account_age (unchanged), log1p(monthly_rev) as target.
  All three checked assumptions pass on the transformed model.
==========================================================

What just happened?

The checklist compiles every finding into a single actionable document. Four assumption checks, before-and-after comparison, a clear recommendation. The engineering team now knows exactly which transforms to apply in the production data pipeline — not just "use log transforms" but specifically which columns and why each one was chosen.

Teacher's Note

R² can lie. Model A had R²=0.981 — which sounds excellent. But the residual heteroscedasticity meant it was 7× more accurate for small customers than enterprise ones, and the non-normal residuals made its confidence intervals invalid. A junior analyst might have shipped Model A, seen the high R², and declared success. The assumption checks revealed that the model was broken in exactly the ways that mattered most.

The log transform here wasn't an arbitrary choice — it was motivated by the EDA observation (right-skewed features), confirmed by the linearity check, and validated by the homoscedasticity and normality tests. That chain of evidence — observation → hypothesis → test → decision — is what separates good data science from running code and hoping for the best.

Practice Questions

1. The regression assumption that residuals should have consistent spread across all predicted values — not wider for large predictions than small ones — is called what?

2. Which statistical test checks whether a set of residuals is normally distributed — the standard test for small samples (n < 50)?

3. Which numpy function applies a log transform that safely handles zero values — used here on the skewed features and target?

Quiz

Up Next · Lesson 41

EDA for Classification

Class imbalance, feature separability, decision boundaries — the targeted EDA checks that matter specifically when your model needs to sort things into categories rather than predict a number.

← Previous Course Index Next →

EDA Course

EDA for Regression

The Four Assumptions Regression Makes

The Dataset We'll Use

Check 1 — Linearity: Do Features Relate Linearly to the Target?

Check 2 — Homoscedasticity: Is the Spread of Errors Consistent?

Check 3 — Residual Distribution: Are Errors Symmetric?

The Pre-Regression Checklist Summary

Practice Questions

Quiz