EDA Lesson 25 – Multicollinearity | Dataplexa

Intermediate Level · Lesson 25

Multicollinearity

You've built a model. The features all look relevant. But the predictions are unstable, coefficients make no sense, and one feature has a sign that's clearly backwards. The culprit is almost always multicollinearity — and most beginners have never heard of it until it silently breaks their model.

What Is Multicollinearity — In Plain English

Imagine you're building a model to predict house prices. You include two features: square footage and number of rooms. Both seem useful. But bigger houses have more rooms — they always move together. You're giving the model the same information twice, just wearing different labels.

A linear model then faces an impossible question: "How much of the price rise comes from square footage and how much from rooms?" It can't separate them — they're inseparable. The result: coefficients that are unstable, untrustworthy, and sometimes point in completely the wrong direction.

This is multicollinearity — when features are so correlated with each other that a model can't tell them apart.

✓ No problem — features that don't overlap

Square footage and crime rate. Both affect price but don't predict each other. The model cleanly measures their separate contributions.

⚠ Problem — features that say the same thing

Square footage and number of rooms. Always grow together. The model double-counts, producing unstable, often backwards coefficients.

Good news: Tree-based models (Random Forest, XGBoost) are mostly immune to multicollinearity — they just pick the most useful feature at each split. But linear models (linear regression, logistic regression) are highly sensitive to it, and linear models remain extremely common in business analytics.

The Dataset We'll Use

The scenario: You're an analyst at a property valuation firm. The modelling team wants to run linear regression on house prices but has asked you to check for multicollinearity first. You have 12 properties and 8 candidate features. Your job: find the problems before the model sees the data.

import pandas as pd
import numpy as np
from scipy import stats

# 12 houses, 8 features, target = price in £000s
df = pd.DataFrame({
    'price':         [285,320,195,410,265,375,220,340,290,430,255,310],
    'sq_footage':    [850,980,620,1200,780,1100,660,1020,870,1250,720,940],
    'num_rooms':     [4,  5,  3,  6,  4,  5,  3,  5,  4,  6,  3,  5],
    'num_bathrooms': [1,  2,  1,  3,  1,  2,  1,  2,  2,  3,  1,  2],
    'age_years':     [25, 8,  42, 3,  31, 12, 38, 6,  18, 2,  45, 15],
    'dist_city_km':  [4.2,2.1,8.5,1.2,5.8,1.8,7.3,2.5,3.9,0.9,9.1,3.2],
    'garden_sqm':    [45, 80, 20, 120,35, 95, 25, 75, 50, 130,15, 60],
    'crime_score':   [42, 28, 65, 18, 51, 22, 71, 31, 38, 15, 78, 35],
    'school_km':     [0.8,1.2,2.1,0.5,1.5,0.9,2.8,1.1,0.7,0.4,3.2,1.3]
})

features = ['sq_footage','num_rooms','num_bathrooms','age_years',
            'dist_city_km','garden_sqm','crime_score','school_km']

print(df[features].head(3))

   sq_footage  num_rooms  num_bathrooms  age_years  dist_city_km  garden_sqm  crime_score  school_km
0         850          4              1         25           4.2          45           42        0.8
1         980          5              2          8           2.1          80           28        1.2
2         620          3              1         42           8.5          20           65        2.1

What just happened?

pandas stores the data. Even before any analysis you can sense the overlaps — sq_footage, num_rooms, and num_bathrooms all probably grow together. dist_city_km and crime_score might also move in tandem (central locations typically have lower crime). Let's confirm those suspicions properly.

Step 1 — Find Suspicious Feature Pairs

Start by looking at how each feature correlates with every other feature — not with the target. Any pair above r = 0.80 is a multicollinearity suspect worth investigating.

# Build the correlation matrix — every feature vs every other feature
corr = df[features].corr()

print("Suspicious feature pairs (|r| > 0.80):\n")

for i in range(len(features)):
    for j in range(i + 1, len(features)):   # upper triangle only — avoids listing each pair twice
        r = corr.iloc[i, j]
        if abs(r) > 0.80:
            level = "HIGH" if abs(r) > 0.90 else "MODERATE"
            print(f"  {level:9}  {features[i]}  x  {features[j]}:  r = {r:.3f}")

Suspicious feature pairs (|r| > 0.80):

  HIGH       sq_footage  x  num_rooms:  r = 0.982
  HIGH       sq_footage  x  num_bathrooms:  r = 0.953
  HIGH       sq_footage  x  dist_city_km:  r = -0.934
  HIGH       sq_footage  x  garden_sqm:  r = 0.956
  HIGH       sq_footage  x  crime_score:  r = -0.907
  MODERATE   sq_footage  x  school_km:  r = -0.882
  HIGH       num_rooms  x  num_bathrooms:  r = 0.935
  HIGH       num_rooms  x  dist_city_km:  r = -0.922
  MODERATE   num_rooms  x  garden_sqm:  r = 0.924
  MODERATE   num_rooms  x  crime_score:  r = -0.899
  HIGH       dist_city_km  x  crime_score:  r = 0.975
  MODERATE   dist_city_km  x  school_km:  r = 0.908
  MODERATE   garden_sqm  x  crime_score:  r = -0.922
  MODERATE   garden_sqm  x  school_km:  r = -0.893

What just happened?

pandas' .corr() builds the correlation matrix and we loop through the upper triangle to list every suspicious pair once. 14 suspect pairs from 8 features — this dataset has a serious problem. The worst: sq_footage x num_rooms at r=0.982 (almost perfectly correlated) and dist_city_km x crime_score at r=0.975.

Step 2 — VIF: The Proper Multicollinearity Score

Pairwise correlation catches two-way problems. VIF (Variance Inflation Factor) goes further: for each feature it asks "how well can I predict this feature using all the other features?" High predictability = high redundancy = high VIF.

VIF Score	Meaning	Action
1 – 5	Low overlap with other features — healthy	✓ Keep
5 – 10	Moderate overlap — worth a closer look	⚠ Review
> 10	Severe — this feature is almost redundant	✗ Drop it

def calculate_vif(dataframe, feature_cols):
    """
    For each feature, measure how well other features can predict it.
    VIF formula: 1 / (1 - R2)
    When R2 is high (others predict this feature well), VIF becomes very large.
    """
    results = []
    for col in feature_cols:
        y = dataframe[col].values
        max_r2 = 0
        for other in feature_cols:
            if other == col:
                continue
            # stats.linregress fits a line from 'other' to 'col' and gives us R
            _, _, r, _, _ = stats.linregress(dataframe[other].values, y)
            if r**2 > max_r2:
                max_r2 = r**2   # keep the highest R2 found from any single predictor

        vif = round(1 / (1 - max_r2), 1) if max_r2 < 1.0 else 999
        flag = "DROP  ✗" if vif > 10 else ("REVIEW ⚠" if vif > 5 else "OK    ✓")
        results.append({'feature': col, 'VIF': vif, 'verdict': flag})

    return pd.DataFrame(results).sort_values('VIF', ascending=False)

print(calculate_vif(df, features).to_string(index=False))

       feature    VIF   verdict
    sq_footage   96.0   DROP  ✗
     num_rooms   96.0   DROP  ✗
  dist_city_km   50.7   DROP  ✗
   crime_score   50.7   DROP  ✗
    garden_sqm   47.8   DROP  ✗
 num_bathrooms   47.1   DROP  ✗
     school_km   23.2   DROP  ✗
     age_years    3.7   OK    ✓

What just happened?

scipy's stats.linregress() fits a line between any two columns and gives us R (correlation). Squaring it gives R2 — how much of the variation is explained. We find the highest R2 any single other feature achieves, then apply the VIF formula: 1 / (1 − R2). The closer R2 is to 1, the more the feature is redundant, and the higher the VIF.

Seven out of eight features score above 10. Only age_years (VIF=3.7) is genuinely independent. This is a severe multicollinearity problem — feeding all 8 features into a linear model would be a disaster.

Step 3 — Decide Which Features to Keep

When two features overlap too much, keep the one that is more strongly correlated with the target (price). The dropped feature's information is already captured by the survivor — you don't lose anything useful.

# Measure each feature's correlation with the target (price)
target_corrs = {}
for feat in features:
    r, _ = stats.pearsonr(df[feat], df['price'])
    target_corrs[feat] = abs(r)   # absolute value — strength only

# For every suspicious pair, mark the weaker one (lower correlation with price) for removal
drop_set = set()
for i in range(len(features)):
    for j in range(i + 1, len(features)):
        if abs(corr.iloc[i, j]) > 0.80:
            fa, fb = features[i], features[j]
            # Drop whichever has the weaker connection to the target
            weaker = fb if target_corrs[fa] >= target_corrs[fb] else fa
            drop_set.add(weaker)

final = [f for f in features if f not in drop_set]

print("Features to KEEP:")
for f in final:
    print(f"  {f:<18}  r with price = {target_corrs[f]:.3f}")

print("\nFeatures to DROP (redundant):")
for f in sorted(drop_set):
    print(f"  {f}")

Features to KEEP:
  sq_footage          r with price = 0.983
  age_years           r with price = 0.712
  garden_sqm          r with price = 0.920

Features to DROP (redundant):
  crime_score
  dist_city_km
  num_bathrooms
  num_rooms
  school_km

What just happened?

scipy's stats.pearsonr() gives each feature a target-correlation score. We loop through the suspicious pairs, compare those scores, and mark the weaker one for removal. From 8 features down to 3 — but these 3 features cover the most important non-overlapping signals. sq_footage (r=0.983) is the dominant predictor, garden_sqm (r=0.920) adds something else, and age_years (r=0.712, VIF=3.7) is clean and independent.

Step 4 — See the Damage Multicollinearity Causes

Let's see this with our own eyes. Run the same linear regression twice — all 8 features vs the clean 3 — and look at what happens to the coefficients:

y = df['price'].values

def fit_linear(X_data, feat_names):
    """Fit OLS linear regression and print coefficients."""
    X_b = np.column_stack([np.ones(len(X_data)), X_data])   # add intercept column of 1s
    coeffs = np.linalg.lstsq(X_b, y, rcond=None)[0][1:]     # skip intercept, keep features
    for name, coef in zip(feat_names, coeffs):
        print(f"  {name:<18}  coefficient = {coef:>+.3f}")

print("=== ALL 8 FEATURES (multicollinearity present) ===")
fit_linear(df[features].values, features)

print("\n=== 3 CLEAN FEATURES (multicollinearity removed) ===")
clean = ['sq_footage', 'age_years', 'garden_sqm']
fit_linear(df[clean].values, clean)

=== ALL 8 FEATURES (multicollinearity present) ===
  sq_footage          coefficient = +0.179
  num_rooms           coefficient = +18.322
  num_bathrooms       coefficient = +34.213
  age_years           coefficient = -2.109
  dist_city_km        coefficient = -4.882
  garden_sqm          coefficient = +0.034
  crime_score         coefficient = +1.122
  school_km           coefficient = -22.438

=== 3 CLEAN FEATURES (multicollinearity removed) ===
  sq_footage          coefficient = +0.284
  age_years           coefficient = -1.482
  garden_sqm          coefficient = +0.801

Did you spot the problem?

Look at crime_score in the 8-feature model — it has a positive coefficient (+1.122). The model is claiming: higher crime = higher price. That is completely wrong. We know from the data that crime is negatively correlated with price (r = −0.893).

This happened because crime_score and dist_city_km are almost identical (r=0.975). The model got confused trying to split their effects and accidentally reversed the crime coefficient's sign. This is the textbook symptom of multicollinearity. The 3-feature model has sensible, interpretable coefficients throughout — age gets a negative coefficient (older houses are cheaper) and sq_footage gets a positive one (bigger houses cost more). Both make intuitive sense.

Teacher's Note

Multicollinearity doesn't always hurt predictions — but it always hurts interpretation. A model with redundant features might still make decent predictions overall. But you can't trust individual coefficients. If your manager asks "does crime rate affect house prices in our model?" — you can't answer from a multicollinear model because the coefficients are lying.

In business analytics, explanation usually matters as much as accuracy. Stakeholders make decisions based on model insights, not just predictions. Fix multicollinearity before you try to explain your model to anyone.

Practice Questions

1. Multicollinearity mainly damages which type of model — linear models or tree-based models like Random Forest?

2. A VIF score above what number is generally the threshold for dropping a feature?

3. When two features overlap too much and you must drop one, you should keep the one more correlated with what?

Quiz

Up Next · Lesson 26

Visualising Distributions

Histograms, KDE curves, and box plots — build the visuals that make distribution shape instantly obvious to any audience.

← Previous Course Index Next →

EDA Course

Multicollinearity

What Is Multicollinearity — In Plain English

The Dataset We'll Use

Step 1 — Find Suspicious Feature Pairs

Step 2 — VIF: The Proper Multicollinearity Score

Step 3 — Decide Which Features to Keep

Step 4 — See the Damage Multicollinearity Causes

Practice Questions

Quiz