EDA Course
Multicollinearity
You've built a model. The features all look relevant. But the predictions are unstable, coefficients make no sense, and one feature has a sign that's clearly backwards. The culprit is almost always multicollinearity — and most beginners have never heard of it until it silently breaks their model.
What Is Multicollinearity — In Plain English
Imagine you're building a model to predict house prices. You include two features: square footage and number of rooms. Both seem useful. But bigger houses have more rooms — they always move together. You're giving the model the same information twice, just wearing different labels.
A linear model then faces an impossible question: "How much of the price rise comes from square footage and how much from rooms?" It can't separate them — they're inseparable. The result: coefficients that are unstable, untrustworthy, and sometimes point in completely the wrong direction.
This is multicollinearity — when features are so correlated with each other that a model can't tell them apart.
✓ No problem — features that don't overlap
Square footage and crime rate. Both affect price but don't predict each other. The model cleanly measures their separate contributions.
⚠ Problem — features that say the same thing
Square footage and number of rooms. Always grow together. The model double-counts, producing unstable, often backwards coefficients.
Good news: Tree-based models (Random Forest, XGBoost) are mostly immune to multicollinearity — they just pick the most useful feature at each split. But linear models (linear regression, logistic regression) are highly sensitive to it, and linear models remain extremely common in business analytics.
The Dataset We'll Use
The scenario: You're an analyst at a property valuation firm. The modelling team wants to run linear regression on house prices but has asked you to check for multicollinearity first. You have 12 properties and 8 candidate features. Your job: find the problems before the model sees the data.
import pandas as pd
import numpy as np
from scipy import stats
# 12 houses, 8 features, target = price in £000s
df = pd.DataFrame({
'price': [285,320,195,410,265,375,220,340,290,430,255,310],
'sq_footage': [850,980,620,1200,780,1100,660,1020,870,1250,720,940],
'num_rooms': [4, 5, 3, 6, 4, 5, 3, 5, 4, 6, 3, 5],
'num_bathrooms': [1, 2, 1, 3, 1, 2, 1, 2, 2, 3, 1, 2],
'age_years': [25, 8, 42, 3, 31, 12, 38, 6, 18, 2, 45, 15],
'dist_city_km': [4.2,2.1,8.5,1.2,5.8,1.8,7.3,2.5,3.9,0.9,9.1,3.2],
'garden_sqm': [45, 80, 20, 120,35, 95, 25, 75, 50, 130,15, 60],
'crime_score': [42, 28, 65, 18, 51, 22, 71, 31, 38, 15, 78, 35],
'school_km': [0.8,1.2,2.1,0.5,1.5,0.9,2.8,1.1,0.7,0.4,3.2,1.3]
})
features = ['sq_footage','num_rooms','num_bathrooms','age_years',
'dist_city_km','garden_sqm','crime_score','school_km']
print(df[features].head(3))
sq_footage num_rooms num_bathrooms age_years dist_city_km garden_sqm crime_score school_km 0 850 4 1 25 4.2 45 42 0.8 1 980 5 2 8 2.1 80 28 1.2 2 620 3 1 42 8.5 20 65 2.1
What just happened?
pandas stores the data. Even before any analysis you can sense the overlaps — sq_footage, num_rooms, and num_bathrooms all probably grow together. dist_city_km and crime_score might also move in tandem (central locations typically have lower crime). Let's confirm those suspicions properly.
Step 1 — Find Suspicious Feature Pairs
Start by looking at how each feature correlates with every other feature — not with the target. Any pair above r = 0.80 is a multicollinearity suspect worth investigating.
# Build the correlation matrix — every feature vs every other feature
corr = df[features].corr()
print("Suspicious feature pairs (|r| > 0.80):\n")
for i in range(len(features)):
for j in range(i + 1, len(features)): # upper triangle only — avoids listing each pair twice
r = corr.iloc[i, j]
if abs(r) > 0.80:
level = "HIGH" if abs(r) > 0.90 else "MODERATE"
print(f" {level:9} {features[i]} x {features[j]}: r = {r:.3f}")
Suspicious feature pairs (|r| > 0.80): HIGH sq_footage x num_rooms: r = 0.982 HIGH sq_footage x num_bathrooms: r = 0.953 HIGH sq_footage x dist_city_km: r = -0.934 HIGH sq_footage x garden_sqm: r = 0.956 HIGH sq_footage x crime_score: r = -0.907 MODERATE sq_footage x school_km: r = -0.882 HIGH num_rooms x num_bathrooms: r = 0.935 HIGH num_rooms x dist_city_km: r = -0.922 MODERATE num_rooms x garden_sqm: r = 0.924 MODERATE num_rooms x crime_score: r = -0.899 HIGH dist_city_km x crime_score: r = 0.975 MODERATE dist_city_km x school_km: r = 0.908 MODERATE garden_sqm x crime_score: r = -0.922 MODERATE garden_sqm x school_km: r = -0.893
What just happened?
pandas' .corr() builds the correlation matrix and we loop through the upper triangle to list every suspicious pair once. 14 suspect pairs from 8 features — this dataset has a serious problem. The worst: sq_footage x num_rooms at r=0.982 (almost perfectly correlated) and dist_city_km x crime_score at r=0.975.
Step 2 — VIF: The Proper Multicollinearity Score
Pairwise correlation catches two-way problems. VIF (Variance Inflation Factor) goes further: for each feature it asks "how well can I predict this feature using all the other features?" High predictability = high redundancy = high VIF.
| VIF Score | Meaning | Action |
|---|---|---|
| 1 – 5 | Low overlap with other features — healthy | ✓ Keep |
| 5 – 10 | Moderate overlap — worth a closer look | ⚠ Review |
| > 10 | Severe — this feature is almost redundant | ✗ Drop it |
def calculate_vif(dataframe, feature_cols):
"""
For each feature, measure how well other features can predict it.
VIF formula: 1 / (1 - R2)
When R2 is high (others predict this feature well), VIF becomes very large.
"""
results = []
for col in feature_cols:
y = dataframe[col].values
max_r2 = 0
for other in feature_cols:
if other == col:
continue
# stats.linregress fits a line from 'other' to 'col' and gives us R
_, _, r, _, _ = stats.linregress(dataframe[other].values, y)
if r**2 > max_r2:
max_r2 = r**2 # keep the highest R2 found from any single predictor
vif = round(1 / (1 - max_r2), 1) if max_r2 < 1.0 else 999
flag = "DROP ✗" if vif > 10 else ("REVIEW ⚠" if vif > 5 else "OK ✓")
results.append({'feature': col, 'VIF': vif, 'verdict': flag})
return pd.DataFrame(results).sort_values('VIF', ascending=False)
print(calculate_vif(df, features).to_string(index=False))
feature VIF verdict
sq_footage 96.0 DROP ✗
num_rooms 96.0 DROP ✗
dist_city_km 50.7 DROP ✗
crime_score 50.7 DROP ✗
garden_sqm 47.8 DROP ✗
num_bathrooms 47.1 DROP ✗
school_km 23.2 DROP ✗
age_years 3.7 OK ✓
What just happened?
scipy's stats.linregress() fits a line between any two columns and gives us R (correlation). Squaring it gives R2 — how much of the variation is explained. We find the highest R2 any single other feature achieves, then apply the VIF formula: 1 / (1 − R2). The closer R2 is to 1, the more the feature is redundant, and the higher the VIF.
Seven out of eight features score above 10. Only age_years (VIF=3.7) is genuinely independent. This is a severe multicollinearity problem — feeding all 8 features into a linear model would be a disaster.
Step 3 — Decide Which Features to Keep
When two features overlap too much, keep the one that is more strongly correlated with the target (price). The dropped feature's information is already captured by the survivor — you don't lose anything useful.
# Measure each feature's correlation with the target (price)
target_corrs = {}
for feat in features:
r, _ = stats.pearsonr(df[feat], df['price'])
target_corrs[feat] = abs(r) # absolute value — strength only
# For every suspicious pair, mark the weaker one (lower correlation with price) for removal
drop_set = set()
for i in range(len(features)):
for j in range(i + 1, len(features)):
if abs(corr.iloc[i, j]) > 0.80:
fa, fb = features[i], features[j]
# Drop whichever has the weaker connection to the target
weaker = fb if target_corrs[fa] >= target_corrs[fb] else fa
drop_set.add(weaker)
final = [f for f in features if f not in drop_set]
print("Features to KEEP:")
for f in final:
print(f" {f:<18} r with price = {target_corrs[f]:.3f}")
print("\nFeatures to DROP (redundant):")
for f in sorted(drop_set):
print(f" {f}")
Features to KEEP: sq_footage r with price = 0.983 age_years r with price = 0.712 garden_sqm r with price = 0.920 Features to DROP (redundant): crime_score dist_city_km num_bathrooms num_rooms school_km
What just happened?
scipy's stats.pearsonr() gives each feature a target-correlation score. We loop through the suspicious pairs, compare those scores, and mark the weaker one for removal. From 8 features down to 3 — but these 3 features cover the most important non-overlapping signals. sq_footage (r=0.983) is the dominant predictor, garden_sqm (r=0.920) adds something else, and age_years (r=0.712, VIF=3.7) is clean and independent.
Step 4 — See the Damage Multicollinearity Causes
Let's see this with our own eyes. Run the same linear regression twice — all 8 features vs the clean 3 — and look at what happens to the coefficients:
y = df['price'].values
def fit_linear(X_data, feat_names):
"""Fit OLS linear regression and print coefficients."""
X_b = np.column_stack([np.ones(len(X_data)), X_data]) # add intercept column of 1s
coeffs = np.linalg.lstsq(X_b, y, rcond=None)[0][1:] # skip intercept, keep features
for name, coef in zip(feat_names, coeffs):
print(f" {name:<18} coefficient = {coef:>+.3f}")
print("=== ALL 8 FEATURES (multicollinearity present) ===")
fit_linear(df[features].values, features)
print("\n=== 3 CLEAN FEATURES (multicollinearity removed) ===")
clean = ['sq_footage', 'age_years', 'garden_sqm']
fit_linear(df[clean].values, clean)
=== ALL 8 FEATURES (multicollinearity present) === sq_footage coefficient = +0.179 num_rooms coefficient = +18.322 num_bathrooms coefficient = +34.213 age_years coefficient = -2.109 dist_city_km coefficient = -4.882 garden_sqm coefficient = +0.034 crime_score coefficient = +1.122 school_km coefficient = -22.438 === 3 CLEAN FEATURES (multicollinearity removed) === sq_footage coefficient = +0.284 age_years coefficient = -1.482 garden_sqm coefficient = +0.801
Did you spot the problem?
Look at crime_score in the 8-feature model — it has a positive coefficient (+1.122). The model is claiming: higher crime = higher price. That is completely wrong. We know from the data that crime is negatively correlated with price (r = −0.893).
This happened because crime_score and dist_city_km are almost identical (r=0.975). The model got confused trying to split their effects and accidentally reversed the crime coefficient's sign. This is the textbook symptom of multicollinearity. The 3-feature model has sensible, interpretable coefficients throughout — age gets a negative coefficient (older houses are cheaper) and sq_footage gets a positive one (bigger houses cost more). Both make intuitive sense.
Teacher's Note
Multicollinearity doesn't always hurt predictions — but it always hurts interpretation. A model with redundant features might still make decent predictions overall. But you can't trust individual coefficients. If your manager asks "does crime rate affect house prices in our model?" — you can't answer from a multicollinear model because the coefficients are lying.
In business analytics, explanation usually matters as much as accuracy. Stakeholders make decisions based on model insights, not just predictions. Fix multicollinearity before you try to explain your model to anyone.
Practice Questions
1. Multicollinearity mainly damages which type of model — linear models or tree-based models like Random Forest?
2. A VIF score above what number is generally the threshold for dropping a feature?
3. When two features overlap too much and you must drop one, you should keep the one more correlated with what?
Quiz
1. What is the main damage multicollinearity does to a linear regression model?
2. Why is VIF more thorough than just checking pairwise correlations?
3. A linear model trained with highly correlated features achieves good overall accuracy. Is multicollinearity still a problem?
Up Next · Lesson 26
Visualising Distributions
Histograms, KDE curves, and box plots — build the visuals that make distribution shape instantly obvious to any audience.