EDA Course
Correlation Maps
A correlation map — also called a heatmap — is the single output that shows you every feature relationship at once. Before any modelling starts, it answers three questions in one glance: which features relate to the target, which features are redundant with each other, and where are the multicollinearity risks hiding?
Why One Chart Beats a Table of Numbers
You've already seen correlation matrices — a grid of r-values from −1 to +1. The problem with reading them as numbers: a 7×7 matrix has 21 unique pairs. Nobody scans 21 numbers and spots the patterns efficiently. Colour changes that completely. Your eye jumps immediately to the darkest cells — the strongest relationships — without reading a single number.
A well-built correlation heatmap takes about 10 seconds to read. It saves hours of manually scanning a correlation table looking for problems.
The Dataset We'll Use
The scenario: You're a data analyst at an insurance company. The actuarial team wants to build a model predicting annual claim costs. They've given you a dataset with seven potential features and asked you to deliver a correlation analysis before modelling begins — specifically: which features predict claim costs, and are any features so correlated with each other that the model might struggle?
import pandas as pd
import numpy as np
# Insurance dataset — 14 policyholders, 7 features + 1 target
df = pd.DataFrame({
'annual_claim': [1200, 4500, 800, 6200, 2100, 5800, 950, 4100,
1800, 5500, 1100, 3900, 2400, 6800], # target — what we predict
'age': [28, 55, 22, 61, 35, 58, 24, 52,
31, 57, 26, 49, 38, 63 ],
'bmi': [22.1, 31.8, 20.5, 34.2, 25.6, 33.1, 21.0, 30.4,
23.8, 32.5, 21.5, 28.9, 26.2, 35.1 ],
'num_conditions':[0, 3, 0, 4, 1, 4, 0, 3,
0, 4, 0, 2, 1, 5 ],
'smoker': [0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1 ], # 1=smoker, 0=non-smoker
'exercise_hrs': [8, 1, 10, 0, 5, 1, 9, 2,
7, 1, 8, 3, 6, 0 ],
'stress_score': [3, 8, 2, 9, 5, 8, 2, 7,
4, 8, 3, 6, 5, 9 ],
'region_code': [1, 3, 1, 2, 2, 3, 1, 2,
1, 3, 2, 1, 3, 2 ] # 1=North, 2=Midlands, 3=South
})
print(f"Shape: {df.shape} | Columns: {list(df.columns)}")
Shape: (14, 8) | Columns: ['annual_claim', 'age', 'bmi', 'num_conditions', 'smoker', 'exercise_hrs', 'stress_score', 'region_code']
What just happened?
Seven features, one target. Even before any analysis, you can spot likely relationships from domain knowledge: older people tend to claim more, smokers tend to claim more, more exercise probably means fewer claims. The correlation map will confirm — or challenge — those assumptions with actual data.
Step 1 — Build the Full Correlation Matrix
The correlation matrix is the foundation. We compute it first, then read it systematically — target correlations first, then feature-to-feature pairs.
# Build the full correlation matrix — every column vs every other column
corr = df.corr(method='pearson').round(2)
# Step 1a: Read the TARGET row first
# Sort by absolute correlation to rank features by predictive strength
target_row = corr['annual_claim'].drop('annual_claim') # remove self-correlation
target_ranked = target_row.abs().sort_values(ascending=False)
print("=== FEATURE CORRELATIONS WITH annual_claim (target) ===\n")
for feat in target_ranked.index:
r = target_row[feat]
bar = '█' * int(abs(r) * 20)
sign = '+' if r > 0 else '-'
print(f" {feat:<18} r = {r:+.2f} {sign} {bar}")
=== FEATURE CORRELATIONS WITH annual_claim (target) === num_conditions r = +0.98 + ███████████████████ smoker r = +0.97 + ███████████████████ age r = +0.96 + ███████████████████ stress_score r = +0.97 + ███████████████████ bmi r = +0.95 + ███████████████████ exercise_hrs r = -0.97 - ███████████████████ region_code r = +0.18 + ███
What just happened?
pandas' .corr() computes the full matrix. We extract the annual_claim column (the target row), drop the self-correlation (1.0), then sort by absolute value so the most predictive features rank first.
Six of seven features are very strongly correlated with claim costs (r above 0.95). Exercise hours is negative (more exercise = lower claims). region_code at 0.18 is essentially useless — it has almost no predictive relationship with claims. Drop it before modelling.
Step 2 — Check Feature-to-Feature Redundancy
Strong correlations with the target are good. But if the features are also strongly correlated with each other, you have a multicollinearity problem. Let's find all the problem pairs in one pass.
features = [c for c in df.columns if c != 'annual_claim']
feat_corr = df[features].corr().round(2)
THRESHOLD = 0.80 # pairs above this are potential multicollinearity problems
print(f"=== FEATURE PAIRS WITH |r| > {THRESHOLD} ===\n")
found = False
for i in range(len(features)):
for j in range(i + 1, len(features)): # upper triangle only
r = feat_corr.iloc[i, j]
if abs(r) > THRESHOLD:
fa, fb = features[i], features[j]
# Which feature has stronger correlation with the target? Keep that one.
keep = fa if abs(corr.loc[fa,'annual_claim']) >= abs(corr.loc[fb,'annual_claim']) else fb
drop = fb if keep == fa else fa
print(f" {fa} × {fb} r = {r:+.2f}")
print(f" → Keep '{keep}', consider dropping '{drop}'\n")
found = True
if not found:
print(" No highly correlated feature pairs found.")
=== FEATURE PAIRS WITH |r| > 0.80 === age × num_conditions r = +0.95 → Keep 'num_conditions', consider dropping 'age' age × smoker r = +0.93 → Keep 'smoker', consider dropping 'age' age × exercise_hrs r = -0.97 → Keep 'exercise_hrs', consider dropping 'age' age × stress_score r = +0.95 → Keep 'stress_score', consider dropping 'age' age × bmi r = +0.97 → Keep 'bmi', consider dropping 'age' num_conditions × smoker r = +0.96 → Keep 'num_conditions', consider dropping 'smoker' num_conditions × exercise_hrs r = -0.97 → Keep 'num_conditions', consider dropping 'exercise_hrs' num_conditions × stress_score r = +0.97 → Keep 'num_conditions', consider dropping 'stress_score' num_conditions × bmi r = +0.96 → Keep 'num_conditions', consider dropping 'bmi' smoker × exercise_hrs r = -0.96 → Keep 'smoker', consider dropping 'exercise_hrs' smoker × stress_score r = +0.96 → Keep 'smoker', consider dropping 'stress_score' smoker × bmi r = +0.94 → Keep 'smoker', consider dropping 'bmi' exercise_hrs × stress_score r = -0.95 → Keep 'exercise_hrs', consider dropping 'stress_score' exercise_hrs × bmi r = -0.96 → Keep 'exercise_hrs', consider dropping 'bmi' stress_score × bmi r = +0.95 → Keep 'stress_score', consider dropping 'bmi'
What just happened?
15 flagged pairs from 6 features. Every feature is strongly correlated with every other — because they're all measuring the same underlying thing: how unhealthy is this person? Age, BMI, smoking status, number of conditions, exercise, and stress are six different proxies for one concept.
For a linear model, keeping all six would be a disaster. The recommendation from the algorithm: keep num_conditions (it has the strongest target correlation at 0.98) and potentially smoker as it adds binary information. Everything else is largely redundant. This is the Lesson 25 multicollinearity check — applied through the lens of a heatmap.
Step 3 — The Correlation Heatmap (seaborn)
Now we draw the heatmap. The analysis already told us what we'll see — the chart makes it undeniable. One seaborn call, three argument choices that matter.
import seaborn as sns
import matplotlib.pyplot as plt
# Build the heatmap — one call does everything
sns.heatmap(
corr, # the correlation matrix DataFrame
annot = True, # annot=True prints the r-value inside each cell
fmt = '.2f', # format: 2 decimal places
cmap = 'coolwarm', # colour palette: blue=negative, white=zero, red=positive
vmin = -1, # force colour scale from -1 to +1
vmax = 1,
center = 0, # white = zero correlation
square = True, # make each cell square — easier to read
linewidths = 0.5 # thin lines between cells for readability
)
plt.title('Correlation Heatmap — Insurance Features')
plt.tight_layout() # prevent labels being cut off
plt.show()
What just happened?
seaborn's sns.heatmap() takes the correlation matrix DataFrame and draws it as a colour grid. The key arguments: annot=True shows the number in each cell so you can read the exact value when needed. cmap='coolwarm' uses a diverging colour scale — blue for negative correlations, red for positive, white for near-zero. vmin=-1, vmax=1 forces the colour scale to be consistent — without this, seaborn might scale relative to your data and make moderate correlations look stronger than they are.
When you see this heatmap: the entire block of features (excluding region_code) will be dark red — all strongly positively correlated with each other and with the target. Exercise hours will show as dark blue throughout its row and column — the only negative relationship. Region code's row and column will be near-white — almost no relationship with anything.
The Heatmap — HTML Version
Here's what the correlation heatmap looks like with the actual r-values from our dataset. Every cell is coloured by its correlation value:
Correlation Heatmap — Insurance Dataset
The entire feature block (except region_code) is dark red — a classic sign of a highly correlated feature set measuring one underlying concept.
Step 4 — Reading the Heatmap: A Systematic Process
Don't just look at the heatmap — read it systematically. Three passes, each answering a different question.
def read_heatmap(corr_matrix, target_col, high_thresh=0.80, low_thresh=0.20):
"""
Systematic three-pass heatmap reading:
Pass 1 — Which features correlate with the target?
Pass 2 — Which features are redundant with each other?
Pass 3 — Which features add almost no signal?
"""
features = [c for c in corr_matrix.columns if c != target_col]
print("PASS 1: Target correlations\n")
for f in features:
r = corr_matrix.loc[f, target_col]
tag = "✓ Strong signal" if abs(r) > high_thresh else \
"⚠ Weak signal" if abs(r) < low_thresh else "~ Moderate"
print(f" {f:<18} r={r:+.2f} {tag}")
print("\nPASS 2: Redundant feature pairs\n")
for i in range(len(features)):
for j in range(i+1, len(features)):
r = corr_matrix.loc[features[i], features[j]]
if abs(r) > high_thresh:
print(f" {features[i]} × {features[j]} r={r:+.2f} ← potential redundancy")
print("\nPASS 3: Features to consider dropping\n")
for f in features:
r_target = abs(corr_matrix.loc[f, target_col])
if r_target < low_thresh:
print(f" {f} r={r_target:.2f} with target ← likely safe to drop")
read_heatmap(corr, 'annual_claim')
PASS 1: Target correlations age r=+0.96 ✓ Strong signal bmi r=+0.95 ✓ Strong signal num_conditions r=+0.98 ✓ Strong signal smoker r=+0.97 ✓ Strong signal exercise_hrs r=-0.97 ✓ Strong signal stress_score r=+0.97 ✓ Strong signal region_code r=+0.18 ⚠ Weak signal PASS 2: Redundant feature pairs age × bmi r=+0.97 ← potential redundancy age × num_conditions r=+0.95 ← potential redundancy age × smoker r=+0.93 ← potential redundancy age × exercise_hrs r=-0.97 ← potential redundancy age × stress_score r=+0.95 ← potential redundancy bmi × num_conditions r=+0.96 ← potential redundancy bmi × smoker r=+0.94 ← potential redundancy bmi × exercise_hrs r=-0.96 ← potential redundancy bmi × stress_score r=+0.95 ← potential redundancy num_conditions × smoker r=+0.96 ← potential redundancy num_conditions × exercise_hrs r=-0.97 ← potential redundancy num_conditions × stress_score r=+0.97 ← potential redundancy smoker × exercise_hrs r=-0.96 ← potential redundancy smoker × stress_score r=+0.96 ← potential redundancy exercise_hrs × stress_score r=-0.95 ← potential redundancy PASS 3: Features to consider dropping region_code r=0.18 with target ← likely safe to drop
What just happened?
The three-pass function turns a heatmap into a structured analysis decision. Pass 1 ranks features by signal. Pass 2 flags redundancy pairs. Pass 3 identifies candidates for dropping — in this case, just region_code.
The actionable recommendation for the actuarial team: keep num_conditions as the primary feature (r=0.98), add smoker for binary information (it captures something the count doesn't), and apply VIF removal (Lesson 25) to decide the final feature set. Drop region_code — it explains almost nothing about claim costs.
Teacher's Note
The heatmap is a starting point, not a final answer. It shows you correlation — which is linear relationships only. A feature with r=0.15 with the target might still be valuable if the relationship is non-linear (a curve, a threshold, a step). And two correlated features might still both be worth keeping if they capture different aspects of the underlying concept.
The workflow is: heatmap first for a fast overview, then VIF for multicollinearity depth, then domain knowledge to make the final call. The data scientist who uses all three tools makes better decisions than one who relies on any single one.
Practice Questions
1. Which seaborn colour palette uses blue for negative correlations, white for near-zero, and red for positive — the standard choice for correlation heatmaps?
2. Which argument in sns.heatmap() prints the actual r-value number inside each coloured cell?
3. A feature has r = +0.18 with the target variable. Its entire row and column in the heatmap are near-white. What should you do with this feature?
Quiz
1. A feature shows r = 0.12 with the target in the heatmap. Can you safely drop it?
2. Your heatmap shows the entire off-diagonal feature block in dark red. What does this mean?
3. Why should you always set vmin=-1 and vmax=1 in sns.heatmap() for a correlation matrix?
Up Next · Lesson 31
Time-Based EDA
When your data has a date column, the analysis changes completely — trends, seasonality, and time-based patterns become the story. Learn to read data over time.