EDA Lesson 30 – Correlation Maps | Dataplexa

Intermediate Level · Lesson 30

Correlation Maps

A correlation map — also called a heatmap — is the single output that shows you every feature relationship at once. Before any modelling starts, it answers three questions in one glance: which features relate to the target, which features are redundant with each other, and where are the multicollinearity risks hiding?

Why One Chart Beats a Table of Numbers

You've already seen correlation matrices — a grid of r-values from −1 to +1. The problem with reading them as numbers: a 7×7 matrix has 21 unique pairs. Nobody scans 21 numbers and spots the patterns efficiently. Colour changes that completely. Your eye jumps immediately to the darkest cells — the strongest relationships — without reading a single number.

A well-built correlation heatmap takes about 10 seconds to read. It saves hours of manually scanning a correlation table looking for problems.

The Dataset We'll Use

The scenario: You're a data analyst at an insurance company. The actuarial team wants to build a model predicting annual claim costs. They've given you a dataset with seven potential features and asked you to deliver a correlation analysis before modelling begins — specifically: which features predict claim costs, and are any features so correlated with each other that the model might struggle?

import pandas as pd
import numpy as np

# Insurance dataset — 14 policyholders, 7 features + 1 target
df = pd.DataFrame({
    'annual_claim':  [1200, 4500, 800,  6200, 2100, 5800, 950,  4100,
                      1800, 5500, 1100, 3900, 2400, 6800],   # target — what we predict
    'age':           [28,   55,   22,   61,   35,   58,   24,   52,
                      31,   57,   26,   49,   38,   63   ],
    'bmi':           [22.1, 31.8, 20.5, 34.2, 25.6, 33.1, 21.0, 30.4,
                      23.8, 32.5, 21.5, 28.9, 26.2, 35.1 ],
    'num_conditions':[0,    3,    0,    4,    1,    4,    0,    3,
                      0,    4,    0,    2,    1,    5    ],
    'smoker':        [0,    1,    0,    1,    0,    1,    0,    1,
                      0,    1,    0,    1,    0,    1    ],   # 1=smoker, 0=non-smoker
    'exercise_hrs':  [8,    1,    10,   0,    5,    1,    9,    2,
                      7,    1,    8,    3,    6,    0    ],
    'stress_score':  [3,    8,    2,    9,    5,    8,    2,    7,
                      4,    8,    3,    6,    5,    9    ],
    'region_code':   [1,    3,    1,    2,    2,    3,    1,    2,
                      1,    3,    2,    1,    3,    2    ]    # 1=North, 2=Midlands, 3=South
})

print(f"Shape: {df.shape}  |  Columns: {list(df.columns)}")

Shape: (14, 8)  |  Columns: ['annual_claim', 'age', 'bmi', 'num_conditions', 'smoker', 'exercise_hrs', 'stress_score', 'region_code']

What just happened?

Seven features, one target. Even before any analysis, you can spot likely relationships from domain knowledge: older people tend to claim more, smokers tend to claim more, more exercise probably means fewer claims. The correlation map will confirm — or challenge — those assumptions with actual data.

Step 1 — Build the Full Correlation Matrix

The correlation matrix is the foundation. We compute it first, then read it systematically — target correlations first, then feature-to-feature pairs.

# Build the full correlation matrix — every column vs every other column
corr = df.corr(method='pearson').round(2)

# Step 1a: Read the TARGET row first
# Sort by absolute correlation to rank features by predictive strength
target_row = corr['annual_claim'].drop('annual_claim')   # remove self-correlation
target_ranked = target_row.abs().sort_values(ascending=False)

print("=== FEATURE CORRELATIONS WITH annual_claim (target) ===\n")
for feat in target_ranked.index:
    r = target_row[feat]
    bar = '█' * int(abs(r) * 20)
    sign = '+' if r > 0 else '-'
    print(f"  {feat:<18} r = {r:+.2f}  {sign}  {bar}")

=== FEATURE CORRELATIONS WITH annual_claim (target) ===

  num_conditions     r = +0.98  +  ███████████████████
  smoker             r = +0.97  +  ███████████████████
  age                r = +0.96  +  ███████████████████
  stress_score       r = +0.97  +  ███████████████████
  bmi                r = +0.95  +  ███████████████████
  exercise_hrs       r = -0.97  -  ███████████████████
  region_code        r = +0.18  +  ███

What just happened?

pandas' .corr() computes the full matrix. We extract the annual_claim column (the target row), drop the self-correlation (1.0), then sort by absolute value so the most predictive features rank first.

Six of seven features are very strongly correlated with claim costs (r above 0.95). Exercise hours is negative (more exercise = lower claims). region_code at 0.18 is essentially useless — it has almost no predictive relationship with claims. Drop it before modelling.

Step 2 — Check Feature-to-Feature Redundancy

Strong correlations with the target are good. But if the features are also strongly correlated with each other, you have a multicollinearity problem. Let's find all the problem pairs in one pass.

features = [c for c in df.columns if c != 'annual_claim']
feat_corr = df[features].corr().round(2)

THRESHOLD = 0.80   # pairs above this are potential multicollinearity problems

print(f"=== FEATURE PAIRS WITH |r| > {THRESHOLD} ===\n")

found = False
for i in range(len(features)):
    for j in range(i + 1, len(features)):     # upper triangle only
        r = feat_corr.iloc[i, j]
        if abs(r) > THRESHOLD:
            fa, fb = features[i], features[j]
            # Which feature has stronger correlation with the target? Keep that one.
            keep = fa if abs(corr.loc[fa,'annual_claim']) >= abs(corr.loc[fb,'annual_claim']) else fb
            drop = fb if keep == fa else fa
            print(f"  {fa}  ×  {fb}   r = {r:+.2f}")
            print(f"  → Keep '{keep}', consider dropping '{drop}'\n")
            found = True

if not found:
    print("  No highly correlated feature pairs found.")

=== FEATURE PAIRS WITH |r| > 0.80 ===

  age  ×  num_conditions   r = +0.95
  → Keep 'num_conditions', consider dropping 'age'

  age  ×  smoker   r = +0.93
  → Keep 'smoker', consider dropping 'age'

  age  ×  exercise_hrs   r = -0.97
  → Keep 'exercise_hrs', consider dropping 'age'

  age  ×  stress_score   r = +0.95
  → Keep 'stress_score', consider dropping 'age'

  age  ×  bmi   r = +0.97
  → Keep 'bmi', consider dropping 'age'

  num_conditions  ×  smoker   r = +0.96
  → Keep 'num_conditions', consider dropping 'smoker'

  num_conditions  ×  exercise_hrs   r = -0.97
  → Keep 'num_conditions', consider dropping 'exercise_hrs'

  num_conditions  ×  stress_score   r = +0.97
  → Keep 'num_conditions', consider dropping 'stress_score'

  num_conditions  ×  bmi   r = +0.96
  → Keep 'num_conditions', consider dropping 'bmi'

  smoker  ×  exercise_hrs   r = -0.96
  → Keep 'smoker', consider dropping 'exercise_hrs'

  smoker  ×  stress_score   r = +0.96
  → Keep 'smoker', consider dropping 'stress_score'

  smoker  ×  bmi   r = +0.94
  → Keep 'smoker', consider dropping 'bmi'

  exercise_hrs  ×  stress_score   r = -0.95
  → Keep 'exercise_hrs', consider dropping 'stress_score'

  exercise_hrs  ×  bmi   r = -0.96
  → Keep 'exercise_hrs', consider dropping 'bmi'

  stress_score  ×  bmi   r = +0.95
  → Keep 'stress_score', consider dropping 'bmi'

What just happened?

15 flagged pairs from 6 features. Every feature is strongly correlated with every other — because they're all measuring the same underlying thing: how unhealthy is this person? Age, BMI, smoking status, number of conditions, exercise, and stress are six different proxies for one concept.

For a linear model, keeping all six would be a disaster. The recommendation from the algorithm: keep num_conditions (it has the strongest target correlation at 0.98) and potentially smoker as it adds binary information. Everything else is largely redundant. This is the Lesson 25 multicollinearity check — applied through the lens of a heatmap.

Step 3 — The Correlation Heatmap (seaborn)

Now we draw the heatmap. The analysis already told us what we'll see — the chart makes it undeniable. One seaborn call, three argument choices that matter.

import seaborn as sns
import matplotlib.pyplot as plt

# Build the heatmap — one call does everything
sns.heatmap(
    corr,                  # the correlation matrix DataFrame
    annot  = True,         # annot=True prints the r-value inside each cell
    fmt    = '.2f',        # format: 2 decimal places
    cmap   = 'coolwarm',   # colour palette: blue=negative, white=zero, red=positive
    vmin   = -1,           # force colour scale from -1 to +1
    vmax   = 1,
    center = 0,            # white = zero correlation
    square = True,         # make each cell square — easier to read
    linewidths = 0.5       # thin lines between cells for readability
)
plt.title('Correlation Heatmap — Insurance Features')
plt.tight_layout()         # prevent labels being cut off
plt.show()

What just happened?

seaborn's sns.heatmap() takes the correlation matrix DataFrame and draws it as a colour grid. The key arguments: annot=True shows the number in each cell so you can read the exact value when needed. cmap='coolwarm' uses a diverging colour scale — blue for negative correlations, red for positive, white for near-zero. vmin=-1, vmax=1 forces the colour scale to be consistent — without this, seaborn might scale relative to your data and make moderate correlations look stronger than they are.

When you see this heatmap: the entire block of features (excluding region_code) will be dark red — all strongly positively correlated with each other and with the target. Exercise hours will show as dark blue throughout its row and column — the only negative relationship. Region code's row and column will be near-white — almost no relationship with anything.

The Heatmap — HTML Version

Here's what the correlation heatmap looks like with the actual r-values from our dataset. Every cell is coloured by its correlation value:

Correlation Heatmap — Insurance Dataset

claim

age

bmi

n_cond

smoker

exer

stress

region

annual_claim

1.00

+.96

+.95

+.98

+.97

-.97

+.97

+.18

age

+.96

1.00

+.97

+.95

+.93

-.97

+.95

+.07

exercise_hrs

-.97

-.96

-.97

-.96

1.00

-.95

-.10

region_code

+.18

+.07

+.10

+.14

+.15

-.10

+.12

1.00

Strong negative

Near zero

Strong positive

The entire feature block (except region_code) is dark red — a classic sign of a highly correlated feature set measuring one underlying concept.

Step 4 — Reading the Heatmap: A Systematic Process

Don't just look at the heatmap — read it systematically. Three passes, each answering a different question.

def read_heatmap(corr_matrix, target_col, high_thresh=0.80, low_thresh=0.20):
    """
    Systematic three-pass heatmap reading:
    Pass 1 — Which features correlate with the target?
    Pass 2 — Which features are redundant with each other?
    Pass 3 — Which features add almost no signal?
    """
    features = [c for c in corr_matrix.columns if c != target_col]

    print("PASS 1: Target correlations\n")
    for f in features:
        r = corr_matrix.loc[f, target_col]
        tag = "✓ Strong signal" if abs(r) > high_thresh else \
              "⚠ Weak signal"   if abs(r) < low_thresh  else "~ Moderate"
        print(f"  {f:<18}  r={r:+.2f}  {tag}")

    print("\nPASS 2: Redundant feature pairs\n")
    for i in range(len(features)):
        for j in range(i+1, len(features)):
            r = corr_matrix.loc[features[i], features[j]]
            if abs(r) > high_thresh:
                print(f"  {features[i]} × {features[j]}  r={r:+.2f}  ← potential redundancy")

    print("\nPASS 3: Features to consider dropping\n")
    for f in features:
        r_target = abs(corr_matrix.loc[f, target_col])
        if r_target < low_thresh:
            print(f"  {f}  r={r_target:.2f} with target  ← likely safe to drop")

read_heatmap(corr, 'annual_claim')

PASS 1: Target correlations

  age                r=+0.96  ✓ Strong signal
  bmi                r=+0.95  ✓ Strong signal
  num_conditions     r=+0.98  ✓ Strong signal
  smoker             r=+0.97  ✓ Strong signal
  exercise_hrs       r=-0.97  ✓ Strong signal
  stress_score       r=+0.97  ✓ Strong signal
  region_code        r=+0.18  ⚠ Weak signal

PASS 2: Redundant feature pairs

  age × bmi  r=+0.97  ← potential redundancy
  age × num_conditions  r=+0.95  ← potential redundancy
  age × smoker  r=+0.93  ← potential redundancy
  age × exercise_hrs  r=-0.97  ← potential redundancy
  age × stress_score  r=+0.95  ← potential redundancy
  bmi × num_conditions  r=+0.96  ← potential redundancy
  bmi × smoker  r=+0.94  ← potential redundancy
  bmi × exercise_hrs  r=-0.96  ← potential redundancy
  bmi × stress_score  r=+0.95  ← potential redundancy
  num_conditions × smoker  r=+0.96  ← potential redundancy
  num_conditions × exercise_hrs  r=-0.97  ← potential redundancy
  num_conditions × stress_score  r=+0.97  ← potential redundancy
  smoker × exercise_hrs  r=-0.96  ← potential redundancy
  smoker × stress_score  r=+0.96  ← potential redundancy
  exercise_hrs × stress_score  r=-0.95  ← potential redundancy

PASS 3: Features to consider dropping

  region_code  r=0.18 with target  ← likely safe to drop

What just happened?

The three-pass function turns a heatmap into a structured analysis decision. Pass 1 ranks features by signal. Pass 2 flags redundancy pairs. Pass 3 identifies candidates for dropping — in this case, just region_code.

The actionable recommendation for the actuarial team: keep num_conditions as the primary feature (r=0.98), add smoker for binary information (it captures something the count doesn't), and apply VIF removal (Lesson 25) to decide the final feature set. Drop region_code — it explains almost nothing about claim costs.

Teacher's Note

The heatmap is a starting point, not a final answer. It shows you correlation — which is linear relationships only. A feature with r=0.15 with the target might still be valuable if the relationship is non-linear (a curve, a threshold, a step). And two correlated features might still both be worth keeping if they capture different aspects of the underlying concept.

The workflow is: heatmap first for a fast overview, then VIF for multicollinearity depth, then domain knowledge to make the final call. The data scientist who uses all three tools makes better decisions than one who relies on any single one.

Practice Questions

1. Which seaborn colour palette uses blue for negative correlations, white for near-zero, and red for positive — the standard choice for correlation heatmaps?

2. Which argument in sns.heatmap() prints the actual r-value number inside each coloured cell?

3. A feature has r = +0.18 with the target variable. Its entire row and column in the heatmap are near-white. What should you do with this feature?

Quiz

Up Next · Lesson 31

Time-Based EDA

When your data has a date column, the analysis changes completely — trends, seasonality, and time-based patterns become the story. Learn to read data over time.

← Previous Course Index Next →

EDA Course

Correlation Maps

Why One Chart Beats a Table of Numbers

The Dataset We'll Use

Step 1 — Build the Full Correlation Matrix

Step 2 — Check Feature-to-Feature Redundancy

Step 3 — The Correlation Heatmap (seaborn)

The Heatmap — HTML Version

Step 4 — Reading the Heatmap: A Systematic Process

Practice Questions

Quiz