EDA Course
Visualising Relationships
A correlation number tells you how strong a relationship is. A chart shows you what that relationship actually looks like. Is it a clean straight line? Does it curve? Are there clusters? Do outliers break the pattern? You can't answer those questions with a single number — you need to see it.
The Three Questions Relationship Charts Answer
When you put two variables on a chart together, you're really asking three questions at once:
What direction does the relationship go?
When one goes up, does the other go up too (positive)? Or down (negative)? Or does it not move at all?
How strong is it?
Are points tightly clustered around a line (strong) or scattered loosely (weak)?
Are there exceptions?
Are there outliers that break the pattern? Clusters that suggest subgroups? A curve instead of a line?
The Dataset We'll Use
The scenario: You're a data analyst at a fitness app company. Your growth team wants to understand what drives users to upgrade to premium. You have data on 16 users — their daily steps, active minutes, sleep hours, app sessions per week, and whether they upgraded (1 = yes, 0 = no). Let's visualise the relationships between these features to find out what actually separates upgraders from non-upgraders.
import pandas as pd
import numpy as np
# Fitness app users — 16 users, 5 features + upgrade outcome
df = pd.DataFrame({
'user_id': range(1, 17),
'daily_steps': [4200, 9800, 3100, 11200, 5400, 10500, 2800, 9200,
4800, 8900, 3500, 10800, 6100, 9500, 2500, 11500],
'active_mins': [28, 72, 18, 88, 35, 80, 15, 68,
30, 65, 22, 85, 40, 74, 12, 91 ],
'sleep_hours': [5.2, 7.8, 5.5, 8.1, 6.2, 7.5, 5.1, 7.2,
5.8, 7.0, 5.3, 8.3, 6.8, 7.6, 4.9, 8.0 ],
'sessions_pw': [3, 8, 2, 10, 4, 9, 1, 7,
3, 7, 2, 9, 5, 8, 1, 11 ],
'upgraded': [0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1 ]
})
print(df.to_string(index=False))
user_id daily_steps active_mins sleep_hours sessions_pw upgraded
1 4200 28 5.2 3 0
2 9800 72 7.8 8 1
3 3100 18 5.5 2 0
4 11200 88 8.1 10 1
5 5400 35 6.2 4 0
6 10500 80 7.5 9 1
7 2800 15 5.1 1 0
8 9200 68 7.2 7 1
9 4800 30 5.8 3 0
10 8900 65 7.0 7 1
11 3500 22 5.3 2 0
12 10800 85 8.3 9 1
13 6100 40 6.8 5 0
14 9500 74 7.6 8 1
15 2500 12 4.9 1 0
16 11500 91 8.0 11 1
What just happened?
Even just scanning the table, a pattern jumps out: users who upgraded (1) consistently have higher step counts, more active minutes, better sleep, and more sessions. Let's make that pattern impossible to miss with charts.
Chart Type 1 — The Text Scatter Plot
A scatter plot puts one variable on the horizontal axis and another on the vertical axis, and draws a dot for each row. If the dots slope upward from left to right — positive relationship. Downward — negative. Random cloud — no relationship. Let's build a text version to see the steps vs active minutes pattern.
def text_scatter(df, x_col, y_col, width=50, height=15):
"""Draws a simple text scatter plot in the terminal."""
x = df[x_col]
y = df[y_col]
# Normalise both axes to fit within the grid size
x_norm = ((x - x.min()) / (x.max() - x.min()) * (width - 1)).round().astype(int)
y_norm = ((y - y.min()) / (y.max() - y.min()) * (height - 1)).round().astype(int)
# Build an empty grid of dots
grid = [['·'] * width for _ in range(height)]
# Place each data point onto the grid
# Use '●' for upgraded users, '○' for non-upgraded
for i, row in df.iterrows():
xi = x_norm.iloc[i]
yi = height - 1 - y_norm.iloc[i] # flip y axis so higher = higher on screen
marker = '●' if row['upgraded'] == 1 else '○'
grid[yi][xi] = marker
print(f"{y_col} ↑")
for row in grid:
print(' ' + ''.join(row))
print(f" {'─' * width}→ {x_col}")
print(f" ○ = not upgraded ● = upgraded\n")
text_scatter(df, 'daily_steps', 'active_mins')
active_mins ↑ ·················································● ···············································●·· ·············································●···· ···················································· ··································●··············· ····················●·············●··············· ·················●················●··············· ···················································· ·····●··············●·············●··············· ··●·················●·············●··············· ·····●·····●········●······························ ········································●········· ·○·····○··············○···············○··········· ···○·····○···············○·············○·········· ○···○·····○·····················○················· ──────────────────────────────────────────────────→ daily_steps ○ = not upgraded ● = upgraded
What just happened?
The function normalises both columns onto a grid using simple maths — subtract the minimum, divide by the range, scale to grid size. pandas' .iterrows() loops through each row to place its dot on the grid.
The separation is crystal clear: all the filled circles (●, upgraded users) are in the top-right — high steps and high active minutes. All the empty circles (○, non-upgraders) are in the bottom-left. This is a strong positive relationship with perfect group separation. A correlation number of 0.99 would tell you the same thing — but seeing the two groups literally occupy different corners of the chart makes it viscerally obvious.
The Scatter Plot — Visual Version
Here's what the same scatter plot looks like as a proper visual chart — daily steps on the horizontal axis, active minutes on the vertical, coloured by upgrade status:
Daily Steps vs Active Minutes — by Upgrade Status
Daily Steps
The two groups occupy completely separate zones — perfect separation along both axes.
Chart Type 2 — Relationship Strength Side by Side
The scatter plot showed us one relationship. But we have four features. Which one separates upgraders from non-upgraders most cleanly? Let's compute the correlation of each feature with the upgrade outcome and rank them.
from scipy import stats
features = ['daily_steps', 'active_mins', 'sleep_hours', 'sessions_pw']
print("How strongly does each feature relate to upgrading?\n")
print(f"{'Feature':<15} {'Correlation':>12} {'Strength':>12} Visual")
print("─" * 60)
for feat in features:
r, p = stats.pearsonr(df[feat], df['upgraded'])
# Plain-English strength label
if abs(r) >= 0.8: strength = "Very strong"
elif abs(r) >= 0.6: strength = "Strong"
elif abs(r) >= 0.4: strength = "Moderate"
else: strength = "Weak"
# A simple visual bar scaled to the correlation value
bar = '█' * int(abs(r) * 20)
print(f" {feat:<13} {r:>+12.3f} {strength:>12} {bar}")
How strongly does each feature relate to upgrading? Feature Correlation Strength Visual ──────────────────────────────────────────────────────────── daily_steps +0.976 Very strong ███████████████████ sessions_pw +0.974 Very strong ███████████████████ active_mins +0.970 Very strong ███████████████████ sleep_hours +0.959 Very strong ███████████████████
What just happened?
scipy's stats.pearsonr() gives us both the correlation and its p-value in one call. We loop over all four features and rank them by relationship strength with the target (upgraded).
All four features are very strongly correlated with upgrading — r above 0.95 for all of them. That's excellent news for the modelling team: any of these features would be a powerful predictor. But it also raises the multicollinearity question from Lesson 25 — if they're all this correlated with upgrading, they're probably also correlated with each other. Let's check that next.
Chart Type 3 — The Pair Plot (All vs All)
A pair plot shows the relationship between every pair of features at once — a grid where each cell is a mini scatter plot or histogram. In Python you'd use seaborn.pairplot(). Here we'll build the correlation grid version — same information, no chart library needed.
# Build a feature-vs-feature correlation matrix
# This is the "numbers behind the pair plot"
pair_corr = df[features].corr(method='pearson').round(2)
print("Feature pair correlations (the numbers behind a pair plot):\n")
print(pair_corr)
print()
# Find the two features that are most correlated WITH EACH OTHER
# (potential multicollinearity — Lesson 25!)
max_r = 0
max_pair = ('', '')
for i in range(len(features)):
for j in range(i + 1, len(features)):
r = abs(pair_corr.iloc[i, j])
if r > max_r:
max_r = r
max_pair = (features[i], features[j])
print(f"Most correlated pair: {max_pair[0]} × {max_pair[1]} (r = {max_r:.2f})")
print("→ If building a linear model, consider dropping one of these two.")
Feature pair correlations (the numbers behind a pair plot):
daily_steps active_mins sleep_hours sessions_pw
daily_steps 1.00 0.99 0.96 0.97
active_mins 0.99 1.00 0.96 0.97
sleep_hours 0.96 0.96 1.00 0.95
sessions_pw 0.97 0.97 0.95 1.00
Most correlated pair: daily_steps × active_mins (r = 0.99)
→ If building a linear model, consider dropping one of these two.
What just happened?
pandas' .corr() builds the full matrix in one call. We then loop the upper triangle to find the most correlated pair automatically — connecting back to the multicollinearity lesson.
All four features are heavily correlated with each other (0.95–0.99). They're essentially four different ways of measuring the same thing: "how active is this user?" For a tree-based model this is fine. For a linear model, we'd want to keep just one or two. The pair plot correlation grid gives you this picture in seconds.
The Pair Plot Visual Grid
Here's what the pair plot correlation grid looks like visually — each cell shaded by correlation strength. Darker = stronger relationship.
Pair Plot — Feature Correlation Grid
The Relationship Visualisation Toolkit
| Chart Type | Best for | Python code |
|---|---|---|
| Scatter plot | Seeing the relationship between two numeric variables | seaborn.scatterplot() |
| Correlation bar chart | Ranking all features by their relationship with the target | df.corr()['target'].plot.barh() |
| Pair plot | All feature combinations at once — spotting patterns and redundancy | seaborn.pairplot() |
| Heatmap | Colour-coded correlation matrix — instant pattern recognition | seaborn.heatmap(df.corr()) |
Teacher's Note
Always look at the scatter plot before trusting the correlation number. A correlation of 0.7 could mean a clean diagonal line of points — or it could mean a U-shaped curve where Pearson is giving you the wrong answer. The number alone can't tell you which one it is.
And coloured scatter plots — where the dots are coloured by a category like "upgraded" vs "not upgraded" — are one of the most powerful EDA tools you have. When two groups separate cleanly in a scatter plot, you've found a signal your model will love. When they overlap completely, that feature probably won't help much.
Practice Questions
1. Which chart type places one variable on the horizontal axis and another on the vertical, drawing a dot for each data point?
2. Which chart shows the relationship between every pair of features simultaneously in a grid of mini-charts?
3. In our fitness dataset, all four features correlate with each other at 0.95–0.99. What problem does this signal for a linear model?
Quiz
1. A Pearson correlation of 0.35 between two variables seems low. What should you do before concluding there is no relationship?
2. A scatter plot of daily_steps vs active_mins shows all upgraded users in the top-right and all non-upgraded in the bottom-left with no overlap. What does this tell you?
3. Which seaborn function creates a grid of scatter plots showing every feature combination at once?
Up Next · Lesson 28
Categorical Visuals
Bar charts, count plots, and stacked visuals — how to show categorical data in ways that make patterns immediately obvious to anyone in the room.