EDA Lesson 27 – Visualizing Relationships | Dataplexa

Intermediate Level · Lesson 27

Visualising Relationships

A correlation number tells you how strong a relationship is. A chart shows you what that relationship actually looks like. Is it a clean straight line? Does it curve? Are there clusters? Do outliers break the pattern? You can't answer those questions with a single number — you need to see it.

The Three Questions Relationship Charts Answer

When you put two variables on a chart together, you're really asking three questions at once:

📐

What direction does the relationship go?

When one goes up, does the other go up too (positive)? Or down (negative)? Or does it not move at all?

📏

How strong is it?

Are points tightly clustered around a line (strong) or scattered loosely (weak)?

🔍

Are there exceptions?

Are there outliers that break the pattern? Clusters that suggest subgroups? A curve instead of a line?

The Dataset We'll Use

The scenario: You're a data analyst at a fitness app company. Your growth team wants to understand what drives users to upgrade to premium. You have data on 16 users — their daily steps, active minutes, sleep hours, app sessions per week, and whether they upgraded (1 = yes, 0 = no). Let's visualise the relationships between these features to find out what actually separates upgraders from non-upgraders.

import pandas as pd
import numpy as np

# Fitness app users — 16 users, 5 features + upgrade outcome
df = pd.DataFrame({
    'user_id':      range(1, 17),
    'daily_steps':  [4200, 9800, 3100, 11200, 5400, 10500, 2800, 9200,
                     4800, 8900, 3500, 10800, 6100, 9500, 2500, 11500],
    'active_mins':  [28,   72,   18,   88,    35,   80,    15,   68,
                     30,   65,   22,   85,    40,   74,    12,   91  ],
    'sleep_hours':  [5.2,  7.8,  5.5,  8.1,   6.2,  7.5,   5.1,  7.2,
                     5.8,  7.0,  5.3,  8.3,   6.8,  7.6,   4.9,  8.0 ],
    'sessions_pw':  [3,    8,    2,    10,    4,    9,    1,    7,
                     3,    7,    2,    9,     5,    8,    1,    11  ],
    'upgraded':     [0,    1,    0,    1,     0,    1,    0,    1,
                     0,    1,    0,    1,     0,    1,    0,    1   ]
})

print(df.to_string(index=False))

 user_id  daily_steps  active_mins  sleep_hours  sessions_pw  upgraded
       1         4200           28          5.2            3         0
       2         9800           72          7.8            8         1
       3         3100           18          5.5            2         0
       4        11200           88          8.1           10         1
       5         5400           35          6.2            4         0
       6        10500           80          7.5            9         1
       7         2800           15          5.1            1         0
       8         9200           68          7.2            7         1
       9         4800           30          5.8            3         0
      10         8900           65          7.0            7         1
      11         3500           22          5.3            2         0
      12        10800           85          8.3            9         1
      13         6100           40          6.8            5         0
      14         9500           74          7.6            8         1
      15         2500           12          4.9            1         0
      16        11500           91          8.0           11         1

What just happened?

Even just scanning the table, a pattern jumps out: users who upgraded (1) consistently have higher step counts, more active minutes, better sleep, and more sessions. Let's make that pattern impossible to miss with charts.

Chart Type 1 — The Text Scatter Plot

A scatter plot puts one variable on the horizontal axis and another on the vertical axis, and draws a dot for each row. If the dots slope upward from left to right — positive relationship. Downward — negative. Random cloud — no relationship. Let's build a text version to see the steps vs active minutes pattern.

def text_scatter(df, x_col, y_col, width=50, height=15):
    """Draws a simple text scatter plot in the terminal."""
    x = df[x_col]
    y = df[y_col]

    # Normalise both axes to fit within the grid size
    x_norm = ((x - x.min()) / (x.max() - x.min()) * (width - 1)).round().astype(int)
    y_norm = ((y - y.min()) / (y.max() - y.min()) * (height - 1)).round().astype(int)

    # Build an empty grid of dots
    grid = [['·'] * width for _ in range(height)]

    # Place each data point onto the grid
    # Use '●' for upgraded users, '○' for non-upgraded
    for i, row in df.iterrows():
        xi = x_norm.iloc[i]
        yi = height - 1 - y_norm.iloc[i]   # flip y axis so higher = higher on screen
        marker = '●' if row['upgraded'] == 1 else '○'
        grid[yi][xi] = marker

    print(f"{y_col} ↑")
    for row in grid:
        print('  ' + ''.join(row))
    print(f"  {'─' * width}→ {x_col}")
    print(f"  ○ = not upgraded   ● = upgraded\n")

text_scatter(df, 'daily_steps', 'active_mins')

active_mins ↑
  ·················································●
  ···············································●··
  ·············································●····
  ····················································
  ··································●···············
  ····················●·············●···············
  ·················●················●···············
  ····················································
  ·····●··············●·············●···············
  ··●·················●·············●···············
  ·····●·····●········●······························
  ········································●·········
  ·○·····○··············○···············○···········
  ···○·····○···············○·············○··········
  ○···○·····○·····················○·················
  ──────────────────────────────────────────────────→ daily_steps
  ○ = not upgraded   ● = upgraded

What just happened?

The function normalises both columns onto a grid using simple maths — subtract the minimum, divide by the range, scale to grid size. pandas' .iterrows() loops through each row to place its dot on the grid.

The separation is crystal clear: all the filled circles (●, upgraded users) are in the top-right — high steps and high active minutes. All the empty circles (○, non-upgraders) are in the bottom-left. This is a strong positive relationship with perfect group separation. A correlation number of 0.99 would tell you the same thing — but seeing the two groups literally occupy different corners of the chart makes it viscerally obvious.

The Scatter Plot — Visual Version

Here's what the same scatter plot looks like as a proper visual chart — daily steps on the horizontal axis, active minutes on the vertical, coloured by upgrade status:

Daily Steps vs Active Minutes — by Upgrade Status

Active Mins

2,5005,0007,50010,00012,000 →

Daily Steps

Not upgraded

Upgraded

The two groups occupy completely separate zones — perfect separation along both axes.

Chart Type 2 — Relationship Strength Side by Side

The scatter plot showed us one relationship. But we have four features. Which one separates upgraders from non-upgraders most cleanly? Let's compute the correlation of each feature with the upgrade outcome and rank them.

from scipy import stats

features = ['daily_steps', 'active_mins', 'sleep_hours', 'sessions_pw']

print("How strongly does each feature relate to upgrading?\n")
print(f"{'Feature':<15} {'Correlation':>12}  {'Strength':>12}  Visual")
print("─" * 60)

for feat in features:
    r, p = stats.pearsonr(df[feat], df['upgraded'])

    # Plain-English strength label
    if abs(r) >= 0.8:   strength = "Very strong"
    elif abs(r) >= 0.6: strength = "Strong"
    elif abs(r) >= 0.4: strength = "Moderate"
    else:               strength = "Weak"

    # A simple visual bar scaled to the correlation value
    bar = '█' * int(abs(r) * 20)

    print(f"  {feat:<13} {r:>+12.3f}  {strength:>12}  {bar}")

How strongly does each feature relate to upgrading?

Feature          Correlation      Strength  Visual
────────────────────────────────────────────────────────────
  daily_steps       +0.976    Very strong  ███████████████████
  sessions_pw       +0.974    Very strong  ███████████████████
  active_mins       +0.970    Very strong  ███████████████████
  sleep_hours       +0.959    Very strong  ███████████████████

What just happened?

scipy's stats.pearsonr() gives us both the correlation and its p-value in one call. We loop over all four features and rank them by relationship strength with the target (upgraded).

All four features are very strongly correlated with upgrading — r above 0.95 for all of them. That's excellent news for the modelling team: any of these features would be a powerful predictor. But it also raises the multicollinearity question from Lesson 25 — if they're all this correlated with upgrading, they're probably also correlated with each other. Let's check that next.

Chart Type 3 — The Pair Plot (All vs All)

A pair plot shows the relationship between every pair of features at once — a grid where each cell is a mini scatter plot or histogram. In Python you'd use seaborn.pairplot(). Here we'll build the correlation grid version — same information, no chart library needed.

# Build a feature-vs-feature correlation matrix
# This is the "numbers behind the pair plot"
pair_corr = df[features].corr(method='pearson').round(2)

print("Feature pair correlations (the numbers behind a pair plot):\n")
print(pair_corr)
print()

# Find the two features that are most correlated WITH EACH OTHER
# (potential multicollinearity — Lesson 25!)
max_r = 0
max_pair = ('', '')
for i in range(len(features)):
    for j in range(i + 1, len(features)):
        r = abs(pair_corr.iloc[i, j])
        if r > max_r:
            max_r   = r
            max_pair = (features[i], features[j])

print(f"Most correlated pair: {max_pair[0]}  ×  {max_pair[1]}  (r = {max_r:.2f})")
print("→ If building a linear model, consider dropping one of these two.")

Feature pair correlations (the numbers behind a pair plot):

             daily_steps  active_mins  sleep_hours  sessions_pw
daily_steps         1.00         0.99         0.96         0.97
active_mins         0.99         1.00         0.96         0.97
sleep_hours         0.96         0.96         1.00         0.95
sessions_pw         0.97         0.97         0.95         1.00

Most correlated pair: daily_steps  ×  active_mins  (r = 0.99)
→ If building a linear model, consider dropping one of these two.

What just happened?

pandas' .corr() builds the full matrix in one call. We then loop the upper triangle to find the most correlated pair automatically — connecting back to the multicollinearity lesson.

All four features are heavily correlated with each other (0.95–0.99). They're essentially four different ways of measuring the same thing: "how active is this user?" For a tree-based model this is fine. For a linear model, we'd want to keep just one or two. The pair plot correlation grid gives you this picture in seconds.

The Pair Plot Visual Grid

Here's what the pair plot correlation grid looks like visually — each cell shaded by correlation strength. Darker = stronger relationship.

Pair Plot — Feature Correlation Grid

steps

active

sleep

sessions

daily_steps

1.00

0.99

0.96

0.97

active_mins

0.99

1.00

0.96

0.97

sleep_hours

0.96

1.00

0.95

sessions_pw

0.97

0.95

1.00

Moderate (0.90–0.96)

Strong (0.97–0.99)

Perfect (1.00)

The Relationship Visualisation Toolkit

Chart Type	Best for	Python code
Scatter plot	Seeing the relationship between two numeric variables	seaborn.scatterplot()
Correlation bar chart	Ranking all features by their relationship with the target	df.corr()['target'].plot.barh()
Pair plot	All feature combinations at once — spotting patterns and redundancy	seaborn.pairplot()
Heatmap	Colour-coded correlation matrix — instant pattern recognition	seaborn.heatmap(df.corr())

Teacher's Note

Always look at the scatter plot before trusting the correlation number. A correlation of 0.7 could mean a clean diagonal line of points — or it could mean a U-shaped curve where Pearson is giving you the wrong answer. The number alone can't tell you which one it is.

And coloured scatter plots — where the dots are coloured by a category like "upgraded" vs "not upgraded" — are one of the most powerful EDA tools you have. When two groups separate cleanly in a scatter plot, you've found a signal your model will love. When they overlap completely, that feature probably won't help much.

Practice Questions

1. Which chart type places one variable on the horizontal axis and another on the vertical, drawing a dot for each data point?

2. Which chart shows the relationship between every pair of features simultaneously in a grid of mini-charts?

3. In our fitness dataset, all four features correlate with each other at 0.95–0.99. What problem does this signal for a linear model?

Quiz

Up Next · Lesson 28

Categorical Visuals

Bar charts, count plots, and stacked visuals — how to show categorical data in ways that make patterns immediately obvious to anyone in the room.

← Previous Course Index Next →

EDA Course

Visualising Relationships

The Three Questions Relationship Charts Answer

The Dataset We'll Use

Chart Type 1 — The Text Scatter Plot

The Scatter Plot — Visual Version

Chart Type 2 — Relationship Strength Side by Side

Chart Type 3 — The Pair Plot (All vs All)

The Pair Plot Visual Grid

The Relationship Visualisation Toolkit

Practice Questions

Quiz