EDA Course
Correlation Analysis
Do taller people weigh more? Do customers who spend more also visit more often? Does studying longer lead to better grades? These are all correlation questions — and this lesson gives you the tools to answer them properly, not just guess.
What Is Correlation — In Plain English
Imagine you track two things about 100 people: how many hours they sleep and how good their mood is the next day. You notice that people who slept more tend to be in a better mood. Not always — but generally. That tendency for two things to move together is what we call correlation.
Correlation is measured as a single number between −1 and +1. Think of it like a temperature gauge for a relationship:
Perfect positive — they always rise together
As one goes up, the other always goes up by a predictable amount. Example: hours worked × pay (fixed hourly rate).
Strong positive — they tend to rise together
The pattern is clear but not perfect. Example: study hours × exam score.
No correlation — no pattern at all
Knowing one tells you nothing about the other. Example: shoe size × IQ.
Strong negative — as one rises, the other falls
The pattern is clear but in opposite directions. Example: exercise hours × resting heart rate.
Perfect negative — one always rises as the other falls
Every increase in one is matched by an exact decrease in the other. Rare in real data.
Three Types of Correlation — Why You Need More Than One
There isn't just one way to measure correlation. The most common method (Pearson) only works well under specific conditions. Using the wrong one is like using a thermometer to measure weight — technically a reading, but meaningless.
Pearson
Measures straight-line relationships. Assumes your data is roughly normally distributed (bell-shaped) and doesn't have extreme outliers.
✓ Best when: data is numeric, roughly symmetric, no wild outliers
Spearman
Measures whether things move in the same direction — it converts values to ranks first (1st, 2nd, 3rd...) then computes Pearson on those ranks. Works even with skewed data or outliers.
✓ Best when: data is skewed, has outliers, or uses ordinal ratings
Kendall's Tau
Counts how many pairs of observations are "in agreement" (both go up together) vs "in disagreement" (one goes up while the other goes down). More reliable with small samples.
✓ Best when: small dataset, lots of tied values, or ordinal data
Pearson vs Spearman — Seeing the Difference
The scenario: You work in the analytics team at an e-commerce company. Your manager wants to know whether customers who spend more money also tend to rate the products more highly. You have order values and star ratings for 12 customers. Simple question — but one giant outlier (a £4,200 luxury order) could completely distort the Pearson result. You'll run both Pearson and Spearman and compare what each one says.
import pandas as pd # pandas: the main Python library for working with data tables — like Excel, but in code
import numpy as np # numpy: a library for doing maths on lists of numbers quickly
from scipy import stats # scipy: a science and statistics library — gives us the correlation tests
# 12 customers — their order value (£) and star rating (1–5)
# Notice customer 12 spent £4,200 — a clear outlier compared to everyone else
df = pd.DataFrame({
'customer_id': range(1, 13),
'order_value': [45, 82, 31, 120, 67, 95, 54, 38, 110, 73, 88, 4200], # customer 12 is the outlier
'star_rating': [3, 4, 2, 5, 3, 4, 3, 2, 5, 4, 4, 3 ] # ratings are 1–5 stars
})
print("Our data:")
print(df[['customer_id', 'order_value', 'star_rating']].to_string(index=False))
print()
# --- PEARSON CORRELATION ---
# Pearson looks at the actual numbers. That £4,200 will pull the result hard.
# stats.pearsonr() returns two numbers: (r, p_value)
# r = the correlation score | p_value = how likely this result is just random chance
pearson_r, pearson_p = stats.pearsonr(df['order_value'], df['star_rating'])
# --- SPEARMAN CORRELATION ---
# Spearman converts values to ranks first: cheapest order = rank 1, most expensive = rank 12
# Then it measures whether the rank orders match
# The £4,200 outlier becomes "rank 12" — a big number, but not impossibly big
spearman_r, spearman_p = stats.spearmanr(df['order_value'], df['star_rating'])
print(f"Pearson r = {pearson_r:.3f} p = {pearson_p:.4f}")
print(f"Spearman r = {spearman_r:.3f} p = {spearman_p:.4f}")
print()
print("Without the outlier (customer 12 removed):")
df_no_outlier = df[df['customer_id'] != 12] # remove the outlier row to compare
p_r2, _ = stats.pearsonr(df_no_outlier['order_value'], df_no_outlier['star_rating'])
s_r2, _ = stats.spearmanr(df_no_outlier['order_value'], df_no_outlier['star_rating'])
print(f"Pearson r = {p_r2:.3f} (was {pearson_r:.3f} — changed a lot!)")
print(f"Spearman r = {s_r2:.3f} (was {spearman_r:.3f} — barely changed)")
Our data:
customer_id order_value star_rating
1 45 3
2 82 4
3 31 2
4 120 5
5 67 3
6 95 4
7 54 3
8 38 2
9 110 5
10 73 4
11 88 4
12 4200 3
Pearson r = 0.143 p = 0.6506
Spearman r = 0.745 p = 0.0056
Without the outlier (customer 12 removed):
Pearson r = 0.942 (was 0.143 — changed a lot!)
Spearman r = 0.779 (was 0.745 — barely changed)
What just happened?
pandas is our data table tool — it holds the customer data in rows and columns, just like a spreadsheet. We use it to store the data and filter out the outlier row with a simple condition.
scipy is a science and statistics library. stats.pearsonr() calculates Pearson correlation and returns both the r value and the p-value (explained below). stats.spearmanr() does the same for Spearman — it quietly converts your values to ranks internally before computing.
This output is a perfect demonstration of why outliers matter. With the £4,200 order included, Pearson says r=0.143 — almost zero, no relationship. But that's a lie caused by one extreme customer distorting the straight-line calculation. Remove them and Pearson jumps to 0.942. Spearman barely flinches — it was already giving the right answer (0.745) because it works on ranks, not raw numbers. One outlier can completely mislead Pearson. Spearman shrugs it off.
What Is a P-Value and Why Does It Matter?
You've seen "p-value" appear in the output. Here's what it actually means — no maths required.
The p-value in plain English
Imagine you flipped a coin 10 times and got 8 heads. Is the coin biased, or did you just get lucky? The p-value answers exactly this kind of question.
A p-value of 0.05 means: "If there were truly no relationship between these two variables, there's only a 5% chance I'd see a result this strong just by random luck." Below 0.05, analysts say the result is statistically significant — it's probably a real pattern, not noise.
A high p-value (say, 0.65) means: "This result could easily happen by chance even if nothing real is going on." Don't trust it.
| p-value | What it means | Trust the result? |
|---|---|---|
| < 0.01 | Less than 1% chance this is random | Very confident ✓ |
| 0.01 – 0.05 | 1–5% chance this is random | Significant ✓ |
| 0.05 – 0.10 | Borderline — proceed with caution | Weak ⚠ |
| > 0.10 | Could easily be random noise | Not significant ✗ |
All Three Methods Side By Side
The scenario: Your manager now wants a full correlation report for the whole dataset — every pair of numeric columns, using all three methods, with a recommendation on which to trust. They've also heard about "Kendall's tau" and want to know if it says anything different. You build a comparison function that runs all three and highlights where they disagree.
import pandas as pd # pandas: our data table library — storing data, selecting columns
import numpy as np # numpy: fast maths on lists of numbers
from scipy import stats # scipy: statistics library — all three correlation methods live here
# A customer behaviour dataset — mix of normal data and some skewed columns
df = pd.DataFrame({
'days_since_signup': [120, 340, 45, 280, 190, 410, 60, 155, 320, 85, 230, 175],
'total_orders': [3, 12, 1, 9, 6, 15, 2, 5, 11, 2, 8, 6 ],
'avg_order_value': [45, 82, 31, 120, 67, 95, 54, 38, 110, 73, 88, 4200], # outlier in last row
'support_tickets': [1, 3, 0, 2, 1, 4, 0, 1, 3, 0, 2, 1 ],
'star_rating': [3, 4, 2, 5, 3, 4, 3, 2, 5, 4, 4, 3 ]
})
# For each pair of columns we care about, run all three correlation methods
pairs = [
('total_orders', 'days_since_signup'), # do older customers order more?
('avg_order_value', 'star_rating'), # do big spenders rate higher? (outlier pair!)
('total_orders', 'support_tickets'), # do frequent buyers raise more issues?
]
print(f"{'Pair':<40} {'Pearson':>8} {'Spearman':>9} {'Kendall':>8} Recommendation")
print("-" * 85)
for col_a, col_b in pairs:
p_r, _ = stats.pearsonr(df[col_a], df[col_b]) # Pearson: sensitive to outliers, linear only
sp_r, _ = stats.spearmanr(df[col_a], df[col_b]) # Spearman: rank-based, handles outliers
k_r, _ = stats.kendalltau(df[col_a], df[col_b]) # Kendall's tau: agreement-based, good for small n
# If Pearson and Spearman disagree by more than 0.3, outliers are likely the cause
flag = " ← outlier suspected" if abs(p_r - sp_r) > 0.3 else ""
label = f"{col_a} × {col_b}"
print(f"{label:<40} {p_r:>8.3f} {sp_r:>9.3f} {k_r:>8.3f} {flag}")
Pair Pearson Spearman Kendall Recommendation ------------------------------------------------------------------------------------- total_orders × days_since_signup 0.978 0.976 0.879 avg_order_value × star_rating 0.143 0.745 0.576 ← outlier suspected total_orders × support_tickets 0.968 0.952 0.843
What just happened?
scipy provides all three correlation functions in its stats submodule. stats.pearsonr() measures linear relationships directly on the raw numbers. stats.spearmanr() converts values to ranks first, then measures whether the ranks agree. stats.kendalltau() counts concordant pairs (both go up together) vs discordant pairs (one goes up, the other goes down) and divides them — a slightly different concept but a similar result.
The automatic outlier flag is the practical gem here: when Pearson and Spearman disagree by more than 0.3, it's almost always because an outlier is distorting Pearson. This check takes one line and saves you from reporting a misleading correlation.
The first and third pairs (total_orders × days_since_signup, and total_orders × support_tickets) show all three methods agreeing — strong positive relationships, no outlier issue. The middle pair (avg_order_value × star_rating) is the flagged one: Pearson says 0.143 (almost nothing), Spearman says 0.745 (strong). That gap of 0.60 is the outlier customer screaming for attention.
Visualising Correlation — A Scatter Plot in Text
Before you trust any correlation number, always visualise the relationship. A number can look strong but hide a curved or clustered pattern. Here's what three different correlation strengths actually look like as data points.
Strong Positive (r ≈ 0.95)
Points hug the line tightly — as X rises, Y rises predictably
No Correlation (r ≈ 0)
Points scattered randomly — no visible pattern
Strong Negative (r ≈ −0.92)
As X rises, Y falls — a clear downward slope
Building a Full Correlation Report
The scenario: Your manager wants the final deliverable — a complete correlation report for all numeric columns in the customer dataset. For each pair: which method is most appropriate, what's the result, is it statistically significant, and what does it mean in plain English? This is the kind of report you'd actually present in a team meeting.
import pandas as pd # pandas: data table library — selecting columns and checking skewness
import numpy as np # numpy: maths library — abs() for comparing values
from scipy import stats # scipy: statistics library — correlation functions and their p-values
def smart_correlation_report(dataframe, col_a, col_b):
"""
Automatically picks the right correlation method based on the data,
returns a plain-English summary.
"""
a = dataframe[col_a].dropna()
b = dataframe[col_b].dropna()
# Decide which method to use:
# - If either column is heavily skewed (|skew| > 1) → Spearman is safer
# - Otherwise → Pearson is fine
skew_a = abs(a.skew()) # .skew() from pandas tells us how lopsided the distribution is
skew_b = abs(b.skew())
if skew_a > 1 or skew_b > 1:
method = 'Spearman'
r, p = stats.spearmanr(a, b)
reason = f"(used because skewness is high: {col_a}={skew_a:.1f}, {col_b}={skew_b:.1f})"
else:
method = 'Pearson'
r, p = stats.pearsonr(a, b)
reason = f"(used because both columns are roughly symmetric)"
# Plain-English interpretation of the r value
if abs(r) >= 0.7: strength = "strong"
elif abs(r) >= 0.4: strength = "moderate"
else: strength = "weak"
direction = "positive" if r > 0 else "negative"
sig = "significant (p<0.05)" if p < 0.05 else "NOT significant (p≥0.05)"
print(f"{'='*52}")
print(f" {col_a} × {col_b}")
print(f"{'='*52}")
print(f" Method: {method} {reason}")
print(f" r = {r:.3f} | p = {p:.4f} | {sig}")
print(f" Finding: {strength} {direction} correlation")
print()
# Run the smart report on three key pairs
smart_correlation_report(df, 'total_orders', 'days_since_signup')
smart_correlation_report(df, 'avg_order_value', 'star_rating')
smart_correlation_report(df, 'total_orders', 'support_tickets')
==================================================== total_orders × days_since_signup ==================================================== Method: Pearson (used because both columns are roughly symmetric) r = 0.978 | p = 0.0000 | significant (p<0.05) Finding: strong positive correlation ==================================================== avg_order_value × star_rating ==================================================== Method: Spearman (used because skewness is high: avg_order_value=3.4, star_rating=0.1) r = 0.745 | p = 0.0056 | significant (p<0.05) Finding: strong positive correlation ==================================================== total_orders × support_tickets ==================================================== Method: Pearson (used because both columns are roughly symmetric) r = 0.968 | p = 0.0000 | significant (p<0.05) Finding: strong positive correlation
What just happened?
pandas' .skew() method is the key decision-maker here — it tells us whether a column is lopsided. If skewness is above 1 (highly lopsided), we automatically switch to Spearman because Pearson assumes a roughly symmetric, bell-shaped distribution. This turns method selection from a manual judgement call into an automatic check in your code.
scipy's correlation functions return both the correlation value and the p-value in a single call. We use both — the r value to describe strength and direction, the p-value to decide whether to trust it.
The function caught the avg_order_value column's extreme skewness (3.4) and automatically used Spearman — giving the correct r=0.745 instead of the misleading Pearson 0.143. This is the kind of automated safeguard that separates a robust analysis from a fragile one.
Teacher's Note
Correlation does not mean causation. This phrase gets repeated so often it stops landing. Here's a concrete version: ice cream sales and drowning deaths are strongly positively correlated. Does ice cream cause drowning? No. Both spike in summer when it's hot. The real driver — temperature — is hidden. A correlation just says two things move together. It says nothing about why.
The practical rule: when you find a strong correlation, your next question should always be "what third variable could explain this?" That discipline alone will save you from embarrassing conclusions in front of stakeholders.
Practice Questions
1. Your dataset has a column with a very large outlier that is skewing the data. Which correlation method should you use — Pearson or Spearman?
2. Below what p-value threshold do analysts typically say a result is "statistically significant"?
3. A correlation of −0.85 between hours of TV watched per day and exam scores means the relationship is strong and ________.
Quiz
1. You're comparing customer age against satisfaction scores. One customer is 104 years old — a clear data entry error. Which method gives you the most reliable correlation?
2. Pearson correlation between revision time and exam score is 0.12 (nearly zero), but when students who revised 0 hours are removed, the correlation jumps to 0.87. What is the most likely explanation?
3. Ice cream sales and drowning deaths have a Pearson correlation of 0.91. What is the correct interpretation?
Up Next · Lesson 20
Covariance Analysis
Correlation's lesser-known cousin — understand what covariance actually measures, why it's harder to interpret, and when it matters more than correlation.