EDA Course
Covariance Analysis
Correlation gets all the glory. But correlation is actually built on top of something more fundamental — covariance. If you've ever wondered what's happening under the hood when you run a correlation, this lesson explains it. And covariance has its own unique uses that correlation simply can't replace.
What Is Covariance — A Simple Analogy
Imagine you and a friend both go on a diet at the same time. Some weeks you both lose weight. Some weeks you both gain a bit. Occasionally one of you loses while the other gains. Covariance measures how consistently your weight changes move in the same direction — do they tend to go up together and down together?
More precisely: covariance measures how much two variables deviate from their own averages at the same time. If when one is above its average the other tends to also be above its average — that's positive covariance. If one tends to be above average while the other is below — that's negative covariance.
Positive Covariance
When one variable is above its average, the other tends to be above its average too. They move together. Example: temperature and ice cream sales both above average in July.
Negative Covariance
When one variable is above its average, the other tends to be below. They move in opposite directions. Example: hours of rain and hours of sunshine — when one is high, the other is low.
Zero Covariance
No consistent pattern. Sometimes they move together, sometimes apart. Knowing one tells you nothing useful about the other. Example: shoe size and salary.
Covariance vs Correlation — The Key Difference
Here's the thing that trips people up: covariance and correlation measure the same basic idea — do two things move together? But they express it differently.
Covariance
- No fixed range — can be any number
- Value depends on the units of your data
- Hard to interpret on its own: is 5,000 large or small?
- Useful for comparing datasets in the same units
- The raw ingredient inside correlation
Correlation
- Always between −1 and +1
- Unitless — works regardless of scale
- Instantly interpretable: 0.85 is always "strong positive"
- Useful for comparing across different types of data
- Covariance divided by both standard deviations
The formula in plain words: Correlation = Covariance ÷ (standard deviation of A × standard deviation of B). Dividing by the standard deviations "standardises" the covariance — it removes the effect of scale and gives you a number that always lands between −1 and +1. That's the only difference.
Calculating Covariance — Your First Look
The scenario: You work at a gym chain. Your manager wants to understand how different membership metrics relate to each other — do members who visit more also tend to spend more on personal training? Do older members visit less often? You've been handed data for 10 members. Start with covariance to get the raw relationship numbers, then you'll build up to the full covariance matrix.
import pandas as pd # pandas: our main data table tool — like a spreadsheet in Python
import numpy as np # numpy: fast maths library — we'll use it to manually compute covariance too
# Gym membership data — 10 members
df = pd.DataFrame({
'member_id': range(1, 11),
'age': [24, 35, 42, 28, 51, 33, 47, 29, 38, 55],
'visits_per_month': [18, 12, 8, 15, 5, 14, 7, 16, 10, 4 ],
'pt_spend_monthly': [0, 40, 80, 20, 120, 30, 90, 10, 50, 150] # personal training spend in £
})
print("Our gym member data:")
print(df[['member_id','age','visits_per_month','pt_spend_monthly']].to_string(index=False))
print()
# pandas .cov() computes the full covariance matrix — every column vs every other column
# Think of it like a correlation matrix, but the numbers are in "raw" units rather than −1 to +1
cov_matrix = df[['age','visits_per_month','pt_spend_monthly']].cov()
print("=== COVARIANCE MATRIX ===")
print(cov_matrix.round(2))
print()
# Manual covariance between visits and pt_spend — to show exactly what's being calculated
# Covariance = average of (how far each A is from A's mean) × (how far each B is from B's mean)
visits_mean = df['visits_per_month'].mean()
pt_spend_mean = df['pt_spend_monthly'].mean()
# For each member, multiply their deviation from the visits mean by their deviation from the spend mean
deviations = (df['visits_per_month'] - visits_mean) * (df['pt_spend_monthly'] - pt_spend_mean)
# The covariance is the average of all those products
# ddof=1 means we divide by (n-1) not n — the standard correction for sample data
manual_cov = deviations.sum() / (len(df) - 1)
print(f"Manual covariance (visits × pt_spend): {manual_cov:.2f}")
print(f"pandas .cov() result: {df['visits_per_month'].cov(df['pt_spend_monthly']):.2f}")
Our gym member data:
member_id age visits_per_month pt_spend_monthly
1 24 18 0
2 35 12 40
3 42 8 80
4 28 15 20
5 51 5 120
6 33 14 30
7 47 7 90
8 29 16 10
9 38 10 50
10 55 4 150
=== COVARIANCE MATRIX ===
age visits_per_month pt_spend_monthly
age 97.43 -47.78 335.56
visits_per_month -47.78 23.43 -164.44
pt_spend_monthly 335.56 -164.44 2572.22
Manual covariance (visits × pt_spend): -164.44
pandas .cov() result: -164.44
What just happened?
pandas is our data table library. The .cov() method computes a covariance matrix — a table where every cell shows how two columns vary together. The diagonal cells (age with age, visits with visits) are just the variance of each column — how spread out it is on its own. The off-diagonal cells tell us how pairs of columns move together.
numpy is our fast maths library. We used it here indirectly — pandas' .cov() uses numpy under the hood for the actual calculations.
Now let's read the output. The covariance between visits_per_month and pt_spend_monthly is −164.44. That's negative — members who visit more tend to spend less on personal training. Makes sense: frequent visitors are gym regulars who don't need a PT. The covariance between age and pt_spend_monthly is +335.56 — older members spend more on personal training. But notice: is 335.56 a "big" number? Without knowing the units, it's hard to say. That's the limitation of covariance — which is exactly why we convert to correlation next.
Why the Scale Problem Matters
The scenario: Your manager says: "The covariance between age and PT spend is 335. Is that strong or weak?" You can't answer without context. Now imagine you get a second dataset where PT spend is recorded in pence instead of pounds. The relationship is identical — but the covariance would be 100× bigger (33,556). The number has changed; the relationship hasn't. This is the scale problem. Converting to correlation fixes it instantly.
import pandas as pd # pandas: data table library — .cov(), .corr(), and column operations
import numpy as np # numpy: maths library — manual formula demonstration
# Show how covariance changes with scale but correlation doesn't
# Original data: PT spend in pounds (£)
spend_pounds = df['pt_spend_monthly']
# Same data but converted to pence (×100) — same real-world relationship, different units
spend_pence = df['pt_spend_monthly'] * 100 # multiply every value by 100
# --- COVARIANCE: changes when units change ---
cov_pounds = df['age'].cov(spend_pounds) # covariance with spend in £
cov_pence = df['age'].cov(spend_pence) # covariance with spend in pence
print("=== THE SCALE PROBLEM ===")
print(f"Covariance (age × PT spend in £): {cov_pounds:.2f}")
print(f"Covariance (age × PT spend in pence): {cov_pence:.2f} ← 100× bigger!")
print(f"Ratio: {cov_pence/cov_pounds:.0f}× (same relationship, wildly different numbers)")
print()
# --- CORRELATION: stays exactly the same regardless of units ---
from scipy import stats # scipy: statistics library — pearsonr for correlation with p-value
corr_pounds, p_pounds = stats.pearsonr(df['age'], spend_pounds)
corr_pence, p_pence = stats.pearsonr(df['age'], spend_pence)
print("=== CORRELATION IS SCALE-FREE ===")
print(f"Correlation (age × PT spend in £): r = {corr_pounds:.4f}")
print(f"Correlation (age × PT spend in pence): r = {corr_pence:.4f} ← identical!")
print()
# Now derive correlation from covariance manually — so you can see the relationship
# Formula: r = cov(A,B) / (std(A) × std(B))
std_age = df['age'].std()
std_spend = spend_pounds.std()
r_manual = cov_pounds / (std_age * std_spend) # divide by both standard deviations
print(f"Manual derivation: {cov_pounds:.2f} / ({std_age:.2f} × {std_spend:.2f}) = {r_manual:.4f}")
print(f"This matches pearsonr: {corr_pounds:.4f} ✓")
=== THE SCALE PROBLEM === Covariance (age × PT spend in £): 335.56 Covariance (age × PT spend in pence): 33555.56 ← 100× bigger! Ratio: 100× (same relationship, wildly different numbers) === CORRELATION IS SCALE-FREE === Correlation (age × PT spend in £): r = 0.9723 Correlation (age × PT spend in pence): r = 0.9723 ← identical! Manual derivation: 335.56 / (9.87 × 34.93) = 0.9723 This matches pearsonr: 0.9723 ✓
What just happened?
pandas is doing the covariance calculations via .cov(). When we multiply PT spend by 100 (converting pounds to pence), the covariance jumps from 335 to 33,555 — 100 times bigger. The data is identical; only the label on the ruler changed.
scipy's stats.pearsonr() shows the correlation stays at exactly 0.9723 in both cases. Because dividing by the standard deviations cancels out the unit change.
The manual derivation at the bottom is the most important part: correlation = covariance ÷ (std_A × std_B). This isn't magic — it's simple arithmetic. Seeing this once makes the relationship between covariance and correlation unforgettable. Covariance gives you the direction and raw magnitude. Dividing by the standard deviations standardises it into a −1 to +1 scale you can always interpret.
Reading a Covariance Matrix — The Diagonal Trick
A covariance matrix looks intimidating at first. Here's all you need to know to read one:
Covariance Matrix — Gym Data
| age | visits | pt_spend | |
| age | 97.43 | −47.78 | 335.56 |
| visits | −47.78 | 23.43 | −164.44 |
| pt_spend | 335.56 | −164.44 | 2572.22 |
How to read this table:
Yellow diagonal = each column's own variance (how spread out it is). PT spend (2572) is very spread out; visits (23) is tightly packed.
Positive off-diagonal = the two variables tend to move in the same direction. Age and PT spend (+335) — older members spend more.
Negative off-diagonal = they move in opposite directions. Visits and PT spend (−164) — frequent visitors spend less on PT.
Symmetry — the table mirrors itself. The top-right is identical to the bottom-left. You only need to read one triangle.
When Covariance Is More Useful Than Correlation
The scenario: Your data team is building a financial model for the gym chain's investment portfolio. They want to combine two revenue streams — gym membership fees and personal training — to reduce overall risk. In finance, the rule is: if two revenue streams go up and down together, combining them doesn't reduce risk much. But if they move in opposite directions, combining them smooths out the volatility. For this calculation — called portfolio optimisation — you need the raw covariance numbers, not correlation. Correlation alone isn't enough.
import pandas as pd # pandas: data table library — .cov(), .var(), and column operations
import numpy as np # numpy: maths library — matrix operations for portfolio variance
# Monthly revenue data (£000s) for two gym revenue streams over 12 months
# Membership fees are stable; PT revenue is more volatile
revenue = pd.DataFrame({
'month': range(1, 13),
'membership': [42, 43, 41, 44, 45, 43, 42, 44, 46, 45, 43, 44], # fairly steady
'pt_revenue': [18, 22, 14, 28, 32, 19, 15, 25, 35, 30, 17, 21] # more volatile
})
# Variance of each revenue stream individually
# Variance = average of squared deviations from the mean — a measure of "bumpiness"
var_membership = revenue['membership'].var() # pandas .var() — sample variance (divides by n-1)
var_pt = revenue['pt_revenue'].var()
print(f"Variance — membership fees: {var_membership:.2f} (lower = more stable)")
print(f"Variance — PT revenue: {var_pt:.2f} (higher = more volatile)")
print()
# Covariance between the two streams
cov_streams = revenue['membership'].cov(revenue['pt_revenue'])
print(f"Covariance between streams: {cov_streams:.2f}")
print()
# Portfolio variance formula:
# If you combine two revenue streams (50/50 split), the combined variance is:
# Var_portfolio = w1² × Var1 + w2² × Var2 + 2 × w1 × w2 × Cov(1,2)
# where w1 and w2 are the weights (both 0.5 for a 50/50 split)
w1, w2 = 0.5, 0.5
var_portfolio = (w1**2 * var_membership) + (w2**2 * var_pt) + (2 * w1 * w2 * cov_streams)
print(f"Portfolio variance (50/50 mix): {var_portfolio:.2f}")
print(f"Portfolio std dev (volatility): £{var_portfolio**0.5:.2f}k per month")
print()
# Compare: what if PT revenue was negatively correlated with membership?
# Simulate an inverse PT stream
revenue['pt_inverse'] = 42 - (revenue['pt_revenue'] - revenue['pt_revenue'].mean())
cov_inverse = revenue['membership'].cov(revenue['pt_inverse'])
var_portfolio_inverse = (w1**2 * var_membership) + (w2**2 * revenue['pt_inverse'].var()) + (2 * w1 * w2 * cov_inverse)
print(f"If PT revenue moved OPPOSITE to membership:")
print(f"Portfolio variance would be: {var_portfolio_inverse:.2f} ← much lower!")
print(f"Portfolio std dev: £{var_portfolio_inverse**0.5:.2f}k per month (less risk)")
Variance — membership fees: 2.08 (lower = more stable) Variance — PT revenue: 42.08 (higher = more volatile) Covariance between streams: 5.55 Portfolio variance (50/50 mix): 13.83 Portfolio std dev (volatility): £3.72k per month If PT revenue moved OPPOSITE to membership: Portfolio variance would be: 8.73 ← much lower! Portfolio std dev: £2.95k per month (less risk)
What just happened?
pandas' .var() computes the variance of a column — how spread out the values are. PT revenue (variance 42.08) is 20× more "bumpy" than membership fees (2.08). That's visible just from glancing at the numbers in the dataset.
numpy powers the arithmetic behind the portfolio variance formula. The formula itself — w1² × Var1 + w2² × Var2 + 2 × w1 × w2 × Cov — is the reason covariance can't be replaced by correlation here. The formula needs the raw covariance number, not a −1 to +1 score. You need the actual size of how the two streams move together to compute a real variance in £ terms.
The key takeaway: when PT revenue moves in the same direction as membership, combining them gives a portfolio volatility of £3.72k/month. When they move in opposite directions, volatility drops to £2.95k/month — nearly 20% less risky. This is diversification in a formula, and covariance is what makes it computable.
Covariance to Correlation — The Complete Comparison
The scenario: You wrap up the gym analysis with a side-by-side comparison — covariance matrix and correlation matrix for the same data. Your manager can see both, understand what each is saying, and know which one to quote in different situations.
import pandas as pd # pandas: data library — .cov() and .corr() for the two matrices
import numpy as np # numpy: maths library — standard import
numeric_cols = ['age', 'visits_per_month', 'pt_spend_monthly']
# The covariance matrix — raw, units-dependent, hard to compare across pairs
cov_mat = df[numeric_cols].cov()
# The correlation matrix — standardised, always −1 to +1, easy to compare
corr_mat = df[numeric_cols].corr() # pandas .corr() uses Pearson by default
print("=== COVARIANCE MATRIX (raw — unit-dependent) ===")
print(cov_mat.round(2))
print()
print("=== CORRELATION MATRIX (standardised — always −1 to +1) ===")
print(corr_mat.round(3))
print()
# Plain-English interpretation of each relationship
print("=== PLAIN-ENGLISH SUMMARY ===")
pairs = [
('age', 'visits_per_month'),
('age', 'pt_spend_monthly'),
('visits_per_month', 'pt_spend_monthly')
]
for a, b in pairs:
cov_val = cov_mat.loc[a, b]
corr_val = corr_mat.loc[a, b]
direction = "positive" if corr_val > 0 else "negative"
strength = "strong" if abs(corr_val) > 0.7 else "moderate" if abs(corr_val) > 0.4 else "weak"
print(f" {a} × {b}:")
print(f" Covariance = {cov_val:.2f} | Correlation = {corr_val:.3f}")
print(f" → {strength} {direction} relationship")
print()
=== COVARIANCE MATRIX (raw — unit-dependent) ===
age visits_per_month pt_spend_monthly
age 97.43 -47.78 335.56
visits_per_month -47.78 23.43 -164.44
pt_spend_monthly 335.56 -164.44 2572.22
=== CORRELATION MATRIX (standardised — always −1 to +1) ===
age visits_per_month pt_spend_monthly
age 1.000 -0.999 0.972
visits_per_month -0.999 1.000 -0.975
pt_spend_monthly 0.972 -0.975 1.000
=== PLAIN-ENGLISH SUMMARY ===
age × visits_per_month:
Covariance = -47.78 | Correlation = -0.999
→ strong negative relationship
age × pt_spend_monthly:
Covariance = 335.56 | Correlation = 0.972
→ strong positive relationship
visits_per_month × pt_spend_monthly:
Covariance = -164.44 | Correlation = -0.975
→ strong negative relationship
What just happened?
pandas makes both matrices trivial: .cov() and .corr() are called identically — the only difference is which number lands in each cell. The covariance matrix values range from 23 to 2,572 — impossible to compare at a glance. The correlation matrix values are all between −1 and +1, making every cell instantly readable.
The correlations here are remarkably strong — age vs visits is −0.999 (almost perfectly negative: the older the member, the fewer times they visit). That pattern was in the covariance matrix too (−47.78), but you couldn't tell it was nearly perfect without doing the mental maths to compare it against the other cells. Correlation makes that comparison instant.
Teacher's Note
In day-to-day EDA, you will use correlation far more than covariance. Correlation is easier to read, easier to explain to a non-technical stakeholder, and doesn't depend on units. If your manager asks "how related are these two things?", the answer is a correlation coefficient.
But covariance isn't just a stepping stone to correlation. It's the right tool whenever you need the size of how two things move together in real units — portfolio risk calculations, principal component analysis (coming in Lesson 39), and multivariate statistical models all need covariance, not just correlation. Learn both; know when to reach for each.
Practice Questions
1. You have covariance values of 450 and 12 for two different pairs of variables. You can't tell which relationship is stronger. Which measure would make them instantly comparable?
2. What value appears on the diagonal of a covariance matrix?
3. The covariance between gym visits per month and personal training spend is −164. Does this mean members who visit more tend to spend more or less on personal training?
Quiz
1. You convert a column from metres to centimetres (multiply by 100) and recompute covariance and correlation with another column. What happens?
2. A finance analyst is calculating the combined risk of two revenue streams using the portfolio variance formula. Should they use covariance or correlation?
3. What is the mathematical relationship between covariance and Pearson correlation?
Up Next · Lesson 21
Feature Relationships
Go beyond pairwise numbers — learn to map out the full web of relationships between all your features and spot which ones are genuinely useful for prediction.