EDA Lesson 20 – Covariance Analysis | Dataplexa

Intermediate Level · Lesson 20

Covariance Analysis

Correlation gets all the glory. But correlation is actually built on top of something more fundamental — covariance. If you've ever wondered what's happening under the hood when you run a correlation, this lesson explains it. And covariance has its own unique uses that correlation simply can't replace.

What Is Covariance — A Simple Analogy

Imagine you and a friend both go on a diet at the same time. Some weeks you both lose weight. Some weeks you both gain a bit. Occasionally one of you loses while the other gains. Covariance measures how consistently your weight changes move in the same direction — do they tend to go up together and down together?

More precisely: covariance measures how much two variables deviate from their own averages at the same time. If when one is above its average the other tends to also be above its average — that's positive covariance. If one tends to be above average while the other is below — that's negative covariance.

Positive Covariance

When one variable is above its average, the other tends to be above its average too. They move together. Example: temperature and ice cream sales both above average in July.

Negative Covariance

When one variable is above its average, the other tends to be below. They move in opposite directions. Example: hours of rain and hours of sunshine — when one is high, the other is low.

Zero Covariance

No consistent pattern. Sometimes they move together, sometimes apart. Knowing one tells you nothing useful about the other. Example: shoe size and salary.

Covariance vs Correlation — The Key Difference

Here's the thing that trips people up: covariance and correlation measure the same basic idea — do two things move together? But they express it differently.

Covariance

No fixed range — can be any number
Value depends on the units of your data
Hard to interpret on its own: is 5,000 large or small?
Useful for comparing datasets in the same units
The raw ingredient inside correlation

Correlation

Always between −1 and +1
Unitless — works regardless of scale
Instantly interpretable: 0.85 is always "strong positive"
Useful for comparing across different types of data
Covariance divided by both standard deviations

The formula in plain words: Correlation = Covariance ÷ (standard deviation of A × standard deviation of B). Dividing by the standard deviations "standardises" the covariance — it removes the effect of scale and gives you a number that always lands between −1 and +1. That's the only difference.

Calculating Covariance — Your First Look

The scenario: You work at a gym chain. Your manager wants to understand how different membership metrics relate to each other — do members who visit more also tend to spend more on personal training? Do older members visit less often? You've been handed data for 10 members. Start with covariance to get the raw relationship numbers, then you'll build up to the full covariance matrix.

import pandas as pd      # pandas: our main data table tool — like a spreadsheet in Python
import numpy as np       # numpy: fast maths library — we'll use it to manually compute covariance too

# Gym membership data — 10 members
df = pd.DataFrame({
    'member_id':        range(1, 11),
    'age':              [24, 35, 42, 28, 51, 33, 47, 29, 38, 55],
    'visits_per_month': [18, 12, 8,  15, 5,  14, 7,  16, 10, 4 ],
    'pt_spend_monthly': [0,  40, 80, 20, 120, 30, 90, 10, 50, 150]  # personal training spend in £
})

print("Our gym member data:")
print(df[['member_id','age','visits_per_month','pt_spend_monthly']].to_string(index=False))
print()

# pandas .cov() computes the full covariance matrix — every column vs every other column
# Think of it like a correlation matrix, but the numbers are in "raw" units rather than −1 to +1
cov_matrix = df[['age','visits_per_month','pt_spend_monthly']].cov()
print("=== COVARIANCE MATRIX ===")
print(cov_matrix.round(2))
print()

# Manual covariance between visits and pt_spend — to show exactly what's being calculated
# Covariance = average of (how far each A is from A's mean) × (how far each B is from B's mean)
visits_mean   = df['visits_per_month'].mean()
pt_spend_mean = df['pt_spend_monthly'].mean()

# For each member, multiply their deviation from the visits mean by their deviation from the spend mean
deviations = (df['visits_per_month'] - visits_mean) * (df['pt_spend_monthly'] - pt_spend_mean)

# The covariance is the average of all those products
# ddof=1 means we divide by (n-1) not n — the standard correction for sample data
manual_cov = deviations.sum() / (len(df) - 1)
print(f"Manual covariance (visits × pt_spend): {manual_cov:.2f}")
print(f"pandas .cov() result:                  {df['visits_per_month'].cov(df['pt_spend_monthly']):.2f}")

Our gym member data:
 member_id  age  visits_per_month  pt_spend_monthly
         1   24                18                 0
         2   35                12                40
         3   42                 8                80
         4   28                15                20
         5   51                 5               120
         6   33                14                30
         7   47                 7                90
         8   29                16                10
         9   38                10                50
        10   55                 4               150

=== COVARIANCE MATRIX ===
                   age  visits_per_month  pt_spend_monthly
age              97.43            -47.78            335.56
visits_per_month -47.78            23.43           -164.44
pt_spend_monthly 335.56          -164.44           2572.22

Manual covariance (visits × pt_spend): -164.44
pandas .cov() result:                  -164.44

What just happened?

pandas is our data table library. The .cov() method computes a covariance matrix — a table where every cell shows how two columns vary together. The diagonal cells (age with age, visits with visits) are just the variance of each column — how spread out it is on its own. The off-diagonal cells tell us how pairs of columns move together.

numpy is our fast maths library. We used it here indirectly — pandas' .cov() uses numpy under the hood for the actual calculations.

Now let's read the output. The covariance between visits_per_month and pt_spend_monthly is −164.44. That's negative — members who visit more tend to spend less on personal training. Makes sense: frequent visitors are gym regulars who don't need a PT. The covariance between age and pt_spend_monthly is +335.56 — older members spend more on personal training. But notice: is 335.56 a "big" number? Without knowing the units, it's hard to say. That's the limitation of covariance — which is exactly why we convert to correlation next.

Why the Scale Problem Matters

The scenario: Your manager says: "The covariance between age and PT spend is 335. Is that strong or weak?" You can't answer without context. Now imagine you get a second dataset where PT spend is recorded in pence instead of pounds. The relationship is identical — but the covariance would be 100× bigger (33,556). The number has changed; the relationship hasn't. This is the scale problem. Converting to correlation fixes it instantly.

import pandas as pd      # pandas: data table library — .cov(), .corr(), and column operations
import numpy as np       # numpy: maths library — manual formula demonstration

# Show how covariance changes with scale but correlation doesn't
# Original data: PT spend in pounds (£)
spend_pounds = df['pt_spend_monthly']

# Same data but converted to pence (×100) — same real-world relationship, different units
spend_pence  = df['pt_spend_monthly'] * 100   # multiply every value by 100

# --- COVARIANCE: changes when units change ---
cov_pounds = df['age'].cov(spend_pounds)      # covariance with spend in £
cov_pence  = df['age'].cov(spend_pence)       # covariance with spend in pence

print("=== THE SCALE PROBLEM ===")
print(f"Covariance (age × PT spend in £):     {cov_pounds:.2f}")
print(f"Covariance (age × PT spend in pence): {cov_pence:.2f}   ← 100× bigger!")
print(f"Ratio: {cov_pence/cov_pounds:.0f}×  (same relationship, wildly different numbers)")
print()

# --- CORRELATION: stays exactly the same regardless of units ---
from scipy import stats   # scipy: statistics library — pearsonr for correlation with p-value

corr_pounds, p_pounds = stats.pearsonr(df['age'], spend_pounds)
corr_pence,  p_pence  = stats.pearsonr(df['age'], spend_pence)

print("=== CORRELATION IS SCALE-FREE ===")
print(f"Correlation (age × PT spend in £):     r = {corr_pounds:.4f}")
print(f"Correlation (age × PT spend in pence): r = {corr_pence:.4f}  ← identical!")
print()

# Now derive correlation from covariance manually — so you can see the relationship
# Formula: r = cov(A,B) / (std(A) × std(B))
std_age   = df['age'].std()
std_spend = spend_pounds.std()
r_manual  = cov_pounds / (std_age * std_spend)   # divide by both standard deviations
print(f"Manual derivation: {cov_pounds:.2f} / ({std_age:.2f} × {std_spend:.2f}) = {r_manual:.4f}")
print(f"This matches pearsonr: {corr_pounds:.4f}  ✓")

=== THE SCALE PROBLEM ===
Covariance (age × PT spend in £):     335.56
Covariance (age × PT spend in pence): 33555.56   ← 100× bigger!
Ratio: 100×  (same relationship, wildly different numbers)

=== CORRELATION IS SCALE-FREE ===
Correlation (age × PT spend in £):     r = 0.9723
Correlation (age × PT spend in pence): r = 0.9723  ← identical!

Manual derivation: 335.56 / (9.87 × 34.93) = 0.9723
This matches pearsonr: 0.9723  ✓

What just happened?

pandas is doing the covariance calculations via .cov(). When we multiply PT spend by 100 (converting pounds to pence), the covariance jumps from 335 to 33,555 — 100 times bigger. The data is identical; only the label on the ruler changed.

scipy's stats.pearsonr() shows the correlation stays at exactly 0.9723 in both cases. Because dividing by the standard deviations cancels out the unit change.

The manual derivation at the bottom is the most important part: correlation = covariance ÷ (std_A × std_B). This isn't magic — it's simple arithmetic. Seeing this once makes the relationship between covariance and correlation unforgettable. Covariance gives you the direction and raw magnitude. Dividing by the standard deviations standardises it into a −1 to +1 scale you can always interpret.

Reading a Covariance Matrix — The Diagonal Trick

A covariance matrix looks intimidating at first. Here's all you need to know to read one:

Covariance Matrix — Gym Data

	age	visits	pt_spend
age	97.43	−47.78	335.56
visits	−47.78	23.43	−164.44
pt_spend	335.56	−164.44	2572.22

How to read this table:

Yellow diagonal = each column's own variance (how spread out it is). PT spend (2572) is very spread out; visits (23) is tightly packed.

Positive off-diagonal = the two variables tend to move in the same direction. Age and PT spend (+335) — older members spend more.

Negative off-diagonal = they move in opposite directions. Visits and PT spend (−164) — frequent visitors spend less on PT.

Symmetry — the table mirrors itself. The top-right is identical to the bottom-left. You only need to read one triangle.

When Covariance Is More Useful Than Correlation

The scenario: Your data team is building a financial model for the gym chain's investment portfolio. They want to combine two revenue streams — gym membership fees and personal training — to reduce overall risk. In finance, the rule is: if two revenue streams go up and down together, combining them doesn't reduce risk much. But if they move in opposite directions, combining them smooths out the volatility. For this calculation — called portfolio optimisation — you need the raw covariance numbers, not correlation. Correlation alone isn't enough.

import pandas as pd      # pandas: data table library — .cov(), .var(), and column operations
import numpy as np       # numpy: maths library — matrix operations for portfolio variance

# Monthly revenue data (£000s) for two gym revenue streams over 12 months
# Membership fees are stable; PT revenue is more volatile
revenue = pd.DataFrame({
    'month':        range(1, 13),
    'membership':   [42, 43, 41, 44, 45, 43, 42, 44, 46, 45, 43, 44],   # fairly steady
    'pt_revenue':   [18, 22, 14, 28, 32, 19, 15, 25, 35, 30, 17, 21]    # more volatile
})

# Variance of each revenue stream individually
# Variance = average of squared deviations from the mean — a measure of "bumpiness"
var_membership = revenue['membership'].var()   # pandas .var() — sample variance (divides by n-1)
var_pt         = revenue['pt_revenue'].var()

print(f"Variance — membership fees:    {var_membership:.2f}  (lower = more stable)")
print(f"Variance — PT revenue:         {var_pt:.2f}  (higher = more volatile)")
print()

# Covariance between the two streams
cov_streams = revenue['membership'].cov(revenue['pt_revenue'])
print(f"Covariance between streams:    {cov_streams:.2f}")
print()

# Portfolio variance formula:
# If you combine two revenue streams (50/50 split), the combined variance is:
# Var_portfolio = w1² × Var1  +  w2² × Var2  +  2 × w1 × w2 × Cov(1,2)
# where w1 and w2 are the weights (both 0.5 for a 50/50 split)
w1, w2 = 0.5, 0.5
var_portfolio = (w1**2 * var_membership) + (w2**2 * var_pt) + (2 * w1 * w2 * cov_streams)

print(f"Portfolio variance (50/50 mix): {var_portfolio:.2f}")
print(f"Portfolio std dev (volatility):  £{var_portfolio**0.5:.2f}k per month")
print()

# Compare: what if PT revenue was negatively correlated with membership?
# Simulate an inverse PT stream
revenue['pt_inverse'] = 42 - (revenue['pt_revenue'] - revenue['pt_revenue'].mean())
cov_inverse = revenue['membership'].cov(revenue['pt_inverse'])
var_portfolio_inverse = (w1**2 * var_membership) + (w2**2 * revenue['pt_inverse'].var()) + (2 * w1 * w2 * cov_inverse)

print(f"If PT revenue moved OPPOSITE to membership:")
print(f"Portfolio variance would be:    {var_portfolio_inverse:.2f}  ← much lower!")
print(f"Portfolio std dev:              £{var_portfolio_inverse**0.5:.2f}k per month  (less risk)")

Variance — membership fees:    2.08  (lower = more stable)
Variance — PT revenue:         42.08  (higher = more volatile)

Covariance between streams:    5.55

Portfolio variance (50/50 mix): 13.83
Portfolio std dev (volatility):  £3.72k per month

If PT revenue moved OPPOSITE to membership:
Portfolio variance would be:    8.73  ← much lower!
Portfolio std dev:              £2.95k per month  (less risk)

What just happened?

pandas' .var() computes the variance of a column — how spread out the values are. PT revenue (variance 42.08) is 20× more "bumpy" than membership fees (2.08). That's visible just from glancing at the numbers in the dataset.

numpy powers the arithmetic behind the portfolio variance formula. The formula itself — w1² × Var1 + w2² × Var2 + 2 × w1 × w2 × Cov — is the reason covariance can't be replaced by correlation here. The formula needs the raw covariance number, not a −1 to +1 score. You need the actual size of how the two streams move together to compute a real variance in £ terms.

The key takeaway: when PT revenue moves in the same direction as membership, combining them gives a portfolio volatility of £3.72k/month. When they move in opposite directions, volatility drops to £2.95k/month — nearly 20% less risky. This is diversification in a formula, and covariance is what makes it computable.

Covariance to Correlation — The Complete Comparison

The scenario: You wrap up the gym analysis with a side-by-side comparison — covariance matrix and correlation matrix for the same data. Your manager can see both, understand what each is saying, and know which one to quote in different situations.

import pandas as pd      # pandas: data library — .cov() and .corr() for the two matrices
import numpy as np       # numpy: maths library — standard import

numeric_cols = ['age', 'visits_per_month', 'pt_spend_monthly']

# The covariance matrix — raw, units-dependent, hard to compare across pairs
cov_mat  = df[numeric_cols].cov()

# The correlation matrix — standardised, always −1 to +1, easy to compare
corr_mat = df[numeric_cols].corr()   # pandas .corr() uses Pearson by default

print("=== COVARIANCE MATRIX (raw — unit-dependent) ===")
print(cov_mat.round(2))
print()
print("=== CORRELATION MATRIX (standardised — always −1 to +1) ===")
print(corr_mat.round(3))
print()

# Plain-English interpretation of each relationship
print("=== PLAIN-ENGLISH SUMMARY ===")
pairs = [
    ('age', 'visits_per_month'),
    ('age', 'pt_spend_monthly'),
    ('visits_per_month', 'pt_spend_monthly')
]

for a, b in pairs:
    cov_val  = cov_mat.loc[a, b]
    corr_val = corr_mat.loc[a, b]
    direction = "positive" if corr_val > 0 else "negative"
    strength  = "strong" if abs(corr_val) > 0.7 else "moderate" if abs(corr_val) > 0.4 else "weak"
    print(f"  {a} × {b}:")
    print(f"    Covariance = {cov_val:.2f}  |  Correlation = {corr_val:.3f}")
    print(f"    → {strength} {direction} relationship")
    print()

=== COVARIANCE MATRIX (raw — unit-dependent) ===
                   age  visits_per_month  pt_spend_monthly
age              97.43            -47.78            335.56
visits_per_month -47.78            23.43           -164.44
pt_spend_monthly 335.56          -164.44           2572.22

=== CORRELATION MATRIX (standardised — always −1 to +1) ===
                      age  visits_per_month  pt_spend_monthly
age               1.000            -0.999             0.972
visits_per_month -0.999             1.000            -0.975
pt_spend_monthly  0.972            -0.975             1.000

=== PLAIN-ENGLISH SUMMARY ===
  age × visits_per_month:
    Covariance = -47.78  |  Correlation = -0.999
    → strong negative relationship

  age × pt_spend_monthly:
    Covariance = 335.56  |  Correlation = 0.972
    → strong positive relationship

  visits_per_month × pt_spend_monthly:
    Covariance = -164.44  |  Correlation = -0.975
    → strong negative relationship

What just happened?

pandas makes both matrices trivial: .cov() and .corr() are called identically — the only difference is which number lands in each cell. The covariance matrix values range from 23 to 2,572 — impossible to compare at a glance. The correlation matrix values are all between −1 and +1, making every cell instantly readable.

The correlations here are remarkably strong — age vs visits is −0.999 (almost perfectly negative: the older the member, the fewer times they visit). That pattern was in the covariance matrix too (−47.78), but you couldn't tell it was nearly perfect without doing the mental maths to compare it against the other cells. Correlation makes that comparison instant.

Teacher's Note

In day-to-day EDA, you will use correlation far more than covariance. Correlation is easier to read, easier to explain to a non-technical stakeholder, and doesn't depend on units. If your manager asks "how related are these two things?", the answer is a correlation coefficient.

But covariance isn't just a stepping stone to correlation. It's the right tool whenever you need the size of how two things move together in real units — portfolio risk calculations, principal component analysis (coming in Lesson 39), and multivariate statistical models all need covariance, not just correlation. Learn both; know when to reach for each.

Practice Questions

1. You have covariance values of 450 and 12 for two different pairs of variables. You can't tell which relationship is stronger. Which measure would make them instantly comparable?

2. What value appears on the diagonal of a covariance matrix?

3. The covariance between gym visits per month and personal training spend is −164. Does this mean members who visit more tend to spend more or less on personal training?

Quiz

Up Next · Lesson 21

Feature Relationships

Go beyond pairwise numbers — learn to map out the full web of relationships between all your features and spot which ones are genuinely useful for prediction.

← Previous Course Index Next →

EDA Course

Covariance Analysis

What Is Covariance — A Simple Analogy

Covariance vs Correlation — The Key Difference

Calculating Covariance — Your First Look

Why the Scale Problem Matters

Reading a Covariance Matrix — The Diagonal Trick

When Covariance Is More Useful Than Correlation

Covariance to Correlation — The Complete Comparison

Practice Questions

Quiz