EDA Course
Multivariate Analysis
Bivariate analysis looks at two variables at a time. But real-world patterns involve three, four, five variables interacting simultaneously. Multivariate analysis is where you stop examining columns in pairs and start asking: how do multiple things work together?
Why Two Variables Are Never Enough
Here's a classic trap. You find that ice cream sales and drowning deaths are strongly correlated. Bivariate analysis flags it as a significant relationship. But add a third variable — temperature — and the entire picture changes. Both ice cream sales and outdoor swimming increase in hot weather. The correlation between the first two isn't a relationship at all; it's a shared response to a third variable.
This is called a confounding variable — and you can only find it by looking at multiple variables together. Multivariate analysis is the tool that reveals these hidden structures in data before your model learns spurious relationships that don't generalise.
Confounding Variables
A third variable that influences both of the variables you're studying, creating a false impression of a direct relationship between them.
Interaction Effects
The effect of one variable on an outcome depends on the value of another variable. Example: a drug works well in young patients but poorly in older ones.
Multicollinearity
When multiple features are so correlated with each other that a model can't reliably estimate their individual contributions. Identified in multivariate analysis, disastrous in linear models.
Grouped Multivariate Summaries
The scenario: You're a senior analyst at the same healthcare company. The readmission model team now wants to understand patterns across three dimensions at once: admission type, primary diagnosis, and average length of stay. Do cardiac emergency patients stay longer than cardiac elective patients? Does the diagnosis matter more than the admission type? A simple groupby on multiple columns reveals the interaction.
import pandas as pd # pandas: Python's core data table library — multi-column groupby and pivot tables
import numpy as np # numpy: numerical library — standard import alongside pandas for any EDA script
# Full patient dataset continuing from Lessons 16 & 17
df = pd.DataFrame({
'patient_id': range(1001, 1013),
'age': [45, 62, 38, 71, 55, 48, 67, 29, 83, 52, 44, 61],
'bmi': [24.1,31.8,22.4,28.6,27.3,35.2,27.9,21.3,33.1,26.5,23.8,29.4],
'systolic_bp': [118,145,122,158,134,142,128,115,172,136,119,141],
'num_admissions': [1,4,1,6,2,3,5,1,8,2,1,3],
'los_days': [3,7,2,12,4,6,9,2,15,5,3,6],
'admission_type': ['Emergency','Elective','Emergency','Emergency','Urgent',
'Elective','Emergency','Elective','Emergency','Urgent',
'Emergency','Elective'],
'primary_diagnosis':['Cardiac','Respiratory','Orthopaedic','Cardiac','Diabetes',
'Respiratory','Cardiac','Orthopaedic','Cardiac','Diabetes',
'Respiratory','Cardiac'],
'readmitted': [0,1,0,1,0,1,1,0,1,0,0,1]
})
# Multi-column groupby: group by TWO categorical columns simultaneously
# This reveals how the combination of admission type + diagnosis affects length of stay
multi_group = (
df.groupby(['admission_type', 'primary_diagnosis'])['los_days']
.agg(n='count', mean_los='mean', median_los='median') # named aggregations — clean column names
.round(2)
)
print("=== LOS BY ADMISSION TYPE × DIAGNOSIS ===")
print(multi_group.to_string())
print()
# Pivot table: same data, reshaped so admission type = rows, diagnosis = columns
# Much easier to scan for patterns than a long multi-index Series
pivot = pd.pivot_table(
df,
values='los_days', # the numeric value to aggregate
index='admission_type', # rows
columns='primary_diagnosis', # columns
aggfunc='mean', # aggregation function
fill_value='-' # fill empty cells (no patients in that combination) with '-'
)
print("=== PIVOT TABLE: Mean LOS (rows=admission, cols=diagnosis) ===")
print(pivot.round(2))
=== LOS BY ADMISSION TYPE × DIAGNOSIS ===
n mean_los median_los
admission_type primary_diagnosis
Elective Cardiac 1 6.00 6.00
Orthopaedic 2 2.50 2.50
Respiratory 1 7.00 7.00
Emergency Cardiac 3 10.33 12.00
Diabetes 1 4.00 4.00
Respiratory 2 6.00 6.00
Urgent Diabetes 1 4.00 4.00
Respiratory 0 NaN NaN
=== PIVOT TABLE: Mean LOS (rows=admission, cols=diagnosis) ===
primary_diagnosis Cardiac Diabetes Orthopaedic Respiratory
admission_type
Elective 6.00 - 2.50 7.00
Emergency 10.33 4.00 - 6.00
Urgent - 4.00 - -
What just happened?
pandas is Python's core data table library. .groupby() accepts a list of column names to create a multi-level index — splitting the data by every unique combination of the listed variables. This is how you explore interactions: instead of "what's the average LOS for emergency patients?" you ask "what's the average LOS for emergency cardiac patients specifically?"
pd.pivot_table() reshapes the same grouped data into a 2D grid that's far easier to scan. Rows are one categorical variable, columns are another, and each cell holds the aggregated value — here the mean LOS. The fill_value='-' argument makes empty combinations explicit rather than just absent.
The pivot table reveals what bivariate analysis hid: Emergency Cardiac patients average 10.33 days — more than double the overall mean of 5.83. Elective Orthopaedic patients average just 2.5 days. The diagnosis and admission type interact: cardiac is bad in any context, but cardiac + emergency is the highest-risk combination. That's a multivariate insight.
Partial Correlation — Controlling for a Third Variable
The scenario: In Lesson 17, you found that age and systolic blood pressure correlate at 0.931. But your clinical consultant raises a concern: both age and blood pressure increase with the number of prior admissions — patients who've been admitted many times are typically older and have worse cardiovascular health. Is the age–blood pressure correlation real, or is it driven by num_admissions acting as a confounder? Partial correlation answers this.
import pandas as pd # pandas: data library — .corr() and DataFrame operations
import numpy as np # numpy: numerical library — linear algebra via np.linalg for partial correlation
from scipy import stats # scipy.stats: statistical functions — used here for Pearson r confirmation
# Partial correlation: correlation between X and Y after removing the effect of Z
# Method: regress X on Z, regress Y on Z, correlate the two sets of residuals
# The residuals represent what's "left over" in X and Y after accounting for Z
def partial_correlation(df, x, y, z):
"""
Compute partial correlation between x and y, controlling for z.
Returns the partial correlation coefficient.
"""
# Regress x on z: fit a linear model x ~ z, get residuals
# Residuals = actual x - predicted x from z
# np.polyfit(z, x, 1) fits a degree-1 polynomial (straight line): returns [slope, intercept]
slope_xz, intercept_xz = np.polyfit(df[z], df[x], 1) # fit line z → x
residuals_x = df[x] - (slope_xz * df[z] + intercept_xz) # x residuals after removing z's influence
# Regress y on z: same process for y
slope_yz, intercept_yz = np.polyfit(df[z], df[y], 1) # fit line z → y
residuals_y = df[y] - (slope_yz * df[z] + intercept_yz) # y residuals after removing z's influence
# Correlate the residuals: this is the partial correlation
# stats.pearsonr() returns (correlation_coefficient, p_value)
r, p = stats.pearsonr(residuals_x, residuals_y)
return round(r, 3), round(p, 4)
# Raw (bivariate) correlation between age and systolic_bp
r_raw, p_raw = stats.pearsonr(df['age'], df['systolic_bp'])
print(f"Raw correlation (age × systolic_bp): r = {r_raw:.3f} (p={p_raw:.4f})")
# Partial correlation controlling for num_admissions
r_partial, p_partial = partial_correlation(df, 'age', 'systolic_bp', 'num_admissions')
print(f"Partial correlation (controlling for num_admissions): r = {r_partial} (p={p_partial})")
print()
print("Interpretation:")
if abs(r_partial) < abs(r_raw) * 0.6:
print(" Correlation dropped substantially — num_admissions is a partial confounder.")
else:
print(" Correlation held — the relationship between age and systolic_bp is likely direct.")
Raw correlation (age × systolic_bp): r = 0.931 (p=0.0000) Partial correlation (controlling for num_admissions): r = 0.778 (p=0.0032) Interpretation: Correlation held — the relationship between age and systolic_bp is likely direct.
What just happened?
numpy is Python's numerical computing library. Here we use np.polyfit(z, x, 1) to fit a first-degree polynomial (a straight line) from the control variable z to the target variable x. The 1 means degree-1 — just slope and intercept. Subtracting the predicted values from the actual values gives us the residuals — the part of x that can't be explained by z.
scipy's stats.pearsonr() computes the Pearson correlation between two arrays and returns both the coefficient and a p-value. We use it here to correlate the two sets of residuals — giving us the partial correlation.
The raw correlation was 0.931. After controlling for num_admissions, it drops to 0.778 — still strong and still significant (p=0.003). The relationship between age and blood pressure is largely a direct biological one, not a confound. But the drop from 0.931 to 0.778 tells us num_admissions does explain some of the shared variance. This is the kind of nuance you only get with multivariate thinking.
Conditional Distributions — Splitting by a Third Variable
The scenario: Your lead asks a sharper question: does the relationship between age and length of stay hold for both readmitted and non-readmitted patients, or is it driven entirely by one group? If the correlation only exists among readmitted patients, it's not a general feature — it's a characteristic of the high-risk subgroup. This is conditional analysis: examine a relationship within each level of a third variable.
import pandas as pd # pandas: data library — boolean indexing for subsetting by group
import numpy as np # numpy: numerical library — standard import
from scipy import stats # scipy.stats: Pearson correlation within subgroups
# Conditional correlation: compute age × los_days correlation separately for each readmission class
print("=== CONDITIONAL CORRELATION: age × los_days ===")
print()
for label, group_df in df.groupby('readmitted'): # loop over each unique value of the conditioning variable
group_label = "Readmitted" if label == 1 else "Not readmitted"
r, p = stats.pearsonr(group_df['age'], group_df['los_days'])
n = len(group_df)
print(f" {group_label} (n={n}): r = {r:.3f} (p = {p:.4f})")
print()
# Conditional group statistics: mean of multiple features, split by readmission AND admission type
print("=== MEAN FEATURES BY READMISSION × ADMISSION TYPE ===")
cond_summary = (
df.groupby(['readmitted', 'admission_type']) # two conditioning variables
[['age', 'num_admissions', 'los_days']] # subset of features to summarise
.mean()
.round(1)
)
print(cond_summary.to_string())
print()
# Simpson's Paradox check: does the overall trend hold within each admission type?
print("=== CORRELATION: num_admissions × los_days BY ADMISSION TYPE ===")
for adm_type, grp in df.groupby('admission_type'):
if len(grp) > 2: # need at least 3 rows for meaningful correlation
r, p = stats.pearsonr(grp['num_admissions'], grp['los_days'])
print(f" {adm_type} (n={len(grp)}): r = {r:.3f} (p = {p:.4f})")
=== CONDITIONAL CORRELATION: age × los_days ===
Not readmitted (n=6): r = 0.796 (p = 0.0594)
Readmitted (n=6): r = 0.939 (p = 0.0056)
=== MEAN FEATURES BY READMISSION × ADMISSION TYPE ===
age num_admissions los_days
readmitted admission_type
0 Elective 29.0 1.0 2.5
Emergency 36.5 1.0 2.5
Urgent 53.5 2.0 4.5
1 Elective 59.5 3.5 6.5
Emergency 77.5 6.5 12.0
Urgent 56.5 2.5 4.5
=== CORRELATION: num_admissions × los_days BY ADMISSION TYPE ===
Elective (n=4): r = 0.866 (p = 0.1340)
Emergency (n=6): r = 0.980 (p = 0.0007)
What just happened?
pandas' .groupby() used as an iterator yields (group_label, group_dataframe) pairs — one per unique value of the grouping column. This lets you run any analysis inside the loop, applied to each subgroup separately. Two-level groupby (['readmitted', 'admission_type']) creates every combination as its own row in the output.
scipy's stats.pearsonr() runs per subgroup, giving a separate r and p-value for each conditional slice. This is more informative than a single overall correlation.
The conditional results are telling: the age × los_days correlation is stronger among readmitted patients (r=0.939) than non-readmitted (r=0.796). The combined summary table shows readmitted emergency patients average age 77.5, num_admissions 6.5, and los_days 12.0 — all dramatically higher. The correlation between num_admissions and los_days is r=0.980 for emergency patients specifically — almost perfectly linear within that subgroup.
Simpson's Paradox — When Aggregates Lie
Simpson's Paradox is one of the most important concepts in multivariate analysis. It happens when a trend that appears in several different groups of data reverses or disappears when the groups are combined. It's not a statistical anomaly — it's a symptom of ignoring a confounding variable.
A Classic Example
A hospital reports that Treatment A has a higher overall survival rate than Treatment B. Sounds clear. But when you split by patient severity — mild cases and severe cases — Treatment B has a higher survival rate in both groups separately.
How? Treatment A was disproportionately given to mild cases (who survive anyway), inflating its overall number. Treatment B was used for severe cases. The aggregate comparison is comparing apples to oranges. The third variable — patient severity — was hidden.
The defence against Simpson's Paradox is always the same: never trust an aggregate trend without checking whether it holds within subgroups. That's what multivariate analysis is for.
Feature Interaction Matrix
The scenario: Before handing the dataset to the modelling team, you want to produce a complete interaction report — not just pairwise correlations, but a systematic check of how each numeric feature relates to the target across different categorical subgroups. This is the multivariate equivalent of the univariate profile function from Lesson 16.
import pandas as pd # pandas: data library — groupby, pivot_table, and corr for multi-feature summaries
import numpy as np # numpy: numerical library — np.abs for sorting, array ops
from scipy import stats # scipy.stats: pearsonr for within-group correlations
def interaction_report(dataframe, numeric_features, group_col, target_col):
"""
For each numeric feature, shows:
1. Correlation with target overall
2. Correlation with target within each group
3. Group means — to spot where the target is highest
"""
print(f"INTERACTION REPORT: features × '{target_col}', grouped by '{group_col}'")
print("=" * 62)
for feat in numeric_features:
# Overall correlation with target
r_all, _ = stats.pearsonr(
dataframe[feat].dropna(),
dataframe[target_col].loc[dataframe[feat].dropna().index] # align indices after dropna
)
print(f"\n Feature: {feat} (overall r with {target_col} = {r_all:.3f})")
# Within-group correlation + group mean comparison
for grp_label, grp_df in dataframe.groupby(group_col):
valid = grp_df[[feat, target_col]].dropna()
grp_mean_feat = round(valid[feat].mean(), 2)
grp_mean_target = round(valid[target_col].mean(), 2)
if len(valid) > 2: # need n>2 for pearsonr
r_grp, p_grp = stats.pearsonr(valid[feat], valid[target_col])
print(f" [{grp_label}] mean {feat}={grp_mean_feat} "
f"mean {target_col}={grp_mean_target} "
f"r={r_grp:.3f} (p={p_grp:.3f})")
else:
print(f" [{grp_label}] mean {feat}={grp_mean_feat} "
f"mean {target_col}={grp_mean_target} "
f"(n too small for corr)")
interaction_report(
df,
numeric_features=['age', 'num_admissions', 'systolic_bp'],
group_col='admission_type',
target_col='los_days'
)
INTERACTION REPORT: features × 'los_days', grouped by 'admission_type'
==============================================================
Feature: age (overall r with los_days = 0.881)
[Elective] mean age=48.50 mean los_days=4.50 r=0.799 (p=0.201)
[Emergency] mean age=63.67 mean los_days=7.83 r=0.976 (p=0.001)
[Urgent] (n too small for corr)
Feature: num_admissions (overall r with los_days = 0.979)
[Elective] mean num_admissions=2.25 mean los_days=4.50 r=0.866 (p=0.134)
[Emergency] mean num_admissions=3.83 mean los_days=7.83 r=0.980 (p=0.001)
[Urgent] (n too small for corr)
Feature: systolic_bp (overall r with los_days = 0.836)
[Elective] mean systolic_bp=129.50 mean los_days=4.50 r=0.955 (p=0.045)
[Emergency] mean systolic_bp=140.50 mean los_days=7.83 r=0.915 (p=0.010)
[Urgent] (n too small for corr)
What just happened?
pandas provides the .groupby() iterator pattern again — looping over group labels and their DataFrames. Combined with .dropna() for clean pairs and index alignment to keep the target column in sync, this builds a per-feature, per-group summary table.
scipy's stats.pearsonr() runs once per feature-group combination, giving us the conditional correlations that single-column analysis misses entirely.
The age × los_days relationship is much stronger within Emergency (r=0.976) than Elective (r=0.799) — the feature interacts with admission type. For systolic_bp, both groups show strong correlations and both are significant. This interaction report is the kind of output that goes directly into a modelling team's feature selection meeting.
The Multivariate Analysis Toolkit at a Glance
| Technique | What it answers | Key method |
|---|---|---|
| Multi-column groupby | How does a numeric outcome vary across combinations of two categories? | .groupby([col1, col2]) |
| Pivot table | Same as above, displayed as a readable 2D grid | pd.pivot_table() |
| Partial correlation | Does the correlation between X and Y survive after removing Z's influence? | np.polyfit() + residuals |
| Conditional distribution | Does a relationship hold within each level of a third variable? | .groupby() iterator |
| Interaction report | Systematically checks whether feature–target correlations differ across groups | Custom loop function |
Teacher's Note
Multivariate analysis doesn't require a machine learning model — it just requires the discipline to look at more than two things at once. The tools are still groupby, corr, and boolean indexing. What changes is the question you're asking.
A good habit: whenever you find a strong bivariate relationship, immediately ask "what third variable might explain this?" Run the partial correlation. Check whether the relationship holds across subgroups. That one extra step separates a junior analyst who reports correlations from a senior analyst who understands what they mean.
Practice Questions
1. Which pandas function reshapes grouped data into a 2D grid with one categorical variable as rows, another as columns, and an aggregated numeric value in each cell?
2. What is the term for a third variable that influences both of the variables being studied, creating a false impression of a direct relationship between them?
3. What technique measures the correlation between two variables after statistically removing the influence of a third variable?
Quiz
1. Treatment A shows a higher overall survival rate than Treatment B, but Treatment B has a higher survival rate in both mild and severe patient groups separately. What phenomenon is this?
2. The overall correlation between age and los_days is r=0.881. Within Emergency patients it is r=0.976, within Elective patients it is r=0.799. What does this tell you?
3. What is the correct method for computing a partial correlation between X and Y, controlling for Z?
Up Next · Lesson 19
Correlation Analysis
Go deeper on correlation — Pearson vs Spearman vs Kendall, when each applies, and how to build a correlation analysis that won't mislead you.