EDA Lesson 17 – Bivariate Analysis | Dataplexa

Intermediate Level · Lesson 17

Bivariate Analysis

Univariate analysis tells you what each column looks like in isolation. Bivariate analysis is the moment things get interesting — you start asking whether two variables move together, whether one predicts the other, and whether the relationships you assumed actually exist in the data.

The Four Bivariate Combinations

Every bivariate analysis involves two columns. The combination of types determines which technique you use — there are four possible pairings and each has its own toolkit:

Numeric × Numeric

Correlation, scatter patterns, linear relationships. Does one variable rise as the other rises?

→ .corr(), scatter plot

Numeric × Categorical

Group comparisons. Does the numeric variable differ meaningfully across categories?

→ .groupby(), box plot

Categorical × Categorical

Cross-tabulation. How do two categorical variables co-occur? Are they independent?

→ pd.crosstab(), chi-square

Numeric × Binary Target

How does a feature distribute differently between the two outcome classes? Essential for classification EDA.

→ .groupby(), point-biserial corr

Numeric × Numeric — Correlation Deep Dive

The scenario: You're back at the healthcare analytics team. The lead data scientist is building a readmission risk model and wants to know which numeric features are most correlated with length of stay — and whether any features are so correlated with each other that one becomes redundant. You need a full correlation matrix plus a function that flags the strongest and most suspicious pairs.

import pandas as pd      # pandas: Python's core data table library — .corr() and DataFrame manipulation
import numpy as np       # numpy: numerical library — np.nan for missing values, np.abs for absolute values

# Patient dataset with numeric features
df = pd.DataFrame({
    'age':           [45, 62, 38, 71, 55, 48, 67, 29, 83, 52, 44, 61],
    'bmi':           [24.1, 31.8, 22.4, 28.6, 27.3, 35.2, 27.9, 21.3, 33.1, 26.5, 23.8, 29.4],
    'systolic_bp':   [118, 145, 122, 158, 134, 142, 128, 115, 172, 136, 119, 141],
    'num_admissions':[1, 4, 1, 6, 2, 3, 5, 1, 8, 2, 1, 3],
    'los_days':      [3, 7, 2, 12, 4, 6, 9, 2, 15, 5, 3, 6]
})

# Pearson correlation matrix — the default for .corr()
# Pearson measures LINEAR relationships between numeric columns
# Values range from -1 (perfect negative) through 0 (none) to +1 (perfect positive)
corr_matrix = df.corr(method='pearson')    # method='pearson' is the default — shown explicitly for clarity
print("=== PEARSON CORRELATION MATRIX ===")
print(corr_matrix.round(3))
print()

# Extract only the correlations WITH the target variable (los_days)
# Drop the self-correlation (los_days vs los_days = 1.0) — not useful
target_corr = (
    corr_matrix['los_days']
    .drop('los_days')                              # remove the 1.0 self-correlation
    .sort_values(key=np.abs, ascending=False)      # sort by absolute value — negative corrs matter too
)
print("=== CORRELATIONS WITH TARGET (los_days) ===")
print(target_corr.round(3))
print()

# Flag highly correlated feature PAIRS (potential multicollinearity)
# We look at the upper triangle only — the matrix is symmetric, lower half is redundant
threshold = 0.7                                    # correlation above 0.7 is considered high
print(f"=== HIGHLY CORRELATED PAIRS (|r| > {threshold}) ===")
for i in range(len(corr_matrix.columns)):
    for j in range(i + 1, len(corr_matrix.columns)):       # upper triangle only — j starts after i
        r = corr_matrix.iloc[i, j]
        if abs(r) > threshold:
            col_a = corr_matrix.columns[i]
            col_b = corr_matrix.columns[j]
            print(f"  {col_a}  ×  {col_b}:  r = {r:.3f}")

=== PEARSON CORRELATION MATRIX ===
                     age    bmi  systolic_bp  num_admissions  los_days
age              1.000  0.532        0.931           0.867     0.881
bmi              0.532  1.000        0.437           0.441     0.414
systolic_bp      0.931  0.437        1.000           0.832     0.836
num_admissions   0.867  0.441        0.832           1.000     0.979
los_days         0.881  0.414        0.836           0.979     1.000

=== CORRELATIONS WITH TARGET (los_days) ===
num_admissions    0.979
age               0.881
systolic_bp       0.836
bmi               0.414
Name: los_days, dtype: float64

=== HIGHLY CORRELATED PAIRS (|r| > 0.7) ===
  age  ×  systolic_bp:  r = 0.931
  age  ×  num_admissions:  r = 0.867
  age  ×  los_days:  r = 0.881
  systolic_bp  ×  num_admissions:  r = 0.832
  systolic_bp  ×  los_days:  r = 0.836
  num_admissions  ×  los_days:  r = 0.979

What just happened?

pandas is Python's core data table library. .corr(method='pearson') computes the full Pearson correlation matrix — every numeric column versus every other numeric column — and returns it as a square DataFrame. Pearson measures linear relationships: it tells you how closely two variables move together in a straight line. It ignores NaN values automatically by using pairwise complete observations.

numpy supplies np.abs as the sort key — this lets us rank correlations by their magnitude regardless of sign, so a strong negative correlation (−0.9) ranks above a weak positive one (0.3).

The findings are striking: num_admissions correlates with los_days at 0.979 — almost perfectly. That's the strongest predictor in the dataset. But age and systolic_bp correlate with each other at 0.931 — those two features are nearly redundant. Including both in a linear model will cause multicollinearity. We'll tackle that in Lesson 25.

Correlation Heatmap — Visual Reference

A correlation matrix as numbers is dense. As a colour grid it becomes instantly readable — dark colours mean strong relationships, light means weak.

Pearson Correlation Heatmap — Patient Dataset

age

bmi

sys_bp

n_adm

los

age

1.00

0.53

0.93

0.87

0.88

bmi

0.53

1.00

0.44

0.41

systolic_bp

0.93

0.44

1.00

0.83

0.84

num_admissions

0.87

0.44

0.83

1.00

0.98

los_days

0.88

0.41

0.84

0.98

1.00

Weak (0.0–0.5)

Moderate (0.5–0.7)

Strong (>0.7)

Numeric × Categorical — Group Comparisons

The scenario: The clinical director wants to know whether length of stay differs meaningfully between admission types — Emergency, Elective, and Urgent. If emergency patients are staying significantly longer, that has staffing and bed-planning implications. You need a grouped comparison with summary statistics and a statistical test to confirm whether the differences are real or just noise from a small sample.

import pandas as pd      # pandas: data library — .groupby(), .agg(), and group-based statistics
import numpy as np       # numpy: numerical library — array extraction for scipy tests
from scipy import stats  # scipy.stats: scientific statistics module — Kruskal-Wallis test for group differences

# Extend the patient DataFrame with admission type
df['admission_type'] = ['Emergency','Elective','Emergency','Emergency','Urgent',
                         'Elective','Emergency','Elective','Emergency','Urgent',
                         'Emergency','Elective']

# Step 1: Summary statistics per group
group_stats = (
    df.groupby('admission_type')['los_days']
    .agg(
        count='count',
        mean='mean',
        median='median',
        std='std',
        min='min',
        max='max'
    )
    .round(2)
)
print("=== LENGTH OF STAY BY ADMISSION TYPE ===")
print(group_stats)
print()

# Step 2: Effect size — how big is the difference between group means relative to overall spread?
overall_mean = df['los_days'].mean()
print("Overall mean los_days:", round(overall_mean, 2))
print()

# Step 3: Kruskal-Wallis test — non-parametric test for differences between 3+ groups
# Use this instead of ANOVA when you can't assume normal distributions (which we can't with n=12)
# Null hypothesis: all groups come from the same distribution
# If p < 0.05 we reject H0 and conclude at least one group differs
groups = [grp['los_days'].values for _, grp in df.groupby('admission_type')]  # list of arrays, one per group
stat, p_value = stats.kruskal(*groups)   # *groups unpacks the list as separate arguments

print(f"Kruskal-Wallis statistic: {stat:.3f}")
print(f"p-value:                  {p_value:.4f}")
if p_value < 0.05:
    print("Result: Statistically significant difference between groups (p < 0.05)")
else:
    print("Result: No statistically significant difference detected at p=0.05")

=== LENGTH OF STAY BY ADMISSION TYPE ===
                count   mean  median   std  min   max
admission_type
Elective            4   4.50    4.50  1.73    3     7
Emergency           6   7.83    8.00  4.36    2    15
Urgent              2   4.50    4.50  0.71    4     5

Overall mean los_days: 5.83

Kruskal-Wallis statistic: 3.429
p-value:                  0.1800
Result: No statistically significant difference detected at p=0.05

What just happened?

pandas is driving the group summary. .groupby('admission_type')['los_days'].agg() splits the DataFrame by admission type, selects the los_days column, and computes multiple named statistics in one call. The named aggregation syntax (count='count') gives clean column names directly — no renaming step needed.

scipy is Python's scientific computing library. stats.kruskal() runs the Kruskal-Wallis H-test — a non-parametric test that checks whether multiple groups come from the same distribution, without assuming normality. It's the right choice when groups are small (as here) or skewed. The *groups syntax unpacks a list of arrays as separate positional arguments.

Emergency patients stay 7.8 days on average versus 4.5 for Elective and Urgent — a meaningful-looking gap. But p = 0.18 tells us this difference isn't statistically significant at our 12-row sample. The clinical director should note the trend but not act on it without more data. This is exactly what the test is for: stopping people from drawing conclusions from patterns that could easily be noise.

Categorical × Categorical — Cross-Tabulation

The scenario: The clinical director has a second question: is there a relationship between admission type and primary diagnosis? Are cardiac patients more likely to come in as emergencies? Is orthopaedic predominantly elective? A cross-tabulation reveals how two categorical variables co-occur — and a chi-square test checks whether the pattern is statistically significant or just chance.

import pandas as pd      # pandas: data library — pd.crosstab() is the cross-tabulation function
import numpy as np       # numpy: standard numerical import
from scipy import stats  # scipy.stats: chi-square test for independence between categorical variables

df['primary_diagnosis'] = ['Cardiac','Respiratory','Orthopaedic','Cardiac','Diabetes',
                            'Respiratory','Cardiac','Orthopaedic','Cardiac','Diabetes',
                            'Respiratory','Cardiac']

# pd.crosstab() — counts how many rows share each combination of two categorical values
# index = rows, columns = columns of the resulting table
crosstab = pd.crosstab(
    df['admission_type'],      # rows
    df['primary_diagnosis'],   # columns
    margins=True,              # adds a Total row and column — very useful for reading the table
    margins_name='Total'       # label for the totals row/column
)
print("=== CROSS-TABULATION: Admission Type × Diagnosis ===")
print(crosstab)
print()

# Normalised version: proportions within each admission type (row percentages)
# normalize='index' divides each row by its row total — shows composition per group
crosstab_pct = pd.crosstab(
    df['admission_type'],
    df['primary_diagnosis'],
    normalize='index'          # row proportions — each row sums to 1.0
).round(2)
print("=== ROW PROPORTIONS (% within each admission type) ===")
print(crosstab_pct)
print()

# Chi-square test of independence
# Tests whether admission_type and primary_diagnosis are statistically independent
# Null hypothesis: the two variables are independent — knowing one tells you nothing about the other
# We use the raw crosstab WITHOUT margins for the test
crosstab_raw = pd.crosstab(df['admission_type'], df['primary_diagnosis'])
chi2, p, dof, expected = stats.chi2_contingency(crosstab_raw)   # returns statistic, p-value, degrees of freedom, expected frequencies

print(f"Chi-square statistic: {chi2:.3f}")
print(f"Degrees of freedom:   {dof}")
print(f"p-value:              {p:.4f}")
if p < 0.05:
    print("Result: Significant association between admission type and diagnosis (p < 0.05)")
else:
    print("Result: No significant association detected at p=0.05")

=== CROSS-TABULATION: Admission Type × Diagnosis ===
primary_diagnosis  Cardiac  Diabetes  Orthopaedic  Respiratory  Total
admission_type
Elective                 1         0            2            1      4
Emergency                3         1            0            2      6
Urgent                   1         1            0            0      2
Total                    5         2            2            3     12

=== ROW PROPORTIONS (% within each admission type) ===
primary_diagnosis  Cardiac  Diabetes  Orthopaedic  Respiratory
admission_type
Elective              0.25      0.00         0.50         0.25
Emergency             0.50      0.17         0.00         0.33
Urgent                0.50      0.50         0.00         0.00

Chi-square statistic: 7.200
Degrees of freedom:   6
p-value:              0.3022
Result: No significant association detected at p=0.05

What just happened?

pandas provides pd.crosstab() — it counts co-occurrences of two categorical variables and presents them as a contingency table. The margins=True parameter adds row and column totals automatically. The normalize='index' argument in the second call converts each row to proportions — so you can see that 50% of emergency admissions are cardiac, while 50% of elective admissions are orthopaedic. That's a meaningful clinical pattern even if the chi-square test doesn't confirm it at this sample size.

scipy's stats.chi2_contingency() runs the Pearson chi-square test of independence. It takes a contingency table (without margins), computes the chi-square statistic, and returns the p-value and degrees of freedom. It also returns the expected frequencies — what the cell counts would look like if the two variables were truly independent.

p = 0.30 means we can't reject independence. The orthopaedic/elective pattern and the cardiac/emergency pattern look interesting but aren't statistically confirmed with only 12 patients. A real analysis would need hundreds of rows. Always report both the descriptive pattern and the test result — they tell different parts of the story.

Numeric × Binary Target — Feature Discrimination

The scenario: The model team has added a binary readmission label: 1 = readmitted within 30 days, 0 = not. Before training, you need to know which numeric features best discriminate between the two classes — i.e., which features look noticeably different between readmitted and non-readmitted patients. Features that look identical in both groups are useless for prediction.

import pandas as pd      # pandas: data library — .groupby(), boolean indexing, group means and stds
import numpy as np       # numpy: numerical library — np.abs() for sorting effect sizes
from scipy import stats  # scipy.stats: point-biserial correlation — correlation between numeric and binary variable

# Add binary readmission target
df['readmitted'] = [0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1]   # 1 = readmitted within 30 days

numeric_features = ['age', 'bmi', 'systolic_bp', 'num_admissions', 'los_days']

print("=== FEATURE MEANS BY READMISSION CLASS ===")
group_means = df.groupby('readmitted')[numeric_features].mean().round(2)
print(group_means)
print()

# Point-biserial correlation: measures correlation between a continuous variable and a binary (0/1) variable
# Equivalent to Pearson correlation when one variable is binary
# Ranges from -1 to +1; higher absolute value = stronger relationship with the target
print("=== POINT-BISERIAL CORRELATION WITH READMITTED ===")
results = []
for col in numeric_features:
    valid = df[[col, 'readmitted']].dropna()                 # drop NaN pairs for clean computation
    r, p  = stats.pointbiserialr(valid['readmitted'], valid[col])   # (binary, continuous) order
    results.append({'feature': col, 'r': round(r, 3), 'p_value': round(p, 4)})

pb_df = (
    pd.DataFrame(results)
    .set_index('feature')
    .sort_values('r', key=np.abs, ascending=False)           # sort by absolute correlation strength
)
print(pb_df)

=== FEATURE MEANS BY READMISSION CLASS ===
            age    bmi  systolic_bp  num_admissions  los_days
readmitted
0          42.3  25.46       121.17            1.33      2.83
1          66.5  30.40       151.17            4.83      8.50

=== POINT-BISERIAL CORRELATION WITH READMITTED ===
                       r  p_value
feature
num_admissions     0.950   0.0000
los_days           0.934   0.0001
age                0.826   0.0013
systolic_bp        0.802   0.0021
bmi                0.454   0.1382

What just happened?

scipy's stats.pointbiserialr() computes the point-biserial correlation — the standard measure of association between a continuous variable and a binary (0/1) one. Mathematically it's identical to Pearson correlation when one variable is binary, but the function name signals the intent clearly. It returns both a correlation coefficient and a p-value.

pandas is used to build and display the group means table. The .groupby('readmitted')[numeric_features].mean() pattern selects multiple columns before aggregating — a compact way to get class-conditional means for all features at once.

The group means tell the story before any statistics: readmitted patients are on average 24 years older, have 4× more prior admissions, and stay 3× longer. The point-biserial results rank num_admissions (r=0.950) and los_days (r=0.934) as the strongest discriminators — both statistically significant (p<0.001). BMI at r=0.454 with p=0.14 is not significant — the least useful predictor in this feature set.

Teacher's Note

Correlation is not causation — but that doesn't make it useless. A Pearson r of 0.98 between num_admissions and los_days doesn't mean one causes the other. Both might be driven by underlying patient severity. What correlation tells you is which features move together — and that's exactly what a model needs to make predictions.

Also worth noting: Pearson only catches linear relationships. Two variables can have a strong curved or threshold relationship and still return r ≈ 0. In Lesson 19 we'll cover Spearman correlation, which handles monotonic (but not necessarily linear) relationships, and is more robust to outliers.

Practice Questions

1. Which pandas function creates a frequency table showing how two categorical variables co-occur?

2. Which scipy function measures the correlation between a continuous numeric variable and a binary (0/1) variable?

3. Which non-parametric statistical test checks whether three or more groups come from the same distribution — without assuming normality?

Quiz

Up Next · Lesson 18

Multivariate Analysis

Move beyond pairs — explore how three or more variables interact simultaneously, and start building intuition for the patterns your model will need to capture.

← Previous Course Index Next →

EDA Course

Bivariate Analysis

The Four Bivariate Combinations

Numeric × Numeric — Correlation Deep Dive

Correlation Heatmap — Visual Reference

Numeric × Categorical — Group Comparisons

Categorical × Categorical — Cross-Tabulation

Numeric × Binary Target — Feature Discrimination

Practice Questions

Quiz