EDA Lesson 39 – PCA Insights | Dataplexa
Advanced Level · Lesson 39

PCA Insights

PCA is usually taught as a dimensionality reduction tool for machine learning. But it is equally powerful as an EDA tool — a way to understand the structure of your data before a single model is trained. Used this way, PCA answers questions that a correlation table can't: which features drive the most variance? Are there hidden groupings in the data? Do my features measure one thing or many things?

PCA in Plain English — Before Any Code

Imagine you have 10 features. Each one is a dimension — you'd need a 10-dimensional space to plot all your data. PCA finds the directions in that space where the data spreads out the most. Those directions are called principal components.

The first principal component (PC1) is the single direction that explains the most variance. PC2 explains the second most, PC3 the third, and so on — each component is perpendicular to all the previous ones. If PC1 and PC2 together explain 85% of the variance in your 10-feature dataset, that tells you the data is essentially two-dimensional — 8 of your 10 features might be measuring variations of the same underlying thing.

Three EDA questions PCA answers that correlation tables don't:

How many independent signals are in my data? If 10 features collapse to 2 components, you have 2 real signals — the rest is redundancy.

Which features drive the main patterns? Each component's loadings show which original features contributed most to that direction of variance.

Are there hidden clusters or outliers? Plotting PC1 vs PC2 reveals groupings that are invisible in any single-feature view.

The Dataset We'll Use

The scenario: You're a data scientist at a marketing agency. The analytics team has been given a dataset of 18 retail customers with seven behavioural features — purchase frequency, average order value, total spend, days since last purchase, number of categories bought, return rate, and loyalty score. Your manager asks: "Before we build a customer segmentation model, can you use PCA to tell us how many truly independent dimensions of customer behaviour we have? And which features are really measuring the same thing?" PCA as EDA — not for the model itself, but to understand the data before the model starts.

import pandas as pd
import numpy as np

# 18 retail customers — 7 behavioural features
df = pd.DataFrame({
    'customer_id':      range(1, 19),
    'purchase_freq':    [12, 3,  18, 2,  14, 4,  20, 1,  15, 3,  11, 5,  19, 2,  13, 4,  17, 1 ],
    'avg_order_value':  [85, 42, 95, 38, 88, 45, 102,35, 91, 40, 82, 48, 98, 36, 87, 44, 93, 33],
    'total_spend':      [1020,126,1710,76,1232,180,2040,35,1365,120,902,240,1862,72,1131,176,1581,33],
    'days_since_last':  [5,  62, 3,  71, 7,  55, 2,  88, 4,  65, 8,  48, 3,  79, 6,  58, 4,  95],
    'num_categories':   [6,  2,  7,  1,  6,  2,  8,  1,  7,  2,  5,  3,  8,  1,  6,  2,  7,  1 ],
    'return_rate':      [0.05,0.18,0.04,0.22,0.06,0.16,0.03,0.25,0.05,0.19,0.07,0.14,0.04,0.23,0.06,0.17,0.04,0.28],
    'loyalty_score':    [88, 31, 94, 25, 85, 36, 97, 20, 91, 29, 84, 40, 95, 22, 87, 34, 92, 18]
})

features = ['purchase_freq','avg_order_value','total_spend','days_since_last',
            'num_categories','return_rate','loyalty_score']

print(f"Dataset: {len(df)} customers, {len(features)} features")
print(df[features].describe().round(1).T[['mean','std','min','max']])

What just happened?

Seven features covering different aspects of customer behaviour. The ranges vary enormously — total_spend goes from £33 to £2,040, while return_rate goes from 0.03 to 0.28. PCA requires standardisation before it runs, or the high-variance columns would dominate the components just because of their scale, not their information content.

Step 1 — Standardise, Then Run PCA

The scenario: Your manager already flagged the scale problem: "Make sure you standardise first. If you don't, total_spend will dominate everything just because its numbers are bigger — we'll end up thinking spend is the most important dimension when actually the model is just reacting to scale." You standardise every feature to zero mean and unit variance before PCA runs, so every feature starts on equal footing.

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Step 1: Standardise — subtract mean, divide by std for each feature
# After this, every feature has mean=0 and std=1
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[features])
# fit_transform learns the mean and std from the data, then applies the transform

# Step 2: Run PCA — keep all 7 components first so we can see the full picture
pca = PCA(n_components=7)
pca.fit(X_scaled)

# The explained variance ratio tells us what fraction of total variance each component captures
ev = pca.explained_variance_ratio_
cumulative = np.cumsum(ev)

print("=== PCA EXPLAINED VARIANCE ===\n")
print(f"{'Component':<12} {'Variance %':>12}  {'Cumulative %':>14}  Visual")
print("─" * 56)

for i, (var, cum) in enumerate(zip(ev, cumulative)):
    bar = '█' * int(var * 50)
    print(f"  PC{i+1:<9} {var*100:>11.1f}%  {cum*100:>13.1f}%  {bar}")

What just happened?

sklearn's StandardScaler centres and scales every feature. PCA then decomposes the scaled data. explained_variance_ratio_ is the key attribute — it tells you what fraction of total variance each component accounts for. np.cumsum() accumulates these fractions.

The answer to your manager's question is already visible: PC1 and PC2 together explain 87.7% of all variance in 7 features. The data is essentially two-dimensional. Seven features, but only two real independent signals. PC3 adds another 7.1%, bringing the total to 95% — everything after that is noise. This is the EDA finding that reshapes the entire modelling strategy.

Step 2 — Read the Component Loadings

The scenario: The manager's follow-up question: "What are those two dimensions actually measuring? If PC1 explains 58% of the variance, which customer behaviours drive it? Is PC1 basically 'how engaged is this customer?' and PC2 something else entirely?" The component loadings answer this — they show how much each original feature contributed to each principal component.

# pca.components_ is a matrix: rows = components, columns = original features
# Each value (loading) tells us how much a feature contributes to that component
# Positive = moves in same direction as component, Negative = opposite direction

loadings = pd.DataFrame(
    pca.components_[:2],         # just PC1 and PC2 — these explain 88% of variance
    index=['PC1', 'PC2'],
    columns=features
).round(3)

print("=== COMPONENT LOADINGS (PC1 and PC2) ===\n")
print("A loading close to ±1.0 means that feature drives this component strongly.\n")
print(loadings.T.to_string())   # .T transposes so features are rows — easier to read
print()

# Interpret PC1: which features have the largest absolute loading?
print("PC1 — top contributing features:")
pc1_sorted = loadings.loc['PC1'].abs().sort_values(ascending=False)
for feat, loading in pc1_sorted.items():
    direction = "↑ high in PC1" if loadings.loc['PC1', feat] > 0 else "↓ low in PC1"
    print(f"  {feat:<20} loading={loadings.loc['PC1',feat]:+.3f}  {direction}")

print()
print("PC2 — top contributing features:")
pc2_sorted = loadings.loc['PC2'].abs().sort_values(ascending=False)
for feat, loading in pc2_sorted.items():
    direction = "↑ high in PC2" if loadings.loc['PC2', feat] > 0 else "↓ low in PC2"
    print(f"  {feat:<20} loading={loadings.loc['PC2',feat]:+.3f}  {direction}")

What just happened?

pca.components_ is the loadings matrix. Rows are principal components, columns are the original features. A large positive loading means the feature contributes strongly in the positive direction of that component. A large negative loading means it contributes in the opposite direction.

PC1 is "customer engagement." High purchase frequency, high total spend, many categories, high loyalty score, low days-since-last-purchase all load strongly together. This is one dimension: how active and loyal is this customer? PC2 is primarily "return behaviour." Return rate dominates at 0.659 — far above every other feature. PC2 separates customers by their tendency to return products, almost independently of how engaged they are. These are two genuinely different behavioural dimensions.

Step 3 — Project Customers into PC Space

The scenario: The manager wants to see the customers plotted in this two-dimensional space. "If PC1 is engagement and PC2 is returns, I want to see which customers are highly engaged low-returners versus low-engagement high-returners. That segmentation would be directly useful for the marketing team — they want to target the high-engagement low-return segment first." You project every customer into PC1–PC2 space and label them.

# Project every customer into the two-dimensional PC space
# .transform() applies the learned PCA rotation to the scaled data
pca_2 = PCA(n_components=2)
scores = pca_2.fit_transform(X_scaled)   # shape: (18, 2)

df['pc1'] = scores[:, 0].round(2)   # engagement score
df['pc2'] = scores[:, 1].round(2)   # return behaviour score

# Label each customer by their quadrant in PC space
def pc_segment(pc1, pc2):
    if pc1 > 0 and pc2 < 0:   return "High Engagement / Low Returns"   # best customers
    elif pc1 > 0 and pc2 >= 0: return "High Engagement / High Returns"
    elif pc1 <= 0 and pc2 < 0: return "Low Engagement / Low Returns"
    else:                       return "Low Engagement / High Returns"   # highest risk

df['segment'] = df.apply(lambda row: pc_segment(row['pc1'], row['pc2']), axis=1)

print("=== CUSTOMER SEGMENTS VIA PCA ===\n")
print(df[['customer_id','pc1','pc2','segment',
          'purchase_freq','return_rate','loyalty_score']].to_string(index=False))
print()
print("Segment counts:")
print(df['segment'].value_counts().to_string())

What just happened?

sklearn's pca.fit_transform() learns the rotation and applies it in one step — returning a matrix where each row is a customer's coordinates in PC space. We extract column 0 (PC1) and column 1 (PC2) into separate DataFrame columns.

The data splits cleanly into two main segments: 9 high-engagement low-return customers (PC1 positive, PC2 negative) and 8 low-engagement high-return customers (PC1 negative, PC2 positive). One customer sits in the low-engagement low-return zone. There are no "High Engagement / High Returns" customers at all — which is itself an EDA finding: in this dataset, highly engaged customers are also careful buyers who don't return things.

Step 4 — The EDA Summary from PCA

The scenario: The manager wants the PCA results translated into actionable findings — not technical outputs, but conclusions about the data structure that will shape the modelling decisions. "Tell me: how many real dimensions are there, which features are redundant, and what do the two main segments look like? That's what I need to take to the business."

# Re-use the 7-component PCA from Step 1 to get all explained variances
pca_full = PCA(n_components=7)
pca_full.fit(X_scaled)
ev_full  = pca_full.explained_variance_ratio_
cum_full = np.cumsum(ev_full)

# How many components needed to reach 90% variance?
n_for_90 = int(np.argmax(cum_full >= 0.90)) + 1

print("=== PCA EDA SUMMARY ===\n")
print(f"  Features in raw dataset:    {len(features)}")
print(f"  Components for 90% variance: {n_for_90}")
print(f"  Components for 95% variance: {int(np.argmax(cum_full >= 0.95)) + 1}")
print(f"  → The 7 features contain only ~{n_for_90} independent signals\n")

print(f"  Dimension 1 (PC1 = {ev_full[0]*100:.0f}% variance): ENGAGEMENT")
print(f"     Driven by: purchase_freq, total_spend, num_categories,")
print(f"                loyalty_score, avg_order_value (all aligned)")
print(f"     These 5 features are largely redundant with each other\n")

print(f"  Dimension 2 (PC2 = {ev_full[1]*100:.0f}% variance): RETURN BEHAVIOUR")
print(f"     Driven almost entirely by: return_rate (loading = 0.659)")
print(f"     Largely independent of engagement — a separate customer trait\n")

# Segment comparison
for seg in df['segment'].unique():
    grp = df[df['segment']==seg]
    print(f"  {seg} (n={len(grp)}):")
    print(f"    Avg purchase_freq={grp['purchase_freq'].mean():.0f}  "
          f"return_rate={grp['return_rate'].mean():.2f}  "
          f"loyalty={grp['loyalty_score'].mean():.0f}\n")

What just happened?

np.argmax(cum_full >= 0.90) finds the index of the first component where cumulative variance crosses 90% — then +1 converts from zero-indexed to human-readable component number. This is a clean way to answer "how many components do I need?" without eyeballing the table.

The EDA summary gives the manager exactly what she asked for: 7 features, 2 real independent signals, five features that largely measure the same "engagement" concept. For the segmentation model, this means: don't include all 7 features — they're redundant. Use one or two engagement proxies plus return_rate, or use the two PC scores directly. PCA as EDA just made the modelling strategy clearer before a single model was trained.

Teacher's Note

PCA tells you about structure, not prediction. A component that explains 58% of variance might explain almost nothing about your target variable — and a feature with a tiny loading on PC1 might still be the strongest predictor of the outcome. PCA is an EDA and data-understanding tool. Always validate the PC scores or selected features against the actual target before deciding what to include in a model.

Also: when two analysts argue about which of five correlated features to keep, PCA settles the debate. It tells you: these five features are essentially one signal (PC1 explains their shared variance). Pick the most interpretable one — or use the PC score directly — and move on.

Practice Questions

1. Before running PCA, you must standardise the features so that high-variance columns don't dominate just because of their scale. Which sklearn class does this — subtracting the mean and dividing by the standard deviation?



2. After fitting a PCA object, which attribute gives you the fraction of total variance explained by each component — used to build the scree plot and decide how many components to keep?



3. Which PCA attribute contains the loadings matrix — the values that show how much each original feature contributed to each principal component?



Quiz

1. Why must you standardise features before running PCA?


2. PCA on 7 features shows PC1 explains 58% and PC2 explains 29% — totalling 87%. What does this mean for your feature set?


3. PC1 explains 58% of variance and is driven by engagement features. Can you conclude these engagement features are the best predictors of customer churn?


Up Next · Lesson 40

EDA for Regression

The specific EDA checks that matter before building a regression model — linearity, heteroscedasticity, multicollinearity, and the residual patterns that tell you whether your model is actually learning the right thing.