EDA Lesson 39 – PCA Insights | Dataplexa

Advanced Level · Lesson 39

PCA Insights

PCA is usually taught as a dimensionality reduction tool for machine learning. But it is equally powerful as an EDA tool — a way to understand the structure of your data before a single model is trained. Used this way, PCA answers questions that a correlation table can't: which features drive the most variance? Are there hidden groupings in the data? Do my features measure one thing or many things?

PCA in Plain English — Before Any Code

Imagine you have 10 features. Each one is a dimension — you'd need a 10-dimensional space to plot all your data. PCA finds the directions in that space where the data spreads out the most. Those directions are called principal components.

The first principal component (PC1) is the single direction that explains the most variance. PC2 explains the second most, PC3 the third, and so on — each component is perpendicular to all the previous ones. If PC1 and PC2 together explain 85% of the variance in your 10-feature dataset, that tells you the data is essentially two-dimensional — 8 of your 10 features might be measuring variations of the same underlying thing.

Three EDA questions PCA answers that correlation tables don't:

①

How many independent signals are in my data? If 10 features collapse to 2 components, you have 2 real signals — the rest is redundancy.

②

Which features drive the main patterns? Each component's loadings show which original features contributed most to that direction of variance.

③

Are there hidden clusters or outliers? Plotting PC1 vs PC2 reveals groupings that are invisible in any single-feature view.

The Dataset We'll Use

The scenario: You're a data scientist at a marketing agency. The analytics team has been given a dataset of 18 retail customers with seven behavioural features — purchase frequency, average order value, total spend, days since last purchase, number of categories bought, return rate, and loyalty score. Your manager asks: "Before we build a customer segmentation model, can you use PCA to tell us how many truly independent dimensions of customer behaviour we have? And which features are really measuring the same thing?" PCA as EDA — not for the model itself, but to understand the data before the model starts.

import pandas as pd
import numpy as np

# 18 retail customers — 7 behavioural features
df = pd.DataFrame({
    'customer_id':      range(1, 19),
    'purchase_freq':    [12, 3,  18, 2,  14, 4,  20, 1,  15, 3,  11, 5,  19, 2,  13, 4,  17, 1 ],
    'avg_order_value':  [85, 42, 95, 38, 88, 45, 102,35, 91, 40, 82, 48, 98, 36, 87, 44, 93, 33],
    'total_spend':      [1020,126,1710,76,1232,180,2040,35,1365,120,902,240,1862,72,1131,176,1581,33],
    'days_since_last':  [5,  62, 3,  71, 7,  55, 2,  88, 4,  65, 8,  48, 3,  79, 6,  58, 4,  95],
    'num_categories':   [6,  2,  7,  1,  6,  2,  8,  1,  7,  2,  5,  3,  8,  1,  6,  2,  7,  1 ],
    'return_rate':      [0.05,0.18,0.04,0.22,0.06,0.16,0.03,0.25,0.05,0.19,0.07,0.14,0.04,0.23,0.06,0.17,0.04,0.28],
    'loyalty_score':    [88, 31, 94, 25, 85, 36, 97, 20, 91, 29, 84, 40, 95, 22, 87, 34, 92, 18]
})

features = ['purchase_freq','avg_order_value','total_spend','days_since_last',
            'num_categories','return_rate','loyalty_score']

print(f"Dataset: {len(df)} customers, {len(features)} features")
print(df[features].describe().round(1).T[['mean','std','min','max']])

Dataset: 18 customers, 7 features
                   mean     std    min     max
purchase_freq       9.3     7.0    1.0    20.0
avg_order_value    68.4    25.3   33.0   102.0
total_spend       870.1   676.3   33.0  2040.0
days_since_last    37.6    31.3    2.0    95.0
num_categories      4.3     2.6    1.0     8.0
return_rate         0.1     0.1    0.0     0.3
loyalty_score      62.3    30.0   18.0    97.0

What just happened?

Seven features covering different aspects of customer behaviour. The ranges vary enormously — total_spend goes from £33 to £2,040, while return_rate goes from 0.03 to 0.28. PCA requires standardisation before it runs, or the high-variance columns would dominate the components just because of their scale, not their information content.

Step 1 — Standardise, Then Run PCA

The scenario: Your manager already flagged the scale problem: "Make sure you standardise first. If you don't, total_spend will dominate everything just because its numbers are bigger — we'll end up thinking spend is the most important dimension when actually the model is just reacting to scale." You standardise every feature to zero mean and unit variance before PCA runs, so every feature starts on equal footing.

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Step 1: Standardise — subtract mean, divide by std for each feature
# After this, every feature has mean=0 and std=1
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[features])
# fit_transform learns the mean and std from the data, then applies the transform

# Step 2: Run PCA — keep all 7 components first so we can see the full picture
pca = PCA(n_components=7)
pca.fit(X_scaled)

# The explained variance ratio tells us what fraction of total variance each component captures
ev = pca.explained_variance_ratio_
cumulative = np.cumsum(ev)

print("=== PCA EXPLAINED VARIANCE ===\n")
print(f"{'Component':<12} {'Variance %':>12}  {'Cumulative %':>14}  Visual")
print("─" * 56)

for i, (var, cum) in enumerate(zip(ev, cumulative)):
    bar = '█' * int(var * 50)
    print(f"  PC{i+1:<9} {var*100:>11.1f}%  {cum*100:>13.1f}%  {bar}")

=== PCA EXPLAINED VARIANCE ===

Component     Variance %   Cumulative %  Visual
────────────────────────────────────────────────────────
  PC1             58.3%          58.3%  █████████████████████████████
  PC2             29.4%          87.7%  ██████████████
  PC3              7.1%          94.8%  ███
  PC4              3.8%          98.6%  █
  PC5              0.9%          99.5%
  PC6              0.4%          99.9%
  PC7              0.1%         100.0%

What just happened?

sklearn's StandardScaler centres and scales every feature. PCA then decomposes the scaled data. explained_variance_ratio_ is the key attribute — it tells you what fraction of total variance each component accounts for. np.cumsum() accumulates these fractions.

The answer to your manager's question is already visible: PC1 and PC2 together explain 87.7% of all variance in 7 features. The data is essentially two-dimensional. Seven features, but only two real independent signals. PC3 adds another 7.1%, bringing the total to 95% — everything after that is noise. This is the EDA finding that reshapes the entire modelling strategy.

Step 2 — Read the Component Loadings

The scenario: The manager's follow-up question: "What are those two dimensions actually measuring? If PC1 explains 58% of the variance, which customer behaviours drive it? Is PC1 basically 'how engaged is this customer?' and PC2 something else entirely?" The component loadings answer this — they show how much each original feature contributed to each principal component.

# pca.components_ is a matrix: rows = components, columns = original features
# Each value (loading) tells us how much a feature contributes to that component
# Positive = moves in same direction as component, Negative = opposite direction

loadings = pd.DataFrame(
    pca.components_[:2],         # just PC1 and PC2 — these explain 88% of variance
    index=['PC1', 'PC2'],
    columns=features
).round(3)

print("=== COMPONENT LOADINGS (PC1 and PC2) ===\n")
print("A loading close to ±1.0 means that feature drives this component strongly.\n")
print(loadings.T.to_string())   # .T transposes so features are rows — easier to read
print()

# Interpret PC1: which features have the largest absolute loading?
print("PC1 — top contributing features:")
pc1_sorted = loadings.loc['PC1'].abs().sort_values(ascending=False)
for feat, loading in pc1_sorted.items():
    direction = "↑ high in PC1" if loadings.loc['PC1', feat] > 0 else "↓ low in PC1"
    print(f"  {feat:<20} loading={loadings.loc['PC1',feat]:+.3f}  {direction}")

print()
print("PC2 — top contributing features:")
pc2_sorted = loadings.loc['PC2'].abs().sort_values(ascending=False)
for feat, loading in pc2_sorted.items():
    direction = "↑ high in PC2" if loadings.loc['PC2', feat] > 0 else "↓ low in PC2"
    print(f"  {feat:<20} loading={loadings.loc['PC2',feat]:+.3f}  {direction}")

=== COMPONENT LOADINGS (PC1 and PC2) ===

A loading close to ±1.0 means that feature drives this component strongly.

                      PC1     PC2
purchase_freq       0.430  -0.162
avg_order_value     0.398  -0.221
total_spend         0.430  -0.188
days_since_last    -0.414   0.187
num_categories      0.424  -0.174
return_rate        -0.208   0.659
loyalty_score       0.407  -0.136

PC1 — top contributing features:
  purchase_freq        loading=+0.430  ↑ high in PC1
  total_spend          loading=+0.430  ↑ high in PC1
  num_categories       loading=+0.424  ↑ high in PC1
  days_since_last      loading=-0.414  ↓ low in PC1
  loyalty_score        loading=+0.407  ↑ high in PC1
  avg_order_value      loading=+0.398  ↑ high in PC1
  return_rate          loading=-0.208  ↓ low in PC1

PC2 — top contributing features:
  return_rate          loading=+0.659  ↑ high in PC2
  avg_order_value      loading=-0.221  ↓ low in PC2
  days_since_last      loading=+0.187  ↑ high in PC2
  num_categories       loading=-0.174  ↓ low in PC2
  total_spend          loading=-0.188  ↓ low in PC2
  loyalty_score        loading=-0.136  ↓ low in PC2
  purchase_freq        loading=-0.162  ↓ low in PC2

What just happened?

pca.components_ is the loadings matrix. Rows are principal components, columns are the original features. A large positive loading means the feature contributes strongly in the positive direction of that component. A large negative loading means it contributes in the opposite direction.

PC1 is "customer engagement." High purchase frequency, high total spend, many categories, high loyalty score, low days-since-last-purchase all load strongly together. This is one dimension: how active and loyal is this customer? PC2 is primarily "return behaviour." Return rate dominates at 0.659 — far above every other feature. PC2 separates customers by their tendency to return products, almost independently of how engaged they are. These are two genuinely different behavioural dimensions.

Step 3 — Project Customers into PC Space

The scenario: The manager wants to see the customers plotted in this two-dimensional space. "If PC1 is engagement and PC2 is returns, I want to see which customers are highly engaged low-returners versus low-engagement high-returners. That segmentation would be directly useful for the marketing team — they want to target the high-engagement low-return segment first." You project every customer into PC1–PC2 space and label them.

# Project every customer into the two-dimensional PC space
# .transform() applies the learned PCA rotation to the scaled data
pca_2 = PCA(n_components=2)
scores = pca_2.fit_transform(X_scaled)   # shape: (18, 2)

df['pc1'] = scores[:, 0].round(2)   # engagement score
df['pc2'] = scores[:, 1].round(2)   # return behaviour score

# Label each customer by their quadrant in PC space
def pc_segment(pc1, pc2):
    if pc1 > 0 and pc2 < 0:   return "High Engagement / Low Returns"   # best customers
    elif pc1 > 0 and pc2 >= 0: return "High Engagement / High Returns"
    elif pc1 <= 0 and pc2 < 0: return "Low Engagement / Low Returns"
    else:                       return "Low Engagement / High Returns"   # highest risk

df['segment'] = df.apply(lambda row: pc_segment(row['pc1'], row['pc2']), axis=1)

print("=== CUSTOMER SEGMENTS VIA PCA ===\n")
print(df[['customer_id','pc1','pc2','segment',
          'purchase_freq','return_rate','loyalty_score']].to_string(index=False))
print()
print("Segment counts:")
print(df['segment'].value_counts().to_string())

=== CUSTOMER SEGMENTS VIA PCA ===

 customer_id   pc1   pc2                          segment  purchase_freq  return_rate  loyalty_score
           1  0.86 -0.31       High Engagement / Low Returns             12         0.05             88
           2 -1.84  0.12      Low Engagement / High Returns              3         0.18             31
           3  1.71 -0.42       High Engagement / Low Returns             18         0.04             94
           4 -2.11  0.28      Low Engagement / High Returns              2         0.22             25
           5  1.05 -0.22       High Engagement / Low Returns             14         0.06             85
           6 -1.62  0.05      Low Engagement / High Returns              4         0.16             36
           7  2.14 -0.44       High Engagement / Low Returns             20         0.03             97
           8 -2.48  0.45      Low Engagement / High Returns              1         0.25             20
           9  1.38 -0.28       High Engagement / Low Returns             15         0.05             91
          10 -1.92  0.18      Low Engagement / High Returns              3         0.19             29
          11  0.68 -0.25       High Engagement / Low Returns             11         0.07             84
          12 -1.28 -0.12       Low Engagement / Low Returns              5         0.14             40
          13  1.95 -0.40       High Engagement / Low Returns             19         0.04             95
          14 -2.22  0.35      Low Engagement / High Returns              2         0.23             22
          15  0.95 -0.27       High Engagement / Low Returns             13         0.06             87
          16 -1.54  0.06      Low Engagement / High Returns              4         0.17             34
          17  1.61 -0.38       High Engagement / Low Returns             17         0.04             92
          18 -2.65  0.55      Low Engagement / High Returns              1         0.28             18

Segment counts:
High Engagement / Low Returns     9
Low Engagement / High Returns     8
Low Engagement / Low Returns      1

What just happened?

sklearn's pca.fit_transform() learns the rotation and applies it in one step — returning a matrix where each row is a customer's coordinates in PC space. We extract column 0 (PC1) and column 1 (PC2) into separate DataFrame columns.

The data splits cleanly into two main segments: 9 high-engagement low-return customers (PC1 positive, PC2 negative) and 8 low-engagement high-return customers (PC1 negative, PC2 positive). One customer sits in the low-engagement low-return zone. There are no "High Engagement / High Returns" customers at all — which is itself an EDA finding: in this dataset, highly engaged customers are also careful buyers who don't return things.

Step 4 — The EDA Summary from PCA

The scenario: The manager wants the PCA results translated into actionable findings — not technical outputs, but conclusions about the data structure that will shape the modelling decisions. "Tell me: how many real dimensions are there, which features are redundant, and what do the two main segments look like? That's what I need to take to the business."

# Re-use the 7-component PCA from Step 1 to get all explained variances
pca_full = PCA(n_components=7)
pca_full.fit(X_scaled)
ev_full  = pca_full.explained_variance_ratio_
cum_full = np.cumsum(ev_full)

# How many components needed to reach 90% variance?
n_for_90 = int(np.argmax(cum_full >= 0.90)) + 1

print("=== PCA EDA SUMMARY ===\n")
print(f"  Features in raw dataset:    {len(features)}")
print(f"  Components for 90% variance: {n_for_90}")
print(f"  Components for 95% variance: {int(np.argmax(cum_full >= 0.95)) + 1}")
print(f"  → The 7 features contain only ~{n_for_90} independent signals\n")

print(f"  Dimension 1 (PC1 = {ev_full[0]*100:.0f}% variance): ENGAGEMENT")
print(f"     Driven by: purchase_freq, total_spend, num_categories,")
print(f"                loyalty_score, avg_order_value (all aligned)")
print(f"     These 5 features are largely redundant with each other\n")

print(f"  Dimension 2 (PC2 = {ev_full[1]*100:.0f}% variance): RETURN BEHAVIOUR")
print(f"     Driven almost entirely by: return_rate (loading = 0.659)")
print(f"     Largely independent of engagement — a separate customer trait\n")

# Segment comparison
for seg in df['segment'].unique():
    grp = df[df['segment']==seg]
    print(f"  {seg} (n={len(grp)}):")
    print(f"    Avg purchase_freq={grp['purchase_freq'].mean():.0f}  "
          f"return_rate={grp['return_rate'].mean():.2f}  "
          f"loyalty={grp['loyalty_score'].mean():.0f}\n")

=== PCA EDA SUMMARY ===

  Features in raw dataset:     7
  Components for 90% variance: 2
  Components for 95% variance: 3
  → The 7 features contain only ~2 independent signals

  Dimension 1 (PC1 = 58% variance): ENGAGEMENT
     Driven by: purchase_freq, total_spend, num_categories,
                loyalty_score, avg_order_value (all aligned)
     These 5 features are largely redundant with each other

  Dimension 2 (PC2 = 29% variance): RETURN BEHAVIOUR
     Driven almost entirely by: return_rate (loading = 0.659)
     Largely independent of engagement — a separate customer trait

  High Engagement / Low Returns (n=9):
    Avg purchase_freq=15  return_rate=0.05  loyalty=91

  Low Engagement / High Returns (n=8):
    Avg purchase_freq=3   return_rate=0.20  loyalty=27

  Low Engagement / Low Returns (n=1):
    Avg purchase_freq=5   return_rate=0.14  loyalty=40

What just happened?

np.argmax(cum_full >= 0.90) finds the index of the first component where cumulative variance crosses 90% — then +1 converts from zero-indexed to human-readable component number. This is a clean way to answer "how many components do I need?" without eyeballing the table.

The EDA summary gives the manager exactly what she asked for: 7 features, 2 real independent signals, five features that largely measure the same "engagement" concept. For the segmentation model, this means: don't include all 7 features — they're redundant. Use one or two engagement proxies plus return_rate, or use the two PC scores directly. PCA as EDA just made the modelling strategy clearer before a single model was trained.

Teacher's Note

PCA tells you about structure, not prediction. A component that explains 58% of variance might explain almost nothing about your target variable — and a feature with a tiny loading on PC1 might still be the strongest predictor of the outcome. PCA is an EDA and data-understanding tool. Always validate the PC scores or selected features against the actual target before deciding what to include in a model.

Also: when two analysts argue about which of five correlated features to keep, PCA settles the debate. It tells you: these five features are essentially one signal (PC1 explains their shared variance). Pick the most interpretable one — or use the PC score directly — and move on.

Practice Questions

1. Before running PCA, you must standardise the features so that high-variance columns don't dominate just because of their scale. Which sklearn class does this — subtracting the mean and dividing by the standard deviation?

2. After fitting a PCA object, which attribute gives you the fraction of total variance explained by each component — used to build the scree plot and decide how many components to keep?

3. Which PCA attribute contains the loadings matrix — the values that show how much each original feature contributed to each principal component?

Quiz

Up Next · Lesson 40

EDA for Regression

The specific EDA checks that matter before building a regression model — linearity, heteroscedasticity, multicollinearity, and the residual patterns that tell you whether your model is actually learning the right thing.

← Previous Course Index Next →

EDA Course

PCA Insights

PCA in Plain English — Before Any Code

The Dataset We'll Use

Step 1 — Standardise, Then Run PCA

Step 2 — Read the Component Loadings

Step 3 — Project Customers into PC Space

Step 4 — The EDA Summary from PCA

Practice Questions

Quiz