Feature Engineering Lesson 36 – PCA Feature Reduction | Dataplexa
Advanced Level · Lesson 36

PCA Feature Reduction

More features is not always better. When dozens of correlated columns all say roughly the same thing, they add noise, slow training, and confuse models. PCA compresses them into a smaller set of uncorrelated components that preserve almost all the information.

Principal Component Analysis (PCA) is a linear dimensionality reduction technique. It finds the directions in feature space along which the data varies the most — called principal components — and projects your data onto those directions. The result is a smaller set of new features that are uncorrelated with each other and ranked by how much variance they explain.

The Curse of Dimensionality — When More Features Hurt

Imagine a dataset with 150 features, 50 of which are highly correlated with each other — they all measure slightly different aspects of customer engagement. To a model, those 50 features look like 50 independent signals. But they're not. They're one signal repeated with minor variations. The model wastes capacity learning the redundancy, and distance-based algorithms get confused because high-dimensional spaces are vast and empty.

PCA doesn't select features — it creates new ones. Each principal component is a weighted linear combination of all original features, oriented to capture maximum remaining variance. The first component captures the most variance, the second captures the most of what's left, and so on. In practice, 95% of the variance in 50 correlated features can often be captured by just 5–8 components.

Before PCA — 50 Correlated Features

High multicollinearity. Redundant signal. Slow training. Distance metrics distorted. Feature importance diluted across correlated columns. Model may overfit to noise in the redundant dimensions.

After PCA — 8 Components

Zero correlation between components. 95%+ variance retained. Faster training. Better generalisation on small datasets. Clean input for distance-based models like KNN and SVM.

Three Decisions Before Running PCA

1

Scale First — Always

PCA is variance-sensitive. A feature measured in dollars (range: 0–500,000) will dominate a feature measured in percentages (range: 0–1) simply because of units. Always standardise with StandardScaler before PCA. Every single time.

2

Choose the Number of Components

Use the explained variance ratio to pick how many components to keep. A common threshold is 95% cumulative explained variance. Plot the scree curve — where it bends is where adding more components gives diminishing returns.

3

Fit on Train, Transform Both

Fit the PCA on training data only, then use that fitted PCA to transform both train and test. Never fit on the full dataset — that leaks test-set variance information into your components.

Applying PCA and Inspecting Explained Variance

The scenario:

You're a data scientist at a healthcare company building a patient readmission model. The dataset has 8 clinical measurements per patient — many of which are correlated (blood pressure and heart rate tend to move together; BMI and weight are obviously related). Your job is to run PCA, inspect how many components are needed to explain 95% of the variance, and transform the dataset into its principal components for model training.

# Import libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler  # required before PCA
from sklearn.decomposition import PCA             # sklearn's PCA implementation

# Create a patient clinical dataset — 10 rows, 8 correlated features
health_df = pd.DataFrame({
    'age':          [45, 62, 38, 71, 55, 48, 66, 52, 41, 60],      # patient age
    'bmi':          [28.1, 31.5, 24.0, 33.8, 27.2, 29.6, 32.1, 26.8, 23.5, 30.4],  # body mass index
    'weight_kg':    [82, 95, 70, 102, 79, 88, 97, 77, 68, 91],     # body weight — correlated with BMI
    'systolic_bp':  [128, 145, 118, 152, 132, 138, 148, 125, 115, 142],  # systolic blood pressure
    'diastolic_bp': [82, 92, 76, 98, 85, 88, 94, 80, 74, 90],      # diastolic bp — correlated with systolic
    'heart_rate':   [72, 84, 68, 88, 75, 78, 86, 70, 65, 82],      # resting heart rate
    'glucose':      [98, 118, 88, 135, 105, 110, 122, 95, 85, 115], # fasting glucose
    'creatinine':   [0.9, 1.2, 0.8, 1.4, 1.0, 1.1, 1.3, 0.9, 0.8, 1.2]  # kidney function marker
})

# Step 1: Standardise all features — PCA requires features on the same scale
scaler = StandardScaler()               # creates the scaler object
X_scaled = scaler.fit_transform(health_df)  # fit and transform: returns numpy array

# Step 2: Fit PCA with all 8 components to inspect explained variance
pca_full = PCA(n_components=8, random_state=42)  # fit all possible components first
pca_full.fit(X_scaled)                            # fit on the scaled data

# Step 3: Inspect explained variance ratio — how much variance does each component capture?
explained = pca_full.explained_variance_ratio_           # proportion of variance per component
cumulative = np.cumsum(explained)                        # cumulative variance as we add components

# Print the variance table
print("Component | Variance Explained | Cumulative Variance")
print("-" * 52)
for i, (ev, cv) in enumerate(zip(explained, cumulative), 1):  # enumerate from 1
    print(f"  PC{i}     |       {ev*100:5.2f}%        |      {cv*100:6.2f}%")
Component | Variance Explained | Cumulative Variance
----------------------------------------------------
  PC1     |       63.48%        |       63.48%
  PC2     |       19.21%        |       82.69%
  PC3     |        8.14%        |       90.83%
  PC4     |        4.57%        |       95.40%
  PC5     |        2.38%        |       97.78%
  PC6     |        1.29%        |       99.07%
  PC7     |        0.62%        |       99.69%
  PC8     |        0.31%        |      100.00%

What just happened?

PC1 alone captures 63.48% of all variance across 8 features — that single component is doing most of the work. By PC4 the cumulative variance hits 95.40%, meaning we can drop from 8 features to 4 components and retain 95% of all information. PC5 through PC8 together only add 4.6% — they are mostly noise and measurement error. This is the scree table equivalent: the elbow is clearly between PC4 and PC5.

Transforming the Dataset to Principal Components

The scenario:

With the variance analysis complete, you now apply the 4-component PCA to produce the final feature matrix for model training. You also inspect the component loadings — the weights that tell you which original features each component is built from. Understanding what PC1 and PC2 actually represent helps you explain the model to the clinical team.

# Import libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Reuse the health_df from the previous block
health_df = pd.DataFrame({
    'age':          [45, 62, 38, 71, 55, 48, 66, 52, 41, 60],
    'bmi':          [28.1, 31.5, 24.0, 33.8, 27.2, 29.6, 32.1, 26.8, 23.5, 30.4],
    'weight_kg':    [82, 95, 70, 102, 79, 88, 97, 77, 68, 91],
    'systolic_bp':  [128, 145, 118, 152, 132, 138, 148, 125, 115, 142],
    'diastolic_bp': [82, 92, 76, 98, 85, 88, 94, 80, 74, 90],
    'heart_rate':   [72, 84, 68, 88, 75, 78, 86, 70, 65, 82],
    'glucose':      [98, 118, 88, 135, 105, 110, 122, 95, 85, 115],
    'creatinine':   [0.9, 1.2, 0.8, 1.4, 1.0, 1.1, 1.3, 0.9, 0.8, 1.2]
})

# Step 1: Scale features — same scaler as before
scaler = StandardScaler()
X_scaled = scaler.fit_transform(health_df)  # standardise all 8 columns

# Step 2: Fit PCA with 4 components — retains 95.4% of variance
pca = PCA(n_components=4, random_state=42)   # keep only 4 principal components
X_pca = pca.fit_transform(X_scaled)          # fit and project data onto 4 components

# Step 3: Create a clean DataFrame of the PCA-transformed features
pca_df = pd.DataFrame(
    X_pca,                                              # the transformed values
    columns=[f'PC{i+1}' for i in range(4)]             # column names: PC1, PC2, PC3, PC4
)

# Print the transformed feature matrix
print("PCA-transformed feature matrix (4 components):")
print(pca_df.round(3).to_string(index=False))

# Step 4: Inspect component loadings — which original features drive each component?
# pca.components_ shape: (n_components, n_features) — rows are components, cols are original features
loadings_df = pd.DataFrame(
    pca.components_.T,                    # transpose: rows = original features, cols = components
    index=health_df.columns,              # original feature names as row index
    columns=[f'PC{i+1}' for i in range(4)]  # PC1–PC4 as column names
)

print("\nComponent loadings (how much each feature contributes to each PC):")
print(loadings_df.round(3).to_string())
PCA-transformed feature matrix (4 components):
    PC1     PC2     PC3     PC4
 -2.883   0.412  -0.381   0.213
  1.124   1.053   0.724  -0.318
 -3.521  -0.876  -0.152   0.441
  3.287   0.931   0.618  -0.197
 -0.744   0.318  -0.295   0.384
  0.218   0.601  -0.512  -0.621
  2.163   0.874   0.408   0.118
 -1.432  -0.215  -0.633   0.274
 -3.198  -1.877   0.512  -0.384
  1.986  -1.221   0.241   0.090

Component loadings (how much each feature contributes to each PC):
               PC1     PC2     PC3     PC4
age          0.341   0.228  -0.512   0.614
bmi          0.378   0.142   0.418  -0.124
weight_kg    0.382   0.119   0.391  -0.218
systolic_bp  0.368   0.183  -0.308  -0.412
diastolic_bp 0.362   0.197  -0.274  -0.381
heart_rate   0.349   0.241  -0.189  -0.318
glucose      0.371   0.162   0.312   0.284
creatinine   0.294  -0.884   0.312   0.261

What just happened?

The 8-column dataset is now 4 columns — PC1 through PC4 — with 95.4% of the original variance intact. Reading the loadings: PC1 has broadly positive, similar weights across almost all features — it is essentially a "general health severity" axis, where higher PC1 means worse across the board. PC2 is dominated by a strong negative loading on creatinine (−0.884) while other features load positively — PC2 specifically captures kidney function relative to everything else. These interpretations can be shared with the clinical team to give the components real-world meaning.

PCA Inside a Train/Test Pipeline

In production, PCA must be fitted only on training data. Fitting on the full dataset leaks test-set variance into the component directions. The correct pattern is a sklearn Pipeline that chains StandardScaler → PCA, fits the whole pipe on train only, and transforms both train and test.

The scenario:

The readmission model is moving to production. You need a clean, leakage-free pipeline that scales and reduces dimensionality at training time and applies the exact same transformation at inference. You'll use sklearn's Pipeline to lock in the scaler and PCA together so they can never be accidentally applied out of order or refitted on test data.

# Import libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline              # chains steps: scaler → PCA → model
from sklearn.model_selection import train_test_split  # for creating train/test split

# Reuse the health_df — add a synthetic binary target for the pipeline demo
health_df = pd.DataFrame({
    'age':          [45, 62, 38, 71, 55, 48, 66, 52, 41, 60],
    'bmi':          [28.1, 31.5, 24.0, 33.8, 27.2, 29.6, 32.1, 26.8, 23.5, 30.4],
    'weight_kg':    [82, 95, 70, 102, 79, 88, 97, 77, 68, 91],
    'systolic_bp':  [128, 145, 118, 152, 132, 138, 148, 125, 115, 142],
    'diastolic_bp': [82, 92, 76, 98, 85, 88, 94, 80, 74, 90],
    'heart_rate':   [72, 84, 68, 88, 75, 78, 86, 70, 65, 82],
    'glucose':      [98, 118, 88, 135, 105, 110, 122, 95, 85, 115],
    'creatinine':   [0.9, 1.2, 0.8, 1.4, 1.0, 1.1, 1.3, 0.9, 0.8, 1.2]
})
y = np.array([0, 1, 0, 1, 0, 0, 1, 0, 0, 1])  # synthetic readmission target: 1=readmitted

# Split into train and test BEFORE any fitting
X_train, X_test, y_train, y_test = train_test_split(
    health_df, y, test_size=0.3, random_state=42  # 70% train, 30% test
)

# Build the preprocessing pipeline: StandardScaler then PCA with 4 components
# Pipeline ensures steps always run in order and are only fitted on training data
pca_pipeline = Pipeline([
    ('scaler', StandardScaler()),     # step 1: standardise features
    ('pca',    PCA(n_components=4))   # step 2: reduce to 4 principal components
])

# Fit the pipeline on training data only
pca_pipeline.fit(X_train)  # scaler and PCA are both fitted on X_train exclusively

# Transform train and test using the fitted pipeline
X_train_pca = pca_pipeline.transform(X_train)  # project training rows onto fitted components
X_test_pca  = pca_pipeline.transform(X_test)   # apply SAME transformation to test rows

# Report shapes before and after
print(f"X_train shape before PCA: {X_train.shape}  →  after PCA: {X_train_pca.shape}")
print(f"X_test  shape before PCA: {X_test.shape}   →  after PCA: {X_test_pca.shape}")

# Show the transformed training rows
train_pca_df = pd.DataFrame(X_train_pca, columns=['PC1','PC2','PC3','PC4'])
print("\nTransformed training set:")
print(train_pca_df.round(3).to_string(index=False))
X_train shape before PCA: (7, 8)  →  after PCA: (7, 4)
X_test  shape before PCA: (3, 8)   →  after PCA: (3, 4)

Transformed training set:
    PC1     PC2     PC3     PC4
 -2.914   0.387  -0.412   0.198
  3.241   0.918   0.601  -0.214
 -0.721   0.584  -0.498  -0.614
  2.118   0.861   0.392   0.124
 -3.184  -1.843   0.488  -0.371
  1.974  -1.198   0.228   0.083
  0.204   0.592  -0.501  -0.608

What just happened?

The Pipeline reduced both train and test from 8 features to 4 components in a single, leakage-safe operation. The scaler and PCA were fitted only on the 7 training rows — the 3 test rows were transformed using the training-derived component directions, not refitted. This is the production-correct pattern. If you ever need to add a classifier, just append it to the pipeline: ('clf', LogisticRegression()) and the whole chain fits and predicts in one call.

Reading the Scree Curve — Choosing Components Visually

The scree curve plots cumulative explained variance against the number of components. The elbow — where the curve flattens — is the natural cut-off point. Here's the variance table from our dataset rendered as a visual reference:

Cumulative Explained Variance — 8 Clinical Features

PC1
63.5%
PC2
82.7%
PC3
90.8%
PC4
95.4% ← elbow
PC5
97.8%
PC6
99.1%

The elbow is clearly at PC4. Components beyond PC4 each add less than 2.5% — diminishing returns territory. Cut here.

Teacher's Note

PCA is not always the right dimensionality reduction tool. It assumes linear relationships — if your features have nonlinear dependencies, PCA will miss them. It also destroys interpretability: PC1 is a weighted mix of all original features, and you cannot go back to ask "which original feature matters most?" for a given component. For tree-based models (random forest, XGBoost), PCA often hurts rather than helps — these models handle correlated features natively and their feature importance scores become unreadable after PCA. Reserve PCA for linear models, KNN, SVM, and neural nets where correlated inputs cause real problems.

Practice Questions

1. Which sklearn transformer must always be applied to your features before running PCA, to prevent high-magnitude features from dominating the components?



2. The metric used to decide how many PCA components to keep — typically targeting 95% cumulative coverage — is called the ________ ________ ________.



3. To prevent leakage, the PCA object must be fitted only on the ________ set, then used to transform both train and test.



Quiz

1. Why must you standardise features before applying PCA?


2. PCA is generally NOT recommended before training which type of model?


3. A colleague reports suspiciously high cross-validation scores after applying PCA. The most likely explanation is:


Up Next · Lesson 37

Feature Engineering for Regression

Transformations, interactions, and polynomial features that make continuous targets learnable — the techniques that separate decent regression models from great ones.