Principal Component Analysis (PCA)
In the previous lesson, we introduced the idea of dimensionality reduction and understood why reducing the number of features can make machine learning models simpler, faster, and more reliable.
In this lesson, we take that idea and apply it using a concrete mathematical technique called Principal Component Analysis, commonly known as PCA.
PCA is one of the most important tools in machine learning and data science. It helps us transform complex, high-dimensional data into a smaller and more meaningful representation.
The Intuition Behind PCA
Imagine you are analyzing customer data with many features such as income, age, credit score, savings, and expenses.
Some of these features are closely related. For example, income and savings often move together.
PCA looks at the data as a whole and tries to find new directions that capture the maximum variation present in the dataset.
These new directions are called principal components. They are not original features, but combinations of them.
What PCA Actually Does
PCA rotates the coordinate system of the data so that the first axis captures the largest possible variance.
The second axis captures the next largest variance while remaining independent of the first.
By keeping only the top few principal components, we reduce dimensionality while retaining most of the information.
Using Our Dataset
We continue using the same dataset consistently throughout the module.
Dataplexa ML Housing & Customer Dataset
For PCA, we focus only on the input features and ignore the loan approval label.
Preparing the Data
PCA is sensitive to feature scale. If features are not standardized, PCA will produce misleading results.
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
df = pd.read_csv("dataplexa_ml_housing_customer_dataset.csv")
X = df.drop("loan_approved", axis=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Applying PCA
Now we apply PCA and reduce the dataset to two principal components. This allows us to visualize complex data in two dimensions.
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
Each row in X_pca now represents a customer
in a reduced two-dimensional space.
Understanding Explained Variance
PCA provides information about how much variance each principal component captures.
pca.explained_variance_ratio_
If the first two components capture most of the variance, then reducing dimensions is safe.
If not, more components may be required.
Why PCA Is Useful in Practice
PCA is commonly used before applying machine learning algorithms to reduce noise and computation cost.
It also helps with visualization and understanding hidden patterns inside the data.
In finance, PCA is used to detect risk factors. In marketing, it helps analyze customer behavior.
Mini Practice
Try changing the number of components from 2 to 3 and observe how the explained variance changes.
This experiment helps you understand the trade-off between simplicity and information retention.
Exercises
Exercise 1:
Why must data be standardized before PCA?
Exercise 2:
Does PCA use the target variable?
Quick Quiz
Q1. Does PCA always improve model accuracy?
In the next lesson, we will study Feature Selection and learn how to choose the most important features directly.