ML Lesson 28 – PCA (Principal Component Analysis) | Dataplexa

Principal Component Analysis (PCA)

In the previous lesson, we introduced the idea of dimensionality reduction and understood why reducing the number of features can make machine learning models simpler, faster, and more reliable.

In this lesson, we take that idea and apply it using a concrete mathematical technique called Principal Component Analysis, commonly known as PCA.

PCA is one of the most important tools in machine learning and data science. It helps us transform complex, high-dimensional data into a smaller and more meaningful representation.


The Intuition Behind PCA

Imagine you are analyzing customer data with many features such as income, age, credit score, savings, and expenses.

Some of these features are closely related. For example, income and savings often move together.

PCA looks at the data as a whole and tries to find new directions that capture the maximum variation present in the dataset.

These new directions are called principal components. They are not original features, but combinations of them.


What PCA Actually Does

PCA rotates the coordinate system of the data so that the first axis captures the largest possible variance.

The second axis captures the next largest variance while remaining independent of the first.

By keeping only the top few principal components, we reduce dimensionality while retaining most of the information.


Using Our Dataset

We continue using the same dataset consistently throughout the module.

Dataplexa ML Housing & Customer Dataset

For PCA, we focus only on the input features and ignore the loan approval label.


Preparing the Data

PCA is sensitive to feature scale. If features are not standardized, PCA will produce misleading results.

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

df = pd.read_csv("dataplexa_ml_housing_customer_dataset.csv")

X = df.drop("loan_approved", axis=1)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Applying PCA

Now we apply PCA and reduce the dataset to two principal components. This allows us to visualize complex data in two dimensions.

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

Each row in X_pca now represents a customer in a reduced two-dimensional space.


Understanding Explained Variance

PCA provides information about how much variance each principal component captures.

pca.explained_variance_ratio_

If the first two components capture most of the variance, then reducing dimensions is safe.

If not, more components may be required.


Why PCA Is Useful in Practice

PCA is commonly used before applying machine learning algorithms to reduce noise and computation cost.

It also helps with visualization and understanding hidden patterns inside the data.

In finance, PCA is used to detect risk factors. In marketing, it helps analyze customer behavior.


Mini Practice

Try changing the number of components from 2 to 3 and observe how the explained variance changes.

This experiment helps you understand the trade-off between simplicity and information retention.


Exercises

Exercise 1:
Why must data be standardized before PCA?

Because PCA is variance-based and unscaled features distort the results.

Exercise 2:
Does PCA use the target variable?

No. PCA is an unsupervised technique and ignores labels.

Quick Quiz

Q1. Does PCA always improve model accuracy?

No. PCA reduces dimensions but may remove useful information if used improperly.

In the next lesson, we will study Feature Selection and learn how to choose the most important features directly.