AI Lesson 35 – Dimensionality Reduction (PCA, LDA) | Dataplexa

Dimensionality Reduction (PCA & LDA)

Dimensionality Reduction is the process of reducing the number of input features while preserving as much important information as possible. It is widely used to simplify models, improve performance, and make data easier to visualize.

As datasets grow, the number of features also increases. Too many features often cause problems instead of improving accuracy. Dimensionality Reduction solves this.

Why Do We Need Dimensionality Reduction?

High-dimensional data creates several challenges:

  • Slower model training
  • Higher memory usage
  • Overfitting due to noise
  • Difficult visualization and interpretation

Dimensionality Reduction helps by keeping only the most informative parts of the data.

Real-World Example

Imagine judging a student based on 100 test scores. Many tests may measure similar skills. Instead of evaluating all 100 scores, you summarize them into key abilities like math, logic, and communication.

This summarization is exactly what dimensionality reduction does to data.

Types of Dimensionality Reduction

There are two main approaches:

  • Feature Selection: Choosing the most important existing features
  • Feature Extraction: Creating new features that summarize existing ones

PCA and LDA are popular feature extraction techniques.

Principal Component Analysis (PCA)

PCA is an unsupervised technique that transforms data into a new coordinate system where:

  • The first component captures the most variance
  • Each next component captures the remaining variance
  • Components are uncorrelated

PCA focuses on preserving maximum information, not class labels.

PCA Example


from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

# Load data
X, y = load_iris(return_X_y=True)

# Apply PCA
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

print(X_reduced.shape)
  
(150, 2)

The original dataset had 4 features. PCA reduced it to 2 while preserving most of the variance.

Understanding PCA Output

PCA does not care about class labels. It only tries to capture directions where data varies the most.

This makes PCA excellent for visualization and noise reduction.

Linear Discriminant Analysis (LDA)

LDA is a supervised dimensionality reduction technique. Unlike PCA, it uses class labels to maximize separation between classes.

LDA focuses on finding feature combinations that best distinguish different categories.

LDA Example


from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.datasets import load_iris

# Load data
X, y = load_iris(return_X_y=True)

# Apply LDA
lda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit_transform(X, y)

print(X_lda.shape)
  
(150, 2)

LDA reduces features while maximizing class separation, making it useful for classification tasks.

PCA vs LDA

  • PCA: Unsupervised, variance-based
  • LDA: Supervised, class-separation-based
  • PCA: Used for visualization and noise reduction
  • LDA: Used for improving classification

When Should You Use Dimensionality Reduction?

  • Before clustering or classification
  • When features are highly correlated
  • When training time is high
  • When visualizing high-dimensional data

Practice Questions

Practice 1: What is the main goal of dimensionality reduction?



Practice 2: PCA belongs to which learning type?



Practice 3: What does LDA try to maximize?



Quick Quiz

Quiz 1: Which technique ignores class labels?





Quiz 2: Which technique is supervised?





Quiz 3: Dimensionality reduction helps reduce?





Coming up next: Feature Engineering — transforming raw data into meaningful features for AI models.