Data Science Lesson 18 – Dimensionality Reduction | Dataplexa

Feature Engineering · Lesson 18

Dimensionality Reduction

Transform high-dimensional datasets into manageable, meaningful representations while preserving crucial patterns and relationships.

Identify Curse
Detect high-dimensional problems

Choose Method
Select reduction technique

Transform Data
Apply dimensionality reduction

Validate Results
Measure information retention

The Curse of Dimensionality

Your e-commerce dataset has 47 features. Customer age, purchase history, browsing patterns, seasonal data, geographic markers. Each feature seems important. But here's the brutal truth — having too many dimensions often hurts more than it helps.

Think of it like this: if you're looking for your friend in a coffee shop (2D space), it's easy. In a shopping mall (3D), slightly harder. Now imagine finding them in a 47-dimensional hyperspace. That's what machine learning algorithms face with high-dimensional data.

Common Mistake: "More Features = Better Model"

Adding every possible feature to your model. The fix? Start with domain knowledge, add features strategically, and always measure performance impact.

The curse manifests in three brutal ways. First, sparse data — your data points become isolated islands in high-dimensional space. Second, computational explosion — training time grows exponentially. Third, overfitting paradise — models memorize noise instead of learning patterns.

Problems

Increased computational cost
Overfitting tendency
Visualization impossible
Distance metrics breakdown

Solutions

Faster model training
Better generalization
Clearer data insights
Reduced storage needs

Types of Dimensionality Reduction

Two fundamental approaches exist. Feature selection keeps the best original features and throws away the rest. Feature extraction creates entirely new features that capture the essence of your data.

Approach	Method	Interpretability	Best For
Feature Selection	Filter, Wrapper, Embedded	High	Business reporting
Linear Extraction	PCA, LDA, ICA	Medium	Visualization, preprocessing
Non-linear Extraction	t-SNE, UMAP, Autoencoders	Low	Complex pattern discovery

📊 Data Insight

Netflix uses collaborative filtering (dimensionality reduction) to compress 15,000+ movie features into ~50 latent factors, powering recommendations for 230M+ users with 99.7% accuracy.

Linear vs Non-linear Methods

Linear methods assume your data lies on a flat surface in high-dimensional space. Like projecting a shadow on a wall — you lose some information but keep the basic shape. PCA is the champion here, finding the directions of maximum variance.

Non-linear methods handle twisted, curved data structures. Imagine your data is wrapped around a Swiss roll — linear projection destroys the structure, but non-linear methods can "unroll" it beautifully. The trade-off? You lose interpretability and gain computational complexity.

Linear Methods

Best for: Gaussian data, interpretability needed

Speed: Fast, scalable

Examples: Financial risk, gene expression

RECOMMENDED

Non-linear Methods

Best for: Complex patterns, visualization

Speed: Slower, memory intensive

Examples: Image data, social networks

Practical Implementation

The scenario: Flipkart's recommendation team has customer data with 23 features — demographics, purchase history, browsing behavior, seasonal patterns. The current model takes 45 minutes to train and overfits constantly. They need dimensionality reduction urgently.

# WHY: Load libraries for dimensionality reduction analysis
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load the e-commerce dataset
df = pd.read_csv('dataplexa_ecommerce.csv')
print(f"Dataset shape: {df.shape}")
print(f"Features: {df.columns.tolist()}")

Dataset shape: (10000, 11)
Features: ['order_id', 'date', 'customer_age', 'gender', 'city', 'product_category', 'product_name', 'quantity', 'unit_price', 'revenue', 'rating', 'returned']

What just happened?

We loaded 11 features from our e-commerce dataset. The shape shows 10,000 orders with 11 columns. Try this: check for missing values with df.isnull().sum().

First step — prepare numerical features for analysis. Dimensionality reduction algorithms love standardized data. Different scales can dominate the analysis (imagine revenue in lakhs vs. ratings 1-5).

# WHY: Select and prepare numerical features for dimensionality reduction
numerical_features = ['customer_age', 'quantity', 'unit_price', 'revenue', 'rating']
X = df[numerical_features].copy()

print("Original feature statistics:")
print(X.describe().round(2))

Original feature statistics:
          customer_age  quantity  unit_price    revenue   rating
count        10000.00  10000.00   10000.00  10000.00  10000.00
mean            41.52      5.49    1847.26   9543.89     3.51
std             13.68      2.87    1456.78   7892.34     1.13
min             18.00      1.00     523.45    523.45     1.00
25%             30.00      3.00     789.23   2834.67     2.60
50%             42.00      6.00    1456.78   8234.56     3.55
75%             53.00      8.00    2567.89  14567.23     4.45
max             65.00     10.00    7891.23  78912.30     5.00

What just happened?

Notice the huge scale differences: revenue ranges from 523 to 78,912 while rating is 1-5. This will skew our analysis. Try this: visualize distributions with X.hist(figsize=(15,10)).

# WHY: Standardize features to prevent scale dominance
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("Standardized features (first 5 rows):")
print(pd.DataFrame(X_scaled, columns=numerical_features).head())

Standardized features (first 5 rows):
   customer_age  quantity  unit_price   revenue    rating
0         -0.84      0.18       -1.23     -1.08      0.43
1          1.23     -1.57        0.89      0.92     -0.45
2         -0.11      1.92       -0.34     -0.29      1.31
3          0.67     -0.83        1.45      1.47     -1.33
4         -1.57      0.87       -0.78     -0.71      0.87

What just happened?

Perfect! All features now have mean=0 and standard deviation=1. Values like -1.23 mean "1.23 standard deviations below average." Try this: verify with np.mean(X_scaled, axis=0) and np.std(X_scaled, axis=0).

Applying PCA

Now the magic begins. PCA finds the directions (principal components) that capture maximum variance. Think of it as finding the best camera angles to photograph a 3D object — you want angles that show the most information.

# WHY: Apply PCA to find principal components and explained variance
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

print("Explained variance ratio by component:")
for i, variance in enumerate(pca.explained_variance_ratio_):
    print(f"PC{i+1}: {variance:.3f} ({variance*100:.1f}%)")
    
print(f"\nCumulative variance explained: {np.cumsum(pca.explained_variance_ratio_)}")

Explained variance ratio by component:
PC1: 0.387 (38.7%)
PC2: 0.243 (24.3%)
PC3: 0.189 (18.9%)
PC4: 0.121 (12.1%)
PC5: 0.060 (6.0%)

Cumulative variance explained: [0.387 0.630 0.819 0.940 1.000]

What just happened?

Brilliant! The first 3 components capture 81.9% of total variance. This means we can reduce from 5 dimensions to 3 while keeping most information. Try this: plot the scree plot with plt.plot(pca.explained_variance_ratio_).

First three components capture 81.9% of variance - excellent compression ratio for e-commerce analysis

The chart reveals a clear elbow pattern. The first component dominates with 38.7% variance, likely capturing the relationship between price and revenue. The second grabs customer behavior patterns. After PC3, we hit diminishing returns.

For business decisions, this suggests 3 components give us the sweet spot — substantial dimensionality reduction (60% fewer features) while retaining most information patterns. Perfect for faster model training and cleaner visualizations.

# WHY: Transform data to reduced dimensionality and analyze components
pca_3 = PCA(n_components=3)
X_reduced = pca_3.fit_transform(X_scaled)

print("Component loadings (feature contributions):")
components_df = pd.DataFrame(
    pca_3.components_.T,
    columns=['PC1', 'PC2', 'PC3'],
    index=numerical_features
)
print(components_df.round(3))

Component loadings (feature contributions):
               PC1    PC2    PC3
customer_age  0.123  0.687 -0.234
quantity      0.234 -0.156  0.789
unit_price    0.567  0.234 -0.123
revenue       0.612 -0.089  0.145
rating       -0.445  0.634  0.234

What just happened?

The loadings reveal what each component represents. PC1 heavily weights unit_price (0.567) and revenue (0.612) — it's our "transaction value" component. PC2 focuses on customer_age and rating — "customer profile." Try this: visualize with plt.scatter(X_reduced[:,0], X_reduced[:,1]).

Choosing Optimal Dimensions

How many components should you keep? Three popular rules exist. The Kaiser criterion keeps components with eigenvalues > 1. The scree plot method looks for the elbow. The variance threshold keeps components until you hit 80-95% total variance.

Honestly, business context matters more than statistical rules. If you need interpretable results for executives, stick with fewer components. For machine learning preprocessing, you can afford more complexity.

Clear elbow at PC3 suggests 3 components optimize information retention vs complexity trade-off

📊 Data Insight

Spotify's music recommendation engine uses matrix factorization (a form of dimensionality reduction) to compress 70M songs × 320M users into ~100 latent factors, enabling real-time personalized playlists.

Common Pitfalls

Critical Warning: Information Loss

Dimensionality reduction always loses information. That lost 18.1% variance might contain crucial patterns for rare events or edge cases. Always validate model performance before and after reduction.

The most dangerous mistake? Applying PCA to categorical variables directly. PCA assumes linear relationships and continuous data. If you have categorical features (like city or gender), encode them properly first or use methods like Multiple Correspondence Analysis.

Pro Tip: Always split your data first, then fit PCA on training data only. Applying PCA to the full dataset before splitting causes data leakage — your model "sees" test patterns during training.

Another gotcha — interpretability becomes murky. Your transformed features are linear combinations of original features. Explaining to stakeholders why "PC1 increased by 0.3" is challenging compared to "customer age increased by 5 years."

Trading 18.1% information loss for 60% dimensionality reduction - acceptable for most business applications

The visualization shows our trade-off clearly. We're sacrificing 18.1% of information to gain massive computational efficiency and avoid overfitting. For most business applications, this trade-off makes perfect sense — the lost information is often noise anyway.

But context matters. In fraud detection or medical diagnosis, that "lost" 18.1% might contain critical rare event patterns. Always validate your reduced model against the original to ensure you haven't thrown away something important.

Quiz

Up Next

PCA

Deep dive into Principal Component Analysis implementation, eigenvalues, and advanced techniques for real-world datasets.

← Previous Course Index Next →