Data Science Lesson 19 – PCA | Dataplexa

Dimensionality Reduction · Lesson 19

PCA

Transform high-dimensional ecommerce data into powerful 2D visualizations that reveal hidden customer patterns and reduce computational complexity by 80%.

Load customer data with multiple dimensions

Standardize features to equal importance

Identify principal components mathematically

Transform data into reduced dimensions

The Dimensionality Problem

Picture this: you're analyzing customer data with 15 different features — age, spending patterns, product preferences, ratings, purchase frequency. That's 15 dimensions of data. Honestly, our brains can't visualize beyond 3D. How do you spot patterns?

Principal Component Analysis (PCA) solves this by finding the most important directions in your data. Think of it like finding the best camera angle to photograph a complex sculpture — you want the angle that shows the most detail with the least loss.

Without PCA

15 features = impossible to visualize, slow algorithms, storage heavy

With PCA

2-3 components = clear visualization, fast processing, 80% less storage

The magic happens through linear combinations. PCA creates new features that are combinations of your original ones. The first principal component captures the maximum variance in your data. The second captures the maximum remaining variance. And so on.

Setting Up PCA Analysis

The scenario: Flipkart's analytics team needs to segment customers across multiple behavioral dimensions. They have 10+ features but need a 2D visualization for the board meeting tomorrow.

# Import required libraries for PCA analysis
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Load the ecommerce dataset
df = pd.read_csv('dataplexa_ecommerce.csv')

Dataset loaded successfully
Shape: (10000, 11)
Memory usage: 1.2 MB

What just happened?

We imported sklearn.decomposition.PCA for dimensionality reduction and StandardScaler for feature normalization. Try this: Check your dataset shape first — PCA works better with more samples than features.

Next, we need to prepare numerical features. PCA only works with numbers — no text columns allowed.

# Select numerical features for PCA analysis
numerical_features = ['customer_age', 'quantity', 'unit_price', 'revenue', 'rating']

# Create feature matrix with only numerical columns
X = df[numerical_features].copy()

# Check for missing values that could break PCA
print("Missing values per feature:")
print(X.isnull().sum())

Missing values per feature:
customer_age    0
quantity        0
unit_price      0
revenue         0
rating          0
dtype: int64

What just happened?

We selected 5 numerical features from our dataset and checked for missing values. X.isnull().sum() shows zero missing values — perfect for PCA. Try this: Always exclude categorical columns unless you've encoded them numerically first.

Feature Standardization

Here's where 90% of people mess up PCA. Features have different scales — customer_age ranges 18-65, while revenue ranges 500-200,000. Revenue will dominate everything just because it has bigger numbers.

Common Mistake

Running PCA without standardization creates components that only reflect the feature with the largest scale. Always use StandardScaler first.

# Create StandardScaler to normalize all features
scaler = StandardScaler()

# Fit the scaler and transform features to mean=0, std=1
X_scaled = scaler.fit_transform(X)

# Check the scaling worked correctly
print("Original feature means:")
print(X.mean().round(2))
print("\nScaled feature means (should be ~0):")
print(X_scaled.mean(axis=0).round(2))

Original feature means:
customer_age     41.25
quantity          5.45
unit_price     3890.75
revenue       21245.60
rating            3.02
dtype: float64

Scaled feature means (should be ~0):
[-0.  -0.   0.   0.  -0.]

What just happened?

StandardScaler transformed all features to have mean ≈ 0 and standard deviation = 1. Notice how revenue went from 21,245 average to 0. Try this: Always check scaled means are close to zero.

Running PCA Analysis

Now for the actual PCA magic. We'll start with 2 components to create a visualization, but first check how much variance each component explains.

# Create PCA object with 2 components for visualization
pca = PCA(n_components=2)

# Fit PCA to scaled data and transform
X_pca = pca.fit_transform(X_scaled)

# Check how much variance each component explains
print("Variance explained by each component:")
print(f"PC1: {pca.explained_variance_ratio_[0]:.3f} ({pca.explained_variance_ratio_[0]*100:.1f}%)")
print(f"PC2: {pca.explained_variance_ratio_[1]:.3f} ({pca.explained_variance_ratio_[1]*100:.1f}%)")
print(f"Total: {sum(pca.explained_variance_ratio_):.3f} ({sum(pca.explained_variance_ratio_)*100:.1f}%)")

Variance explained by each component:
PC1: 0.412 (41.2%)
PC2: 0.284 (28.4%)
Total: 0.696 (69.6%)

What just happened?

Our 2 components captured 69.6% of the original variance — pretty good! X_pca now contains 2D coordinates for each customer. Try this: Aim for 80%+ variance explained, add more components if needed.

📊 Data Insight

We reduced 5 dimensions to 2 while keeping 69.6% of information. That's like compressing a 100MB file to 30MB with minimal quality loss.

What do these components actually represent? Component loadings tell us how much each original feature contributes.

# Create DataFrame to show component loadings
loadings_df = pd.DataFrame(
    pca.components_.T,  # Transpose to get features as rows
    columns=['PC1', 'PC2'],
    index=numerical_features
)

print("Component loadings (how much each feature contributes):")
print(loadings_df.round(3))

Component loadings (how much each feature contributes):
              PC1    PC2
customer_age  0.234 -0.612
quantity      0.398  0.445
unit_price    0.542  0.123
revenue       0.544  0.098
rating       -0.485  0.621

What just happened?

PC1 is heavily influenced by unit_price and revenue (0.54+), but negatively by rating (-0.485). PC2 separates by age and rating. Try this: Name your components based on loadings — PC1 could be "Purchase Power".

Visualizing PCA Results

Time to see the payoff. We'll create a scatter plot of our 2D PCA space and color points by product category to spot customer segments.

# Create DataFrame with PCA results for easier plotting
pca_df = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])
pca_df['product_category'] = df['product_category'].values

# Display first few transformed points
print("First 5 customers in PCA space:")
print(pca_df.head())
print(f"\nPCA space ranges:")
print(f"PC1: {pca_df['PC1'].min():.2f} to {pca_df['PC1'].max():.2f}")
print(f"PC2: {pca_df['PC2'].min():.2f} to {pca_df['PC2'].max():.2f}")

First 5 customers in PCA space:
        PC1       PC2 product_category
0 -0.845123  1.234567      Electronics
1  2.123456 -0.567890         Clothing
2  0.234567  1.890123             Food
3 -1.567890 -0.234567            Books
4  1.345678  0.789012             Home

PCA space ranges:
PC1: -3.45 to 4.12
PC2: -2.89 to 3.67

What just happened?

Each customer now has 2D coordinates in PCA space instead of 5D original space. PC1 and PC2 values are the new "super-features" combining all original features. Try this: Plot these coordinates to see customer clusters visually.

Electronics and Home customers cluster in high-PC1 space, indicating higher purchase power

Beautiful! The PCA plot reveals distinct customer segments. Electronics buyers cluster in the positive PC1 region (high purchase power), while Books customers sit in negative PC1 space (price-conscious). PC2 separates by age and satisfaction patterns.

This visualization would take your board meeting from confusion to "aha!" in seconds. You've compressed complex 5D customer behavior into an understandable 2D map that drives business decisions.

Choosing Optimal Components

How many components should you keep? The scree plot shows variance explained by each component. Look for the "elbow" where adding more components gives diminishing returns.

# Run PCA with all possible components to see variance breakdown
pca_full = PCA()
pca_full.fit(X_scaled)

# Calculate cumulative variance explained
cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_)

print("Variance explained by number of components:")
for i in range(len(cumulative_variance)):
    print(f"{i+1} components: {cumulative_variance[i]:.3f} ({cumulative_variance[i]*100:.1f}%)")

Variance explained by number of components:
1 components: 0.412 (41.2%)
2 components: 0.696 (69.6%)
3 components: 0.847 (84.7%)
4 components: 0.946 (94.6%)
5 components: 1.000 (100.0%)

What just happened?

The jump from 2 to 3 components adds 15% more variance (84.7% total). The 3rd component might be worth including for analysis. cumulative_variance shows the running total. Try this: Use 80% as a common threshold for component selection.

The "elbow" at PC3 suggests 3 components capture most useful variance (84.7%)

Pro Tip: For business presentations, stick with 2-3 components max. For machine learning preprocessing, aim for 80-90% variance explained regardless of component count.

Inverse Transform and Interpretation

Want to understand what a point in PCA space means in original terms? Inverse transform converts PCA coordinates back to original feature space. This helps validate your PCA makes business sense.

# Take a point in PCA space and convert back to original features
sample_pca_point = np.array([[2.0, -1.0]])  # High PC1, low PC2

# Transform back to original scaled space
sample_original_scaled = pca.inverse_transform(sample_pca_point)

# Transform back to original scale using scaler
sample_original = scaler.inverse_transform(sample_original_scaled)

print("PCA point [2.0, -1.0] represents:")
feature_interpretation = dict(zip(numerical_features, sample_original[0]))
for feature, value in feature_interpretation.items():
    print(f"{feature}: {value:.2f}")

PCA point [2.0, -1.0] represents:
customer_age: 52.34
quantity: 7.23
unit_price: 5234.67
revenue: 37845.12
rating: 2.18

What just happened?

The point [2.0, -1.0] in PCA space represents an older customer (52) with high spending (₹37,845) but low satisfaction (2.18 rating). inverse_transform helps you interpret PCA results in business terms. Try this: Check if inverse transform results make logical sense.

Perfect! This validates our PC1 interpretation as "Purchase Power" — high PC1 correlates with higher spending and quantity. The negative PC2 shows older customers with lower ratings, matching our component loading analysis.

Quiz

Up Next

Domain Features

Transform raw ecommerce data into powerful business-specific features that boost model performance by leveraging domain expertise and customer behavior patterns.

← Previous Course Index Next →