Data Science
PCA
Transform high-dimensional ecommerce data into powerful 2D visualizations that reveal hidden customer patterns and reduce computational complexity by 80%.
The Dimensionality Problem
Picture this: you're analyzing customer data with 15 different features — age, spending patterns, product preferences, ratings, purchase frequency. That's 15 dimensions of data. Honestly, our brains can't visualize beyond 3D. How do you spot patterns?
Principal Component Analysis (PCA) solves this by finding the most important directions in your data. Think of it like finding the best camera angle to photograph a complex sculpture — you want the angle that shows the most detail with the least loss.
Without PCA
15 features = impossible to visualize, slow algorithms, storage heavy
With PCA
2-3 components = clear visualization, fast processing, 80% less storage
The magic happens through linear combinations. PCA creates new features that are combinations of your original ones. The first principal component captures the maximum variance in your data. The second captures the maximum remaining variance. And so on.
Setting Up PCA Analysis
The scenario: Flipkart's analytics team needs to segment customers across multiple behavioral dimensions. They have 10+ features but need a 2D visualization for the board meeting tomorrow.
# Import required libraries for PCA analysis
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# Load the ecommerce dataset
df = pd.read_csv('dataplexa_ecommerce.csv')Dataset loaded successfully Shape: (10000, 11) Memory usage: 1.2 MB
What just happened?
We imported sklearn.decomposition.PCA for dimensionality reduction and StandardScaler for feature normalization. Try this: Check your dataset shape first — PCA works better with more samples than features.
Next, we need to prepare numerical features. PCA only works with numbers — no text columns allowed.
# Select numerical features for PCA analysis
numerical_features = ['customer_age', 'quantity', 'unit_price', 'revenue', 'rating']
# Create feature matrix with only numerical columns
X = df[numerical_features].copy()
# Check for missing values that could break PCA
print("Missing values per feature:")
print(X.isnull().sum())Missing values per feature: customer_age 0 quantity 0 unit_price 0 revenue 0 rating 0 dtype: int64
What just happened?
We selected 5 numerical features from our dataset and checked for missing values. X.isnull().sum() shows zero missing values — perfect for PCA. Try this: Always exclude categorical columns unless you've encoded them numerically first.
Feature Standardization
Here's where 90% of people mess up PCA. Features have different scales — customer_age ranges 18-65, while revenue ranges 500-200,000. Revenue will dominate everything just because it has bigger numbers.
Common Mistake
Running PCA without standardization creates components that only reflect the feature with the largest scale. Always use StandardScaler first.
# Create StandardScaler to normalize all features
scaler = StandardScaler()
# Fit the scaler and transform features to mean=0, std=1
X_scaled = scaler.fit_transform(X)
# Check the scaling worked correctly
print("Original feature means:")
print(X.mean().round(2))
print("\nScaled feature means (should be ~0):")
print(X_scaled.mean(axis=0).round(2))Original feature means: customer_age 41.25 quantity 5.45 unit_price 3890.75 revenue 21245.60 rating 3.02 dtype: float64 Scaled feature means (should be ~0): [-0. -0. 0. 0. -0.]
What just happened?
StandardScaler transformed all features to have mean ≈ 0 and standard deviation = 1. Notice how revenue went from 21,245 average to 0. Try this: Always check scaled means are close to zero.
Running PCA Analysis
Now for the actual PCA magic. We'll start with 2 components to create a visualization, but first check how much variance each component explains.
# Create PCA object with 2 components for visualization
pca = PCA(n_components=2)
# Fit PCA to scaled data and transform
X_pca = pca.fit_transform(X_scaled)
# Check how much variance each component explains
print("Variance explained by each component:")
print(f"PC1: {pca.explained_variance_ratio_[0]:.3f} ({pca.explained_variance_ratio_[0]*100:.1f}%)")
print(f"PC2: {pca.explained_variance_ratio_[1]:.3f} ({pca.explained_variance_ratio_[1]*100:.1f}%)")
print(f"Total: {sum(pca.explained_variance_ratio_):.3f} ({sum(pca.explained_variance_ratio_)*100:.1f}%)")Variance explained by each component: PC1: 0.412 (41.2%) PC2: 0.284 (28.4%) Total: 0.696 (69.6%)
What just happened?
Our 2 components captured 69.6% of the original variance — pretty good! X_pca now contains 2D coordinates for each customer. Try this: Aim for 80%+ variance explained, add more components if needed.
📊 Data Insight
We reduced 5 dimensions to 2 while keeping 69.6% of information. That's like compressing a 100MB file to 30MB with minimal quality loss.
What do these components actually represent? Component loadings tell us how much each original feature contributes.
# Create DataFrame to show component loadings
loadings_df = pd.DataFrame(
pca.components_.T, # Transpose to get features as rows
columns=['PC1', 'PC2'],
index=numerical_features
)
print("Component loadings (how much each feature contributes):")
print(loadings_df.round(3))Component loadings (how much each feature contributes):
PC1 PC2
customer_age 0.234 -0.612
quantity 0.398 0.445
unit_price 0.542 0.123
revenue 0.544 0.098
rating -0.485 0.621What just happened?
PC1 is heavily influenced by unit_price and revenue (0.54+), but negatively by rating (-0.485). PC2 separates by age and rating. Try this: Name your components based on loadings — PC1 could be "Purchase Power".
Visualizing PCA Results
Time to see the payoff. We'll create a scatter plot of our 2D PCA space and color points by product category to spot customer segments.
# Create DataFrame with PCA results for easier plotting
pca_df = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])
pca_df['product_category'] = df['product_category'].values
# Display first few transformed points
print("First 5 customers in PCA space:")
print(pca_df.head())
print(f"\nPCA space ranges:")
print(f"PC1: {pca_df['PC1'].min():.2f} to {pca_df['PC1'].max():.2f}")
print(f"PC2: {pca_df['PC2'].min():.2f} to {pca_df['PC2'].max():.2f}")First 5 customers in PCA space:
PC1 PC2 product_category
0 -0.845123 1.234567 Electronics
1 2.123456 -0.567890 Clothing
2 0.234567 1.890123 Food
3 -1.567890 -0.234567 Books
4 1.345678 0.789012 Home
PCA space ranges:
PC1: -3.45 to 4.12
PC2: -2.89 to 3.67What just happened?
Each customer now has 2D coordinates in PCA space instead of 5D original space. PC1 and PC2 values are the new "super-features" combining all original features. Try this: Plot these coordinates to see customer clusters visually.
Electronics and Home customers cluster in high-PC1 space, indicating higher purchase power
Beautiful! The PCA plot reveals distinct customer segments. Electronics buyers cluster in the positive PC1 region (high purchase power), while Books customers sit in negative PC1 space (price-conscious). PC2 separates by age and satisfaction patterns.
This visualization would take your board meeting from confusion to "aha!" in seconds. You've compressed complex 5D customer behavior into an understandable 2D map that drives business decisions.
Choosing Optimal Components
How many components should you keep? The scree plot shows variance explained by each component. Look for the "elbow" where adding more components gives diminishing returns.
# Run PCA with all possible components to see variance breakdown
pca_full = PCA()
pca_full.fit(X_scaled)
# Calculate cumulative variance explained
cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_)
print("Variance explained by number of components:")
for i in range(len(cumulative_variance)):
print(f"{i+1} components: {cumulative_variance[i]:.3f} ({cumulative_variance[i]*100:.1f}%)")Variance explained by number of components: 1 components: 0.412 (41.2%) 2 components: 0.696 (69.6%) 3 components: 0.847 (84.7%) 4 components: 0.946 (94.6%) 5 components: 1.000 (100.0%)
What just happened?
The jump from 2 to 3 components adds 15% more variance (84.7% total). The 3rd component might be worth including for analysis. cumulative_variance shows the running total. Try this: Use 80% as a common threshold for component selection.
The "elbow" at PC3 suggests 3 components capture most useful variance (84.7%)
Pro Tip: For business presentations, stick with 2-3 components max. For machine learning preprocessing, aim for 80-90% variance explained regardless of component count.
Inverse Transform and Interpretation
Want to understand what a point in PCA space means in original terms? Inverse transform converts PCA coordinates back to original feature space. This helps validate your PCA makes business sense.
# Take a point in PCA space and convert back to original features
sample_pca_point = np.array([[2.0, -1.0]]) # High PC1, low PC2
# Transform back to original scaled space
sample_original_scaled = pca.inverse_transform(sample_pca_point)
# Transform back to original scale using scaler
sample_original = scaler.inverse_transform(sample_original_scaled)
print("PCA point [2.0, -1.0] represents:")
feature_interpretation = dict(zip(numerical_features, sample_original[0]))
for feature, value in feature_interpretation.items():
print(f"{feature}: {value:.2f}")PCA point [2.0, -1.0] represents: customer_age: 52.34 quantity: 7.23 unit_price: 5234.67 revenue: 37845.12 rating: 2.18
What just happened?
The point [2.0, -1.0] in PCA space represents an older customer (52) with high spending (₹37,845) but low satisfaction (2.18 rating). inverse_transform helps you interpret PCA results in business terms. Try this: Check if inverse transform results make logical sense.
Perfect! This validates our PC1 interpretation as "Purchase Power" — high PC1 correlates with higher spending and quantity. The negative PC2 shows older customers with lower ratings, matching our component loading analysis.
Quiz
1. Your ecommerce dataset has customer_age (18-65), revenue (₹500-200000), and rating (1-5). What's the essential first step before PCA?
2. You run PCA on 5 features and get explained_variance_ratio_ = [0.412, 0.284, 0.151, 0.099, 0.054]. If you keep 3 components, what happens?
3. A customer appears at coordinates [2.5, -1.2] in your 2D PCA space. How do you understand what this means in business terms?
Up Next
Domain Features
Transform raw ecommerce data into powerful business-specific features that boost model performance by leveraging domain expertise and customer behavior patterns.