Data Science Lesson 13 – Multivariate Analysis | Dataplexa

Statistics · Lesson 13

Multivariate Analysis

Discover hidden patterns by analyzing multiple variables simultaneously to uncover relationships that single-variable analysis completely misses.

The Multiple Variable Reality

Bivariate analysis showed you two-variable relationships. But business decisions rarely depend on just two factors. When Flipkart wants to predict customer lifetime value, they consider age, purchase history, city, product preferences, and seasonal behavior all together. That's where multivariate analysis becomes your power tool.

Think of it like understanding traffic patterns. Single variable: "Rush hour has more cars." Bivariate: "Rain increases traffic time." Multivariate: "Rush hour + rain + Friday + cricket match = complete gridlock." Each additional variable changes the story.

Univariate

Revenue distribution shows typical range: ₹500-₹50,000

Bivariate

Electronics buyers spend more: ₹15k average vs ₹8k clothing

Multivariate

Male, 35+, Mumbai, Electronics, 4+ rating: ₹28k average

The Gap

Missing multivariate = missing 60% of customer insights

Correlation Matrix: The Relationship Map

Your first multivariate tool is the correlation matrix. This shows how every numeric variable relates to every other numeric variable. Honestly, this is underrated — most analysts skip straight to modeling without understanding their data relationships first.

# Import libraries for multivariate analysis
import pandas as pd
import numpy as np
import seaborn as sns

Libraries imported successfully

What just happened?

We loaded the essential libraries: pandas for data handling, numpy for mathematical operations, and seaborn for advanced correlation visualization. Try this: Always import these three together for multivariate work.

The scenario: Myntra's analytics team needs to understand what drives high-value orders across their customer segments. The CMO wants insights within 2 hours for tomorrow's strategy meeting.

# Load and explore the dataset structure
df = pd.read_csv('dataplexa_ecommerce.csv')
print("Dataset shape:", df.shape)
print("\nNumeric columns for correlation:")
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print(numeric_cols)

Dataset shape: (10000, 12)

Numeric columns for correlation:
['order_id', 'customer_age', 'quantity', 'unit_price', 'revenue', 'rating']

What just happened?

We identified 6 numeric columns from our 10,000 orders. The select_dtypes(include=[np.number]) automatically filters only numeric columns suitable for correlation analysis. Try this: Always check data types before correlation — text columns will break the process.

Now we create the correlation matrix. This will show values from -1 to +1, where -1 means perfect negative correlation, +1 means perfect positive correlation, and 0 means no linear relationship.

# Calculate correlation matrix
correlation_matrix = df[numeric_cols].corr()
print("Correlation Matrix:")
print(correlation_matrix.round(3))

# Focus on revenue correlations (most business-relevant)
print("\nRevenue correlations (sorted by strength):")
revenue_corr = correlation_matrix['revenue'].abs().sort_values(ascending=False)
print(revenue_corr.round(3))

Correlation Matrix:
             order_id  customer_age  quantity  unit_price  revenue  rating
order_id        1.000        -0.003     0.001       0.001    0.002   0.001
customer_age   -0.003         1.000     0.156       0.234    0.312   0.189
quantity        0.001         0.156     1.000       0.089    0.687   0.143
unit_price      0.001         0.234     0.089       1.000    0.756   0.201
revenue         0.002         0.312     0.687       0.756    1.000   0.245
rating          0.001         0.189     0.143       0.201    0.245   1.000

Revenue correlations (sorted by strength):
revenue        1.000
unit_price     0.756
quantity       0.687
customer_age   0.312
rating         0.245
order_id       0.002

What just happened?

Revenue correlations reveal the story: unit_price (0.756) and quantity (0.687) are the strongest predictors. Customer age shows moderate correlation (0.312), meaning older customers tend to spend more. Try this: Focus on correlations above 0.3 for business decisions.

📊 Data Insight

Unit price drives 57% of revenue variance (0.756²), while quantity drives 47%. Combined, these two factors explain most high-value order patterns — the foundation for targeted marketing campaigns.

Principal Component Analysis (PCA)

When you have many variables, Principal Component Analysis reduces complexity without losing important information. Think of it as creating "super-variables" that capture the essence of multiple original variables.

The scenario: Zomato wants to segment customers but has 15+ variables per customer. The data science team needs to reduce this to 3-4 key dimensions for the clustering algorithm to work effectively.

# Prepare data for PCA (standardization is crucial)
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Select relevant features for customer analysis
features = ['customer_age', 'quantity', 'unit_price', 'revenue', 'rating']
X = df[features].copy()

print("Original data statistics:")
print(X.describe().round(2))

Original data statistics:
       customer_age  quantity  unit_price     revenue     rating
count     10000.00  10000.00    10000.00   10000.00   10000.00
mean         41.52      3.45     3248.75   11183.94       3.50
std          13.86      2.84     2156.43   15647.82       1.13
min          18.00      1.00      500.00     500.00       1.00
25%          30.00      1.00     1248.75    2475.00       2.40
50%          42.00      3.00     2749.00    6847.50       3.50
75%          53.00      5.00     4748.75   17475.00       4.60
max          65.00     10.00     9995.00   99950.00       5.00

What just happened?

Notice the scale differences: revenue ranges from 500 to 99,950, while rating goes from 1 to 5. PCA needs standardized data or the high-scale variables dominate. Try this: Always check data scales before PCA.

# Standardize features (mean=0, std=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Check explained variance ratio
print("Variance explained by each component:")
for i, variance in enumerate(pca.explained_variance_ratio_):
    print(f"PC{i+1}: {variance:.3f} ({variance*100:.1f}%)")

print(f"\nCumulative variance explained by first 3 components: {sum(pca.explained_variance_ratio_[:3]):.3f}")

Variance explained by each component:
PC1: 0.442 (44.2%)
PC2: 0.194 (19.4%)
PC3: 0.166 (16.6%)
PC4: 0.134 (13.4%)
PC5: 0.065 (6.5%)

Cumulative variance explained by first 3 components: 0.802

What just happened?

The first 3 components capture 80.2% of total variance — meaning we can reduce 5 variables to 3 while keeping 80% of information. PC1 (44.2%) is the most important dimension. Try this: Aim for 80%+ cumulative variance in business applications.

Multivariate Visualization

Charts become powerful when they show multiple variables simultaneously. Here's how to visualize relationships that correlation numbers can't fully capture:

Three variables in one chart: Electronics customers show stronger age-revenue correlation with higher satisfaction ratings

This bubble chart reveals what correlation matrices miss: Electronics buyers show a clear age-revenue trend (older customers = higher spending), while Clothing purchases stay relatively flat across ages. The bubble sizes (ratings) show Electronics customers are also more satisfied with expensive purchases.

For business decisions: Target Electronics ads to 35+ demographics, but keep Clothing campaigns age-neutral. The rating correlation suggests premium Electronics positioning works, while Clothing needs volume-based strategies.

Cluster Analysis Basics

When variables work together, they create natural customer groups. Cluster analysis finds these hidden segments automatically by looking at multiple variables simultaneously.

# Simple customer segmentation using K-Means clustering
from sklearn.cluster import KMeans

# Use our PCA results for clustering (3 components capture 80% variance)
kmeans = KMeans(n_clusters=4, random_state=42)
clusters = kmeans.fit_predict(X_pca[:, :3])

# Add cluster labels to original data
df_cluster = df.copy()
df_cluster['cluster'] = clusters

print("Cluster distribution:")
print(df_cluster['cluster'].value_counts().sort_index())

Cluster distribution:
0    2847
1    2456
2    2298
3    2399
Name: cluster, dtype: int64

# Analyze cluster characteristics
print("Cluster profiles (mean values):")
cluster_profiles = df_cluster.groupby('cluster')[features].mean().round(2)
print(cluster_profiles)

print("\nRevenue comparison across clusters:")
revenue_by_cluster = df_cluster.groupby('cluster')['revenue'].agg(['mean', 'median', 'count']).round(2)
print(revenue_by_cluster)

Cluster profiles (mean values):
         customer_age  quantity  unit_price  revenue  rating
cluster                                                     
0               29.45      2.12     1875.32   4187.23    2.85
1               45.78      5.89     4582.45  26918.74    4.12
2               38.92      3.28     2654.78   8896.45    3.48
3               52.34      2.45     4789.23  11734.89    3.65

Cluster profiles interpretation:
0 - Young Budget Shoppers (low age, quantity, price, revenue)
1 - Premium Family Buyers (high age, quantity, price, revenue, satisfaction)
2 - Average Customers (middle values across all dimensions)
3 - Mature Premium Buyers (older, fewer items, high unit price)

Revenue comparison across clusters:
         mean      median  count
cluster                         
0      4187.23    3450.00   2847
1     26918.74   24500.00   2456
2      8896.45    7200.00   2298
3     11734.89   10100.00   2399

What just happened?

Clustering revealed 4 distinct customer segments: Cluster 1 generates 6x more revenue than Cluster 0. The Premium Family Buyers (Cluster 1) combine high volume AND high price — your most valuable segment. Try this: Focus marketing spend on similar profiles to Cluster 1.

Premium Family Buyers drive 6.4x more revenue than budget shoppers — the power of multivariate segmentation

Common Mistake

Using too many clusters without business logic. Exact fix: Start with 3-5 clusters based on business strategy (budget/mid/premium), then validate with data. More clusters ≠ better insights.

Interaction Effects and Feature Engineering

Sometimes variables are weak alone but powerful together. Interaction effects capture these combinations. Age alone might not predict spending, but age × category creates strong patterns.

# Create interaction features
df_interact = df.copy()

# Age groups for easier interpretation
df_interact['age_group'] = pd.cut(df['customer_age'], 
                                 bins=[0, 30, 45, 100], 
                                 labels=['Young', 'Middle', 'Senior'])

# Calculate interaction patterns
interaction_analysis = df_interact.groupby(['age_group', 'product_category'])['revenue'].agg(['mean', 'count']).round(2)
print("Revenue by Age Group × Product Category:")
print(interaction_analysis)

Revenue by Age Group × Product Category:
                          mean  count
age_group product_category             
Young     Books           3245     142
          Clothing        6789     358
          Electronics    12456     234
          Food            2134     287
          Home            4567     189
Middle    Books           4123     198
          Clothing        8934     445
          Electronics    18923     312
          Food            2876     334
          Home            6789     298
Senior    Books           5678     156
          Clothing       11234     389
          Electronics    28445     267
          Food            3456     298
          Home            9876     234

What just happened?

The interaction reveals powerful patterns: Senior Electronics buyers spend ₹28,445 vs Young at ₹12,456 — a 128% difference. But Food spending stays low across all ages. Try this: Look for categories where age creates 2x+ revenue differences.

📊 Data Insight

Electronics shows the strongest age interaction effect (128% revenue increase from Young to Senior), while Food remains flat across all age groups. This suggests premium Electronics positioning targets older demographics, but Food appeals equally to all ages.

Practical Implementation Strategy

Here's how to implement multivariate analysis in your projects. This works 90% of the time — the 10% trips everyone up when they skip the correlation step first.

Start with Correlation Matrix

Identify Strong Relationships (>0.3)

Apply PCA if 8+ Variables

Create Multivariate Visualizations

Test Business Hypotheses with Clusters

Pro tip: Always validate multivariate insights with business stakeholders before building models. A statistically significant pattern might not be operationally useful — but when it aligns with business intuition, you've found gold.

Quiz

Up Next

Summary Statistics

Master the essential statistical measures that transform raw multivariate insights into clear, actionable business metrics and KPIs.

← Previous Course Index Next →