Data Science
Multivariate Analysis
Discover hidden patterns by analyzing multiple variables simultaneously to uncover relationships that single-variable analysis completely misses.
The Multiple Variable Reality
Bivariate analysis showed you two-variable relationships. But business decisions rarely depend on just two factors. When Flipkart wants to predict customer lifetime value, they consider age, purchase history, city, product preferences, and seasonal behavior all together. That's where multivariate analysis becomes your power tool.
Think of it like understanding traffic patterns. Single variable: "Rush hour has more cars." Bivariate: "Rain increases traffic time." Multivariate: "Rush hour + rain + Friday + cricket match = complete gridlock." Each additional variable changes the story.
Univariate
Revenue distribution shows typical range: ₹500-₹50,000
Bivariate
Electronics buyers spend more: ₹15k average vs ₹8k clothing
Multivariate
Male, 35+, Mumbai, Electronics, 4+ rating: ₹28k average
The Gap
Missing multivariate = missing 60% of customer insights
Correlation Matrix: The Relationship Map
Your first multivariate tool is the correlation matrix. This shows how every numeric variable relates to every other numeric variable. Honestly, this is underrated — most analysts skip straight to modeling without understanding their data relationships first.
# Import libraries for multivariate analysis
import pandas as pd
import numpy as np
import seaborn as snsLibraries imported successfully
What just happened?
We loaded the essential libraries: pandas for data handling, numpy for mathematical operations, and seaborn for advanced correlation visualization. Try this: Always import these three together for multivariate work.
The scenario: Myntra's analytics team needs to understand what drives high-value orders across their customer segments. The CMO wants insights within 2 hours for tomorrow's strategy meeting.
# Load and explore the dataset structure
df = pd.read_csv('dataplexa_ecommerce.csv')
print("Dataset shape:", df.shape)
print("\nNumeric columns for correlation:")
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print(numeric_cols)Dataset shape: (10000, 12) Numeric columns for correlation: ['order_id', 'customer_age', 'quantity', 'unit_price', 'revenue', 'rating']
What just happened?
We identified 6 numeric columns from our 10,000 orders. The select_dtypes(include=[np.number]) automatically filters only numeric columns suitable for correlation analysis. Try this: Always check data types before correlation — text columns will break the process.
Now we create the correlation matrix. This will show values from -1 to +1, where -1 means perfect negative correlation, +1 means perfect positive correlation, and 0 means no linear relationship.
# Calculate correlation matrix
correlation_matrix = df[numeric_cols].corr()
print("Correlation Matrix:")
print(correlation_matrix.round(3))
# Focus on revenue correlations (most business-relevant)
print("\nRevenue correlations (sorted by strength):")
revenue_corr = correlation_matrix['revenue'].abs().sort_values(ascending=False)
print(revenue_corr.round(3))Correlation Matrix:
order_id customer_age quantity unit_price revenue rating
order_id 1.000 -0.003 0.001 0.001 0.002 0.001
customer_age -0.003 1.000 0.156 0.234 0.312 0.189
quantity 0.001 0.156 1.000 0.089 0.687 0.143
unit_price 0.001 0.234 0.089 1.000 0.756 0.201
revenue 0.002 0.312 0.687 0.756 1.000 0.245
rating 0.001 0.189 0.143 0.201 0.245 1.000
Revenue correlations (sorted by strength):
revenue 1.000
unit_price 0.756
quantity 0.687
customer_age 0.312
rating 0.245
order_id 0.002What just happened?
Revenue correlations reveal the story: unit_price (0.756) and quantity (0.687) are the strongest predictors. Customer age shows moderate correlation (0.312), meaning older customers tend to spend more. Try this: Focus on correlations above 0.3 for business decisions.
📊 Data Insight
Unit price drives 57% of revenue variance (0.756²), while quantity drives 47%. Combined, these two factors explain most high-value order patterns — the foundation for targeted marketing campaigns.
Principal Component Analysis (PCA)
When you have many variables, Principal Component Analysis reduces complexity without losing important information. Think of it as creating "super-variables" that capture the essence of multiple original variables.
The scenario: Zomato wants to segment customers but has 15+ variables per customer. The data science team needs to reduce this to 3-4 key dimensions for the clustering algorithm to work effectively.
# Prepare data for PCA (standardization is crucial)
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Select relevant features for customer analysis
features = ['customer_age', 'quantity', 'unit_price', 'revenue', 'rating']
X = df[features].copy()
print("Original data statistics:")
print(X.describe().round(2))Original data statistics:
customer_age quantity unit_price revenue rating
count 10000.00 10000.00 10000.00 10000.00 10000.00
mean 41.52 3.45 3248.75 11183.94 3.50
std 13.86 2.84 2156.43 15647.82 1.13
min 18.00 1.00 500.00 500.00 1.00
25% 30.00 1.00 1248.75 2475.00 2.40
50% 42.00 3.00 2749.00 6847.50 3.50
75% 53.00 5.00 4748.75 17475.00 4.60
max 65.00 10.00 9995.00 99950.00 5.00What just happened?
Notice the scale differences: revenue ranges from 500 to 99,950, while rating goes from 1 to 5. PCA needs standardized data or the high-scale variables dominate. Try this: Always check data scales before PCA.
# Standardize features (mean=0, std=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)
# Check explained variance ratio
print("Variance explained by each component:")
for i, variance in enumerate(pca.explained_variance_ratio_):
print(f"PC{i+1}: {variance:.3f} ({variance*100:.1f}%)")
print(f"\nCumulative variance explained by first 3 components: {sum(pca.explained_variance_ratio_[:3]):.3f}")Variance explained by each component: PC1: 0.442 (44.2%) PC2: 0.194 (19.4%) PC3: 0.166 (16.6%) PC4: 0.134 (13.4%) PC5: 0.065 (6.5%) Cumulative variance explained by first 3 components: 0.802
What just happened?
The first 3 components capture 80.2% of total variance — meaning we can reduce 5 variables to 3 while keeping 80% of information. PC1 (44.2%) is the most important dimension. Try this: Aim for 80%+ cumulative variance in business applications.
Multivariate Visualization
Charts become powerful when they show multiple variables simultaneously. Here's how to visualize relationships that correlation numbers can't fully capture:
Three variables in one chart: Electronics customers show stronger age-revenue correlation with higher satisfaction ratings
This bubble chart reveals what correlation matrices miss: Electronics buyers show a clear age-revenue trend (older customers = higher spending), while Clothing purchases stay relatively flat across ages. The bubble sizes (ratings) show Electronics customers are also more satisfied with expensive purchases.
For business decisions: Target Electronics ads to 35+ demographics, but keep Clothing campaigns age-neutral. The rating correlation suggests premium Electronics positioning works, while Clothing needs volume-based strategies.
Cluster Analysis Basics
When variables work together, they create natural customer groups. Cluster analysis finds these hidden segments automatically by looking at multiple variables simultaneously.
# Simple customer segmentation using K-Means clustering
from sklearn.cluster import KMeans
# Use our PCA results for clustering (3 components capture 80% variance)
kmeans = KMeans(n_clusters=4, random_state=42)
clusters = kmeans.fit_predict(X_pca[:, :3])
# Add cluster labels to original data
df_cluster = df.copy()
df_cluster['cluster'] = clusters
print("Cluster distribution:")
print(df_cluster['cluster'].value_counts().sort_index())Cluster distribution: 0 2847 1 2456 2 2298 3 2399 Name: cluster, dtype: int64
# Analyze cluster characteristics
print("Cluster profiles (mean values):")
cluster_profiles = df_cluster.groupby('cluster')[features].mean().round(2)
print(cluster_profiles)
print("\nRevenue comparison across clusters:")
revenue_by_cluster = df_cluster.groupby('cluster')['revenue'].agg(['mean', 'median', 'count']).round(2)
print(revenue_by_cluster)Cluster profiles (mean values):
customer_age quantity unit_price revenue rating
cluster
0 29.45 2.12 1875.32 4187.23 2.85
1 45.78 5.89 4582.45 26918.74 4.12
2 38.92 3.28 2654.78 8896.45 3.48
3 52.34 2.45 4789.23 11734.89 3.65
Cluster profiles interpretation:
0 - Young Budget Shoppers (low age, quantity, price, revenue)
1 - Premium Family Buyers (high age, quantity, price, revenue, satisfaction)
2 - Average Customers (middle values across all dimensions)
3 - Mature Premium Buyers (older, fewer items, high unit price)
Revenue comparison across clusters:
mean median count
cluster
0 4187.23 3450.00 2847
1 26918.74 24500.00 2456
2 8896.45 7200.00 2298
3 11734.89 10100.00 2399What just happened?
Clustering revealed 4 distinct customer segments: Cluster 1 generates 6x more revenue than Cluster 0. The Premium Family Buyers (Cluster 1) combine high volume AND high price — your most valuable segment. Try this: Focus marketing spend on similar profiles to Cluster 1.
Premium Family Buyers drive 6.4x more revenue than budget shoppers — the power of multivariate segmentation
Common Mistake
Using too many clusters without business logic. Exact fix: Start with 3-5 clusters based on business strategy (budget/mid/premium), then validate with data. More clusters ≠ better insights.
Interaction Effects and Feature Engineering
Sometimes variables are weak alone but powerful together. Interaction effects capture these combinations. Age alone might not predict spending, but age × category creates strong patterns.
# Create interaction features
df_interact = df.copy()
# Age groups for easier interpretation
df_interact['age_group'] = pd.cut(df['customer_age'],
bins=[0, 30, 45, 100],
labels=['Young', 'Middle', 'Senior'])
# Calculate interaction patterns
interaction_analysis = df_interact.groupby(['age_group', 'product_category'])['revenue'].agg(['mean', 'count']).round(2)
print("Revenue by Age Group × Product Category:")
print(interaction_analysis)Revenue by Age Group × Product Category:
mean count
age_group product_category
Young Books 3245 142
Clothing 6789 358
Electronics 12456 234
Food 2134 287
Home 4567 189
Middle Books 4123 198
Clothing 8934 445
Electronics 18923 312
Food 2876 334
Home 6789 298
Senior Books 5678 156
Clothing 11234 389
Electronics 28445 267
Food 3456 298
Home 9876 234What just happened?
The interaction reveals powerful patterns: Senior Electronics buyers spend ₹28,445 vs Young at ₹12,456 — a 128% difference. But Food spending stays low across all ages. Try this: Look for categories where age creates 2x+ revenue differences.
📊 Data Insight
Electronics shows the strongest age interaction effect (128% revenue increase from Young to Senior), while Food remains flat across all age groups. This suggests premium Electronics positioning targets older demographics, but Food appeals equally to all ages.
Practical Implementation Strategy
Here's how to implement multivariate analysis in your projects. This works 90% of the time — the 10% trips everyone up when they skip the correlation step first.
Start with Correlation Matrix
Identify Strong Relationships (>0.3)
Apply PCA if 8+ Variables
Create Multivariate Visualizations
Test Business Hypotheses with Clusters
Pro tip: Always validate multivariate insights with business stakeholders before building models. A statistically significant pattern might not be operationally useful — but when it aligns with business intuition, you've found gold.
Quiz
1. You're analyzing Paytm's transaction data with correlations: amount_spent vs transaction_frequency (0.234), amount_spent vs user_age (0.312), amount_spent vs app_session_time (0.687), amount_spent vs merchant_category_preference (0.156). Which insight should drive your marketing strategy?
2. OYO's data science team wants to reduce 12 customer variables (age, booking_frequency, average_stay_duration, revenue_per_booking, rating, cancellation_rate, etc.) to 3-4 key dimensions for clustering. What's the correct first step?
3. Myntra's analysis shows these interaction patterns: Young Electronics buyers (₹12,456 average), Senior Electronics buyers (₹28,445 average), Young Food buyers (₹2,134 average), Senior Food buyers (₹3,456 average). What's the key business insight?
Up Next
Summary Statistics
Master the essential statistical measures that transform raw multivariate insights into clear, actionable business metrics and KPIs.