Data Science Lesson 13 – Multivariate Analysis | Dataplexa
Statistics · Lesson 13

Multivariate Analysis

Discover hidden patterns by analyzing multiple variables simultaneously to uncover relationships that single-variable analysis completely misses.

The Multiple Variable Reality

Bivariate analysis showed you two-variable relationships. But business decisions rarely depend on just two factors. When Flipkart wants to predict customer lifetime value, they consider age, purchase history, city, product preferences, and seasonal behavior all together. That's where multivariate analysis becomes your power tool.

Think of it like understanding traffic patterns. Single variable: "Rush hour has more cars." Bivariate: "Rain increases traffic time." Multivariate: "Rush hour + rain + Friday + cricket match = complete gridlock." Each additional variable changes the story.

Univariate

Revenue distribution shows typical range: ₹500-₹50,000

Bivariate

Electronics buyers spend more: ₹15k average vs ₹8k clothing

Multivariate

Male, 35+, Mumbai, Electronics, 4+ rating: ₹28k average

The Gap

Missing multivariate = missing 60% of customer insights

Correlation Matrix: The Relationship Map

Your first multivariate tool is the correlation matrix. This shows how every numeric variable relates to every other numeric variable. Honestly, this is underrated — most analysts skip straight to modeling without understanding their data relationships first.

# Import libraries for multivariate analysis
import pandas as pd
import numpy as np
import seaborn as sns

What just happened?

We loaded the essential libraries: pandas for data handling, numpy for mathematical operations, and seaborn for advanced correlation visualization. Try this: Always import these three together for multivariate work.

The scenario: Myntra's analytics team needs to understand what drives high-value orders across their customer segments. The CMO wants insights within 2 hours for tomorrow's strategy meeting.

# Load and explore the dataset structure
df = pd.read_csv('dataplexa_ecommerce.csv')
print("Dataset shape:", df.shape)
print("\nNumeric columns for correlation:")
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print(numeric_cols)

What just happened?

We identified 6 numeric columns from our 10,000 orders. The select_dtypes(include=[np.number]) automatically filters only numeric columns suitable for correlation analysis. Try this: Always check data types before correlation — text columns will break the process.

Now we create the correlation matrix. This will show values from -1 to +1, where -1 means perfect negative correlation, +1 means perfect positive correlation, and 0 means no linear relationship.

# Calculate correlation matrix
correlation_matrix = df[numeric_cols].corr()
print("Correlation Matrix:")
print(correlation_matrix.round(3))

# Focus on revenue correlations (most business-relevant)
print("\nRevenue correlations (sorted by strength):")
revenue_corr = correlation_matrix['revenue'].abs().sort_values(ascending=False)
print(revenue_corr.round(3))

What just happened?

Revenue correlations reveal the story: unit_price (0.756) and quantity (0.687) are the strongest predictors. Customer age shows moderate correlation (0.312), meaning older customers tend to spend more. Try this: Focus on correlations above 0.3 for business decisions.

📊 Data Insight

Unit price drives 57% of revenue variance (0.756²), while quantity drives 47%. Combined, these two factors explain most high-value order patterns — the foundation for targeted marketing campaigns.

Principal Component Analysis (PCA)

When you have many variables, Principal Component Analysis reduces complexity without losing important information. Think of it as creating "super-variables" that capture the essence of multiple original variables.

The scenario: Zomato wants to segment customers but has 15+ variables per customer. The data science team needs to reduce this to 3-4 key dimensions for the clustering algorithm to work effectively.

# Prepare data for PCA (standardization is crucial)
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Select relevant features for customer analysis
features = ['customer_age', 'quantity', 'unit_price', 'revenue', 'rating']
X = df[features].copy()

print("Original data statistics:")
print(X.describe().round(2))

What just happened?

Notice the scale differences: revenue ranges from 500 to 99,950, while rating goes from 1 to 5. PCA needs standardized data or the high-scale variables dominate. Try this: Always check data scales before PCA.

# Standardize features (mean=0, std=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Check explained variance ratio
print("Variance explained by each component:")
for i, variance in enumerate(pca.explained_variance_ratio_):
    print(f"PC{i+1}: {variance:.3f} ({variance*100:.1f}%)")

print(f"\nCumulative variance explained by first 3 components: {sum(pca.explained_variance_ratio_[:3]):.3f}")

What just happened?

The first 3 components capture 80.2% of total variance — meaning we can reduce 5 variables to 3 while keeping 80% of information. PC1 (44.2%) is the most important dimension. Try this: Aim for 80%+ cumulative variance in business applications.

Multivariate Visualization

Charts become powerful when they show multiple variables simultaneously. Here's how to visualize relationships that correlation numbers can't fully capture:

Three variables in one chart: Electronics customers show stronger age-revenue correlation with higher satisfaction ratings

This bubble chart reveals what correlation matrices miss: Electronics buyers show a clear age-revenue trend (older customers = higher spending), while Clothing purchases stay relatively flat across ages. The bubble sizes (ratings) show Electronics customers are also more satisfied with expensive purchases.

For business decisions: Target Electronics ads to 35+ demographics, but keep Clothing campaigns age-neutral. The rating correlation suggests premium Electronics positioning works, while Clothing needs volume-based strategies.

Cluster Analysis Basics

When variables work together, they create natural customer groups. Cluster analysis finds these hidden segments automatically by looking at multiple variables simultaneously.

# Simple customer segmentation using K-Means clustering
from sklearn.cluster import KMeans

# Use our PCA results for clustering (3 components capture 80% variance)
kmeans = KMeans(n_clusters=4, random_state=42)
clusters = kmeans.fit_predict(X_pca[:, :3])

# Add cluster labels to original data
df_cluster = df.copy()
df_cluster['cluster'] = clusters

print("Cluster distribution:")
print(df_cluster['cluster'].value_counts().sort_index())
# Analyze cluster characteristics
print("Cluster profiles (mean values):")
cluster_profiles = df_cluster.groupby('cluster')[features].mean().round(2)
print(cluster_profiles)

print("\nRevenue comparison across clusters:")
revenue_by_cluster = df_cluster.groupby('cluster')['revenue'].agg(['mean', 'median', 'count']).round(2)
print(revenue_by_cluster)

What just happened?

Clustering revealed 4 distinct customer segments: Cluster 1 generates 6x more revenue than Cluster 0. The Premium Family Buyers (Cluster 1) combine high volume AND high price — your most valuable segment. Try this: Focus marketing spend on similar profiles to Cluster 1.

Premium Family Buyers drive 6.4x more revenue than budget shoppers — the power of multivariate segmentation

Common Mistake

Using too many clusters without business logic. Exact fix: Start with 3-5 clusters based on business strategy (budget/mid/premium), then validate with data. More clusters ≠ better insights.

Interaction Effects and Feature Engineering

Sometimes variables are weak alone but powerful together. Interaction effects capture these combinations. Age alone might not predict spending, but age × category creates strong patterns.

# Create interaction features
df_interact = df.copy()

# Age groups for easier interpretation
df_interact['age_group'] = pd.cut(df['customer_age'], 
                                 bins=[0, 30, 45, 100], 
                                 labels=['Young', 'Middle', 'Senior'])

# Calculate interaction patterns
interaction_analysis = df_interact.groupby(['age_group', 'product_category'])['revenue'].agg(['mean', 'count']).round(2)
print("Revenue by Age Group × Product Category:")
print(interaction_analysis)

What just happened?

The interaction reveals powerful patterns: Senior Electronics buyers spend ₹28,445 vs Young at ₹12,456 — a 128% difference. But Food spending stays low across all ages. Try this: Look for categories where age creates 2x+ revenue differences.

📊 Data Insight

Electronics shows the strongest age interaction effect (128% revenue increase from Young to Senior), while Food remains flat across all age groups. This suggests premium Electronics positioning targets older demographics, but Food appeals equally to all ages.

Practical Implementation Strategy

Here's how to implement multivariate analysis in your projects. This works 90% of the time — the 10% trips everyone up when they skip the correlation step first.

1

Start with Correlation Matrix

2

Identify Strong Relationships (>0.3)

3

Apply PCA if 8+ Variables

4

Create Multivariate Visualizations

5

Test Business Hypotheses with Clusters

Pro tip: Always validate multivariate insights with business stakeholders before building models. A statistically significant pattern might not be operationally useful — but when it aligns with business intuition, you've found gold.

Quiz

1. You're analyzing Paytm's transaction data with correlations: amount_spent vs transaction_frequency (0.234), amount_spent vs user_age (0.312), amount_spent vs app_session_time (0.687), amount_spent vs merchant_category_preference (0.156). Which insight should drive your marketing strategy?


2. OYO's data science team wants to reduce 12 customer variables (age, booking_frequency, average_stay_duration, revenue_per_booking, rating, cancellation_rate, etc.) to 3-4 key dimensions for clustering. What's the correct first step?


3. Myntra's analysis shows these interaction patterns: Young Electronics buyers (₹12,456 average), Senior Electronics buyers (₹28,445 average), Young Food buyers (₹2,134 average), Senior Food buyers (₹3,456 average). What's the key business insight?


Up Next

Summary Statistics

Master the essential statistical measures that transform raw multivariate insights into clear, actionable business metrics and KPIs.