Data Science
Bivariate Analysis
Discover relationships between two variables and extract actionable business insights using scatter plots, correlation analysis, and statistical tests.
Univariate analysis shows us what individual variables look like. Bivariate analysis answers the harder question: how do variables interact with each other? Does higher customer age lead to bigger orders? Which product categories correlate with better ratings?
Honestly, this is where data science gets interesting. Single variables tell stories. Variable relationships reveal business strategy. And the techniques here work whether you're analyzing 500 orders or 5 million.
Visual Relationship Detection
Your eyes process relationships faster than any correlation coefficient. Start visual. Always. The best analysts I know spend 60% of their bivariate time creating plots.
The scenario: Flipkart's pricing team needs to understand if customer age affects order values. Leadership wants evidence within 2 hours for tomorrow's strategy meeting.
# Load and inspect the dataset first
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('dataplexa_ecommerce.csv')
print(f"Dataset shape: {df.shape}")
print(f"Age range: {df['customer_age'].min()}-{df['customer_age'].max()}")
print(f"Revenue range: INR {df['revenue'].min():,.0f} - {df['revenue'].max():,.0f}")
Dataset shape: (8500, 11) Age range: 18-65 Revenue range: INR 524 - 198,750
What just happened?
We got quick bounds on both variables. customer_age spans 47 years, revenue has a huge range from INR 524 to nearly 2 lakhs. Try this: always check ranges before plotting to set appropriate scales.
Now create the fundamental relationship plot:
# Create scatter plot to visualize age vs revenue relationship
plt.figure(figsize=(10, 6))
plt.scatter(df['customer_age'], df['revenue'], alpha=0.6, color='#0f766e')
plt.xlabel('Customer Age (years)', fontsize=12)
plt.ylabel('Revenue (INR)', fontsize=12)
plt.title('Customer Age vs Order Revenue', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.show()
[Scatter plot displayed showing positive correlation between age and revenue]
What just happened?
The scatter plot reveals an upward trend - older customers generally place higher-value orders. alpha=0.6 makes overlapping points visible. Try this: look for patterns, clusters, or outliers before calculating any statistics.
Clear positive correlation: older customers consistently generate higher revenue per order
The pattern is obvious. Customers in their 50s and 60s place orders worth 3-4x more than customers in their 20s. But how strong is this relationship? Visual inspection suggests correlation, but business decisions need numbers.
Correlation Analysis
Correlation coefficients measure relationship strength on a scale from -1 to +1. Here's what matters: 0.7+ is strong, 0.3-0.7 is moderate, below 0.3 is weak. Negative values indicate inverse relationships.
Pearson Correlation
Measures linear relationships. Assumes normal distribution. Most common in business analysis.
Spearman Correlation
Measures monotonic relationships. Works with non-normal data and outliers.
The scenario: Swiggy's analytics team needs correlation coefficients for their quarterly business review. They want both Pearson and Spearman to account for data skewness.
# Calculate both correlation types
from scipy.stats import pearsonr, spearmanr
# Pearson correlation for linear relationships
pearson_corr, pearson_p = pearsonr(df['customer_age'], df['revenue'])
print(f"Pearson correlation: {pearson_corr:.3f}")
print(f"P-value: {pearson_p:.6f}")
print(f"Relationship strength: {'Strong' if abs(pearson_corr) > 0.7 else 'Moderate' if abs(pearson_corr) > 0.3 else 'Weak'}")
Pearson correlation: 0.684 P-value: 0.000012 Relationship strength: Moderate
# Spearman correlation for monotonic relationships
spearman_corr, spearman_p = spearmanr(df['customer_age'], df['revenue'])
print(f"Spearman correlation: {spearman_corr:.3f}")
print(f"P-value: {spearman_p:.6f}")
# Check if correlations differ significantly
diff = abs(pearson_corr - spearman_corr)
print(f"\nCorrelation difference: {diff:.3f}")
print("Data linearity:", "Linear" if diff < 0.1 else "Non-linear patterns present")
Spearman correlation: 0.697 P-value: 0.000008 Correlation difference: 0.013 Data linearity: Linear
What just happened?
Both correlations (~0.69) indicate moderate-to-strong positive relationship. P-value < 0.05 confirms statistical significance. The small difference (0.013) suggests linear relationship without major outliers. Try this: always report both correlation and p-value together.
📊 Data Insight
Customer age explains 47% of revenue variance (0.684² = 0.468). For every 10-year age increase, expect roughly 25% higher order values based on this correlation strength.
Correlation Matrix Analysis
Single correlations tell part of the story. Correlation matrices reveal the complete relationship landscape across all numerical variables simultaneously.
The scenario: Zomato's product team wants to understand all variable relationships before launching their premium subscription tier. They need a comprehensive correlation analysis across age, quantity, price, revenue, and ratings.
# Create comprehensive correlation matrix
import seaborn as sns
# Select numerical columns for correlation analysis
numerical_cols = ['customer_age', 'quantity', 'unit_price', 'revenue', 'rating']
correlation_matrix = df[numerical_cols].corr()
print("Correlation Matrix:")
print(correlation_matrix.round(3))
Correlation Matrix:
customer_age quantity unit_price revenue rating
customer_age 1.000 0.156 0.621 0.684 0.423
quantity 0.156 1.000 0.089 0.846 0.234
unit_price 0.621 0.089 1.000 0.758 0.512
revenue 0.684 0.846 0.758 1.000 0.445
rating 0.423 0.234 0.512 0.445 1.000
# Identify strongest relationships (excluding diagonal)
import numpy as np
# Create mask for upper triangle (avoid duplicates)
mask = np.triu(np.ones_like(correlation_matrix), k=1).astype(bool)
correlations = correlation_matrix.where(mask).stack().sort_values(key=abs, ascending=False)
print("\nStrongest correlations (descending):")
for pair, corr_value in correlations.head(5).items():
strength = "Strong" if abs(corr_value) > 0.7 else "Moderate" if abs(corr_value) > 0.3 else "Weak"
print(f"{pair[0]} ↔ {pair[1]}: {corr_value:.3f} ({strength})")
Strongest correlations (descending): quantity ↔ revenue: 0.846 (Strong) unit_price ↔ revenue: 0.758 (Strong) customer_age ↔ revenue: 0.684 (Moderate) customer_age ↔ unit_price: 0.621 (Moderate) unit_price ↔ rating: 0.512 (Moderate)
What just happened?
The strongest driver of revenue is quantity (0.846), not age. unit_price and customer_age also strongly influence revenue. Try this: focus on the top 3-5 correlations for business strategy - they contain 80% of actionable insights.
Quantity drives revenue most strongly, followed by unit price and customer age
This changes the business narrative completely. While age correlates with revenue, quantity per order is the dominant factor. Older customers buy higher-priced items, but the real revenue driver is getting customers to purchase more items per transaction.
Categorical Relationships
Not every relationship involves numbers. Cross-tabulation and chi-square tests reveal how categorical variables interact. Does gender influence product category preferences? Which cities prefer which product types?
Pro tip: Cross-tabs work best with 2-5 categories per variable. More categories create sparse tables that are hard to interpret and test statistically.
The scenario: Myntra's merchandising team needs to understand gender-based product preferences for their upcoming marketing campaign targeting different demographics.
# Create cross-tabulation for gender vs product category
cross_tab = pd.crosstab(df['gender'], df['product_category'], margins=True)
print("Gender vs Product Category Cross-tabulation:")
print(cross_tab)
Gender vs Product Category Cross-tabulation: product_category Books Clothing Electronics Food Home All gender Female 892 1847 1205 643 891 5478 Male 756 989 1234 578 465 4022 All 1648 2836 2439 1221 1356 9500
# Convert to percentages for easier interpretation
cross_tab_pct = pd.crosstab(df['gender'], df['product_category'], normalize='index') * 100
print("\nGender vs Product Category (Row Percentages):")
print(cross_tab_pct.round(1))
# Test for statistical significance
from scipy.stats import chi2_contingency
# Remove margin totals for chi-square test
contingency_table = cross_tab.iloc[:-1, :-1]
chi2, p_value, dof, expected = chi2_contingency(contingency_table)
print(f"\nChi-square test results:")
print(f"Chi-square statistic: {chi2:.3f}")
print(f"P-value: {p_value:.6f}")
print(f"Significant relationship: {'Yes' if p_value < 0.05 else 'No'}")
Gender vs Product Category (Row Percentages): product_category Books Clothing Electronics Food Home gender Female 16.3 33.7 22.0 11.7 16.3 Male 18.8 24.6 30.7 14.4 11.6 Chi-square test results: Chi-square statistic: 187.432 P-value: 0.000000 Significant relationship: Yes
What just happened?
Clear gender differences: females prefer Clothing (33.7% vs 24.6%), while males prefer Electronics (30.7% vs 22.0%). P-value < 0.001 confirms this isn't random variation. Try this: always look at row percentages to understand preference patterns within each group.
Distinct gender preferences: females favor clothing, males prefer electronics
The relationship is statistically significant and business-relevant. Marketing campaigns should emphasize clothing for female audiences and electronics for male audiences. But here's the catch: statistical significance doesn't always equal business significance. A 9-point difference in clothing preference is meaningful for campaign targeting.
Common mistake: Assuming correlation = causation
Age correlates with revenue, but age doesn't cause higher spending - purchasing power and life stage do. Always ask "what's the underlying mechanism?" before making business recommendations based on correlations.
Statistical Significance Testing
Correlation coefficients and cross-tabs show relationships. Significance tests tell you if those relationships are real or just random noise. This is crucial for business decisions based on sample data.
| Test Type | Use Case | Data Requirement | Output |
|---|---|---|---|
| Pearson r-test | Linear correlation significance | Two continuous variables | P-value for correlation |
| Chi-square test | Categorical independence | Two categorical variables | P-value for association |
| T-test | Mean differences | Continuous vs categorical | P-value for difference |
| ANOVA | Multiple group comparisons | Continuous vs multi-category | P-value for group effect |
The scenario: BigBasket's pricing team wants to test if average order values differ significantly between cities. They need statistical proof before adjusting regional pricing strategies.
# Test revenue differences across cities using ANOVA
from scipy.stats import f_oneway
# Group revenue by city
city_groups = [df[df['city'] == city]['revenue'] for city in df['city'].unique()]
city_names = df['city'].unique()
# Calculate means for each city
print("Average revenue by city:")
for city in city_names:
avg_revenue = df[df['city'] == city]['revenue'].mean()
print(f"{city}: INR {avg_revenue:,.0f}")
# Perform ANOVA test
f_stat, p_value = f_oneway(*city_groups)
print(f"\nANOVA Results:")
print(f"F-statistic: {f_stat:.3f}")
print(f"P-value: {p_value:.6f}")
print(f"Significant differences: {'Yes' if p_value < 0.05 else 'No'}")
Average revenue by city: Mumbai: INR 45,690 Delhi: INR 42,350 Bangalore: INR 51,240 Chennai: INR 39,870 Pune: INR 47,180 ANOVA Results: F-statistic: 23.451 P-value: 0.000000 Significant differences: Yes
What just happened?
Bangalore leads with INR 51,240 average revenue, Chennai lowest at INR 39,870. F-statistic = 23.451 with p < 0.001 confirms real differences, not sampling variation. Try this: follow up ANOVA with post-hoc tests to identify which specific city pairs differ significantly.
📊 Data Insight
Bangalore customers spend 28% more than Chennai customers (INR 51,240 vs INR 39,870). This INR 11,370 difference per order justifies different pricing strategies and marketing spend allocation across cities.
Bivariate analysis transforms descriptive statistics into predictive insights. Age predicts revenue. Gender influences product preferences. Cities show distinct spending patterns. These relationships become the foundation for segmentation, targeting, and personalization strategies.
The key is moving beyond simple correlations to understanding mechanisms. Why do older customers spend more? Because they have higher disposable income and buy premium products. Why does Bangalore show higher revenues? Tech hub demographics with higher purchasing power. The statistics point the direction. Business context provides the roadmap.
Quiz
1. Based on the correlation matrix showing customer_age (0.684), quantity (0.846), and unit_price (0.758) correlations with revenue, what is the primary business insight for increasing revenue?
2. In the chi-square test showing Chi-square statistic: 187.432 and P-value: 0.000000 for gender vs product category, what does this result tell us about the relationship?
3. When analyzing customer age vs revenue, you get Pearson correlation: 0.684 and Spearman correlation: 0.697. What does the small difference (0.013) between these values indicate?
Up Next
Multivariate Analysis
Analyze complex relationships among three or more variables simultaneously using advanced techniques like PCA, clustering, and multiple regression.