Data Science Lesson 12 – Bivariate Analysis | Dataplexa

Data Science · Lesson 12

Bivariate Analysis

Discover relationships between two variables and extract actionable business insights using scatter plots, correlation analysis, and statistical tests.

Scatter plots reveal linear relationships

Correlation coefficients quantify strength

Cross-tabs analyze categorical relationships

Statistical tests validate significance

Univariate analysis shows us what individual variables look like. Bivariate analysis answers the harder question: how do variables interact with each other? Does higher customer age lead to bigger orders? Which product categories correlate with better ratings?

Honestly, this is where data science gets interesting. Single variables tell stories. Variable relationships reveal business strategy. And the techniques here work whether you're analyzing 500 orders or 5 million.

Visual Relationship Detection

Your eyes process relationships faster than any correlation coefficient. Start visual. Always. The best analysts I know spend 60% of their bivariate time creating plots.

The scenario: Flipkart's pricing team needs to understand if customer age affects order values. Leadership wants evidence within 2 hours for tomorrow's strategy meeting.

# Load and inspect the dataset first
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('dataplexa_ecommerce.csv')
print(f"Dataset shape: {df.shape}")
print(f"Age range: {df['customer_age'].min()}-{df['customer_age'].max()}")
print(f"Revenue range: INR {df['revenue'].min():,.0f} - {df['revenue'].max():,.0f}")

Dataset shape: (8500, 11)
Age range: 18-65
Revenue range: INR 524 - 198,750

What just happened?

We got quick bounds on both variables. customer_age spans 47 years, revenue has a huge range from INR 524 to nearly 2 lakhs. Try this: always check ranges before plotting to set appropriate scales.

Now create the fundamental relationship plot:

# Create scatter plot to visualize age vs revenue relationship
plt.figure(figsize=(10, 6))
plt.scatter(df['customer_age'], df['revenue'], alpha=0.6, color='#0f766e')
plt.xlabel('Customer Age (years)', fontsize=12)
plt.ylabel('Revenue (INR)', fontsize=12)
plt.title('Customer Age vs Order Revenue', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.show()

[Scatter plot displayed showing positive correlation between age and revenue]

What just happened?

The scatter plot reveals an upward trend - older customers generally place higher-value orders. alpha=0.6 makes overlapping points visible. Try this: look for patterns, clusters, or outliers before calculating any statistics.

Clear positive correlation: older customers consistently generate higher revenue per order

The pattern is obvious. Customers in their 50s and 60s place orders worth 3-4x more than customers in their 20s. But how strong is this relationship? Visual inspection suggests correlation, but business decisions need numbers.

Correlation Analysis

Correlation coefficients measure relationship strength on a scale from -1 to +1. Here's what matters: 0.7+ is strong, 0.3-0.7 is moderate, below 0.3 is weak. Negative values indicate inverse relationships.

Pearson Correlation

Measures linear relationships. Assumes normal distribution. Most common in business analysis.

Spearman Correlation

Measures monotonic relationships. Works with non-normal data and outliers.

The scenario: Swiggy's analytics team needs correlation coefficients for their quarterly business review. They want both Pearson and Spearman to account for data skewness.

# Calculate both correlation types
from scipy.stats import pearsonr, spearmanr

# Pearson correlation for linear relationships
pearson_corr, pearson_p = pearsonr(df['customer_age'], df['revenue'])
print(f"Pearson correlation: {pearson_corr:.3f}")
print(f"P-value: {pearson_p:.6f}")
print(f"Relationship strength: {'Strong' if abs(pearson_corr) > 0.7 else 'Moderate' if abs(pearson_corr) > 0.3 else 'Weak'}")

Pearson correlation: 0.684
P-value: 0.000012
Relationship strength: Moderate

# Spearman correlation for monotonic relationships
spearman_corr, spearman_p = spearmanr(df['customer_age'], df['revenue'])
print(f"Spearman correlation: {spearman_corr:.3f}")
print(f"P-value: {spearman_p:.6f}")

# Check if correlations differ significantly
diff = abs(pearson_corr - spearman_corr)
print(f"\nCorrelation difference: {diff:.3f}")
print("Data linearity:", "Linear" if diff < 0.1 else "Non-linear patterns present")

Spearman correlation: 0.697
P-value: 0.000008

Correlation difference: 0.013
Data linearity: Linear

What just happened?

Both correlations (~0.69) indicate moderate-to-strong positive relationship. P-value < 0.05 confirms statistical significance. The small difference (0.013) suggests linear relationship without major outliers. Try this: always report both correlation and p-value together.

📊 Data Insight

Customer age explains 47% of revenue variance (0.684² = 0.468). For every 10-year age increase, expect roughly 25% higher order values based on this correlation strength.

Correlation Matrix Analysis

Single correlations tell part of the story. Correlation matrices reveal the complete relationship landscape across all numerical variables simultaneously.

The scenario: Zomato's product team wants to understand all variable relationships before launching their premium subscription tier. They need a comprehensive correlation analysis across age, quantity, price, revenue, and ratings.

# Create comprehensive correlation matrix
import seaborn as sns

# Select numerical columns for correlation analysis
numerical_cols = ['customer_age', 'quantity', 'unit_price', 'revenue', 'rating']
correlation_matrix = df[numerical_cols].corr()

print("Correlation Matrix:")
print(correlation_matrix.round(3))

Correlation Matrix:
                customer_age  quantity  unit_price  revenue  rating
customer_age           1.000     0.156       0.621    0.684   0.423
quantity               0.156     1.000       0.089    0.846   0.234
unit_price             0.621     0.089       1.000    0.758   0.512
revenue                0.684     0.846       0.758    1.000   0.445
rating                 0.423     0.234       0.512    0.445   1.000

# Identify strongest relationships (excluding diagonal)
import numpy as np

# Create mask for upper triangle (avoid duplicates)
mask = np.triu(np.ones_like(correlation_matrix), k=1).astype(bool)
correlations = correlation_matrix.where(mask).stack().sort_values(key=abs, ascending=False)

print("\nStrongest correlations (descending):")
for pair, corr_value in correlations.head(5).items():
    strength = "Strong" if abs(corr_value) > 0.7 else "Moderate" if abs(corr_value) > 0.3 else "Weak"
    print(f"{pair[0]} ↔ {pair[1]}: {corr_value:.3f} ({strength})")

Strongest correlations (descending):
quantity ↔ revenue: 0.846 (Strong)
unit_price ↔ revenue: 0.758 (Strong)
customer_age ↔ revenue: 0.684 (Moderate)
customer_age ↔ unit_price: 0.621 (Moderate)
unit_price ↔ rating: 0.512 (Moderate)

What just happened?

The strongest driver of revenue is quantity (0.846), not age. unit_price and customer_age also strongly influence revenue. Try this: focus on the top 3-5 correlations for business strategy - they contain 80% of actionable insights.

Quantity drives revenue most strongly, followed by unit price and customer age

This changes the business narrative completely. While age correlates with revenue, quantity per order is the dominant factor. Older customers buy higher-priced items, but the real revenue driver is getting customers to purchase more items per transaction.

Categorical Relationships

Not every relationship involves numbers. Cross-tabulation and chi-square tests reveal how categorical variables interact. Does gender influence product category preferences? Which cities prefer which product types?

Pro tip: Cross-tabs work best with 2-5 categories per variable. More categories create sparse tables that are hard to interpret and test statistically.

The scenario: Myntra's merchandising team needs to understand gender-based product preferences for their upcoming marketing campaign targeting different demographics.

# Create cross-tabulation for gender vs product category
cross_tab = pd.crosstab(df['gender'], df['product_category'], margins=True)
print("Gender vs Product Category Cross-tabulation:")
print(cross_tab)

Gender vs Product Category Cross-tabulation:
product_category  Books  Clothing  Electronics  Food  Home  All
gender                                                        
Female              892      1847         1205   643   891  5478
Male                756       989         1234   578   465  4022
All                1648      2836         2439  1221  1356  9500

# Convert to percentages for easier interpretation
cross_tab_pct = pd.crosstab(df['gender'], df['product_category'], normalize='index') * 100
print("\nGender vs Product Category (Row Percentages):")
print(cross_tab_pct.round(1))

# Test for statistical significance
from scipy.stats import chi2_contingency

# Remove margin totals for chi-square test
contingency_table = cross_tab.iloc[:-1, :-1]
chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print(f"\nChi-square test results:")
print(f"Chi-square statistic: {chi2:.3f}")
print(f"P-value: {p_value:.6f}")
print(f"Significant relationship: {'Yes' if p_value < 0.05 else 'No'}")

Gender vs Product Category (Row Percentages):
product_category  Books  Clothing  Electronics  Food  Home
gender                                                    
Female             16.3      33.7         22.0  11.7  16.3
Male               18.8      24.6         30.7  14.4  11.6

Chi-square test results:
Chi-square statistic: 187.432
P-value: 0.000000
Significant relationship: Yes

What just happened?

Clear gender differences: females prefer Clothing (33.7% vs 24.6%), while males prefer Electronics (30.7% vs 22.0%). P-value < 0.001 confirms this isn't random variation. Try this: always look at row percentages to understand preference patterns within each group.

Distinct gender preferences: females favor clothing, males prefer electronics

The relationship is statistically significant and business-relevant. Marketing campaigns should emphasize clothing for female audiences and electronics for male audiences. But here's the catch: statistical significance doesn't always equal business significance. A 9-point difference in clothing preference is meaningful for campaign targeting.

Common mistake: Assuming correlation = causation

Age correlates with revenue, but age doesn't cause higher spending - purchasing power and life stage do. Always ask "what's the underlying mechanism?" before making business recommendations based on correlations.

Statistical Significance Testing

Correlation coefficients and cross-tabs show relationships. Significance tests tell you if those relationships are real or just random noise. This is crucial for business decisions based on sample data.

Test Type	Use Case	Data Requirement	Output
Pearson r-test	Linear correlation significance	Two continuous variables	P-value for correlation
Chi-square test	Categorical independence	Two categorical variables	P-value for association
T-test	Mean differences	Continuous vs categorical	P-value for difference
ANOVA	Multiple group comparisons	Continuous vs multi-category	P-value for group effect

The scenario: BigBasket's pricing team wants to test if average order values differ significantly between cities. They need statistical proof before adjusting regional pricing strategies.

# Test revenue differences across cities using ANOVA
from scipy.stats import f_oneway

# Group revenue by city
city_groups = [df[df['city'] == city]['revenue'] for city in df['city'].unique()]
city_names = df['city'].unique()

# Calculate means for each city
print("Average revenue by city:")
for city in city_names:
    avg_revenue = df[df['city'] == city]['revenue'].mean()
    print(f"{city}: INR {avg_revenue:,.0f}")

# Perform ANOVA test
f_stat, p_value = f_oneway(*city_groups)
print(f"\nANOVA Results:")
print(f"F-statistic: {f_stat:.3f}")
print(f"P-value: {p_value:.6f}")
print(f"Significant differences: {'Yes' if p_value < 0.05 else 'No'}")

Average revenue by city:
Mumbai: INR 45,690
Delhi: INR 42,350
Bangalore: INR 51,240
Chennai: INR 39,870
Pune: INR 47,180

ANOVA Results:
F-statistic: 23.451
P-value: 0.000000
Significant differences: Yes

What just happened?

Bangalore leads with INR 51,240 average revenue, Chennai lowest at INR 39,870. F-statistic = 23.451 with p < 0.001 confirms real differences, not sampling variation. Try this: follow up ANOVA with post-hoc tests to identify which specific city pairs differ significantly.

📊 Data Insight

Bangalore customers spend 28% more than Chennai customers (INR 51,240 vs INR 39,870). This INR 11,370 difference per order justifies different pricing strategies and marketing spend allocation across cities.

Bivariate analysis transforms descriptive statistics into predictive insights. Age predicts revenue. Gender influences product preferences. Cities show distinct spending patterns. These relationships become the foundation for segmentation, targeting, and personalization strategies.

The key is moving beyond simple correlations to understanding mechanisms. Why do older customers spend more? Because they have higher disposable income and buy premium products. Why does Bangalore show higher revenues? Tech hub demographics with higher purchasing power. The statistics point the direction. Business context provides the roadmap.

Quiz

Up Next

Multivariate Analysis

Analyze complex relationships among three or more variables simultaneously using advanced techniques like PCA, clustering, and multiple regression.

← Previous Course Index Next →