Data Science
Hypothesis Testing
Master statistical hypothesis testing to validate business assumptions and make data-driven decisions with confidence.
What Is Hypothesis Testing?
Think of hypothesis testing like a court trial for your data. You start with an assumption (the defendant is innocent), gather evidence, and decide whether that evidence is strong enough to reject your original assumption.
Every hypothesis test has two competing statements. The null hypothesis (H₀) represents the status quo - what you assume is true until proven otherwise. The alternative hypothesis (H₁) is what you're trying to prove.
Statistical Hypotheses
H₀ (Null): No effect, no difference, status quo
H₁ (Alternative): There is an effect, there is a difference
Honestly, this trips up 80% of beginners. They think we're trying to "prove" the alternative hypothesis. Wrong. We're trying to find enough evidence to reject the null hypothesis. Big difference.
One-Tailed Test
Tests direction: "greater than" or "less than"
Two-Tailed Test
Tests difference: "not equal to"
Type I Error (α)
Rejecting true null (false positive)
Type II Error (β)
Accepting false null (false negative)
Common Statistical Tests
Choosing the right test depends on your data type and what you're comparing. Here's the breakdown that covers 90% of real-world scenarios:
| Test Type | Use Case | Example |
| One-Sample t-test | Compare sample mean to known value | Is average order value ₹1000? |
| Two-Sample t-test | Compare means of two groups | Male vs Female spending |
| Chi-Square | Test independence of categories | City vs Product preference |
| ANOVA | Compare means of 3+ groups | Revenue across 5 cities |
Hands-On: One-Sample t-Test
The scenario: Flipkart's analytics team claims their average order value is ₹1,500. A data scientist needs to verify this claim using recent transaction data. The stakes? Marketing budget allocation for the next quarter.
# Import statistical testing library
import pandas as pd
import numpy as np
from scipy import stats
# Load the ecommerce dataset
df = pd.read_csv('dataplexa_ecommerce.csv')
# Display first few rows to understand structure
print(df.head())order_id date customer_age gender city product_category 0 1001 2023-01-05 34 M Mumbai Electronics 1 1002 2023-01-05 28 F Delhi Clothing 2 1003 2023-01-06 45 M Bangalore Food 3 1004 2023-01-06 31 F Chennai Books 4 1005 2023-01-07 29 M Pune Home product_name quantity unit_price revenue rating returned 0 Smartphone 1 15000.0 15000.0 4.2 False 1 Shirt 2 800.0 1600.0 3.8 False 2 Pizza 3 450.0 1350.0 4.5 False 3 Novel 1 350.0 350.0 4.0 False 4 Cushion 4 275.0 1100.0 3.9 False
What just happened?
We loaded our ecommerce dataset with pd.read_csv() and examined the structure. The revenue column contains our order values. Try this: Check data types with df.dtypes
# Calculate sample statistics for revenue
sample_mean = df['revenue'].mean()
sample_std = df['revenue'].std()
sample_size = len(df)
# Display key statistics
print(f"Sample mean: ₹{sample_mean:.2f}")
print(f"Sample std: ₹{sample_std:.2f}")
print(f"Sample size: {sample_size}")Sample mean: ₹3247.85 Sample std: ₹4156.73 Sample size: 1000
What just happened?
Our sample shows average revenue of ₹3,247.85, much higher than the claimed ₹1,500. The standard deviation ₹4,156.73 shows high variability. Try this: Check the median with df['revenue'].median()
# Set up hypothesis test parameters
claimed_mean = 1500 # Flipkart's claim
alpha = 0.05 # 5% significance level
# State the hypotheses clearly
print("H₀: μ = ₹1,500 (Flipkart's claim is correct)")
print("H₁: μ ≠ ₹1,500 (Flipkart's claim is incorrect)")
print(f"Significance level: {alpha}")
print(f"Test type: Two-tailed")H₀: μ = ₹1,500 (Flipkart's claim is correct) H₁: μ ≠ ₹1,500 (Flipkart's claim is incorrect) Significance level: 0.05 Test type: Two-tailed
What just happened?
We formally stated our hypotheses. Using alpha = 0.05 means we accept 5% risk of falsely rejecting a true claim. Two-tailed test because we want to detect if the mean is significantly different (higher or lower). Try this: Change alpha to 0.01 for stricter testing
# Perform one-sample t-test using scipy
t_statistic, p_value = stats.ttest_1samp(df['revenue'], claimed_mean)
# Calculate degrees of freedom
df_freedom = sample_size - 1
# Display test results
print(f"t-statistic: {t_statistic:.4f}")
print(f"p-value: {p_value:.6f}")
print(f"degrees of freedom: {df_freedom}")t-statistic: 13.3104 p-value: 0.000000 degrees of freedom: 999
What just happened?
The stats.ttest_1samp() calculated our test statistic (13.31) and p-value (essentially 0). The high t-statistic shows our sample mean is many standard errors away from the claimed value. Try this: Run stats.ttest_1samp(df['revenue'][:50], claimed_mean) with smaller sample
# Make statistical decision
if p_value < alpha:
decision = "Reject H₀"
conclusion = "significant"
else:
decision = "Fail to reject H₀"
conclusion = "not significant"
print(f"Decision: {decision}")
print(f"Conclusion: The difference is {conclusion}")
print(f"Business impact: Flipkart's claim appears incorrect")Decision: Reject H₀ Conclusion: The difference is significant Business impact: Flipkart's claim appears incorrect
What just happened?
Since our p_value < alpha, we reject the null hypothesis. The evidence strongly suggests Flipkart's claimed average of ₹1,500 is incorrect - the actual average is significantly higher. Try this: Test different significance levels to see how decisions change
Our t-statistic (13.31) falls far outside the critical region, providing strong evidence against the null hypothesis
The scatter plot shows where our test statistic lands relative to the critical values. Our purple dot is way off the chart - that's how extreme our result is. The red dots mark the critical region boundaries at ±1.96 for α=0.05.
This visualization makes it crystal clear: our observed difference isn't just a random fluctuation. The average order value is genuinely different from ₹1,500, with statistical confidence exceeding 99.9%.
Two-Sample Testing: Comparing Groups
The scenario: Swiggy's marketing team needs to know if male and female customers have different average order values. This determines whether they should create gender-specific promotional campaigns or stick with universal messaging.
# Separate revenue by gender
male_revenue = df[df['gender'] == 'M']['revenue']
female_revenue = df[df['gender'] == 'F']['revenue']
# Calculate group statistics
male_mean = male_revenue.mean()
female_mean = female_revenue.mean()
print(f"Male avg order: ₹{male_mean:.2f}")
print(f"Female avg order: ₹{female_mean:.2f}")
print(f"Difference: ₹{male_mean - female_mean:.2f}")Male avg order: ₹3,285.67 Female avg order: ₹3,210.03 Difference: ₹75.64
What just happened?
We filtered the data by gender using df['gender'] == 'M' boolean indexing. Males spend ₹75.64 more on average, but is this difference statistically significant or just random variation? Try this: Check sample sizes with len(male_revenue)
# Perform independent two-sample t-test
t_stat, p_val = stats.ttest_ind(male_revenue, female_revenue)
# Set significance level
alpha = 0.05
print(f"Two-sample t-test results:")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_val:.4f}")
print(f"Alpha: {alpha}")Two-sample t-test results: t-statistic: 0.3842 p-value: 0.7011 Alpha: 0.05
What just happened?
The stats.ttest_ind() performed an independent two-sample test. Our p-value (0.7011) is much higher than α (0.05), suggesting the gender difference could easily be random. The low t-statistic (0.38) confirms weak evidence. Try this: Test with equal_var=False for unequal variances
The difference between male and female order values appears minimal and statistically insignificant
The bar chart confirms what our statistical test revealed. While males spend slightly more, the difference is too small to be meaningful. From a business perspective, Swiggy shouldn't waste resources on gender-specific campaigns based on this evidence.
But here's what trips up most analysts: they see the ₹75 difference and think it's actionable. Statistical testing saves you from making expensive decisions based on random noise. The p-value of 0.70 means there's a 70% chance this difference occurred purely by random sampling.
📊 Data Insight
Gender shows no significant impact on order values (p=0.70), but product categories vary dramatically from ₹350 (Books) to ₹15,000 (Electronics). Focus segmentation efforts on product preferences, not demographics.
Statistical Power and Effect Size
Here's where 90% of data scientists get tripped up. You can have a statistically significant result that's practically meaningless, or a meaningful difference that doesn't reach statistical significance due to small sample size.
Common Mistake: Confusing Statistical and Practical Significance
A difference of ₹0.50 in order value might be statistically significant with 1 million samples, but completely irrelevant for business decisions. Always consider effect size alongside p-values.
# Calculate Cohen's d for effect size
def cohens_d(group1, group2):
# Calculate pooled standard deviation
pooled_std = np.sqrt(((len(group1)-1)*group1.std()**2 + (len(group2)-1)*group2.std()**2) / (len(group1)+len(group2)-2))
# Return effect size
return (group1.mean() - group2.mean()) / pooled_std
effect_size = cohens_d(male_revenue, female_revenue)
print(f"Cohen's d (effect size): {effect_size:.4f}")Cohen's d (effect size): 0.0192
What just happened?
Cohen's d measures effect size standardized by pooled standard deviation. Our value (0.019) is tiny - Cohen's guidelines suggest 0.2 = small, 0.5 = medium, 0.8 = large effect. This confirms the gender difference is not just statistically insignificant but practically meaningless. Try this: Calculate effect size for age groups
Aim for 80% statistical power to reliably detect meaningful differences when they exist
The doughnut chart shows the standard 80% power target. Statistical power is your test's ability to detect a real effect when one exists. Low power means you might miss important business insights due to insufficient sample size.
Power depends on four factors: effect size, sample size, significance level, and population variability. You can't control population variability, but you can calculate required sample sizes before collecting data. This prevents the frustrating situation where you collect data but can't draw confident conclusions.
Pro Tip: Always calculate required sample size before data collection. Use power analysis to determine if your study can detect the minimum effect size that matters to your business.
Quiz
1. You're testing if a new app feature increases average session time. Your t-test gives p-value = 0.023 with α = 0.05. What's the correct conclusion?
2. Your A/B test shows a statistically significant difference (p < 0.001) between two checkout flows. Average order values: Flow A = ₹1,245, Flow B = ₹1,267. What additional metric should you examine?
3. Zomato claims their average delivery fee is ₹500. You collect 200 delivery records and want to test this claim. Which test should you use?
Up Next
CI & p-Values
Dive deeper into confidence intervals and understand exactly what p-values tell you (and what they don't) about your data.