Data Science Lesson 25 – Hypothesis Testing | Dataplexa

Statistics · Lesson 25

Hypothesis Testing

Master statistical hypothesis testing to validate business assumptions and make data-driven decisions with confidence.

State the Hypothesis

Choose Test & Significance Level

Calculate Test Statistic

Make Decision & Interpret

What Is Hypothesis Testing?

Think of hypothesis testing like a court trial for your data. You start with an assumption (the defendant is innocent), gather evidence, and decide whether that evidence is strong enough to reject your original assumption.

Every hypothesis test has two competing statements. The null hypothesis (H₀) represents the status quo - what you assume is true until proven otherwise. The alternative hypothesis (H₁) is what you're trying to prove.

Statistical Hypotheses

H₀ (Null): No effect, no difference, status quo
H₁ (Alternative): There is an effect, there is a difference

Honestly, this trips up 80% of beginners. They think we're trying to "prove" the alternative hypothesis. Wrong. We're trying to find enough evidence to reject the null hypothesis. Big difference.

One-Tailed Test

Tests direction: "greater than" or "less than"

Two-Tailed Test

Tests difference: "not equal to"

Type I Error (α)

Rejecting true null (false positive)

Type II Error (β)

Accepting false null (false negative)

Common Statistical Tests

Choosing the right test depends on your data type and what you're comparing. Here's the breakdown that covers 90% of real-world scenarios:

Test Type	Use Case	Example
One-Sample t-test	Compare sample mean to known value	Is average order value ₹1000?
Two-Sample t-test	Compare means of two groups	Male vs Female spending
Chi-Square	Test independence of categories	City vs Product preference
ANOVA	Compare means of 3+ groups	Revenue across 5 cities

Hands-On: One-Sample t-Test

The scenario: Flipkart's analytics team claims their average order value is ₹1,500. A data scientist needs to verify this claim using recent transaction data. The stakes? Marketing budget allocation for the next quarter.

# Import statistical testing library
import pandas as pd
import numpy as np
from scipy import stats

# Load the ecommerce dataset
df = pd.read_csv('dataplexa_ecommerce.csv')

# Display first few rows to understand structure
print(df.head())

   order_id        date  customer_age gender        city product_category
0      1001  2023-01-05            34      M      Mumbai      Electronics
1      1002  2023-01-05            28      F       Delhi         Clothing
2      1003  2023-01-06            45      M   Bangalore         Food
3      1004  2023-01-06            31      F     Chennai         Books
4      1005  2023-01-07            29      M        Pune         Home

   product_name  quantity  unit_price   revenue  rating  returned
0    Smartphone         1     15000.0   15000.0     4.2     False
1         Shirt         2       800.0    1600.0     3.8     False
2         Pizza         3       450.0    1350.0     4.5     False
3         Novel         1       350.0     350.0     4.0     False
4      Cushion         4       275.0    1100.0     3.9     False

What just happened?

We loaded our ecommerce dataset with pd.read_csv() and examined the structure. The revenue column contains our order values. Try this: Check data types with df.dtypes

# Calculate sample statistics for revenue
sample_mean = df['revenue'].mean()
sample_std = df['revenue'].std()
sample_size = len(df)

# Display key statistics
print(f"Sample mean: ₹{sample_mean:.2f}")
print(f"Sample std: ₹{sample_std:.2f}")
print(f"Sample size: {sample_size}")

Sample mean: ₹3247.85
Sample std: ₹4156.73
Sample size: 1000

What just happened?

Our sample shows average revenue of ₹3,247.85, much higher than the claimed ₹1,500. The standard deviation ₹4,156.73 shows high variability. Try this: Check the median with df['revenue'].median()

# Set up hypothesis test parameters
claimed_mean = 1500  # Flipkart's claim
alpha = 0.05  # 5% significance level

# State the hypotheses clearly
print("H₀: μ = ₹1,500 (Flipkart's claim is correct)")
print("H₁: μ ≠ ₹1,500 (Flipkart's claim is incorrect)")
print(f"Significance level: {alpha}")
print(f"Test type: Two-tailed")

H₀: μ = ₹1,500 (Flipkart's claim is correct)
H₁: μ ≠ ₹1,500 (Flipkart's claim is incorrect)
Significance level: 0.05
Test type: Two-tailed

What just happened?

We formally stated our hypotheses. Using alpha = 0.05 means we accept 5% risk of falsely rejecting a true claim. Two-tailed test because we want to detect if the mean is significantly different (higher or lower). Try this: Change alpha to 0.01 for stricter testing

# Perform one-sample t-test using scipy
t_statistic, p_value = stats.ttest_1samp(df['revenue'], claimed_mean)

# Calculate degrees of freedom
df_freedom = sample_size - 1

# Display test results
print(f"t-statistic: {t_statistic:.4f}")
print(f"p-value: {p_value:.6f}")
print(f"degrees of freedom: {df_freedom}")

t-statistic: 13.3104
p-value: 0.000000
degrees of freedom: 999

What just happened?

The stats.ttest_1samp() calculated our test statistic (13.31) and p-value (essentially 0). The high t-statistic shows our sample mean is many standard errors away from the claimed value. Try this: Run stats.ttest_1samp(df['revenue'][:50], claimed_mean) with smaller sample

# Make statistical decision
if p_value < alpha:
    decision = "Reject H₀"
    conclusion = "significant"
else:
    decision = "Fail to reject H₀" 
    conclusion = "not significant"

print(f"Decision: {decision}")
print(f"Conclusion: The difference is {conclusion}")
print(f"Business impact: Flipkart's claim appears incorrect")

Decision: Reject H₀
Conclusion: The difference is significant
Business impact: Flipkart's claim appears incorrect

What just happened?

Since our p_value < alpha, we reject the null hypothesis. The evidence strongly suggests Flipkart's claimed average of ₹1,500 is incorrect - the actual average is significantly higher. Try this: Test different significance levels to see how decisions change

Our t-statistic (13.31) falls far outside the critical region, providing strong evidence against the null hypothesis

The scatter plot shows where our test statistic lands relative to the critical values. Our purple dot is way off the chart - that's how extreme our result is. The red dots mark the critical region boundaries at ±1.96 for α=0.05.

This visualization makes it crystal clear: our observed difference isn't just a random fluctuation. The average order value is genuinely different from ₹1,500, with statistical confidence exceeding 99.9%.

Two-Sample Testing: Comparing Groups

The scenario: Swiggy's marketing team needs to know if male and female customers have different average order values. This determines whether they should create gender-specific promotional campaigns or stick with universal messaging.

# Separate revenue by gender
male_revenue = df[df['gender'] == 'M']['revenue']
female_revenue = df[df['gender'] == 'F']['revenue']

# Calculate group statistics
male_mean = male_revenue.mean()
female_mean = female_revenue.mean()

print(f"Male avg order: ₹{male_mean:.2f}")
print(f"Female avg order: ₹{female_mean:.2f}")
print(f"Difference: ₹{male_mean - female_mean:.2f}")

Male avg order: ₹3,285.67
Female avg order: ₹3,210.03
Difference: ₹75.64

What just happened?

We filtered the data by gender using df['gender'] == 'M' boolean indexing. Males spend ₹75.64 more on average, but is this difference statistically significant or just random variation? Try this: Check sample sizes with len(male_revenue)

# Perform independent two-sample t-test
t_stat, p_val = stats.ttest_ind(male_revenue, female_revenue)

# Set significance level
alpha = 0.05

print(f"Two-sample t-test results:")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_val:.4f}")
print(f"Alpha: {alpha}")

Two-sample t-test results:
t-statistic: 0.3842
p-value: 0.7011
Alpha: 0.05

What just happened?

The stats.ttest_ind() performed an independent two-sample test. Our p-value (0.7011) is much higher than α (0.05), suggesting the gender difference could easily be random. The low t-statistic (0.38) confirms weak evidence. Try this: Test with equal_var=False for unequal variances

The difference between male and female order values appears minimal and statistically insignificant

The bar chart confirms what our statistical test revealed. While males spend slightly more, the difference is too small to be meaningful. From a business perspective, Swiggy shouldn't waste resources on gender-specific campaigns based on this evidence.

But here's what trips up most analysts: they see the ₹75 difference and think it's actionable. Statistical testing saves you from making expensive decisions based on random noise. The p-value of 0.70 means there's a 70% chance this difference occurred purely by random sampling.

📊 Data Insight

Gender shows no significant impact on order values (p=0.70), but product categories vary dramatically from ₹350 (Books) to ₹15,000 (Electronics). Focus segmentation efforts on product preferences, not demographics.

Statistical Power and Effect Size

Here's where 90% of data scientists get tripped up. You can have a statistically significant result that's practically meaningless, or a meaningful difference that doesn't reach statistical significance due to small sample size.

Common Mistake: Confusing Statistical and Practical Significance

A difference of ₹0.50 in order value might be statistically significant with 1 million samples, but completely irrelevant for business decisions. Always consider effect size alongside p-values.

# Calculate Cohen's d for effect size
def cohens_d(group1, group2):
    # Calculate pooled standard deviation
    pooled_std = np.sqrt(((len(group1)-1)*group1.std()**2 + (len(group2)-1)*group2.std()**2) / (len(group1)+len(group2)-2))
    # Return effect size
    return (group1.mean() - group2.mean()) / pooled_std

effect_size = cohens_d(male_revenue, female_revenue)
print(f"Cohen's d (effect size): {effect_size:.4f}")

Cohen's d (effect size): 0.0192

What just happened?

Cohen's d measures effect size standardized by pooled standard deviation. Our value (0.019) is tiny - Cohen's guidelines suggest 0.2 = small, 0.5 = medium, 0.8 = large effect. This confirms the gender difference is not just statistically insignificant but practically meaningless. Try this: Calculate effect size for age groups

Aim for 80% statistical power to reliably detect meaningful differences when they exist

The doughnut chart shows the standard 80% power target. Statistical power is your test's ability to detect a real effect when one exists. Low power means you might miss important business insights due to insufficient sample size.

Power depends on four factors: effect size, sample size, significance level, and population variability. You can't control population variability, but you can calculate required sample sizes before collecting data. This prevents the frustrating situation where you collect data but can't draw confident conclusions.

Pro Tip: Always calculate required sample size before data collection. Use power analysis to determine if your study can detect the minimum effect size that matters to your business.

Quiz

Up Next

CI & p-Values

Dive deeper into confidence intervals and understand exactly what p-values tell you (and what they don't) about your data.

← Previous Course Index Next →