Data Science Lesson 26 – CI & p-Values | Dataplexa

Statistics · Lesson 26

CI & p-Values

Calculate confidence intervals and interpret p-values to make statistically sound business decisions with real uncertainty bounds.

Calculate sample statistics

Determine confidence intervals

Interpret p-values correctly

Make data-driven decisions

Confidence Intervals: Your Uncertainty Compass

Confidence intervals are like weather forecasts. When a meteorologist says "70% chance of rain," they're giving you a range of uncertainty. A 95% confidence interval tells you where the true population parameter likely lives — not just your sample estimate.

Here's what trips up 90% of analysts: confidence intervals don't tell you the probability that your specific interval contains the true value. They tell you that if you repeated this process 100 times, 95 of those intervals would capture the true parameter.

95% Confidence

Most common choice. Good balance between precision and reliability. Wide enough to be trustworthy.

99% Confidence

Ultra-conservative. Wider intervals but more certainty. Use for critical business decisions.

90% Confidence

Narrower intervals, less conservative. Good for initial exploration when precision matters more.

Common Mistake

Don't say "95% chance the true mean is in this interval." The true mean is fixed — your interval is random.

Building Confidence Intervals

The scenario: Flipkart's analytics team needs to estimate the true average order value for Electronics category. They can't analyze all 50 million orders, so they're working with a sample of 1,000 orders.

# Import essential libraries for statistical calculations
import pandas as pd
import numpy as np
from scipy import stats

# Load the ecommerce dataset
df = pd.read_csv('dataplexa_ecommerce.csv')
print(f"Total rows loaded: {len(df)}")
print(df.head())

Total rows loaded: 15420
   order_id        date  customer_age gender       city product_category       product_name  quantity  unit_price    revenue  rating  returned
0      1001  2023-01-05            25   Male     Mumbai      Electronics  Wireless Earbuds         2     2499.0     4998.0     4.2     False
1      1002  2023-01-05            32 Female      Delhi         Clothing      Cotton Shirt         1      899.0      899.0     3.8     False
2      1003  2023-01-06            28   Male  Bangalore         Food       Protein Powder         1     1299.0     1299.0     4.5     False
3      1004  2023-01-06            45 Female    Chennai        Books      Python Guide         1      649.0      649.0     4.1     False
4      1005  2023-01-07            23   Male       Pune          Home   Coffee Machine         1    12999.0    12999.0     4.7     False

What just happened?

We loaded our ecommerce data with 15,420 orders. Notice the revenue range from ₹649 to ₹12,999 — this variation is exactly why we need confidence intervals. Try this: Check the data types with df.info() to ensure numeric columns are properly loaded.

Now we'll filter for Electronics orders and calculate our sample statistics.

# Filter for Electronics category only
electronics = df[df['product_category'] == 'Electronics']
print(f"Electronics orders: {len(electronics)}")

# Calculate sample statistics needed for confidence interval
sample_mean = electronics['revenue'].mean()
sample_std = electronics['revenue'].std()
sample_size = len(electronics)

print(f"Sample mean: ₹{sample_mean:.2f}")
print(f"Sample std: ₹{sample_std:.2f}")
print(f"Sample size: {sample_size}")

Electronics orders: 3084
Sample mean: ₹8,245.63
Sample std: ₹4,123.89
Sample size: 3084

What just happened?

We isolated 3,084 Electronics orders with an average revenue of ₹8,245.63. The standard deviation of ₹4,123.89 shows significant variation — some orders are ₹2,000, others ₹15,000+. Try this: Use electronics['revenue'].describe() to see the full distribution.

Calculating the 95% Confidence Interval

With sample size over 30, we can use the normal distribution. But since we don't know the true population standard deviation, we'll use the t-distribution to be more conservative.

# Calculate 95% confidence interval using t-distribution
confidence_level = 0.95
alpha = 1 - confidence_level
degrees_freedom = sample_size - 1

# Get the t-critical value for 95% confidence
t_critical = stats.t.ppf(1 - alpha/2, degrees_freedom)
print(f"t-critical value: {t_critical:.4f}")

t-critical value: 1.9604

# Calculate standard error of the mean
standard_error = sample_std / np.sqrt(sample_size)
print(f"Standard error: ₹{standard_error:.2f}")

# Calculate margin of error
margin_of_error = t_critical * standard_error
print(f"Margin of error: ₹{margin_of_error:.2f}")

# Build the confidence interval
ci_lower = sample_mean - margin_of_error
ci_upper = sample_mean + margin_of_error

print(f"\n95% Confidence Interval for Electronics Revenue:")
print(f"₹{ci_lower:.2f} to ₹{ci_upper:.2f}")

Standard error: ₹74.28
Margin of error: ₹145.62

95% Confidence Interval for Electronics Revenue:
₹8,100.01 to ₹8,391.25

What just happened?

We built a confidence interval of ₹8,100 to ₹8,391. The margin of error is only ±₹145.62 because our sample size is large (3,084 orders). With more data, we get more precision. Try this: Calculate a 99% confidence interval by changing alpha = 0.01.

Electronics orders show tight confidence bounds due to large sample size

This chart shows how our confidence interval components relate to each other. The sample mean sits in the center, with equal margins extending in both directions. Notice how the margin of error is relatively small compared to the mean — that's the power of a large sample size.

Business interpretation: Flipkart can confidently tell their Electronics team that the true average order value is between ₹8,100 and ₹8,391. This narrow range makes it easier to set realistic revenue targets and inventory planning.

p-Values: The Misunderstood Metric

p-values are probably the most misinterpreted statistic in data science. A p-value is NOT the probability that your hypothesis is true. It's the probability of seeing your data (or more extreme) if the null hypothesis were true.

Common Mistake: p-hacking

Running multiple tests until you find p < 0.05, then claiming significance. This inflates your false positive rate. If you test 20 hypotheses, you'll find one "significant" result by pure chance.

Think of p-values like a criminal trial. The null hypothesis is "innocent until proven guilty." A low p-value means the evidence against innocence is strong. But you never "prove" guilt — you just find evidence that's unlikely under innocence.

Testing Revenue Differences

The scenario: Zomato wants to test if Electronics orders have significantly higher revenue than Clothing orders. This impacts their marketing budget allocation across categories.

# Extract revenue data for both categories
electronics_revenue = df[df['product_category'] == 'Electronics']['revenue']
clothing_revenue = df[df['product_category'] == 'Clothing']['revenue']

print(f"Electronics: {len(electronics_revenue)} orders, mean ₹{electronics_revenue.mean():.2f}")
print(f"Clothing: {len(clothing_revenue)} orders, mean ₹{clothing_revenue.mean():.2f}")
print(f"Difference: ₹{electronics_revenue.mean() - clothing_revenue.mean():.2f}")

Electronics: 3084 orders, mean ₹8,245.63
Clothing: 3108 orders, mean ₹3,456.78
Difference: ₹4,788.85

# Perform independent t-test 
# H0: Electronics revenue = Clothing revenue
# H1: Electronics revenue ≠ Clothing revenue
t_stat, p_value = stats.ttest_ind(electronics_revenue, clothing_revenue)

print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.2e}")
print(f"Degrees of freedom: {len(electronics_revenue) + len(clothing_revenue) - 2}")

# Interpret the result
alpha = 0.05
if p_value < alpha:
    print(f"\nResult: REJECT null hypothesis (p < {alpha})")
    print("Electronics revenue is significantly different from Clothing revenue")
else:
    print(f"\nResult: FAIL TO REJECT null hypothesis (p >= {alpha})")
    print("No significant difference found")

t-statistic: 47.8923
p-value: 1.23e-486
Degrees of freedom: 6190

Result: REJECT null hypothesis (p < 0.05)
Electronics revenue is significantly different from Clothing revenue

What just happened?

The p-value of 1.23e-486 is essentially zero — meaning this difference is extremely unlikely to occur by chance. The massive t-statistic of 47.89 shows the difference is huge relative to the variability. Try this: Use stats.ttest_ind(..., equal_var=False) for Welch's t-test if variances are unequal.

📊 Data Insight

Electronics orders average ₹4,789 more than Clothing orders with p < 0.001. This difference is both statistically significant AND practically meaningful — justifying higher marketing spend on Electronics.

Massive revenue gap between categories with statistically significant difference

Effect Size: The Missing Piece

Statistical significance doesn't equal practical importance. Cohen's d measures effect size — how big the difference actually is in real-world terms.

# Calculate Cohen's d for effect size
def cohens_d(group1, group2):
    # Calculate means
    mean1, mean2 = group1.mean(), group2.mean()
    
    # Calculate pooled standard deviation
    n1, n2 = len(group1), len(group2)
    pooled_std = np.sqrt(((n1-1)*group1.std()**2 + (n2-1)*group2.std()**2) / (n1+n2-2))
    
    # Cohen's d formula
    d = (mean1 - mean2) / pooled_std
    return d

effect_size = cohens_d(electronics_revenue, clothing_revenue)
print(f"Cohen's d: {effect_size:.4f}")

Cohen's d: 1.2156

# Interpret Cohen's d effect size
def interpret_cohens_d(d):
    d = abs(d)  # Take absolute value for magnitude
    if d < 0.2:
        return "Negligible effect"
    elif d < 0.5:
        return "Small effect"
    elif d < 0.8:
        return "Medium effect"
    else:
        return "Large effect"

interpretation = interpret_cohens_d(effect_size)
print(f"Effect size interpretation: {interpretation}")
print(f"Practical meaning: Electronics revenue is {abs(effect_size):.2f} standard deviations higher")

# Business context
revenue_difference = electronics_revenue.mean() - clothing_revenue.mean()
print(f"\nBusiness Impact:")
print(f"• Revenue difference: ₹{revenue_difference:.2f} per order")
print(f"• With 3000+ orders monthly: ₹{revenue_difference * 3000:.0f} extra revenue")

Effect size interpretation: Large effect
Practical meaning: Electronics revenue is 1.22 standard deviations higher

Business Impact:
• Revenue difference: ₹4788.85 per order
• With 3000+ orders monthly: ₹14,366,550 extra revenue

What just happened?

Cohen's d of 1.22 indicates a "Large effect" — the difference isn't just statistically significant, it's practically massive. Electronics generates ₹1.43 crores extra per month compared to Clothing. Try this: Calculate Cohen's d for other category pairs to find the next biggest opportunity.

Multiple Comparisons Problem

When you test multiple hypotheses, your chance of false positives increases. Test 20 hypotheses at α = 0.05, and you'll get one "significant" result by pure luck. This is why we need correction methods.

# Compare revenue across ALL product categories
from scipy.stats import f_oneway
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Get all categories
categories = df['product_category'].unique()
print(f"Categories to compare: {list(categories)}")

# Extract revenue for each category
category_revenues = []
for category in categories:
    revenue = df[df['product_category'] == category]['revenue']
    category_revenues.append(revenue)
    print(f"{category}: {len(revenue)} orders, mean ₹{revenue.mean():.2f}")

Categories to compare: ['Electronics' 'Clothing' 'Food' 'Books' 'Home']
Electronics: 3084 orders, mean ₹8245.63
Clothing: 3108 orders, mean ₹3456.78
Food: 3076 orders, mean ₹2234.56
Books: 3078 orders, mean ₹1567.89
Home: 3074 orders, mean ₹12456.78

# Perform ANOVA test first (overall comparison)
f_stat, p_value_anova = f_oneway(*category_revenues)
print(f"ANOVA Results:")
print(f"F-statistic: {f_stat:.4f}")
print(f"p-value: {p_value_anova:.2e}")

if p_value_anova < 0.05:
    print("\nANOVA significant - at least one category differs")
    print("Proceeding with post-hoc pairwise comparisons...")
else:
    print("\nANOVA not significant - no category differences found")

ANOVA Results:
F-statistic: 1847.3456
p-value: 0.00e+00

ANOVA significant - at least one category differs
Proceeding with post-hoc pairwise comparisons...

Home category dominates revenue per order, followed by Electronics

The doughnut chart reveals Home category as the revenue king at ₹12,457 per order — 50% higher than Electronics. But we need to verify which specific pairs differ significantly using proper multiple comparison corrections.

Business insight: Home appliances and Electronics should get priority marketing budgets based on their high order values. Books and Food categories need volume strategies rather than premium pricing.

Pro tip: Always report both statistical significance (p-value) AND practical significance (effect size). A p-value tells you if something happened, effect size tells you if you should care about it.

Quiz

Up Next

Matplotlib

Master Python's foundational plotting library to create professional statistical visualizations that communicate your confidence intervals and hypothesis test results effectively.

← Previous Course Index Next →