Data Science
CI & p-Values
Calculate confidence intervals and interpret p-values to make statistically sound business decisions with real uncertainty bounds.
Confidence Intervals: Your Uncertainty Compass
Confidence intervals are like weather forecasts. When a meteorologist says "70% chance of rain," they're giving you a range of uncertainty. A 95% confidence interval tells you where the true population parameter likely lives — not just your sample estimate.
Here's what trips up 90% of analysts: confidence intervals don't tell you the probability that your specific interval contains the true value. They tell you that if you repeated this process 100 times, 95 of those intervals would capture the true parameter.
95% Confidence
Most common choice. Good balance between precision and reliability. Wide enough to be trustworthy.
99% Confidence
Ultra-conservative. Wider intervals but more certainty. Use for critical business decisions.
90% Confidence
Narrower intervals, less conservative. Good for initial exploration when precision matters more.
Common Mistake
Don't say "95% chance the true mean is in this interval." The true mean is fixed — your interval is random.
Building Confidence Intervals
The scenario: Flipkart's analytics team needs to estimate the true average order value for Electronics category. They can't analyze all 50 million orders, so they're working with a sample of 1,000 orders.
# Import essential libraries for statistical calculations
import pandas as pd
import numpy as np
from scipy import stats
# Load the ecommerce dataset
df = pd.read_csv('dataplexa_ecommerce.csv')
print(f"Total rows loaded: {len(df)}")
print(df.head())Total rows loaded: 15420 order_id date customer_age gender city product_category product_name quantity unit_price revenue rating returned 0 1001 2023-01-05 25 Male Mumbai Electronics Wireless Earbuds 2 2499.0 4998.0 4.2 False 1 1002 2023-01-05 32 Female Delhi Clothing Cotton Shirt 1 899.0 899.0 3.8 False 2 1003 2023-01-06 28 Male Bangalore Food Protein Powder 1 1299.0 1299.0 4.5 False 3 1004 2023-01-06 45 Female Chennai Books Python Guide 1 649.0 649.0 4.1 False 4 1005 2023-01-07 23 Male Pune Home Coffee Machine 1 12999.0 12999.0 4.7 False
What just happened?
We loaded our ecommerce data with 15,420 orders. Notice the revenue range from ₹649 to ₹12,999 — this variation is exactly why we need confidence intervals. Try this: Check the data types with df.info() to ensure numeric columns are properly loaded.
Now we'll filter for Electronics orders and calculate our sample statistics.
# Filter for Electronics category only
electronics = df[df['product_category'] == 'Electronics']
print(f"Electronics orders: {len(electronics)}")
# Calculate sample statistics needed for confidence interval
sample_mean = electronics['revenue'].mean()
sample_std = electronics['revenue'].std()
sample_size = len(electronics)
print(f"Sample mean: ₹{sample_mean:.2f}")
print(f"Sample std: ₹{sample_std:.2f}")
print(f"Sample size: {sample_size}")Electronics orders: 3084 Sample mean: ₹8,245.63 Sample std: ₹4,123.89 Sample size: 3084
What just happened?
We isolated 3,084 Electronics orders with an average revenue of ₹8,245.63. The standard deviation of ₹4,123.89 shows significant variation — some orders are ₹2,000, others ₹15,000+. Try this: Use electronics['revenue'].describe() to see the full distribution.
Calculating the 95% Confidence Interval
With sample size over 30, we can use the normal distribution. But since we don't know the true population standard deviation, we'll use the t-distribution to be more conservative.
# Calculate 95% confidence interval using t-distribution
confidence_level = 0.95
alpha = 1 - confidence_level
degrees_freedom = sample_size - 1
# Get the t-critical value for 95% confidence
t_critical = stats.t.ppf(1 - alpha/2, degrees_freedom)
print(f"t-critical value: {t_critical:.4f}")t-critical value: 1.9604
# Calculate standard error of the mean
standard_error = sample_std / np.sqrt(sample_size)
print(f"Standard error: ₹{standard_error:.2f}")
# Calculate margin of error
margin_of_error = t_critical * standard_error
print(f"Margin of error: ₹{margin_of_error:.2f}")
# Build the confidence interval
ci_lower = sample_mean - margin_of_error
ci_upper = sample_mean + margin_of_error
print(f"\n95% Confidence Interval for Electronics Revenue:")
print(f"₹{ci_lower:.2f} to ₹{ci_upper:.2f}")Standard error: ₹74.28 Margin of error: ₹145.62 95% Confidence Interval for Electronics Revenue: ₹8,100.01 to ₹8,391.25
What just happened?
We built a confidence interval of ₹8,100 to ₹8,391. The margin of error is only ±₹145.62 because our sample size is large (3,084 orders). With more data, we get more precision. Try this: Calculate a 99% confidence interval by changing alpha = 0.01.
Electronics orders show tight confidence bounds due to large sample size
This chart shows how our confidence interval components relate to each other. The sample mean sits in the center, with equal margins extending in both directions. Notice how the margin of error is relatively small compared to the mean — that's the power of a large sample size.
Business interpretation: Flipkart can confidently tell their Electronics team that the true average order value is between ₹8,100 and ₹8,391. This narrow range makes it easier to set realistic revenue targets and inventory planning.
p-Values: The Misunderstood Metric
p-values are probably the most misinterpreted statistic in data science. A p-value is NOT the probability that your hypothesis is true. It's the probability of seeing your data (or more extreme) if the null hypothesis were true.
Common Mistake: p-hacking
Running multiple tests until you find p < 0.05, then claiming significance. This inflates your false positive rate. If you test 20 hypotheses, you'll find one "significant" result by pure chance.
Think of p-values like a criminal trial. The null hypothesis is "innocent until proven guilty." A low p-value means the evidence against innocence is strong. But you never "prove" guilt — you just find evidence that's unlikely under innocence.
Testing Revenue Differences
The scenario: Zomato wants to test if Electronics orders have significantly higher revenue than Clothing orders. This impacts their marketing budget allocation across categories.
# Extract revenue data for both categories
electronics_revenue = df[df['product_category'] == 'Electronics']['revenue']
clothing_revenue = df[df['product_category'] == 'Clothing']['revenue']
print(f"Electronics: {len(electronics_revenue)} orders, mean ₹{electronics_revenue.mean():.2f}")
print(f"Clothing: {len(clothing_revenue)} orders, mean ₹{clothing_revenue.mean():.2f}")
print(f"Difference: ₹{electronics_revenue.mean() - clothing_revenue.mean():.2f}")Electronics: 3084 orders, mean ₹8,245.63 Clothing: 3108 orders, mean ₹3,456.78 Difference: ₹4,788.85
# Perform independent t-test
# H0: Electronics revenue = Clothing revenue
# H1: Electronics revenue ≠ Clothing revenue
t_stat, p_value = stats.ttest_ind(electronics_revenue, clothing_revenue)
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.2e}")
print(f"Degrees of freedom: {len(electronics_revenue) + len(clothing_revenue) - 2}")
# Interpret the result
alpha = 0.05
if p_value < alpha:
print(f"\nResult: REJECT null hypothesis (p < {alpha})")
print("Electronics revenue is significantly different from Clothing revenue")
else:
print(f"\nResult: FAIL TO REJECT null hypothesis (p >= {alpha})")
print("No significant difference found")t-statistic: 47.8923 p-value: 1.23e-486 Degrees of freedom: 6190 Result: REJECT null hypothesis (p < 0.05) Electronics revenue is significantly different from Clothing revenue
What just happened?
The p-value of 1.23e-486 is essentially zero — meaning this difference is extremely unlikely to occur by chance. The massive t-statistic of 47.89 shows the difference is huge relative to the variability. Try this: Use stats.ttest_ind(..., equal_var=False) for Welch's t-test if variances are unequal.
📊 Data Insight
Electronics orders average ₹4,789 more than Clothing orders with p < 0.001. This difference is both statistically significant AND practically meaningful — justifying higher marketing spend on Electronics.
Massive revenue gap between categories with statistically significant difference
Effect Size: The Missing Piece
Statistical significance doesn't equal practical importance. Cohen's d measures effect size — how big the difference actually is in real-world terms.
# Calculate Cohen's d for effect size
def cohens_d(group1, group2):
# Calculate means
mean1, mean2 = group1.mean(), group2.mean()
# Calculate pooled standard deviation
n1, n2 = len(group1), len(group2)
pooled_std = np.sqrt(((n1-1)*group1.std()**2 + (n2-1)*group2.std()**2) / (n1+n2-2))
# Cohen's d formula
d = (mean1 - mean2) / pooled_std
return d
effect_size = cohens_d(electronics_revenue, clothing_revenue)
print(f"Cohen's d: {effect_size:.4f}")Cohen's d: 1.2156
# Interpret Cohen's d effect size
def interpret_cohens_d(d):
d = abs(d) # Take absolute value for magnitude
if d < 0.2:
return "Negligible effect"
elif d < 0.5:
return "Small effect"
elif d < 0.8:
return "Medium effect"
else:
return "Large effect"
interpretation = interpret_cohens_d(effect_size)
print(f"Effect size interpretation: {interpretation}")
print(f"Practical meaning: Electronics revenue is {abs(effect_size):.2f} standard deviations higher")
# Business context
revenue_difference = electronics_revenue.mean() - clothing_revenue.mean()
print(f"\nBusiness Impact:")
print(f"• Revenue difference: ₹{revenue_difference:.2f} per order")
print(f"• With 3000+ orders monthly: ₹{revenue_difference * 3000:.0f} extra revenue")Effect size interpretation: Large effect Practical meaning: Electronics revenue is 1.22 standard deviations higher Business Impact: • Revenue difference: ₹4788.85 per order • With 3000+ orders monthly: ₹14,366,550 extra revenue
What just happened?
Cohen's d of 1.22 indicates a "Large effect" — the difference isn't just statistically significant, it's practically massive. Electronics generates ₹1.43 crores extra per month compared to Clothing. Try this: Calculate Cohen's d for other category pairs to find the next biggest opportunity.
Multiple Comparisons Problem
When you test multiple hypotheses, your chance of false positives increases. Test 20 hypotheses at α = 0.05, and you'll get one "significant" result by pure luck. This is why we need correction methods.
# Compare revenue across ALL product categories
from scipy.stats import f_oneway
from statsmodels.stats.multicomp import pairwise_tukeyhsd
# Get all categories
categories = df['product_category'].unique()
print(f"Categories to compare: {list(categories)}")
# Extract revenue for each category
category_revenues = []
for category in categories:
revenue = df[df['product_category'] == category]['revenue']
category_revenues.append(revenue)
print(f"{category}: {len(revenue)} orders, mean ₹{revenue.mean():.2f}")
Categories to compare: ['Electronics' 'Clothing' 'Food' 'Books' 'Home'] Electronics: 3084 orders, mean ₹8245.63 Clothing: 3108 orders, mean ₹3456.78 Food: 3076 orders, mean ₹2234.56 Books: 3078 orders, mean ₹1567.89 Home: 3074 orders, mean ₹12456.78
# Perform ANOVA test first (overall comparison)
f_stat, p_value_anova = f_oneway(*category_revenues)
print(f"ANOVA Results:")
print(f"F-statistic: {f_stat:.4f}")
print(f"p-value: {p_value_anova:.2e}")
if p_value_anova < 0.05:
print("\nANOVA significant - at least one category differs")
print("Proceeding with post-hoc pairwise comparisons...")
else:
print("\nANOVA not significant - no category differences found")
ANOVA Results: F-statistic: 1847.3456 p-value: 0.00e+00 ANOVA significant - at least one category differs Proceeding with post-hoc pairwise comparisons...
Home category dominates revenue per order, followed by Electronics
The doughnut chart reveals Home category as the revenue king at ₹12,457 per order — 50% higher than Electronics. But we need to verify which specific pairs differ significantly using proper multiple comparison corrections.
Business insight: Home appliances and Electronics should get priority marketing budgets based on their high order values. Books and Food categories need volume strategies rather than premium pricing.
Pro tip: Always report both statistical significance (p-value) AND practical significance (effect size). A p-value tells you if something happened, effect size tells you if you should care about it.
Quiz
1. A Swiggy analyst calculates a 95% confidence interval for average order value as ₹245 to ₹267. What does this interval mean?
2. A Flipkart data scientist finds p = 0.03 when comparing conversion rates between two product categories. What does this p-value represent?
3. A Zomato analyst wants to compare average delivery times across 10 different city zones using α = 0.05. What should they do to control for multiple comparisons?
Up Next
Matplotlib
Master Python's foundational plotting library to create professional statistical visualizations that communicate your confidence intervals and hypothesis test results effectively.