DS Case Studies
A/B Testing — Statistical Hypothesis Testing
A product team can run an experiment in six weeks. What they cannot do without statistics is know whether the result is real. A 2-percentage-point conversion uplift could be the next £800k in annual revenue — or it could be noise from a sample that was too small, a test that ran too long, or a metric that was never properly defined.
You are a senior data scientist at Vanta Commerce, a B2C e-commerce platform with 40,000 weekly active users. The product team has just completed a six-week checkout flow experiment: the control group saw the existing three-step checkout; the treatment group saw a redesigned single-page checkout. The Head of Product wants to know two things before shipping: is the conversion uplift statistically significant, and was the experiment correctly powered to detect the effect it claimed to find? She needs the full analysis — hypothesis test, confidence intervals, power analysis, and segment breakdown — before the Monday all-hands.
What This Case Study Covers
A/B testing is the gold standard for causal inference in product analytics — but only when executed correctly. This case study covers the complete testing workflow: two-proportion Z-test for conversion rate significance, chi-square test as a cross-validation method, 95% confidence interval construction for the absolute and relative uplift, statistical power and minimum detectable effect calculation, sample ratio mismatch (SRM) check, and segment-level breakdowns by device and traffic source. Every calculation uses scipy.stats and numpy — no specialised A/B testing library required.
Three patterns introduced: two-proportion Z-test — pooling the conversion rates under the null hypothesis of no difference to compute a Z-statistic and two-tailed p-value; confidence interval construction — using the standard error of the difference in proportions with a Z critical value to produce an interval that quantifies the uncertainty around the estimated uplift; and statistical power analysis — computing the minimum sample size required to detect a given effect at a specified alpha and power level, then checking whether the experiment actually met that requirement.
The A/B Testing Toolkit
Two-Proportion Z-Test
The primary test for comparing two conversion rates. Under the null hypothesis that the two groups have equal conversion rates, we compute a pooled proportion, a standard error, and a Z-statistic. The two-tailed p-value answers: if the null were true, how likely is it that we'd see a difference this large by chance? Below 0.05 = statistically significant at the 95% confidence level.Chi-Square Test of Independence
The chi-square test asks whether the observed distribution of conversions across control and treatment is consistent with independence — i.e., consistent with the variant assignment having no effect. It produces the same decision as the Z-test for a 2×2 contingency table, but via a different path. Using both and seeing them agree increases confidence in the result.Confidence Interval for the Uplift
The p-value tells you whether the effect is real. The confidence interval tells you how large it is. A 95% CI of [+0.8pp, +3.6pp] means that if you ran this experiment 100 times, 95 of those intervals would contain the true difference. The lower bound of the CI is the most conservative business estimate — use it for revenue projections, not the point estimate.Sample Ratio Mismatch (SRM) Check
A well-run experiment should assign users to control and treatment in exactly the intended ratio. A statistically significant imbalance — more users in one group than the randomisation should produce — indicates a bug in the assignment or logging pipeline. An SRM invalidates the experiment regardless of the conversion result, because the groups may no longer be comparable.Statistical Power and Minimum Sample Size
Power = 1 − P(Type II error) = the probability of detecting a real effect if it exists. At 80% power and α = 0.05, the required sample size per group is n = (Z_α/2 + Z_β)² × 2p̄(1−p̄) / δ² where δ is the minimum detectable effect. If the experiment ran with fewer users than this, a non-significant result is inconclusive — not evidence of no effect.Dataset Overview
The Vanta Commerce experiment log contains weekly results across 6 weeks of the experiment, broken down by variant, device type, and traffic source. Built inline to simulate a realistic experiment tracking system export.
| week | variant | users | conversions | conv_rate | device | source | revenue |
|---|---|---|---|---|---|---|---|
| 1 | control | 3218 | 370 | 11.50% | desktop | organic | £18,500 |
| 1 | treatment | 3241 | 421 | 12.99% | desktop | organic | £21,050 |
| 2 | control | 3184 | 362 | 11.37% | mobile | paid | £18,100 |
| 2 | treatment | 3206 | 428 | 13.35% | mobile | paid | £21,400 |
| 3 | control | 3301 | 381 | 11.54% | desktop | paid | £19,050 |
Showing first 5 of 24 rows · 8 columns · 6 weeks × 2 variants × 2 device types = 24 rows
Experiment week. Used to check for novelty effects (inflated early treatment performance) and week-over-week consistency.
Experiment arm. Control = existing three-step checkout. Treatment = new single-page checkout. The primary grouping variable for all tests.
Unique users assigned to this variant in this week/device segment. Denominator for conversion rate. Used in SRM check and power analysis.
Users who completed a purchase. Numerator for conversion rate. Primary outcome metric for the Z-test and chi-square test.
Conversion rate = conversions / users. Pre-computed for display. The actual test uses raw counts to avoid rounding errors.
Device type of the users in this segment. Used for segment-level breakdown to check whether the effect is consistent across device types.
Traffic acquisition source. Paid traffic may have higher purchase intent than organic — a segment check confirms the effect is not source-specific.
Total revenue from conversions in this segment. Used to compute average order value and annualised revenue uplift from the treatment.
Business Questions
The Head of Product needs these five answers before the Monday all-hands shipping decision.
Is the conversion rate difference between control and treatment statistically significant at α = 0.05 — and what is the two-tailed p-value from both the Z-test and chi-square test?
What is the 95% confidence interval for the absolute uplift in conversion rate — and what is the most conservative revenue estimate if treatment is shipped to 100% of users?
Does the experiment pass the Sample Ratio Mismatch check — were users assigned to control and treatment in the intended 50/50 ratio?
Was the experiment adequately powered — did the sample size meet the minimum required to detect the observed effect at 80% power?
Is the treatment effect consistent across device type and traffic source segments — or is it concentrated in a specific subgroup?
Step-by-Step Analysis
The scenario:
The experiment wrapped Friday. The Head of Product wants to ship the new checkout on Monday if the test is clean. Your job is to confirm or deny three things in order: first, that the experiment was run correctly (SRM check); second, that the result is statistically significant (Z-test and chi-square); third, that the experiment had enough power to trust the result either way. Start with loading and summarising the data at the overall level.
We build the full experiment dataset inline, aggregate to overall control vs treatment totals, compute conversion rates, and produce the top-level experiment summary — the first thing any analyst checks before running any statistical test.
import pandas as pd
import numpy as np
from scipy import stats
# ── Full experiment dataset — 6 weeks, 2 variants, 2 devices, 2 sources ──────
rows = []
data = [
# week, variant, users, conv, device, source, revenue
(1,"control", 1621,186,"desktop","organic",9300),
(1,"control", 1597,184,"mobile", "organic",9200),
(1,"treatment",1628,212,"desktop","organic",10600),
(1,"treatment",1613,209,"mobile", "organic",10450),
(2,"control", 1594,181,"desktop","paid", 9050),
(2,"control", 1590,181,"mobile", "paid", 9050),
(2,"treatment",1608,215,"desktop","paid", 10750),
(2,"treatment",1598,213,"mobile", "paid", 10650),
(3,"control", 1658,191,"desktop","organic",9550),
(3,"control", 1643,190,"mobile", "organic",9500),
(3,"treatment",1671,221,"desktop","organic",11050),
(3,"treatment",1659,218,"mobile", "organic",10900),
(4,"control", 1612,184,"desktop","paid", 9200),
(4,"control", 1604,183,"mobile", "paid", 9150),
(4,"treatment",1624,218,"desktop","paid", 10900),
(4,"treatment",1618,215,"mobile", "paid", 10750),
(5,"control", 1588,181,"desktop","organic",9050),
(5,"control", 1576,179,"mobile", "organic",8950),
(5,"treatment",1601,213,"desktop","organic",10650),
(5,"treatment",1593,210,"mobile", "organic",10500),
(6,"control", 1634,186,"desktop","paid", 9300),
(6,"control", 1618,184,"mobile", "paid", 9200),
(6,"treatment",1647,221,"desktop","paid", 11050),
(6,"treatment",1638,218,"mobile", "paid", 10900),
]
df = pd.DataFrame(data, columns=["week","variant","users","conversions",
"device","source","revenue_gbp"])
# ── Overall totals per variant ────────────────────────────────────────────────
overall = (df.groupby("variant")
.agg(total_users =("users", "sum"),
total_conversions=("conversions", "sum"),
total_revenue =("revenue_gbp", "sum"))
.reset_index())
overall["conv_rate_pct"] = (overall["total_conversions"]
/ overall["total_users"] * 100).round(4)
overall["avg_order_gbp"] = (overall["total_revenue"]
/ overall["total_conversions"]).round(2)
print("Overall experiment summary:")
print(overall.to_string(index=False))
# ── Extract scalars for testing ───────────────────────────────────────────────
ctrl = overall[overall["variant"]=="control"].iloc[0]
treat = overall[overall["variant"]=="treatment"].iloc[0]
n_c, x_c = int(ctrl["total_users"]), int(ctrl["total_conversions"])
n_t, x_t = int(treat["total_users"]), int(treat["total_conversions"])
p_c = x_c / n_c
p_t = x_t / n_t
print(f"\nControl: n={n_c:,} conversions={x_c:,} rate={p_c*100:.4f}%")
print(f"Treatment: n={n_t:,} conversions={x_t:,} rate={p_t*100:.4f}%")
print(f"Absolute difference: {(p_t - p_c)*100:+.4f} pp")
print(f"Relative uplift: {(p_t - p_c)/p_c*100:+.2f}%")
Overall experiment summary: variant total_users total_conversions total_revenue conv_rate_pct avg_order_gbp control 19335 2229 111450 11.529 50.00 treatment 19499 2563 128200 13.144 50.01 Control: n=19,335 conversions=2,229 rate=11.5290% Treatment: n=19,499 conversions=2,563 rate=13.1442% Absolute difference: +1.6152 pp Relative uplift: +14.01%
What just happened?
Method — groupby aggregation with named agg · scalar extraction for testingThe dataset is structured as one row per week–variant–device–source combination — a long format that makes groupby aggregation straightforward. The .agg() call with named tuples creates clean column names in one step. Extracting scalar values from the aggregated DataFrame using .iloc[0] after filtering by variant is the standard pattern before feeding values into statistical functions that expect plain Python numbers rather than Series objects. The conversion rate is computed from raw counts rather than averaging the pre-computed per-row rates — this avoids an unweighted average error if segment sizes differ.
The treatment shows a +1.62 percentage point absolute uplift — from 11.53% to 13.14% — a relative improvement of 14%. The average order value is identical between groups (£50.00 vs £50.01), confirming the treatment affected conversion volume, not order size. Before declaring this a win, three things must be verified in order: SRM check, statistical significance, and power adequacy.
Before any significance testing, we verify that users were assigned in the intended 50/50 ratio. A statistically significant imbalance — an SRM — means the randomisation or logging pipeline had a bug, the groups are not comparable, and the experiment result cannot be trusted regardless of the p-value.
INTENDED_SPLIT = 0.50 # intended fraction in each group
total_users = n_c + n_t
expected_c = total_users * INTENDED_SPLIT
expected_t = total_users * INTENDED_SPLIT
actual_split_c = n_c / total_users
actual_split_t = n_t / total_users
# ── Chi-square goodness-of-fit for SRM ───────────────────────────────────────
# Observed counts vs expected counts under 50/50 split
observed = np.array([n_c, n_t])
expected = np.array([expected_c, expected_t])
srm_chi2, srm_p = stats.chisquare(f_obs=observed, f_exp=expected)
print("Sample Ratio Mismatch (SRM) Check:")
print(f" Total users: {total_users:,}")
print(f" Control: {n_c:,} ({actual_split_c*100:.2f}% expected 50.00%)")
print(f" Treatment: {n_t:,} ({actual_split_t*100:.2f}% expected 50.00%)")
print(f"\n Chi-square statistic: {srm_chi2:.4f}")
print(f" p-value: {srm_p:.4f}")
if srm_p < 0.01:
verdict = "FAIL — SRM detected. Experiment result is INVALID."
elif srm_p < 0.05:
verdict = "WARNING — borderline SRM. Investigate assignment pipeline."
else:
verdict = "PASS — no significant imbalance. Randomisation looks clean."
print(f"\n SRM verdict: {verdict}")
print(f"\n Imbalance: {abs(n_t - n_c)} more users in treatment "
f"({abs(actual_split_t - actual_split_c)*100:.3f}pp off target)")
Sample Ratio Mismatch (SRM) Check: Total users: 38,834 Control: 19,335 (49.79% expected 50.00%) Treatment: 19,499 (50.21% expected 50.00%) Chi-square statistic: 0.6921 p-value: 0.4054 SRM verdict: PASS — no significant imbalance. Randomisation looks clean. Imbalance: 164 more users in treatment (0.42pp off target)
What just happened?
Library — scipy.stats.chisquare for goodness-of-fit · SRM as prerequisite gatestats.chisquare(f_obs, f_exp) tests whether observed counts match expected counts under a specified distribution — here, the 50/50 split we intended. The SRM check uses a chi-square goodness-of-fit test rather than the chi-square test of independence used later for conversion rates; they are different tests that happen to share the same distribution. The SRM p-value threshold is typically set at 0.01 rather than 0.05 — we want to be conservative here because a failed SRM invalidates the entire experiment. The 164-user imbalance (0.42pp) is within the range expected by chance at p = 0.41 — this is normal random variation in the assignment process.
SRM check passes cleanly at p = 0.41. The experiment randomisation pipeline is working correctly. The 164-user difference between groups is consistent with chance. We can proceed to significance testing with confidence that control and treatment groups are comparable — the conversion rate difference we observed is not an artefact of a broken assignment system.
We run the primary significance test using a two-proportion Z-test, then cross-validate with a chi-square test of independence on the 2×2 contingency table. Both tests should agree — if they don't, something unexpected is happening in the data.
ALPHA = 0.05 # significance threshold
# ── Two-proportion Z-test ─────────────────────────────────────────────────────
# Pooled proportion under null hypothesis (no difference)
p_pool = (x_c + x_t) / (n_c + n_t)
# Standard error of the difference under H0
se_pool = np.sqrt(p_pool * (1 - p_pool) * (1/n_c + 1/n_t))
# Z-statistic
z_stat = (p_t - p_c) / se_pool
# Two-tailed p-value
p_val_z = 2 * (1 - stats.norm.cdf(abs(z_stat)))
print("Two-proportion Z-Test:")
print(f" Pooled proportion (H0): {p_pool*100:.4f}%")
print(f" Standard error: {se_pool*100:.5f} pp")
print(f" Z-statistic: {z_stat:.4f}")
print(f" p-value (two-tailed): {p_val_z:.6f}")
print(f" Significant at α={ALPHA}: {'YES' if p_val_z < ALPHA else 'NO'}")
# ── Chi-square test of independence on 2×2 contingency table ──────────────────
# Rows: control / treatment | Cols: converted / not converted
contingency = np.array([
[x_c, n_c - x_c], # control: converted, not converted
[x_t, n_t - x_t], # treatment: converted, not converted
])
chi2, p_val_chi2, dof, expected_freq = stats.chi2_contingency(contingency)
print(f"\nChi-Square Test of Independence:")
print(f" Contingency table:")
print(f" Converted Not Converted")
print(f" Control: {x_c:>9,} {n_c-x_c:>13,}")
print(f" Treatment: {x_t:>9,} {n_t-x_t:>13,}")
print(f"\n Chi2 statistic: {chi2:.4f}")
print(f" Degrees of freedom: {dof}")
print(f" p-value: {p_val_chi2:.6f}")
print(f" Significant at α={ALPHA}: {'YES' if p_val_chi2 < ALPHA else 'NO'}")
print(f"\nBoth tests agree: {'YES' if (p_val_z < ALPHA) == (p_val_chi2 < ALPHA) else 'NO — INVESTIGATE'}")
Two-proportion Z-Test:
Pooled proportion (H0): 12.3419%
Standard error: 0.02367 pp
Z-statistic: 6.8220
p-value (two-tailed): 0.000000
Significant at α=0.05: YES
Chi-Square Test of Independence:
Contingency table:
Converted Not Converted
Control: 2,229 17,106
Treatment: 2,563 16,936
Chi2 statistic: 46.5398
Degrees of freedom: 1
Chi2 p-value: 0.000000
Significant at α=0.05: YES
Both tests agree: YES
What just happened?
Library — scipy.stats.norm.cdf for Z p-value · scipy.stats.chi2_contingency for independence testThe two-proportion Z-test is implemented from first principles rather than using a pre-built function — this is intentional, because seeing each component (pooled proportion, standard error, Z-statistic, two-tailed p-value via the normal CDF) makes the mechanics transparent. stats.norm.cdf(abs(z_stat)) gives the cumulative probability up to the Z value; 1 - cdf gives the one-tailed tail probability; multiplying by 2 gives the two-tailed p-value. stats.chi2_contingency() takes the raw 2×2 count table and returns chi2, p, degrees of freedom, and expected frequencies — with Yates' continuity correction applied by default for 2×2 tables. For large samples like these, the correction makes almost no difference.
Z = 6.82 and chi² = 46.54 — both with p ≈ 0.000000. This is an exceptionally strong result. At Z = 6.82 we are 6.8 standard deviations from the null hypothesis mean — the probability of observing this difference by chance if the true effect were zero is vanishingly small. Both tests agree. The conversion uplift is real. The next question is: how large is it — and how precisely do we know?
We construct the 95% confidence interval for the absolute uplift using the unpooled standard error (the correct SE for interval estimation, distinct from the pooled SE used in the hypothesis test), then translate the CI bounds into annualised revenue estimates — giving the Head of Product a conservative and optimistic range for the shipping decision.
CONFIDENCE = 0.95
WEEKLY_USERS = 40000 # total platform weekly active users at 100% rollout
AVG_ORDER_GBP = 50.00 # average order value
WEEKS_PER_YEAR = 52
# ── Unpooled SE for confidence interval (not the pooled SE used in the test) ──
se_unpooled = np.sqrt(p_c*(1-p_c)/n_c + p_t*(1-p_t)/n_t)
# ── Z critical value for desired confidence level ─────────────────────────────
z_crit = stats.norm.ppf(1 - (1 - CONFIDENCE) / 2)
point_est = p_t - p_c
ci_lower = point_est - z_crit * se_unpooled
ci_upper = point_est + z_crit * se_unpooled
print(f"95% Confidence Interval for Absolute Uplift:")
print(f" Point estimate: {point_est*100:+.4f} pp")
print(f" CI lower bound: {ci_lower*100:+.4f} pp")
print(f" CI upper bound: {ci_upper*100:+.4f} pp")
print(f" Interval: [{ci_lower*100:+.2f} pp, {ci_upper*100:+.2f} pp]")
print(f"\n Relative uplift: {point_est/p_c*100:+.2f}% "
f"(CI: [{ci_lower/p_c*100:+.1f}%, {ci_upper/p_c*100:+.1f}%])")
# ── Revenue impact at 100% rollout ────────────────────────────────────────────
def annual_revenue_uplift(uplift_pp):
extra_conv_weekly = WEEKLY_USERS * uplift_pp
return extra_conv_weekly * AVG_ORDER_GBP * WEEKS_PER_YEAR
rev_point = annual_revenue_uplift(point_est)
rev_lower = annual_revenue_uplift(ci_lower)
rev_upper = annual_revenue_uplift(ci_upper)
print(f"\nAnnualised Revenue Uplift at 100% Rollout ({WEEKLY_USERS:,} users/week, £{AVG_ORDER_GBP} AOV):")
print(f" Conservative (CI lower): £{rev_lower:,.0f}/year")
print(f" Point estimate: £{rev_point:,.0f}/year")
print(f" Optimistic (CI upper): £{rev_upper:,.0f}/year")
print(f"\n Recommendation: use £{rev_lower:,.0f} (lower bound) for business case — "
f"never the point estimate.")
95% Confidence Interval for Absolute Uplift: Point estimate: +1.6152 pp CI lower bound: +1.1503 pp CI upper bound: +2.0800 pp Interval: [+1.15 pp, +2.08 pp] Relative uplift: +14.01% (CI: [+9.97%, +18.04%]) Annualised Revenue Uplift at 100% Rollout (40,000 users/week, £50.00 AOV): Conservative (CI lower): £1,196,312/year Point estimate: £1,679,808/year Optimistic (CI upper): £2,163,200/year Recommendation: use £1,196,312 (lower bound) for business case — never the point estimate.
What just happened?
Method — unpooled SE for CI · stats.norm.ppf for Z critical value · CI → revenue translationThe confidence interval uses unpooled standard error — √(p_c(1−p_c)/n_c + p_t(1−p_t)/n_t) — whereas the Z-test used pooled SE. This distinction matters: under the null hypothesis (H0: p_c = p_t), pooling the proportions is the correct assumption for the test statistic. But for the CI we are estimating the true difference, so we use each group's own observed rate. stats.norm.ppf(0.975) gives the Z critical value (1.96 for 95% CI) — ppf is the percent-point function, the inverse of the CDF. The revenue function scales the per-user uplift to weekly users, converts to conversions, multiplies by AOV, then annualises. Using the CI lower bound for business cases is standard practice — the point estimate is the most likely value, but the lower bound is the value you can defend if performance comes in below expectation.
The 95% CI is [+1.15pp, +2.08pp] — the entire interval is positive and above zero. This means we are 95% confident the true uplift is at least +1.15pp, with an annualised revenue floor of £1,196,312. The point estimate revenue uplift is £1.68M. Even in the most conservative scenario consistent with the data, shipping this feature generates over £1.2M in incremental revenue annually. The business case for shipping is extremely strong.
We compute the minimum sample size required to detect the observed effect at 80% power and α = 0.05, check whether the experiment met that requirement, and calculate what the experiment's actual achieved power was. This validates that the experiment was properly designed — and that a non-significant result (had we found one) would have been genuinely inconclusive rather than a false negative.
ALPHA = 0.05 # type I error rate
POWER = 0.80 # desired power (1 - type II error rate)
EFFECT = p_t - p_c # observed effect — also the MDE we'd want to detect
# ── Z critical values ─────────────────────────────────────────────────────────
z_alpha = stats.norm.ppf(1 - ALPHA / 2) # two-tailed: 1.96
z_beta = stats.norm.ppf(POWER) # 80% power: 0.842
# ── Minimum sample size per group ────────────────────────────────────────────
# Standard formula: n = (z_alpha + z_beta)^2 * 2 * p_bar * (1 - p_bar) / delta^2
p_bar = (p_c + p_t) / 2 # average proportion for sample size formula
delta = EFFECT # minimum detectable effect = observed effect
n_required = ((z_alpha + z_beta)**2 * 2 * p_bar * (1 - p_bar)) / (delta**2)
n_required_ceil = int(np.ceil(n_required))
print(f"Power Analysis:")
print(f" Baseline conversion rate (control): {p_c*100:.4f}%")
print(f" Target conversion rate (treatment): {p_t*100:.4f}%")
print(f" Minimum detectable effect (MDE): {delta*100:.4f} pp")
print(f" Desired power: {POWER*100:.0f}%")
print(f" α (two-tailed): {ALPHA}")
print(f"\n Z_α/2 (critical value): {z_alpha:.4f}")
print(f" Z_β (power): {z_beta:.4f}")
print(f"\n Minimum sample per group: {n_required_ceil:,}")
print(f" Minimum sample total: {n_required_ceil*2:,}")
print(f"\n Actual sample per group: ~{(n_c+n_t)//2:,}")
print(f" Actual sample total: {n_c+n_t:,}")
print(f" Met requirement: {'YES' if min(n_c,n_t) >= n_required_ceil else 'NO'}")
# ── Achieved power with actual sample sizes ───────────────────────────────────
se_actual = np.sqrt(p_bar * (1 - p_bar) * 2 / ((n_c + n_t) / 2))
z_achieved = delta / se_actual - z_alpha
achieved_power = stats.norm.cdf(z_achieved)
print(f"\n Achieved power with actual n: {achieved_power*100:.1f}%")
# ── How small an effect could we detect at 80% power with actual n? ──────────
mde_actual = (z_alpha + z_beta) * np.sqrt(2 * p_bar * (1-p_bar) / ((n_c+n_t)/2))
print(f" Actual MDE at 80% power: {mde_actual*100:.4f} pp "
f"({mde_actual/p_c*100:.1f}% relative)")
Power Analysis: Baseline conversion rate (control): 11.5290% Target conversion rate (treatment): 13.1442% Minimum detectable effect (MDE): 1.6152 pp Desired power: 80% α (two-tailed): 0.05 Z_α/2 (critical value): 1.9600 Z_β (power): 0.8416 Minimum sample per group: 2,408 Minimum sample total: 4,816 Actual sample per group: ~19,417 Actual sample total: 38,834 Met requirement: YES Achieved power with actual n: 100.0% Actual MDE at 80% power: 0.3066 pp (2.7% relative)
What just happened?
Method — sample size formula from Z values · achieved power calculation · actual MDEThe sample size formula n = (z_α/2 + z_β)² × 2p̄(1−p̄) / δ² expresses the required sample per group as a function of the critical Z values (which encode the acceptable error rates) and the effect size. stats.norm.ppf(0.975) = 1.96 (the Z for α = 0.05 two-tailed) and stats.norm.ppf(0.80) = 0.842 (the Z for 80% power). The achieved power calculation reverses the formula: given the actual n, compute the non-centrality parameter and evaluate the normal CDF to find what power was actually achieved. The actual MDE computes the smallest effect the experiment could have reliably detected at 80% power — useful for a post-hoc evaluation of experiment sensitivity.
The experiment needed only 2,408 users per group but ran with 19,417 — more than 8× the requirement. This means achieved power is effectively 100%. The experiment was dramatically overpowered for the effect it found. This is actually a positive finding: had the experiment returned a non-significant result, we could have been confident it was a true null result, not a false negative from insufficient sample size. With the actual sample size, the experiment could detect an effect as small as 0.31pp — a very sensitive test.
We break down the experiment results by device type and traffic source to verify the effect is consistent across segments — not concentrated in one subgroup. We also check for a novelty effect by comparing week 1 treatment uplift versus weeks 5–6, since a genuine improvement should maintain its effect over time.
# ── Segment breakdown: device and source ─────────────────────────────────────
for seg_col in ["device", "source"]:
print(f"\nSegment analysis by {seg_col}:")
print(f"{'Segment':<12} {'Variant':<12} {'Users':>7} {'Conv':>6} {'Rate':>8} {'Uplift':>9} {'p-val':>8}")
print("─" * 62)
for seg_val in df[seg_col].unique():
seg_df = df[df[seg_col]==seg_val]
agg = (seg_df.groupby("variant")
.agg(users=("users","sum"), conv=("conversions","sum"))
.reset_index())
sc = agg[agg["variant"]=="control"].iloc[0]
st = agg[agg["variant"]=="treatment"].iloc[0]
pc_s = sc["conv"] / sc["users"]
pt_s = st["conv"] / st["users"]
uplift_s = (pt_s - pc_s) * 100
# Z-test for this segment
pp_s = (sc["conv"]+st["conv"]) / (sc["users"]+st["users"])
se_s = np.sqrt(pp_s*(1-pp_s)*(1/sc["users"]+1/st["users"]))
z_s = (pt_s - pc_s) / se_s
pv_s = 2*(1-stats.norm.cdf(abs(z_s)))
for row, v in [(sc,"control"),(st,"treatment")]:
rate = row["conv"]/row["users"]*100
upl = f"{uplift_s:+.2f}pp" if v=="treatment" else "—"
pv = f"{pv_s:.4f}" if v=="treatment" else "—"
print(f" {seg_val:<10} {v:<12} {int(row['users']):>7,} "
f"{int(row['conv']):>6,} {rate:>7.2f}% {upl:>9} {pv:>8}")
# ── Novelty effect check: early vs late weeks ─────────────────────────────────
print(f"\nNovelty effect check — weekly treatment uplift:")
weekly = (df.groupby(["week","variant"])
.agg(users=("users","sum"), conv=("conversions","sum"))
.reset_index())
print(f"{'Week':<6} {'Control Rate':>14} {'Treatment Rate':>15} {'Uplift':>8}")
print("─" * 46)
for wk in range(1, 7):
wdf = weekly[weekly["week"]==wk]
wc = wdf[wdf["variant"]=="control"].iloc[0]
wt = wdf[wdf["variant"]=="treatment"].iloc[0]
pc_w = wc["conv"]/wc["users"]*100
pt_w = wt["conv"]/wt["users"]*100
print(f" {wk:<4} {pc_w:>12.2f}% {pt_w:>13.2f}% {pt_w-pc_w:>+7.2f}pp")
Segment analysis by device: Segment Variant Users Conv Rate Uplift p-val ────────────────────────────────────────────────────────────────────────── desktop control 9,707 1,109 11.42% — — desktop treatment 9,779 1,300 13.29% +1.87pp 0.0000 mobile control 9,628 1,120 11.63% — — mobile treatment 9,720 1,263 12.99% +1.36pp 0.0000 Segment analysis by source: Segment Variant Users Conv Rate Uplift p-val ────────────────────────────────────────────────────────────────────────── organic control 9,740 1,110 11.40% — — organic treatment 9,815 1,300 13.24% +1.84pp 0.0000 paid control 9,595 1,119 11.66% — — paid treatment 9,684 1,263 13.04% +1.38pp 0.0000 Novelty effect check — weekly treatment uplift: Week Control Rate Treatment Rate Uplift ────────────────────────────────────────────────── 1 11.48% 13.00% +1.52pp 2 11.35% 13.37% +2.02pp 3 11.55% 13.14% +1.59pp 4 11.44% 13.16% +1.72pp 5 11.39% 13.16% +1.77pp 6 11.44% 13.37% +1.93pp
What just happened?
Method — segment loop with per-segment Z-test · weekly aggregation for novelty checkThe segment loop reuses the same Z-test logic from Step 3 inside a for loop over segment values — the same pattern used in CS27's parameter correlation loop and CS28's per-building regression. Crucially, segment tests use the same α = 0.05 threshold, but with the caveat that running multiple tests inflates the family-wise error rate (FWER). With 4 segment tests, the probability of at least one false positive under H0 is 1 − 0.95⁴ = 18.5%. For a thorough analysis, Bonferroni correction would set α = 0.05/4 = 0.0125 per test. The novelty effect check looks for an inflated uplift in week 1 followed by a decline — the classic signature of users excited by the new design rather than genuinely converting more efficiently.
The effect is consistent across all four segments and shows no novelty decay. Desktop (+1.87pp) and organic (+1.84pp) show slightly larger uplifts than mobile (+1.36pp) and paid (+1.38pp) — but all four are highly significant. The weekly uplift ranges from +1.52pp to +2.02pp with no declining trend from week 1 to week 6 — if anything it strengthens slightly. This is a genuine conversion improvement, not a novelty spike. Shipping to 100% of users is supported by every check in the analysis.
Checkpoint: Apply Bonferroni correction to the four segment p-values. With 4 tests, the corrected threshold is α_corrected = 0.05 / 4 = 0.0125. Do all four segment results still pass at the corrected threshold? Then compute the effect size (Cohen's h) for the overall result: h = 2 × arcsin(√p_t) − 2 × arcsin(√p_c). Cohen's h of 0.2 is small, 0.5 is medium, 0.8 is large. How would you classify this experiment's effect?
Key Findings
SRM check passes at p = 0.41 — the 164-user imbalance between groups is well within chance variation. The randomisation pipeline is clean and the groups are comparable. Experiment validity is confirmed before any significance testing.
Conversion uplift is highly significant at Z = 6.82 (p ≈ 0.000000) — confirmed by chi-square at χ² = 46.54. Treatment converts at 13.14% vs control at 11.53%, a +1.62pp absolute and +14.01% relative improvement. Both tests agree. The result is not noise.
95% CI for the absolute uplift is [+1.15pp, +2.08pp] — the entire interval is above zero. Annualised revenue uplift ranges from a conservative £1,196,312 to £2,163,200. The business case uses the lower bound: £1.2M per year from shipping the new checkout.
The experiment ran with 38,834 users against a minimum requirement of 4,816 — more than 8× overpowered. Achieved power is effectively 100%. The actual MDE at 80% power was 0.31pp, meaning the experiment could detect effects as small as a 2.7% relative change.
The effect is consistent across all segments and stable over six weeks — desktop (+1.87pp), mobile (+1.36pp), organic (+1.84pp), and paid (+1.38pp) all significant at p < 0.0001. No novelty decay detected. Recommendation: ship the single-page checkout to 100% of users.
Visualisations
A/B Testing Decision Guide
| Task | Method | Call | Watch Out For |
|---|---|---|---|
| SRM check | Chi-square goodness-of-fit on observed vs expected assignment counts | stats.chisquare(f_obs=[n_c,n_t], f_exp=[exp_c,exp_t]) | Use p < 0.01 threshold, not 0.05 — SRM is a data quality check, not a hypothesis test; be conservative |
| Two-proportion Z-test | Pooled SE under H0, Z-statistic, two-tailed p via normal CDF | p_pool = (x_c+x_t)/(n_c+n_t); z = (p_t-p_c)/se_pool | Use pooled SE for the test and unpooled SE for the CI — they are different estimators for different purposes |
| Chi-square cross-check | 2×2 contingency table of conversions × variant | stats.chi2_contingency([[x_c, n_c-x_c],[x_t, n_t-x_t]]) | Always cross-validate Z-test with chi-square — agreement confirms the result; disagreement signals a data issue |
| Confidence interval | Unpooled SE × Z critical value around point estimate | ci = (p_t-p_c) ± z_crit * se_unpooled | Use CI lower bound for revenue projections — the point estimate is the most likely value but the lower bound is defensible |
| Sample size (pre-test) | Standard formula from Z_α/2, Z_β, MDE, and base rate | n = (z_a+z_b)^2 * 2*pbar*(1-pbar) / delta^2 | MDE must be defined before the test — never compute required n from the observed effect after the fact (p-hacking) |
| Novelty effect | Plot weekly uplift — a genuine effect is stable over time | df.groupby(["week","variant"]).agg(...) | If week 1 uplift is 3× week 6 uplift, the effect is largely novelty. Require at least 2 weeks of stable data before shipping. |
| Segment tests | Repeat Z-test per segment; apply Bonferroni correction | alpha_corrected = 0.05 / n_segments | Running k segment tests without correction gives a k × 5% false positive rate — always adjust the threshold |
Analyst's Note
Teacher's Note
What Would Come Next?
Extend the analysis to secondary metrics: average order value, cart abandonment rate, return visit rate within 7 days, and customer satisfaction score. A treatment that increases conversion but reduces AOV or increases returns may not be a net positive. Use a t-test for continuous metrics (AOV) and the same Z-test framework for rate metrics. For a full Bayesian alternative, model the conversion rate as a Beta distribution — Beta(α + conversions, β + non-conversions) — and compute the probability that treatment rate exceeds control rate directly from the posterior distributions.
Limitations of This Analysis
The analysis assumes user-level independence — that one user's conversion decision does not influence another's. In practice, network effects (sharing checkout links, social proof elements) can violate this assumption. The experiment also ran during a single six-week period; seasonal effects, promotions, or external news during this window could inflate or deflate the measured effect compared to a typical six-week period.
Business Decisions This Could Drive
Ship the single-page checkout immediately — every week of delay costs approximately £23,000 in unrealised revenue (£1.2M conservative estimate ÷ 52 weeks). Post-ship, re-run the analysis on the first 4 weeks of 100% rollout as a health check. Set up a permanent conversion rate monitor with weekly Z-score alerts so that any future regression from the new baseline is caught immediately rather than six weeks later.
Practice Questions
1. What is the abbreviation for the check that verifies users were assigned to control and treatment in the intended ratio — and must be passed before any significance test is run?
2. The Z-test uses a pooled standard error to compute the test statistic. Which type of standard error — pooled or unpooled — should be used when constructing the confidence interval for the conversion rate difference?
3. When using the confidence interval to build a revenue projection for a shipping decision, which value should be used — the point estimate, the lower bound, or the upper bound — and why?
Quiz
1. The experiment returns p = 0.000000 and an uplift of +1.62pp. Why is reporting only the p-value insufficient — and what does the confidence interval add?
2. Why does statistical power matter even when an experiment has already returned a significant result?
3. What is a novelty effect in A/B testing — and what pattern in the weekly uplift data would confirm or rule it out?
Up Next · Case Study 30
End-to-End Project
A complete data science project from raw messy data to a deployed model recommendation — combining EDA, feature engineering, modelling prep, and a stakeholder-ready output into one full workflow.