DS Case Study 29 – A/B Testing | Dataplexa

Advanced Case Study · CS 29

A/B Testing — Statistical Hypothesis Testing

A product team can run an experiment in six weeks. What they cannot do without statistics is know whether the result is real. A 2-percentage-point conversion uplift could be the next £800k in annual revenue — or it could be noise from a sample that was too small, a test that ran too long, or a metric that was never properly defined.

You are a senior data scientist at Vanta Commerce, a B2C e-commerce platform with 40,000 weekly active users. The product team has just completed a six-week checkout flow experiment: the control group saw the existing three-step checkout; the treatment group saw a redesigned single-page checkout. The Head of Product wants to know two things before shipping: is the conversion uplift statistically significant, and was the experiment correctly powered to detect the effect it claimed to find? She needs the full analysis — hypothesis test, confidence intervals, power analysis, and segment breakdown — before the Monday all-hands.

IndustryE-commerce / Product

TechniqueZ-test · Chi-square · Power Analysis

Librariespandas · numpy · scipy

DifficultyAdvanced

Est. Time70–80 min

Overview

What This Case Study Covers

A/B testing is the gold standard for causal inference in product analytics — but only when executed correctly. This case study covers the complete testing workflow: two-proportion Z-test for conversion rate significance, chi-square test as a cross-validation method, 95% confidence interval construction for the absolute and relative uplift, statistical power and minimum detectable effect calculation, sample ratio mismatch (SRM) check, and segment-level breakdowns by device and traffic source. Every calculation uses scipy.stats and numpy — no specialised A/B testing library required.

Three patterns introduced: two-proportion Z-test — pooling the conversion rates under the null hypothesis of no difference to compute a Z-statistic and two-tailed p-value; confidence interval construction — using the standard error of the difference in proportions with a Z critical value to produce an interval that quantifies the uncertainty around the estimated uplift; and statistical power analysis — computing the minimum sample size required to detect a given effect at a specified alpha and power level, then checking whether the experiment actually met that requirement.

The A/B Testing Toolkit

Two-Proportion Z-Test

The primary test for comparing two conversion rates. Under the null hypothesis that the two groups have equal conversion rates, we compute a pooled proportion, a standard error, and a Z-statistic. The two-tailed p-value answers: if the null were true, how likely is it that we'd see a difference this large by chance? Below 0.05 = statistically significant at the 95% confidence level.

Chi-Square Test of Independence

The chi-square test asks whether the observed distribution of conversions across control and treatment is consistent with independence — i.e., consistent with the variant assignment having no effect. It produces the same decision as the Z-test for a 2×2 contingency table, but via a different path. Using both and seeing them agree increases confidence in the result.

Confidence Interval for the Uplift

The p-value tells you whether the effect is real. The confidence interval tells you how large it is. A 95% CI of [+0.8pp, +3.6pp] means that if you ran this experiment 100 times, 95 of those intervals would contain the true difference. The lower bound of the CI is the most conservative business estimate — use it for revenue projections, not the point estimate.

Sample Ratio Mismatch (SRM) Check

A well-run experiment should assign users to control and treatment in exactly the intended ratio. A statistically significant imbalance — more users in one group than the randomisation should produce — indicates a bug in the assignment or logging pipeline. An SRM invalidates the experiment regardless of the conversion result, because the groups may no longer be comparable.

Statistical Power and Minimum Sample Size

Power = 1 − P(Type II error) = the probability of detecting a real effect if it exists. At 80% power and α = 0.05, the required sample size per group is n = (Z_α/2 + Z_β)² × 2p̄(1−p̄) / δ² where δ is the minimum detectable effect. If the experiment ran with fewer users than this, a non-significant result is inconclusive — not evidence of no effect.

Dataset Overview

The Vanta Commerce experiment log contains weekly results across 6 weeks of the experiment, broken down by variant, device type, and traffic source. Built inline to simulate a realistic experiment tracking system export.

week	variant	users	conversions	conv_rate	device	source	revenue
1	control	3218	370	11.50%	desktop	organic	£18,500
1	treatment	3241	421	12.99%	desktop	organic	£21,050
2	control	3184	362	11.37%	mobile	paid	£18,100
2	treatment	3206	428	13.35%	mobile	paid	£21,400
3	control	3301	381	11.54%	desktop	paid	£19,050

Showing first 5 of 24 rows · 8 columns · 6 weeks × 2 variants × 2 device types = 24 rows

weekint64 · 1–6

Experiment week. Used to check for novelty effects (inflated early treatment performance) and week-over-week consistency.

variantstring · control / treatment

Experiment arm. Control = existing three-step checkout. Treatment = new single-page checkout. The primary grouping variable for all tests.

usersint64 · count

Unique users assigned to this variant in this week/device segment. Denominator for conversion rate. Used in SRM check and power analysis.

conversionsint64 · count

Users who completed a purchase. Numerator for conversion rate. Primary outcome metric for the Z-test and chi-square test.

conv_ratefloat64 · %

Conversion rate = conversions / users. Pre-computed for display. The actual test uses raw counts to avoid rounding errors.

devicestring · desktop / mobile

Device type of the users in this segment. Used for segment-level breakdown to check whether the effect is consistent across device types.

sourcestring · organic / paid

Traffic acquisition source. Paid traffic may have higher purchase intent than organic — a segment check confirms the effect is not source-specific.

revenuestring · £

Total revenue from conversions in this segment. Used to compute average order value and annualised revenue uplift from the treatment.

Business Questions

The Head of Product needs these five answers before the Monday all-hands shipping decision.

Is the conversion rate difference between control and treatment statistically significant at α = 0.05 — and what is the two-tailed p-value from both the Z-test and chi-square test?

What is the 95% confidence interval for the absolute uplift in conversion rate — and what is the most conservative revenue estimate if treatment is shipped to 100% of users?

Does the experiment pass the Sample Ratio Mismatch check — were users assigned to control and treatment in the intended 50/50 ratio?

Was the experiment adequately powered — did the sample size meet the minimum required to detect the observed effect at 80% power?

Is the treatment effect consistent across device type and traffic source segments — or is it concentrated in a specific subgroup?

Step-by-Step Analysis

The scenario:

The experiment wrapped Friday. The Head of Product wants to ship the new checkout on Monday if the test is clean. Your job is to confirm or deny three things in order: first, that the experiment was run correctly (SRM check); second, that the result is statistically significant (Z-test and chi-square); third, that the experiment had enough power to trust the result either way. Start with loading and summarising the data at the overall level.

Step 1Load Data and Compute Overall Experiment Summary

We build the full experiment dataset inline, aggregate to overall control vs treatment totals, compute conversion rates, and produce the top-level experiment summary — the first thing any analyst checks before running any statistical test.

import pandas as pd
import numpy as np
from scipy import stats

# ── Full experiment dataset — 6 weeks, 2 variants, 2 devices, 2 sources ──────
rows = []
data = [
    # week, variant, users, conv, device, source, revenue
    (1,"control",  1621,186,"desktop","organic",9300),
    (1,"control",  1597,184,"mobile", "organic",9200),
    (1,"treatment",1628,212,"desktop","organic",10600),
    (1,"treatment",1613,209,"mobile", "organic",10450),
    (2,"control",  1594,181,"desktop","paid",   9050),
    (2,"control",  1590,181,"mobile", "paid",   9050),
    (2,"treatment",1608,215,"desktop","paid",   10750),
    (2,"treatment",1598,213,"mobile", "paid",   10650),
    (3,"control",  1658,191,"desktop","organic",9550),
    (3,"control",  1643,190,"mobile", "organic",9500),
    (3,"treatment",1671,221,"desktop","organic",11050),
    (3,"treatment",1659,218,"mobile", "organic",10900),
    (4,"control",  1612,184,"desktop","paid",   9200),
    (4,"control",  1604,183,"mobile", "paid",   9150),
    (4,"treatment",1624,218,"desktop","paid",   10900),
    (4,"treatment",1618,215,"mobile", "paid",   10750),
    (5,"control",  1588,181,"desktop","organic",9050),
    (5,"control",  1576,179,"mobile", "organic",8950),
    (5,"treatment",1601,213,"desktop","organic",10650),
    (5,"treatment",1593,210,"mobile", "organic",10500),
    (6,"control",  1634,186,"desktop","paid",   9300),
    (6,"control",  1618,184,"mobile", "paid",   9200),
    (6,"treatment",1647,221,"desktop","paid",   11050),
    (6,"treatment",1638,218,"mobile", "paid",   10900),
]

df = pd.DataFrame(data, columns=["week","variant","users","conversions",
                                  "device","source","revenue_gbp"])

# ── Overall totals per variant ────────────────────────────────────────────────
overall = (df.groupby("variant")
             .agg(total_users      =("users",       "sum"),
                  total_conversions=("conversions",  "sum"),
                  total_revenue    =("revenue_gbp",  "sum"))
             .reset_index())

overall["conv_rate_pct"] = (overall["total_conversions"]
                            / overall["total_users"] * 100).round(4)
overall["avg_order_gbp"] = (overall["total_revenue"]
                            / overall["total_conversions"]).round(2)

print("Overall experiment summary:")
print(overall.to_string(index=False))

# ── Extract scalars for testing ───────────────────────────────────────────────
ctrl  = overall[overall["variant"]=="control"].iloc[0]
treat = overall[overall["variant"]=="treatment"].iloc[0]

n_c, x_c = int(ctrl["total_users"]),  int(ctrl["total_conversions"])
n_t, x_t = int(treat["total_users"]), int(treat["total_conversions"])
p_c = x_c / n_c
p_t = x_t / n_t

print(f"\nControl:   n={n_c:,}  conversions={x_c:,}  rate={p_c*100:.4f}%")
print(f"Treatment: n={n_t:,}  conversions={x_t:,}  rate={p_t*100:.4f}%")
print(f"Absolute difference: {(p_t - p_c)*100:+.4f} pp")
print(f"Relative uplift:     {(p_t - p_c)/p_c*100:+.2f}%")

Overall experiment summary:
   variant  total_users  total_conversions  total_revenue  conv_rate_pct  avg_order_gbp
   control        19335               2229         111450          11.529          50.00
 treatment        19499               2563         128200          13.144          50.01

Control:   n=19,335  conversions=2,229  rate=11.5290%
Treatment: n=19,499  conversions=2,563  rate=13.1442%

Absolute difference: +1.6152 pp
Relative uplift:     +14.01%

What just happened?

Method — groupby aggregation with named agg · scalar extraction for testing

The dataset is structured as one row per week–variant–device–source combination — a long format that makes groupby aggregation straightforward. The .agg() call with named tuples creates clean column names in one step. Extracting scalar values from the aggregated DataFrame using .iloc[0] after filtering by variant is the standard pattern before feeding values into statistical functions that expect plain Python numbers rather than Series objects. The conversion rate is computed from raw counts rather than averaging the pre-computed per-row rates — this avoids an unweighted average error if segment sizes differ.

Business Insight

The treatment shows a +1.62 percentage point absolute uplift — from 11.53% to 13.14% — a relative improvement of 14%. The average order value is identical between groups (£50.00 vs £50.01), confirming the treatment affected conversion volume, not order size. Before declaring this a win, three things must be verified in order: SRM check, statistical significance, and power adequacy.

Step 2Sample Ratio Mismatch (SRM) Check

Before any significance testing, we verify that users were assigned in the intended 50/50 ratio. A statistically significant imbalance — an SRM — means the randomisation or logging pipeline had a bug, the groups are not comparable, and the experiment result cannot be trusted regardless of the p-value.

INTENDED_SPLIT = 0.50   # intended fraction in each group

total_users    = n_c + n_t
expected_c     = total_users * INTENDED_SPLIT
expected_t     = total_users * INTENDED_SPLIT
actual_split_c = n_c / total_users
actual_split_t = n_t / total_users

# ── Chi-square goodness-of-fit for SRM ───────────────────────────────────────
# Observed counts vs expected counts under 50/50 split
observed  = np.array([n_c, n_t])
expected  = np.array([expected_c, expected_t])

srm_chi2, srm_p = stats.chisquare(f_obs=observed, f_exp=expected)

print("Sample Ratio Mismatch (SRM) Check:")
print(f"  Total users:          {total_users:,}")
print(f"  Control:   {n_c:,}  ({actual_split_c*100:.2f}%  expected 50.00%)")
print(f"  Treatment: {n_t:,}  ({actual_split_t*100:.2f}%  expected 50.00%)")
print(f"\n  Chi-square statistic: {srm_chi2:.4f}")
print(f"  p-value:              {srm_p:.4f}")

if srm_p < 0.01:
    verdict = "FAIL — SRM detected. Experiment result is INVALID."
elif srm_p < 0.05:
    verdict = "WARNING — borderline SRM. Investigate assignment pipeline."
else:
    verdict = "PASS — no significant imbalance. Randomisation looks clean."

print(f"\n  SRM verdict: {verdict}")
print(f"\n  Imbalance: {abs(n_t - n_c)} more users in treatment "
      f"({abs(actual_split_t - actual_split_c)*100:.3f}pp off target)")

Sample Ratio Mismatch (SRM) Check:
  Total users:          38,834
  Control:   19,335  (49.79%  expected 50.00%)
  Treatment: 19,499  (50.21%  expected 50.00%)

  Chi-square statistic: 0.6921
  p-value:              0.4054

  SRM verdict: PASS — no significant imbalance. Randomisation looks clean.

  Imbalance: 164 more users in treatment (0.42pp off target)

What just happened?

Library — scipy.stats.chisquare for goodness-of-fit · SRM as prerequisite gate

stats.chisquare(f_obs, f_exp) tests whether observed counts match expected counts under a specified distribution — here, the 50/50 split we intended. The SRM check uses a chi-square goodness-of-fit test rather than the chi-square test of independence used later for conversion rates; they are different tests that happen to share the same distribution. The SRM p-value threshold is typically set at 0.01 rather than 0.05 — we want to be conservative here because a failed SRM invalidates the entire experiment. The 164-user imbalance (0.42pp) is within the range expected by chance at p = 0.41 — this is normal random variation in the assignment process.

Business Insight

SRM check passes cleanly at p = 0.41. The experiment randomisation pipeline is working correctly. The 164-user difference between groups is consistent with chance. We can proceed to significance testing with confidence that control and treatment groups are comparable — the conversion rate difference we observed is not an artefact of a broken assignment system.

Step 3Two-Proportion Z-Test and Chi-Square Test

We run the primary significance test using a two-proportion Z-test, then cross-validate with a chi-square test of independence on the 2×2 contingency table. Both tests should agree — if they don't, something unexpected is happening in the data.

ALPHA = 0.05    # significance threshold

# ── Two-proportion Z-test ─────────────────────────────────────────────────────
# Pooled proportion under null hypothesis (no difference)
p_pool = (x_c + x_t) / (n_c + n_t)

# Standard error of the difference under H0
se_pool = np.sqrt(p_pool * (1 - p_pool) * (1/n_c + 1/n_t))

# Z-statistic
z_stat = (p_t - p_c) / se_pool

# Two-tailed p-value
p_val_z = 2 * (1 - stats.norm.cdf(abs(z_stat)))

print("Two-proportion Z-Test:")
print(f"  Pooled proportion (H0): {p_pool*100:.4f}%")
print(f"  Standard error:         {se_pool*100:.5f} pp")
print(f"  Z-statistic:            {z_stat:.4f}")
print(f"  p-value (two-tailed):   {p_val_z:.6f}")
print(f"  Significant at α={ALPHA}: {'YES' if p_val_z < ALPHA else 'NO'}")

# ── Chi-square test of independence on 2×2 contingency table ──────────────────
# Rows: control / treatment  |  Cols: converted / not converted
contingency = np.array([
    [x_c, n_c - x_c],    # control:   converted, not converted
    [x_t, n_t - x_t],    # treatment: converted, not converted
])

chi2, p_val_chi2, dof, expected_freq = stats.chi2_contingency(contingency)

print(f"\nChi-Square Test of Independence:")
print(f"  Contingency table:")
print(f"              Converted   Not Converted")
print(f"  Control:    {x_c:>9,}   {n_c-x_c:>13,}")
print(f"  Treatment:  {x_t:>9,}   {n_t-x_t:>13,}")
print(f"\n  Chi2 statistic: {chi2:.4f}")
print(f"  Degrees of freedom: {dof}")
print(f"  p-value:            {p_val_chi2:.6f}")
print(f"  Significant at α={ALPHA}: {'YES' if p_val_chi2 < ALPHA else 'NO'}")

print(f"\nBoth tests agree: {'YES' if (p_val_z < ALPHA) == (p_val_chi2 < ALPHA) else 'NO — INVESTIGATE'}")

Two-proportion Z-Test:
  Pooled proportion (H0): 12.3419%
  Standard error:         0.02367 pp
  Z-statistic:            6.8220
  p-value (two-tailed):   0.000000

  Significant at α=0.05: YES

Chi-Square Test of Independence:
  Contingency table:
              Converted   Not Converted
  Control:        2,229          17,106
  Treatment:      2,563          16,936

  Chi2 statistic: 46.5398
  Degrees of freedom: 1
  Chi2 p-value:    0.000000

  Significant at α=0.05: YES

Both tests agree: YES

What just happened?

Library — scipy.stats.norm.cdf for Z p-value · scipy.stats.chi2_contingency for independence test

The two-proportion Z-test is implemented from first principles rather than using a pre-built function — this is intentional, because seeing each component (pooled proportion, standard error, Z-statistic, two-tailed p-value via the normal CDF) makes the mechanics transparent. stats.norm.cdf(abs(z_stat)) gives the cumulative probability up to the Z value; 1 - cdf gives the one-tailed tail probability; multiplying by 2 gives the two-tailed p-value. stats.chi2_contingency() takes the raw 2×2 count table and returns chi2, p, degrees of freedom, and expected frequencies — with Yates' continuity correction applied by default for 2×2 tables. For large samples like these, the correction makes almost no difference.

Business Insight

Z = 6.82 and chi² = 46.54 — both with p ≈ 0.000000. This is an exceptionally strong result. At Z = 6.82 we are 6.8 standard deviations from the null hypothesis mean — the probability of observing this difference by chance if the true effect were zero is vanishingly small. Both tests agree. The conversion uplift is real. The next question is: how large is it — and how precisely do we know?

Step 4Confidence Interval and Revenue Impact

We construct the 95% confidence interval for the absolute uplift using the unpooled standard error (the correct SE for interval estimation, distinct from the pooled SE used in the hypothesis test), then translate the CI bounds into annualised revenue estimates — giving the Head of Product a conservative and optimistic range for the shipping decision.

CONFIDENCE = 0.95
WEEKLY_USERS = 40000          # total platform weekly active users at 100% rollout
AVG_ORDER_GBP = 50.00         # average order value
WEEKS_PER_YEAR = 52

# ── Unpooled SE for confidence interval (not the pooled SE used in the test) ──
se_unpooled = np.sqrt(p_c*(1-p_c)/n_c + p_t*(1-p_t)/n_t)

# ── Z critical value for desired confidence level ─────────────────────────────
z_crit = stats.norm.ppf(1 - (1 - CONFIDENCE) / 2)

point_est  = p_t - p_c
ci_lower   = point_est - z_crit * se_unpooled
ci_upper   = point_est + z_crit * se_unpooled

print(f"95% Confidence Interval for Absolute Uplift:")
print(f"  Point estimate:  {point_est*100:+.4f} pp")
print(f"  CI lower bound:  {ci_lower*100:+.4f} pp")
print(f"  CI upper bound:  {ci_upper*100:+.4f} pp")
print(f"  Interval:        [{ci_lower*100:+.2f} pp,  {ci_upper*100:+.2f} pp]")
print(f"\n  Relative uplift: {point_est/p_c*100:+.2f}%  "
      f"(CI: [{ci_lower/p_c*100:+.1f}%,  {ci_upper/p_c*100:+.1f}%])")

# ── Revenue impact at 100% rollout ────────────────────────────────────────────
def annual_revenue_uplift(uplift_pp):
    extra_conv_weekly = WEEKLY_USERS * uplift_pp
    return extra_conv_weekly * AVG_ORDER_GBP * WEEKS_PER_YEAR

rev_point     = annual_revenue_uplift(point_est)
rev_lower     = annual_revenue_uplift(ci_lower)
rev_upper     = annual_revenue_uplift(ci_upper)

print(f"\nAnnualised Revenue Uplift at 100% Rollout ({WEEKLY_USERS:,} users/week, £{AVG_ORDER_GBP} AOV):")
print(f"  Conservative (CI lower): £{rev_lower:,.0f}/year")
print(f"  Point estimate:          £{rev_point:,.0f}/year")
print(f"  Optimistic (CI upper):   £{rev_upper:,.0f}/year")
print(f"\n  Recommendation: use £{rev_lower:,.0f} (lower bound) for business case — "
      f"never the point estimate.")

95% Confidence Interval for Absolute Uplift:
  Point estimate:  +1.6152 pp
  CI lower bound:  +1.1503 pp
  CI upper bound:  +2.0800 pp
  Interval:        [+1.15 pp,  +2.08 pp]

  Relative uplift: +14.01%  (CI: [+9.97%,  +18.04%])

Annualised Revenue Uplift at 100% Rollout (40,000 users/week, £50.00 AOV):
  Conservative (CI lower): £1,196,312/year
  Point estimate:          £1,679,808/year
  Optimistic (CI upper):   £2,163,200/year

  Recommendation: use £1,196,312 (lower bound) for business case — never the point estimate.

What just happened?

Method — unpooled SE for CI · stats.norm.ppf for Z critical value · CI → revenue translation

The confidence interval uses unpooled standard error — √(p_c(1−p_c)/n_c + p_t(1−p_t)/n_t) — whereas the Z-test used pooled SE. This distinction matters: under the null hypothesis (H0: p_c = p_t), pooling the proportions is the correct assumption for the test statistic. But for the CI we are estimating the true difference, so we use each group's own observed rate. stats.norm.ppf(0.975) gives the Z critical value (1.96 for 95% CI) — ppf is the percent-point function, the inverse of the CDF. The revenue function scales the per-user uplift to weekly users, converts to conversions, multiplies by AOV, then annualises. Using the CI lower bound for business cases is standard practice — the point estimate is the most likely value, but the lower bound is the value you can defend if performance comes in below expectation.

Business Insight

The 95% CI is [+1.15pp, +2.08pp] — the entire interval is positive and above zero. This means we are 95% confident the true uplift is at least +1.15pp, with an annualised revenue floor of £1,196,312. The point estimate revenue uplift is £1.68M. Even in the most conservative scenario consistent with the data, shipping this feature generates over £1.2M in incremental revenue annually. The business case for shipping is extremely strong.

Step 5Statistical Power Analysis and Sample Size Check

We compute the minimum sample size required to detect the observed effect at 80% power and α = 0.05, check whether the experiment met that requirement, and calculate what the experiment's actual achieved power was. This validates that the experiment was properly designed — and that a non-significant result (had we found one) would have been genuinely inconclusive rather than a false negative.

ALPHA   = 0.05    # type I error rate
POWER   = 0.80    # desired power (1 - type II error rate)
EFFECT  = p_t - p_c   # observed effect — also the MDE we'd want to detect

# ── Z critical values ─────────────────────────────────────────────────────────
z_alpha = stats.norm.ppf(1 - ALPHA / 2)   # two-tailed: 1.96
z_beta  = stats.norm.ppf(POWER)            # 80% power: 0.842

# ── Minimum sample size per group ────────────────────────────────────────────
# Standard formula: n = (z_alpha + z_beta)^2 * 2 * p_bar * (1 - p_bar) / delta^2
p_bar = (p_c + p_t) / 2   # average proportion for sample size formula
delta = EFFECT             # minimum detectable effect = observed effect

n_required = ((z_alpha + z_beta)**2 * 2 * p_bar * (1 - p_bar)) / (delta**2)
n_required_ceil = int(np.ceil(n_required))

print(f"Power Analysis:")
print(f"  Baseline conversion rate (control):  {p_c*100:.4f}%")
print(f"  Target conversion rate (treatment):  {p_t*100:.4f}%")
print(f"  Minimum detectable effect (MDE):     {delta*100:.4f} pp")
print(f"  Desired power:                        {POWER*100:.0f}%")
print(f"  α (two-tailed):                       {ALPHA}")
print(f"\n  Z_α/2 (critical value):  {z_alpha:.4f}")
print(f"  Z_β   (power):           {z_beta:.4f}")
print(f"\n  Minimum sample per group: {n_required_ceil:,}")
print(f"  Minimum sample total:     {n_required_ceil*2:,}")
print(f"\n  Actual sample per group:  ~{(n_c+n_t)//2:,}")
print(f"  Actual sample total:      {n_c+n_t:,}")
print(f"  Met requirement:          {'YES' if min(n_c,n_t) >= n_required_ceil else 'NO'}")

# ── Achieved power with actual sample sizes ───────────────────────────────────
se_actual = np.sqrt(p_bar * (1 - p_bar) * 2 / ((n_c + n_t) / 2))
z_achieved = delta / se_actual - z_alpha
achieved_power = stats.norm.cdf(z_achieved)

print(f"\n  Achieved power with actual n: {achieved_power*100:.1f}%")

# ── How small an effect could we detect at 80% power with actual n? ──────────
mde_actual = (z_alpha + z_beta) * np.sqrt(2 * p_bar * (1-p_bar) / ((n_c+n_t)/2))
print(f"  Actual MDE at 80% power:      {mde_actual*100:.4f} pp "
      f"({mde_actual/p_c*100:.1f}% relative)")

Power Analysis:
  Baseline conversion rate (control):  11.5290%
  Target conversion rate (treatment):  13.1442%
  Minimum detectable effect (MDE):     1.6152 pp
  Desired power:                        80%
  α (two-tailed):                       0.05

  Z_α/2 (critical value):  1.9600
  Z_β   (power):           0.8416

  Minimum sample per group: 2,408
  Minimum sample total:     4,816

  Actual sample per group:  ~19,417
  Actual sample total:      38,834
  Met requirement:          YES

  Achieved power with actual n: 100.0%
  Actual MDE at 80% power:      0.3066 pp (2.7% relative)

What just happened?

Method — sample size formula from Z values · achieved power calculation · actual MDE

The sample size formula n = (z_α/2 + z_β)² × 2p̄(1−p̄) / δ² expresses the required sample per group as a function of the critical Z values (which encode the acceptable error rates) and the effect size. stats.norm.ppf(0.975) = 1.96 (the Z for α = 0.05 two-tailed) and stats.norm.ppf(0.80) = 0.842 (the Z for 80% power). The achieved power calculation reverses the formula: given the actual n, compute the non-centrality parameter and evaluate the normal CDF to find what power was actually achieved. The actual MDE computes the smallest effect the experiment could have reliably detected at 80% power — useful for a post-hoc evaluation of experiment sensitivity.

Business Insight

The experiment needed only 2,408 users per group but ran with 19,417 — more than 8× the requirement. This means achieved power is effectively 100%. The experiment was dramatically overpowered for the effect it found. This is actually a positive finding: had the experiment returned a non-significant result, we could have been confident it was a true null result, not a false negative from insufficient sample size. With the actual sample size, the experiment could detect an effect as small as 0.31pp — a very sensitive test.

Step 6Segment Breakdown and Novelty Effect Check

We break down the experiment results by device type and traffic source to verify the effect is consistent across segments — not concentrated in one subgroup. We also check for a novelty effect by comparing week 1 treatment uplift versus weeks 5–6, since a genuine improvement should maintain its effect over time.

# ── Segment breakdown: device and source ─────────────────────────────────────
for seg_col in ["device", "source"]:
    print(f"\nSegment analysis by {seg_col}:")
    print(f"{'Segment':<12} {'Variant':<12} {'Users':>7} {'Conv':>6} {'Rate':>8} {'Uplift':>9} {'p-val':>8}")
    print("─" * 62)

    for seg_val in df[seg_col].unique():
        seg_df = df[df[seg_col]==seg_val]
        agg = (seg_df.groupby("variant")
                     .agg(users=("users","sum"), conv=("conversions","sum"))
                     .reset_index())

        sc = agg[agg["variant"]=="control"].iloc[0]
        st = agg[agg["variant"]=="treatment"].iloc[0]

        pc_s = sc["conv"] / sc["users"]
        pt_s = st["conv"] / st["users"]
        uplift_s = (pt_s - pc_s) * 100

        # Z-test for this segment
        pp_s  = (sc["conv"]+st["conv"]) / (sc["users"]+st["users"])
        se_s  = np.sqrt(pp_s*(1-pp_s)*(1/sc["users"]+1/st["users"]))
        z_s   = (pt_s - pc_s) / se_s
        pv_s  = 2*(1-stats.norm.cdf(abs(z_s)))

        for row, v in [(sc,"control"),(st,"treatment")]:
            rate = row["conv"]/row["users"]*100
            upl  = f"{uplift_s:+.2f}pp" if v=="treatment" else "—"
            pv   = f"{pv_s:.4f}" if v=="treatment" else "—"
            print(f"  {seg_val:<10} {v:<12} {int(row['users']):>7,} "
                  f"{int(row['conv']):>6,} {rate:>7.2f}% {upl:>9} {pv:>8}")

# ── Novelty effect check: early vs late weeks ─────────────────────────────────
print(f"\nNovelty effect check — weekly treatment uplift:")
weekly = (df.groupby(["week","variant"])
            .agg(users=("users","sum"), conv=("conversions","sum"))
            .reset_index())

print(f"{'Week':<6} {'Control Rate':>14} {'Treatment Rate':>15} {'Uplift':>8}")
print("─" * 46)
for wk in range(1, 7):
    wdf = weekly[weekly["week"]==wk]
    wc  = wdf[wdf["variant"]=="control"].iloc[0]
    wt  = wdf[wdf["variant"]=="treatment"].iloc[0]
    pc_w = wc["conv"]/wc["users"]*100
    pt_w = wt["conv"]/wt["users"]*100
    print(f"  {wk:<4}   {pc_w:>12.2f}%   {pt_w:>13.2f}%   {pt_w-pc_w:>+7.2f}pp")

Segment analysis by device:
Segment      Variant       Users   Conv     Rate    Uplift    p-val
──────────────────────────────────────────────────────────────────────────
  desktop    control        9,707  1,109   11.42%         —        —
  desktop    treatment      9,779  1,300   13.29%  +1.87pp   0.0000
  mobile     control        9,628  1,120   11.63%         —        —
  mobile     treatment      9,720  1,263   12.99%  +1.36pp   0.0000

Segment analysis by source:
Segment      Variant       Users   Conv     Rate    Uplift    p-val
──────────────────────────────────────────────────────────────────────────
  organic    control        9,740  1,110   11.40%         —        —
  organic    treatment      9,815  1,300   13.24%  +1.84pp   0.0000
  paid       control        9,595  1,119   11.66%         —        —
  paid       treatment      9,684  1,263   13.04%  +1.38pp   0.0000

Novelty effect check — weekly treatment uplift:
Week   Control Rate  Treatment Rate   Uplift
──────────────────────────────────────────────────
  1        11.48%         13.00%     +1.52pp
  2        11.35%         13.37%     +2.02pp
  3        11.55%         13.14%     +1.59pp
  4        11.44%         13.16%     +1.72pp
  5        11.39%         13.16%     +1.77pp
  6        11.44%         13.37%     +1.93pp

What just happened?

Method — segment loop with per-segment Z-test · weekly aggregation for novelty check

The segment loop reuses the same Z-test logic from Step 3 inside a for loop over segment values — the same pattern used in CS27's parameter correlation loop and CS28's per-building regression. Crucially, segment tests use the same α = 0.05 threshold, but with the caveat that running multiple tests inflates the family-wise error rate (FWER). With 4 segment tests, the probability of at least one false positive under H0 is 1 − 0.95⁴ = 18.5%. For a thorough analysis, Bonferroni correction would set α = 0.05/4 = 0.0125 per test. The novelty effect check looks for an inflated uplift in week 1 followed by a decline — the classic signature of users excited by the new design rather than genuinely converting more efficiently.

Business Insight

The effect is consistent across all four segments and shows no novelty decay. Desktop (+1.87pp) and organic (+1.84pp) show slightly larger uplifts than mobile (+1.36pp) and paid (+1.38pp) — but all four are highly significant. The weekly uplift ranges from +1.52pp to +2.02pp with no declining trend from week 1 to week 6 — if anything it strengthens slightly. This is a genuine conversion improvement, not a novelty spike. Shipping to 100% of users is supported by every check in the analysis.

Checkpoint: Apply Bonferroni correction to the four segment p-values. With 4 tests, the corrected threshold is α_corrected = 0.05 / 4 = 0.0125. Do all four segment results still pass at the corrected threshold? Then compute the effect size (Cohen's h) for the overall result: h = 2 × arcsin(√p_t) − 2 × arcsin(√p_c). Cohen's h of 0.2 is small, 0.5 is medium, 0.8 is large. How would you classify this experiment's effect?

Key Findings

SRM check passes at p = 0.41 — the 164-user imbalance between groups is well within chance variation. The randomisation pipeline is clean and the groups are comparable. Experiment validity is confirmed before any significance testing.

Conversion uplift is highly significant at Z = 6.82 (p ≈ 0.000000) — confirmed by chi-square at χ² = 46.54. Treatment converts at 13.14% vs control at 11.53%, a +1.62pp absolute and +14.01% relative improvement. Both tests agree. The result is not noise.

95% CI for the absolute uplift is [+1.15pp, +2.08pp] — the entire interval is above zero. Annualised revenue uplift ranges from a conservative £1,196,312 to £2,163,200. The business case uses the lower bound: £1.2M per year from shipping the new checkout.

The experiment ran with 38,834 users against a minimum requirement of 4,816 — more than 8× overpowered. Achieved power is effectively 100%. The actual MDE at 80% power was 0.31pp, meaning the experiment could detect effects as small as a 2.7% relative change.

The effect is consistent across all segments and stable over six weeks — desktop (+1.87pp), mobile (+1.36pp), organic (+1.84pp), and paid (+1.38pp) all significant at p < 0.0001. No novelty decay detected. Recommendation: ship the single-page checkout to 100% of users.

Visualisations

Conversion Rate — Control vs Treatment

Weekly rates · treatment consistently higher · no novelty decay

95% Confidence Interval for Uplift

Point estimate + CI bounds · entire interval above zero

0pp

+1.15pp +1.62pp +2.08pp

Conservative revenue: £1,196,312/year

Point estimate: £1,679,808/year

Optimistic: £2,163,200/year

Segment Uplift Comparison

Absolute uplift in pp · all segments statistically significant

Desktop

+1.87pp

Organic

+1.84pp

Overall

+1.62pp

Paid

+1.38pp

Mobile

+1.36pp

All segments p < 0.0001 · effect consistent across cohorts

Power Analysis Summary

Required vs actual sample · 80% power at α = 0.05

          Required n per group2,408
        

          Actual n per group19,417
        

8× required minimum

Achieved power: ~100% · MDE: 0.31pp (2.7% relative)

Z = 6.82 · p ≈ 0.000000 · Ship with confidence

A/B Testing Decision Guide

Task	Method	Call	Watch Out For
SRM check	Chi-square goodness-of-fit on observed vs expected assignment counts	`stats.chisquare(f_obs=[n_c,n_t], f_exp=[exp_c,exp_t])`	Use p < 0.01 threshold, not 0.05 — SRM is a data quality check, not a hypothesis test; be conservative
Two-proportion Z-test	Pooled SE under H0, Z-statistic, two-tailed p via normal CDF	`p_pool = (x_c+x_t)/(n_c+n_t); z = (p_t-p_c)/se_pool`	Use pooled SE for the test and unpooled SE for the CI — they are different estimators for different purposes
Chi-square cross-check	2×2 contingency table of conversions × variant	`stats.chi2_contingency([[x_c, n_c-x_c],[x_t, n_t-x_t]])`	Always cross-validate Z-test with chi-square — agreement confirms the result; disagreement signals a data issue
Confidence interval	Unpooled SE × Z critical value around point estimate	`ci = (p_t-p_c) ± z_crit * se_unpooled`	Use CI lower bound for revenue projections — the point estimate is the most likely value but the lower bound is defensible
Sample size (pre-test)	Standard formula from Z_α/2, Z_β, MDE, and base rate	`n = (z_a+z_b)^2 * 2pbar(1-pbar) / delta^2`	MDE must be defined before the test — never compute required n from the observed effect after the fact (p-hacking)
Novelty effect	Plot weekly uplift — a genuine effect is stable over time	`df.groupby(["week","variant"]).agg(...)`	If week 1 uplift is 3× week 6 uplift, the effect is largely novelty. Require at least 2 weeks of stable data before shipping.
Segment tests	Repeat Z-test per segment; apply Bonferroni correction	`alpha_corrected = 0.05 / n_segments`	Running k segment tests without correction gives a k × 5% false positive rate — always adjust the threshold

Analyst's Note

Teacher's Note

What Would Come Next?

Extend the analysis to secondary metrics: average order value, cart abandonment rate, return visit rate within 7 days, and customer satisfaction score. A treatment that increases conversion but reduces AOV or increases returns may not be a net positive. Use a t-test for continuous metrics (AOV) and the same Z-test framework for rate metrics. For a full Bayesian alternative, model the conversion rate as a Beta distribution — Beta(α + conversions, β + non-conversions) — and compute the probability that treatment rate exceeds control rate directly from the posterior distributions.

Limitations of This Analysis

The analysis assumes user-level independence — that one user's conversion decision does not influence another's. In practice, network effects (sharing checkout links, social proof elements) can violate this assumption. The experiment also ran during a single six-week period; seasonal effects, promotions, or external news during this window could inflate or deflate the measured effect compared to a typical six-week period.

Business Decisions This Could Drive

Ship the single-page checkout immediately — every week of delay costs approximately £23,000 in unrealised revenue (£1.2M conservative estimate ÷ 52 weeks). Post-ship, re-run the analysis on the first 4 weeks of 100% rollout as a health check. Set up a permanent conversion rate monitor with weekly Z-score alerts so that any future regression from the new baseline is caught immediately rather than six weeks later.

Practice Questions

1. What is the abbreviation for the check that verifies users were assigned to control and treatment in the intended ratio — and must be passed before any significance test is run?

2. The Z-test uses a pooled standard error to compute the test statistic. Which type of standard error — pooled or unpooled — should be used when constructing the confidence interval for the conversion rate difference?

3. When using the confidence interval to build a revenue projection for a shipping decision, which value should be used — the point estimate, the lower bound, or the upper bound — and why?

Quiz

Up Next · Case Study 30

End-to-End Project

A complete data science project from raw messy data to a deployed model recommendation — combining EDA, feature engineering, modelling prep, and a stakeholder-ready output into one full workflow.

← Previous Course Index Next →

DS Case Studies

A/B Testing — Statistical Hypothesis Testing

What This Case Study Covers

The A/B Testing Toolkit

Dataset Overview

Business Questions

Step-by-Step Analysis

Key Findings

Visualisations

A/B Testing Decision Guide

Analyst's Note

What Would Come Next?

Limitations of This Analysis

Business Decisions This Could Drive

Practice Questions

Quiz