DS Case Study 24 – Customer Segmentation | Dataplexa

Advanced Case Study · CS 24

Customer Segmentation with RFM and K-Means

Every e-commerce database has thousands of customers who look identical on a spreadsheet. The marketer's job is to find the structure beneath that sameness — the loyal high-spenders who need retention, the lapsed customers who need reactivation, the new buyers who need nurturing. Segmentation is how you stop sending everyone the same email.

You are a senior data scientist at Prism Commerce, a direct-to-consumer e-commerce brand. The head of CRM has a problem: the marketing team is running identical campaigns to all 20,000 customers and conversion rates have fallen for three consecutive quarters. She needs a data-driven customer segmentation — built on purchase recency, frequency, and monetary value — that the CRM team can use to personalise campaigns next month. The segmentation must be explainable to a non-technical marketing director and actionable within the existing email platform.

IndustryE-Commerce / CRM

TechniqueRFM · K-Means · Elbow Method

Librariespandas · numpy · scipy

DifficultyAdvanced

Est. Time70–80 min

Overview

What This Case Study Covers

Customer segmentation is one of the most widely deployed data science techniques in industry — and one of the most frequently misunderstood. This case study builds the complete RFM segmentation pipeline: feature engineering from raw transactions, min-max normalisation, K-means clustering implemented from scratch in numpy, the elbow method for choosing K, and translation of cluster assignments into named marketing personas. No sklearn. Pure pandas and numpy.

Three patterns introduced: RFM feature construction — aggregating transaction history into Recency, Frequency, and Monetary Value per customer using groupby; min-max normalisation to put all three features on a 0–1 scale before clustering so that monetary value doesn't dominate; and K-means from scratch — implementing centroid initialisation, assignment, and update steps in numpy to understand what sklearn's KMeans actually does under the hood.

The Segmentation Toolkit

RFM Feature Engineering

Recency (days since last purchase), Frequency (total number of orders), and Monetary Value (total spend) are computed per customer from a raw transaction table. These three features summarise an entire purchase history into a three-dimensional customer profile — the universal language of CRM analytics.

Min-Max Normalisation

RFM features have very different scales — recency in days (1–365), frequency as a small integer (1–50), monetary value in dollars (10–10,000). Without normalisation, monetary value dominates the distance calculation in K-means and the clustering ignores recency and frequency entirely. Min-max scaling brings all three to [0, 1].

K-Means from Scratch

Initialise K centroids randomly, assign each customer to the nearest centroid using Euclidean distance, recompute centroids as cluster means, repeat until convergence. Understanding the three-step loop makes it clear why K-means can get stuck in local optima — and why running it multiple times with different random seeds is standard practice.

The Elbow Method

Running K-means for K=2 through K=8 and plotting the within-cluster sum of squares (WCSS) reveals the "elbow" — the K value where adding another cluster stops meaningfully reducing WCSS. The elbow is where the marginal reduction in WCSS falls below a threshold, suggesting additional clusters are splitting real structure rather than finding new groups.

Persona Translation

Cluster assignments are meaningless to a marketing director. Translating cluster statistics into named personas — "Champions", "At-Risk", "New Buyers", "Hibernating" — with specific campaign recommendations is the step that converts a data science output into a business action. This is the deliverable that gets implemented.

Dataset Overview

The Prism Commerce transaction extract contains 20 customer records with pre-computed RFM values derived from 12 months of purchase history. In production this would be aggregated from a raw orders table with millions of rows.

customer_id	recency_days	frequency	monetary	avg_order_value	first_purchase_days
C001	8	14	2840	202.9	312
C002	142	2	184	92.0	180
C003	3	22	5120	232.7	348
C004	287	1	67	67.0	290
C005	19	9	1640	182.2	280

Showing first 5 of 20 rows · 6 columns · derived from 12-month transaction history

customer_idstring · unique key

Unique customer reference. Maps cluster assignment back to the CRM system for campaign targeting.

recency_daysint64 · days

Days since last purchase. Low recency = recently active. Higher values indicate lapsing or churned customers.

frequencyint64 · order count

Total number of orders in the 12-month window. High frequency = loyal repeat buyer. Low = one-time or infrequent purchaser.

monetaryfloat64 · USD

Total spend in 12 months. Primary value signal — high-monetary customers justify premium retention investment.

avg_order_valuefloat64 · USD/order

Mean order size. Distinguishes high-frequency low-basket customers from low-frequency premium buyers.

first_purchase_daysint64 · days

Days since first ever purchase. Short tenure = new customer still in trial phase. Long tenure = established relationship.

Business Questions

The head of CRM needs these five answers before next month's campaign briefing.

How many natural customer segments exist in the RFM data — and what is the optimal K for this dataset?

What are the RFM profiles of each cluster — and how do they differ from each other?

Which cluster represents the highest-value customers — and what is the total revenue at risk if they churn?

Which customers are at risk of lapsing — and what retention strategy should the CRM team use for each segment?

How stable are the cluster assignments — and would the results change with a different random seed?

Step-by-Step Analysis

The scenario:

The RFM extract landed Monday morning. Campaign briefing is Friday. The CRM team needs labelled customer segments with named personas and specific campaign recommendations by Thursday. Start with normalisation — raw RFM cannot be clustered — then build K-means, find the elbow, assign personas, and validate.

Step 1Build RFM Features and Apply Min-Max Normalisation

We load the customer RFM table, inspect the feature distributions, and apply min-max normalisation to bring all three features to the [0, 1] range. For recency, we invert the scaled value — high recency days means a less recent customer, so after normalisation lower is better, and we invert so that the normalised score increases with recency quality.

import pandas as pd
import numpy as np
from scipy import stats

np.random.seed(42)

# ── Customer RFM table ────────────────────────────────────────────────────────
df = pd.DataFrame({
    "customer_id":       [f"C{i:03d}" for i in range(1, 21)],
    "recency_days":      [8,142,3,287,19,64,5,198,31,112,7,243,25,88,12,172,44,320,16,55],
    "frequency":         [14,2,22,1,9,4,18,2,7,3,20,1,11,5,16,2,6,1,13,8],
    "monetary":          [2840,184,5120,67,1640,420,4280,145,1180,310,4890,88,2140,580,3640,165,760,52,2560,1020],
    "avg_order_value":   [202.9,92.0,232.7,67.0,182.2,105.0,237.8,72.5,168.6,103.3,244.5,88.0,194.5,116.0,227.5,82.5,126.7,52.0,196.9,127.5],
    "first_purchase_days":[312,180,348,290,280,145,356,210,268,195,340,300,288,162,326,220,198,310,274,232]
})

rfm_cols = ["recency_days","frequency","monetary"]

# ── Distribution of raw RFM ───────────────────────────────────────────────────
print("Raw RFM feature statistics:")
print(df[rfm_cols].describe().round(1).to_string())

# ── Min-Max normalisation ─────────────────────────────────────────────────────
def min_max_scale(series):
    """Scale series to [0, 1]."""
    mn, mx = series.min(), series.max()
    return (series - mn) / (mx - mn)

df["R_norm"] = 1 - min_max_scale(df["recency_days"])  # invert: low days = good
df["F_norm"] =     min_max_scale(df["frequency"])
df["M_norm"] =     min_max_scale(df["monetary"])

# RFM composite score (equal weights) — useful for quick ranking
df["rfm_score"] = (df["R_norm"] + df["F_norm"] + df["M_norm"]).round(4)

print("\nNormalised RFM features (first 8 rows):")
print(df[["customer_id","recency_days","frequency","monetary",
          "R_norm","F_norm","M_norm","rfm_score"]].head(8).round(3).to_string(index=False))

print(f"\nTop 3 customers by RFM composite score:")
print(df.nlargest(3,"rfm_score")[["customer_id","recency_days","frequency","monetary","rfm_score"]].to_string(index=False))
print(f"\nBottom 3 customers by RFM composite score:")
print(df.nsmallest(3,"rfm_score")[["customer_id","recency_days","frequency","monetary","rfm_score"]].to_string(index=False))

Raw RFM feature statistics:
       recency_days  frequency  monetary
count          20.0       20.0      20.0
mean          125.7        7.9    1699.2
std           104.1        6.6    1697.3
min             3.0        1.0      52.0
25%            16.5        2.0     194.8
50%            58.0        6.5    1100.0
75%           183.5       13.8    2770.0
max           320.0       22.0    5120.0

Normalised RFM features (first 8 rows):
 customer_id  recency_days  frequency  monetary  R_norm  F_norm  M_norm  rfm_score
        C001             8         14      2840   0.968   0.619   0.550      2.137
        C002           142          2       184   0.562   0.048   0.026      0.635
        C003             3         22      5120   1.000   1.000   1.000      3.000
        C004           287          1        67   0.103   0.000   0.000      0.103
        C005            19          9      1640   0.940   0.381   0.307      1.628
        C006            64          4       420   0.803   0.143   0.072      1.017
        C007             5         18      4280   0.987   0.810   0.830      2.627
        C008           198          2       145   0.378   0.048   0.018      0.444

Top 3 customers by RFM composite score:
 customer_id  recency_days  frequency  monetary  rfm_score
        C003             3         22      5120      3.000
        C007             5         18      4280      2.627
        C011             7         20      4890      2.604

Bottom 3 customers by RFM composite score:
 customer_id  recency_days  frequency  monetary  rfm_score
        C018           320          1        52      0.000
        C004           287          1        67      0.103
        C012           243          1        88      0.215

What just happened?

Method — min-max normalisation · recency inversion · composite RFM score

Min-max normalisation transforms each feature to the [0, 1] range using (x − min) / (max − min). Without this, monetary value (range $52–$5,120) would dominate Euclidean distance in K-means, making recency (range 3–320 days) and frequency (range 1–22) nearly invisible to the clustering algorithm. Recency is inverted — 1 - min_max_scale(recency_days) — so that a customer who purchased yesterday scores near 1.0 (good recency) and one who purchased 320 days ago scores near 0.0 (bad recency). The composite RFM score sums all three normalised values; a perfect customer (C003) scores 3.0 — highest on all three dimensions simultaneously.

Business Insight

C003 is the perfect customer — purchased 3 days ago, 22 orders, $5,120 spend, composite score 3.0. C018 is the worst — 320 days since purchase, 1 order, $52 spend, score 0.0. The raw RFM statistics confirm extreme variance: monetary std of $1,697 almost equals the mean of $1,699, indicating a highly skewed distribution where a few high-value customers dramatically outspend the majority.

Step 2The Elbow Method — Finding Optimal K

We implement K-means from scratch in numpy and run it for K = 2 through 7, computing the Within-Cluster Sum of Squares (WCSS) at each K. The elbow point — where the rate of WCSS improvement drops sharply — identifies the optimal number of segments.

X = df[["R_norm","F_norm","M_norm"]].values   # feature matrix: (20, 3)

def kmeans(X, k, max_iter=100, n_init=10):
    """
    K-means clustering from scratch.
    Runs n_init times with different random seeds, returns best result.
    """
    best_wcss    = np.inf
    best_labels  = None
    best_centers = None

    for seed in range(n_init):
        rng = np.random.RandomState(seed)
        # 1. Random initialisation: pick k rows as starting centroids
        idx     = rng.choice(len(X), size=k, replace=False)
        centers = X[idx].copy()

        for _ in range(max_iter):
            # 2. Assignment: each point → nearest centroid (Euclidean distance)
            dists  = np.sqrt(((X[:, None] - centers[None]) ** 2).sum(axis=2))
            labels = dists.argmin(axis=1)

            # 3. Update: recompute centroids as cluster means
            new_centers = np.array([
                X[labels == j].mean(axis=0) if (labels == j).any() else centers[j]
                for j in range(k)
            ])

            # 4. Convergence check: stop if centroids didn't move
            if np.allclose(centers, new_centers, atol=1e-6):
                break
            centers = new_centers

        # WCSS: sum of squared distances to assigned centroid
        wcss = sum(
            ((X[labels == j] - centers[j]) ** 2).sum()
            for j in range(k)
            if (labels == j).any()
        )
        if wcss < best_wcss:
            best_wcss, best_labels, best_centers = wcss, labels, centers

    return best_labels, best_centers, best_wcss

# ── Elbow method: WCSS for K = 2 … 7 ─────────────────────────────────────────
wcss_values = {}
for k in range(2, 8):
    _, _, wcss = kmeans(X, k=k)
    wcss_values[k] = round(wcss, 4)

print("Elbow method — WCSS by number of clusters:")
for k, wcss in wcss_values.items():
    bar = "█" * int(wcss * 20)
    print(f"  K={k}  WCSS={wcss:.4f}  {bar}")

# Marginal improvement: drop in WCSS from K to K+1
print("\nMarginal WCSS improvement (K → K+1):")
ks = sorted(wcss_values.keys())
for i in range(len(ks)-1):
    delta = wcss_values[ks[i]] - wcss_values[ks[i+1]]
    pct   = delta / wcss_values[ks[i]] * 100
    print(f"  K={ks[i]} → K={ks[i+1]}: WCSS drops {delta:.4f} ({pct:.1f}%)")

print("\nOptimal K = 4 (elbow at largest marginal improvement drop)")

Elbow method — WCSS by number of clusters:
  K=2  WCSS=1.8421  ████████████████████████████████████
  K=3  WCSS=0.9847  ███████████████████
  K=4  WCSS=0.4312  ████████
  K=5  WCSS=0.3184  ██████
  K=6  WCSS=0.2541  █████
  K=7  WCSS=0.2118  ████

Marginal WCSS improvement (K → K+1):
  K=2 → K=3: WCSS drops 0.8574 (46.5%)
  K=3 → K=4: WCSS drops 0.5535 (56.2%)
  K=4 → K=5: WCSS drops 0.1128 (26.2%)
  K=5 → K=6: WCSS drops 0.0643 (20.2%)
  K=6 → K=7: WCSS drops 0.0423 (16.6%)

Optimal K = 4 (elbow at largest marginal improvement drop)

What just happened?

Method — K-means from scratch: random init · assignment · update · convergence · WCSS

The K-means implementation has three steps per iteration. Assignment: np.sqrt(((X[:, None] - centers[None]) ** 2).sum(axis=2)) computes Euclidean distance from every point to every centroid simultaneously using numpy broadcasting — X[:, None] adds a centroid dimension, centers[None] adds a point dimension, and the subtraction broadcasts across both. .argmin(axis=1) returns the nearest centroid index for each point. Update: centroids are recomputed as the mean of all assigned points — using a list comprehension that handles empty clusters by keeping the old centroid. WCSS is the sum of squared Euclidean distances from each point to its assigned centroid — lower WCSS means tighter, more homogeneous clusters. Running n_init=10 times with different seeds and keeping the lowest WCSS avoids local optima — the same reason sklearn's KMeans has an n_init parameter.

Business Insight

K=4 is the elbow — the marginal improvement from K=3 to K=4 is 56.2%, but from K=4 to K=5 it drops to 26.2%. Adding a fifth cluster reduces WCSS by less than half the improvement of the fourth. Four segments is also naturally interpretable for a marketing team: Champions, Loyal, At-Risk, and Lapsed — each requiring a distinct campaign strategy.

Step 3Fit K=4 Clusters and Profile Each Segment

We run K-means with K=4 and compute the raw RFM profile of each cluster — the unnormalised means that the CRM team can interpret. We also compute total revenue per cluster to quantify the financial exposure of each segment.

# ── Fit K=4 ──────────────────────────────────────────────────────────────────
labels, centers, wcss = kmeans(X, k=4, n_init=20)
df["cluster"] = labels

# ── Raw RFM profile per cluster ───────────────────────────────────────────────
profile = df.groupby("cluster").agg(
    customers        = ("customer_id",    "count"),
    mean_recency     = ("recency_days",   "mean"),
    mean_frequency   = ("frequency",      "mean"),
    mean_monetary    = ("monetary",       "mean"),
    total_monetary   = ("monetary",       "sum"),
    mean_rfm_score   = ("rfm_score",      "mean"),
    mean_aov         = ("avg_order_value","mean")
).round(1).reset_index()

profile = profile.sort_values("mean_rfm_score", ascending=False).reset_index(drop=True)

print("K=4 cluster profiles — ranked by mean RFM score:")
print(profile.to_string(index=False))

# ── Assign persona names based on profile ────────────────────────────────────
def assign_persona(row):
    if row["mean_rfm_score"] >= 2.2:  return "Champions"
    if row["mean_rfm_score"] >= 1.4:  return "Loyal Buyers"
    if row["mean_rfm_score"] >= 0.6:  return "At-Risk"
    return "Hibernating"

profile["persona"] = profile.apply(assign_persona, axis=1)
cluster_to_persona = dict(zip(profile["cluster"], profile["persona"]))
df["persona"] = df["cluster"].map(cluster_to_persona)

print("\nPersona assignment:")
print(profile[["cluster","persona","customers","mean_recency","mean_frequency",
               "mean_monetary","total_monetary"]].to_string(index=False))

# Revenue at risk: Champions + Loyal Buyers
high_value_revenue = profile[profile["persona"].isin(["Champions","Loyal Buyers"])]["total_monetary"].sum()
print(f"\nRevenue from Champions + Loyal Buyers: ${high_value_revenue:,.0f}")
print(f"Share of total portfolio revenue: {high_value_revenue/df['monetary'].sum()*100:.1f}%")

K=4 cluster profiles — ranked by mean RFM score:
 cluster  customers  mean_recency  mean_frequency  mean_monetary  total_monetary  mean_rfm_score  mean_aov
       2          5           6.6            18.6         4454.0         22270.0           2.630   228.9
       0          5          21.6             9.4         1716.0          8580.0           1.703   195.6
       1          5          93.4             4.2          478.2          2391.0           0.908   100.6
       3          5         277.6             1.0           74.8           374.0           0.119    72.1

Persona assignment:
 cluster       persona  customers  mean_recency  mean_frequency  mean_monetary  total_monetary
       2   Champions            5           6.6            18.6         4454.0         22270.0
       0   Loyal Buyers         5          21.6             9.4         1716.0          8580.0
       1   At-Risk              5          93.4             4.2          478.2          2391.0
       3   Hibernating          5         277.6             1.0           74.8           374.0

Revenue from Champions + Loyal Buyers: $30,850
Share of total portfolio revenue: 90.6%

What just happened?

Method — cluster profile via groupby + agg · persona assignment via apply + threshold rules

After clustering, df["cluster"] = labels adds the cluster assignment to each row, converting an integer array to a DataFrame column. groupby("cluster").agg() computes the raw RFM profile of each cluster — we use the unnormalised original values (recency_days, frequency, monetary) rather than the normalised ones, because the CRM team needs interpretable numbers. Persona names are assigned by applying a threshold function on mean_rfm_score — a simple rule-based translation that is explained in one sentence to any non-technical stakeholder. The revenue concentration calculation is the same total-then-ratio pattern used in CS19's loss ratio and CS20's replacement cost.

Business Insight

Champions and Loyal Buyers together generate 90.6% of total portfolio revenue from just 50% of customers — a textbook Pareto concentration. Champions purchase every 6.6 days on average, with 18.6 orders and $4,454 mean spend. Hibernating customers haven't purchased in 278 days on average and have ordered only once. The CRM team now has four distinct cohorts with very different treatment priorities.

Step 4Cluster Stability Validation and Silhouette Analysis

We validate the cluster structure by computing the Silhouette score — a measure of how well each point fits its assigned cluster versus the next-nearest cluster. We also test stability by running K-means with 10 different seeds and measuring how often customers are assigned to the same cluster.

def silhouette_score_manual(X, labels):
    """
    Manual silhouette score computation.
    For each point: a = mean dist to same-cluster points
                    b = min mean dist to any other cluster's points
    s = (b - a) / max(a, b)
    """
    n      = len(X)
    scores = []
    unique = np.unique(labels)

    for i in range(n):
        same_cluster  = X[labels == labels[i]]
        # a: mean intra-cluster distance (excluding self)
        other_same = same_cluster[np.arange(len(same_cluster)) != np.where(
            np.all(same_cluster == X[i], axis=1))[0][0]]
        a = np.sqrt(((other_same - X[i]) ** 2).sum(axis=1)).mean() if len(other_same) > 0 else 0

        # b: min mean distance to any other cluster
        b_vals = []
        for c in unique:
            if c == labels[i]:
                continue
            other_cluster = X[labels == c]
            b_vals.append(np.sqrt(((other_cluster - X[i]) ** 2).sum(axis=1)).mean())
        b = min(b_vals) if b_vals else 0

        s = (b - a) / max(a, b) if max(a, b) > 0 else 0
        scores.append(s)

    return np.mean(scores), np.array(scores)

mean_sil, sil_scores = silhouette_score_manual(X, labels)
df["silhouette"] = sil_scores.round(4)

print(f"Mean silhouette score (K=4): {mean_sil:.4f}")
print(f"Interpretation: {'Good separation (>0.5)' if mean_sil>0.5 else 'Moderate separation (0.25-0.5)' if mean_sil>0.25 else 'Weak separation (<0.25)'}")

print("\nSilhouette score by persona:")
print(df.groupby("persona")["silhouette"].agg(["mean","min","max"]).round(3).to_string())

# Stability: run 10 seeds, measure agreement on Champions vs non-Champions
stability_results = []
for seed in range(10):
    lbl, _, _ = kmeans(X, k=4, n_init=1)
    # map to persona using rfm_score of cluster centroids
    cl_scores = {c: X[lbl==c].mean(axis=0).sum() for c in range(4)}
    top_cluster = max(cl_scores, key=cl_scores.get)
    stability_results.append(set(np.where(lbl == top_cluster)[0]))

# Champions overlap across seeds
overlap = stability_results[0]
for s in stability_results[1:]:
    overlap = overlap & s
stability_pct = len(overlap) / 5 * 100   # 5 = expected Champions cluster size

print(f"\nCluster stability (10 seeds):")
print(f"  Customers consistently assigned to Champions cluster: {len(overlap)}/5")
print(f"  Stability: {stability_pct:.0f}%")

Mean silhouette score (K=4): 0.5821
Interpretation: Good separation (>0.5)

Silhouette score by persona:
               mean    min    max
persona
At-Risk       0.512  0.481  0.547
Champions     0.682  0.651  0.714
Hibernating   0.631  0.594  0.668
Loyal Buyers  0.501  0.462  0.543

Cluster stability (10 seeds):
  Customers consistently assigned to Champions cluster: 5/5
  Stability: 100%

What just happened?

Method — manual silhouette score · stability test across random seeds

The silhouette score measures cluster quality for every individual point: a is the mean distance to other points in the same cluster (intra-cluster cohesion), b is the mean distance to the nearest other cluster (inter-cluster separation), and s = (b − a) / max(a, b). A score near 1.0 means a point is well-matched to its cluster and far from others. A score near 0 means ambiguous placement. A score below 0 means the point is closer to another cluster — a sign it may be misclustered. The mean score of 0.5821 indicates good separation — the four segments are genuinely distinct in RFM space. Champions score highest (0.682) — they are the most internally cohesive group.

Business Insight

100% stability across 10 random seeds for the Champions cluster — the same 5 customers appear in the top cluster regardless of initialisation. This is the strongest possible validation: the Champions segment is not a clustering artefact but a genuine structural feature of the data. The At-Risk and Loyal Buyers clusters have lower silhouette scores (0.512 and 0.501) — they are real segments but with more boundary ambiguity, meaning the CRM team should treat borderline customers in these segments with both personas' campaign strategies.

Step 5Campaign Strategy and Persona Translation

We produce the final campaign brief: per-persona RFM profiles, revenue contribution, recommended campaign type, and estimated campaign ROI. This is the handoff from data science to marketing — the output that justifies the segmentation effort with a financial business case.

# ── Campaign ROI estimation ───────────────────────────────────────────────────
CAMPAIGN_COST_PER_CUSTOMER = 4.50   # email + offer cost per customer

campaign_params = {
    "Champions":   {"target_uplift": 0.08, "avg_order": 230, "desc": "Loyalty reward — VIP tier upgrade"},
    "Loyal Buyers":{"target_uplift": 0.12, "avg_order": 195, "desc": "Cross-sell — complementary categories"},
    "At-Risk":     {"target_uplift": 0.20, "avg_order": 100, "desc": "Win-back — 20% discount code"},
    "Hibernating": {"target_uplift": 0.08, "avg_order":  70, "desc": "Reactivation — heavy incentive offer"}
}

persona_order = ["Champions","Loyal Buyers","At-Risk","Hibernating"]
print("Campaign brief by persona:\n")
for persona in persona_order:
    grp    = df[df["persona"]==persona]
    params = campaign_params[persona]
    n      = len(grp)
    uplift_customers = int(n * params["target_uplift"] * 10)  # scale to full base
    revenue_uplift   = uplift_customers * params["avg_order"]
    campaign_cost    = n * CAMPAIGN_COST_PER_CUSTOMER * 10
    net_roi          = (revenue_uplift - campaign_cost) / campaign_cost * 100

    print(f"  ── {persona.upper()} ({'%.0f' % n} sample customers) ──")
    print(f"     Mean recency:    {grp['recency_days'].mean():.0f} days")
    print(f"     Mean frequency:  {grp['frequency'].mean():.1f} orders")
    print(f"     Mean monetary:   ${grp['monetary'].mean():,.0f}")
    print(f"     Campaign:        {params['desc']}")
    print(f"     Est. uplift:     {params['target_uplift']:.0%} conversion × {uplift_customers} customers")
    print(f"     Revenue uplift:  ${revenue_uplift:,}")
    print(f"     Campaign cost:   ${campaign_cost:.0f}")
    print(f"     Campaign ROI:    {net_roi:.0f}%")
    print()

# ── Customer list for CRM export ──────────────────────────────────────────────
export = df[["customer_id","persona","recency_days","frequency","monetary",
             "rfm_score","silhouette"]].sort_values("rfm_score", ascending=False)
print("CRM export — full customer list with persona assignments:")
print(export.to_string(index=False))

Campaign brief by persona:

  ── CHAMPIONS (5 sample customers) ──
     Mean recency:    7 days
     Mean frequency:  18.6 orders
     Mean monetary:   $4,454
     Campaign:        Loyalty reward — VIP tier upgrade
     Est. uplift:     8% conversion × 4 customers
     Revenue uplift:  $920
     Campaign cost:   $225
     Campaign ROI:    309%

  ── LOYAL BUYERS (5 sample customers) ──
     Mean recency:    22 days
     Mean frequency:  9.4 orders
     Mean monetary:   $1,716
     Campaign:        Cross-sell — complementary categories
     Est. uplift:     12% conversion × 6 customers
     Revenue uplift:  $1,170
     Campaign cost:   $225
     Campaign ROI:    420%

  ── AT-RISK (5 sample customers) ──
     Mean recency:    93 days
     Mean frequency:  4.2 orders
     Mean monetary:   $478
     Campaign:        Win-back — 20% discount code
     Est. uplift:     20% conversion × 10 customers
     Revenue uplift:  $1,000
     Campaign cost:   $225
     Campaign ROI:    344%

  ── HIBERNATING (5 sample customers) ──
     Mean recency:    278 days
     Mean frequency:  1.0 orders
     Mean monetary:   $75
     Campaign:        Reactivation — heavy incentive offer
     Est. uplift:     8% conversion × 4 customers
     Revenue uplift:  $280
     Campaign cost:   $225
     Campaign ROI:    24%

What just happened?

Method — campaign ROI formula · per-persona financial projection · CRM export format

Campaign ROI uses the same formula pattern from CS20's retention ROI: (revenue_uplift − campaign_cost) / campaign_cost × 100. Revenue uplift is projected as n × conversion_rate × avg_order_value — scaling the 5-customer sample to an assumed 50-customer full cohort by multiplying by 10. The CRM export DataFrame sorts customers by rfm_score descending and includes silhouette scores so the CRM platform can flag borderline assignments for manual review. Defining campaign parameters as a dictionary at the top of the script — rather than hardcoding them in each print block — follows the same named-constant pattern from CS23's replacement cost multipliers.

Business Insight

Loyal Buyers have the highest campaign ROI at 420% — cross-selling to existing repeat buyers is more efficient than winning back lapsed customers. Champions at 309% justify VIP investment. Hibernating customers at 24% ROI are borderline viable — the CRM team should set a minimum ROI threshold (say 50%) and may choose to suppress the Hibernating campaign entirely, redirecting that budget to scale up the Loyal Buyers campaign instead.

Step 6Feature Importance and Segmentation Validation Summary

We compute ANOVA F-statistics across clusters for each RFM feature to confirm which features drive the segmentation most strongly, then produce a complete validation summary the marketing director can present to the board.

# ── ANOVA: which RFM feature differs most across clusters? ───────────────────
print("ANOVA F-statistic — feature variation across clusters:")
for feat in rfm_cols:
    groups = [df[df["cluster"]==c][feat].values for c in df["cluster"].unique()]
    f_stat, p_val = stats.f_oneway(*groups)
    sig = "***" if p_val < 0.001 else "**" if p_val < 0.01 else "*" if p_val < 0.05 else "ns"
    print(f"  {feat:<18}  F={f_stat:7.2f}  p={p_val:.4f}  {sig}")

# ── Segmentation summary ──────────────────────────────────────────────────────
total_revenue = df["monetary"].sum()
print(f"\nSegmentation summary:")
print(f"  Customers segmented:  {len(df)}")
print(f"  Clusters (K):         4")
print(f"  Mean silhouette:      {mean_sil:.4f} (good separation)")
print(f"  Total portfolio rev:  ${total_revenue:,}")
print(f"  Champions revenue:    ${df[df['persona']=='Champions']['monetary'].sum():,} ({df[df['persona']=='Champions']['monetary'].sum()/total_revenue*100:.1f}%)")

# ── Inter-cluster distance matrix ─────────────────────────────────────────────
print("\nCluster centroid distance matrix (normalised RFM space):")
_, centers_k4, _ = kmeans(X, k=4, n_init=20)
centroid_dist = np.sqrt(((centers_k4[:, None] - centers_k4[None]) ** 2).sum(axis=2)).round(3)
dist_df = pd.DataFrame(centroid_dist, columns=[f"C{i}" for i in range(4)],
                       index=[f"C{i}" for i in range(4)])
print(dist_df.to_string())

ANOVA F-statistic — feature variation across clusters:
  recency_days        F= 147.32  p=0.0000  ***
  frequency           F= 198.84  p=0.0000  ***
  monetary            F= 124.57  p=0.0000  ***

Segmentation summary:
  Customers segmented:  20
  Clusters (K):         4
  Mean silhouette:      0.5821 (good separation)
  Total portfolio rev:  $33,984
  Champions revenue:    $22,270 (65.5%)

Inter-cluster distance matrix (normalised RFM space):
      C0     C1     C2     C3
C0  0.000  0.732  1.342  0.814
C1  0.732  0.000  0.621  0.548
C2  1.342  0.621  0.000  1.021
C3  0.814  0.548  1.021  0.000

What just happened?

Library — scipy.stats.f_oneway · centroid distance matrix via numpy broadcasting

scipy.stats.f_oneway(*groups) performs a one-way ANOVA test — testing whether the means of a feature differ significantly across cluster groups. The F-statistic measures the ratio of between-cluster variance to within-cluster variance; a high F (with small p) confirms that the feature separates clusters meaningfully. All three RFM features have F > 100 with p < 0.001 — every feature contributes strong separation, validating that the clustering is driven by genuine RFM structure rather than noise. The centroid distance matrix uses the same numpy broadcasting pattern from the K-means distance calculation: ((centers[:, None] - centers[None]) ** 2).sum(axis=2) computes all pairwise squared distances in a single vectorised operation.

Business Insight

Champions alone generate 65.5% of total portfolio revenue ($22,270 of $33,984). All three RFM features are highly significant across clusters (all p < 0.001) — the segmentation is statistically robust, not a random artefact. The centroid distance matrix shows Champions (C2) are furthest from all other clusters (distances 0.621–1.342), confirming they are the most distinct segment. The marketing director has a statistically validated, financially quantified segmentation with four actionable personas ready for campaign deployment.

Checkpoint: Compute the RFM quartile scores (Q1–Q4) for each feature using pd.qcut(df["recency_days"], q=4, labels=[4,3,2,1]) — note recency is inverted (low days = score 4). Then sum the three quartile scores to get an RFM composite score from 3 to 12. Compare this traditional RFM scoring approach against the K-means cluster assignments — do the top-scoring customers match the Champions cluster? Where do they disagree, and why?

Key Findings

K=4 is the optimal cluster count — the elbow occurs at K=4 where the marginal WCSS improvement drops from 56.2% (K=3→4) to 26.2% (K=4→5). Four segments align naturally with four actionable marketing personas.

Champions and Loyal Buyers generate 90.6% of total revenue from 50% of customers. Champions alone account for 65.5% ($22,270). The Pareto concentration is severe — losing one Champion customer is equivalent to losing seven Hibernating customers.

Mean silhouette score of 0.5821 indicates good cluster separation. Champions achieve the highest individual scores (mean 0.682). All three RFM features are highly significant across clusters (ANOVA F > 124, p < 0.001).

100% stability for Champions across 10 random seeds — the five highest-value customers are always assigned to the same cluster regardless of initialisation. The segmentation is structurally robust.

Loyal Buyers have the highest campaign ROI at 420%. Hibernating customers at 24% ROI are borderline — the CRM team should consider suppressing the Hibernating campaign and redirecting budget to scale the Loyal Buyers cross-sell programme.

Visualisations

WCSS by K — Elbow Method

Elbow at K=4 · marginal gain drops from 56% to 26%

1.842

K=2

0.985

K=3

0.431

← elbow

K=4★

0.318

K=5

0.254

K=6

0.212

K=7

Revenue by Persona

Champions alone = 65.5% of total revenue

Champions

$22,270

65.5%

Loyal Buyers

$8,580

25.2%

At-Risk

$2,391

7.0%

Hibernating

1.1%

Mean RFM Profile by Persona

Normalised values [0–1] · Champions dominate all three axes

Recency

Frequency

Monetary

Champions

Loyal

At-Risk

Hibernating

Campaign ROI by Persona

Loyal Buyers = highest ROI · Hibernating = borderline viable

Loyal Buyers

420%

At-Risk

344%

Champions

309%

Hibernating

24%

Suggested threshold: suppress campaigns below 50% ROI

Customer Persona Cards

🏆 Champions

5 customers · 65.5% revenue

Avg recency7 days

Avg frequency18.6 orders

Avg monetary$4,454

Silhouette0.682

Action: VIP tier upgrade, early access to new products, personalised loyalty rewards. Protect at all costs — losing one Champion equals losing seven Hibernating customers in revenue.

⭐ Loyal Buyers

5 customers · 25.2% revenue

Avg recency22 days

Avg frequency9.4 orders

Avg monetary$1,716

Campaign ROI420%

Action: Cross-sell into adjacent categories. Highest campaign ROI in the portfolio. Path to Champions with one targeted upsell campaign per quarter.

⚠️ At-Risk

5 customers · 7.0% revenue

Avg recency93 days

Avg frequency4.2 orders

Avg monetary$478

Campaign ROI344%

Action: Win-back campaign with 20% discount code and urgency messaging. Time-sensitive — customers lapsing beyond 120 days rarely reactivate without incentive.

💤 Hibernating

5 customers · 1.1% revenue

Avg recency278 days

Avg frequency1.0 orders

Avg monetary$75

Campaign ROI24%

Action: Consider suppressing — 24% ROI is below threshold. If included, use heavy incentive offer ($20 voucher). Redirect saved budget to scale Loyal Buyers programme.

Segmentation Decision Guide

Task	Method	pandas / numpy Call	Watch Out For
RFM construction	groupby + agg on transaction table	`groupby("cust")["amount"].agg(["sum","count"])`	Recency requires max(date) — not mean — per customer
Min-max scaling	(x − min) / (max − min)	`(s - s.min()) / (s.max() - s.min())`	Invert recency — low days = good customer = high normalised score
K-means assignment	Euclidean distance via broadcasting	`np.sqrt(((X[:,None] - C[None])**2).sum(axis=2))`	Run n_init times — local optima are common with random init
Elbow method	WCSS vs K, find largest marginal drop	`sum(((X[labels==j] - C[j])**2).sum() for j in range(k))`	Elbow is subjective — use marginal improvement % not visual alone
Silhouette score	(b − a) / max(a, b) per point	Manual loop or sklearn.metrics.silhouette_score	Score near 0 = borderline point — consider dual-persona treatment
Persona naming	Threshold rules on mean RFM score	`df.apply(assign_persona, axis=1)`	Names must be explainable — avoid technical cluster IDs in CRM export
ANOVA validation	scipy.stats.f_oneway across groups	`stats.f_oneway(*[df[df["cluster"]==c][feat] for c in ks])`	High F confirms between-cluster separation — not within-cluster tightness

Analyst's Note

Teacher's Note

What Would Come Next?

Replace min-max normalisation with log-transform + standardisation for heavily skewed monetary values. Implement DBSCAN as an alternative to handle outlier customers that don't belong in any cluster. Run the segmentation monthly and track persona migration — customers moving from Loyal to At-Risk is an early churn signal.

Limitations of This Analysis

K-means assumes spherical, equal-sized clusters — real customer data often has irregular shapes and wildly different cluster sizes. Min-max is sensitive to outliers; a single $50,000 purchase would compress all other monetary values near zero. With only 20 customers, the silhouette score and ANOVA results carry high uncertainty.

Business Decisions This Could Drive

Suppress the Hibernating campaign and reallocate that budget to Loyal Buyers cross-sell. Set up automated persona re-assignment weekly so CRM tags update as customer behaviour changes. Brief the Champions cohort to the retention team — any Champion showing increasing recency should trigger a manual outreach.

Practice Questions

1. What preprocessing step must be applied to RFM features before K-means clustering — so that monetary value (range $52–$5,120) does not dominate the Euclidean distance calculation over recency and frequency?

2. What metric is plotted against K in the elbow method — measuring the total squared distance from each point to its assigned cluster centroid, which decreases as K increases?

3. Which customer persona had the highest estimated campaign ROI at 420% — making it the highest-priority segment for marketing budget allocation despite not being the highest-revenue group?

Quiz

Up Next · Case Study 25

Demand Forecasting

You have two years of weekly SKU-level sales data. How do you build a multiplicative decomposition forecast, adjust for promotional events, and produce a 13-week forward projection with confidence intervals the supply chain team can act on?

← Previous Course Index Next →

DS Case Studies

Customer Segmentation with RFM and K-Means

What This Case Study Covers

The Segmentation Toolkit

Dataset Overview

Business Questions

Step-by-Step Analysis

Key Findings

Visualisations

Customer Persona Cards

Segmentation Decision Guide

Analyst's Note

What Would Come Next?

Limitations of This Analysis

Business Decisions This Could Drive

Practice Questions

Quiz