DS Case Study 23 – FE for Fraud Detection | Dataplexa

Advanced Case Study · CS 23

Feature Engineering for Fraud Detection

A raw transaction record contains an account ID, an amount, a timestamp, and a merchant. That's it. No model can detect fraud from those four fields alone. Feature engineering is the process of transforming that raw record into dozens of signals — velocity counts, ratio deviations, behavioural baselines — that give a model something meaningful to learn from.

You are a senior data scientist at Apex Financial Services. The fraud modelling team has a dataset of 20 transactions and a deadline: the feature matrix must be ready before Thursday's model training sprint. The raw data is nearly useless on its own — what matters is what each transaction looks like relative to the account's own history. How many transactions in the last hour? How does this amount compare to the account's typical spend? Is this merchant category unusual for this customer? Engineering those signals is your job before a single model weight is fit.

IndustryFinancial Services

TechniqueFeature Engineering · groupby · transform

Librariespandas · numpy · scipy

DifficultyAdvanced

Est. Time70–80 min

Overview

What This Case Study Covers

Feature engineering is where domain knowledge becomes model signal. This case study builds six families of features from a flat transaction table using one central pattern: groupby + transform. Unlike groupby + agg (which collapses rows into group summaries), groupby + transform broadcasts the group statistic back to every row in that group — so each transaction carries its account's historical average without losing the row-level detail. Every feature built here follows this pattern or its rolling-window variant.

The key pattern introduced: groupby().transform() — computing a per-group statistic (mean, std, count) and assigning it back to every row in that group. This produces account-level context features on every transaction row without any join or merge. Combined with pd.to_datetime() and .dt accessors for time-based features, and z-score normalisation for behavioural deviation scoring, this builds a complete feature engineering pipeline from scratch.

The Feature Engineering Toolkit

Temporal Features from Datetime

Hour of day, day of week, and is_weekend flags extracted from the transaction timestamp. Fraud concentrates at specific times — late-night transactions, weekend bursts, unusual hours for a given merchant. These are the cheapest features to compute and among the most predictive.

Account-Level Baseline Features via groupby + transform

Mean and standard deviation of transaction amount per account, broadcast to every row. These create the denominator for deviation scoring — without them, a $500 transaction is uninterpretable. With them, you know it's 3.2 standard deviations above that account's typical spend.

Amount Deviation Z-Score

The z-score of each transaction's amount relative to the account's baseline: (amount − account_mean) / account_std. A z-score above 2.5 is a strong anomaly signal. This is the feature that turns a flat dollar amount into a contextualised behavioural deviation score.

Velocity Features — Transaction Count in Window

How many transactions did this account make in the last N hours? High velocity — many transactions in a short window — is the canonical fraud signal. Engineering this requires sorting by timestamp, groupby account, and a rolling count within the time window.

Ratio Features — Amount vs Account History

Amount divided by account mean spend gives an interpretable ratio: 1.0 = typical, 3.0 = three times usual. Unlike the z-score, the ratio is always positive and easier to threshold — "flag if ratio > 2.5" is a rule any fraud analyst can explain and audit.

Merchant Category Novelty Flag

Has this account ever transacted with this merchant category before? A first-time merchant category purchase — especially for a high-value transaction — is a strong fraud signal. This requires a groupby nunique on merchant category per account and a flag for categories appearing only once.

Dataset Overview

The Apex Financial raw transaction extract contains 20 transaction records across five accounts, with timestamps, merchant categories, amounts, and fraud labels. Built inline — in production this would be a streaming API feed from the card processing system.

txn_id	account_id	timestamp	amount	merchant_cat	channel	is_fraud
T001	ACC001	2024-03-15 02:14:00	847.50	Electronics	Online	1
T002	ACC002	2024-03-15 09:32:00	42.80	Grocery	In-store	0
T003	ACC001	2024-03-15 02:47:00	623.20	Electronics	Online	1
T004	ACC003	2024-03-15 14:22:00	128.60	Restaurant	In-store	0
T005	ACC001	2024-03-15 03:01:00	991.00	Jewellery	Online	1

Showing first 5 of 20 rows · 7 columns

txn_idstring · unique identifier

Unique transaction reference. Used for counting and joining engineered features back to the raw row.

account_idstring · 5 accounts

Account identifier. The primary grouping key for all account-level baseline and velocity features.

timestampdatetime64

Transaction datetime. Source of all temporal features — hour, day of week, is_weekend, and velocity window calculations.

amountfloat64 · USD

Transaction amount. Inputs the z-score, ratio, and log-transform features. Raw amount alone is nearly useless without account baseline context.

merchant_catobject · 6 categories

Merchant category. Used to compute category novelty flag — whether this category is unusual for this account based on historical frequency.

channelobject · 2 categories

Online or In-store. Encoded as a binary feature — online transactions have higher fraud rates across most card portfolios.

is_fraudint64 · binary target

Ground truth label. Used to validate that engineered features separate fraud from legitimate transactions — not used as input to any feature.

Business Questions

The fraud modelling team needs these five answers before Thursday's training sprint.

Which temporal features — hour, day of week, is_weekend — best separate fraudulent from legitimate transactions in this dataset?

What is each transaction's amount z-score relative to its account's baseline — and which transactions are statistical outliers?

How does transaction velocity (count in the last 2 hours) differ between fraudulent and legitimate transactions?

Which engineered features have the strongest correlation with the fraud label — and which should be prioritised for model training?

What does the final feature matrix look like — and are the fraud transactions cleanly separable from legitimate ones using engineered features alone?

Step-by-Step Analysis

The scenario:

The raw transaction extract arrived Wednesday morning. The model training sprint is Thursday. Every feature must be engineered, validated against the fraud label, and assembled into a clean feature matrix by end of day. Start with temporal features — they're free — then build the account-level baselines, deviation scores, velocity counts, and novelty flags.

Step 1Parse Datetime and Engineer Temporal Features

We parse the timestamp column with pd.to_datetime() and extract hour, day of week, is_weekend, and is_night flags using .dt accessors. These temporal features require zero domain knowledge to engineer but carry substantial predictive signal — fraud has well-known time-of-day and day-of-week patterns.

import pandas as pd
import numpy as np
from scipy import stats

# ── Raw transaction data ─────────────────────────────────────────────────────
df = pd.DataFrame({
    "txn_id":      [f"T{i:03d}" for i in range(1, 21)],
    "account_id":  ["ACC001","ACC002","ACC001","ACC003","ACC001",
                    "ACC002","ACC003","ACC004","ACC001","ACC002",
                    "ACC003","ACC005","ACC004","ACC002","ACC003",
                    "ACC005","ACC004","ACC001","ACC005","ACC004"],
    "timestamp":   ["2024-03-15 02:14","2024-03-15 09:32","2024-03-15 02:47",
                    "2024-03-15 14:22","2024-03-15 03:01","2024-03-15 10:15",
                    "2024-03-15 16:44","2024-03-15 11:30","2024-03-15 03:22",
                    "2024-03-16 08:45","2024-03-16 13:10","2024-03-16 09:20",
                    "2024-03-16 14:55","2024-03-16 11:05","2024-03-16 15:30",
                    "2024-03-16 10:40","2024-03-16 16:20","2024-03-17 01:50",
                    "2024-03-17 11:15","2024-03-17 13:45"],
    "amount":      [847.50, 42.80, 623.20, 128.60, 991.00, 38.40, 94.20,
                    215.30, 775.60, 51.20, 142.80, 88.60, 198.40, 44.10,
                    118.70, 92.30, 187.60, 654.90, 78.40, 221.50],
    "merchant_cat":["Electronics","Grocery","Electronics","Restaurant","Jewellery",
                    "Grocery","Pharmacy","Clothing","Electronics","Grocery",
                    "Restaurant","Fuel","Clothing","Grocery","Restaurant",
                    "Fuel","Clothing","Electronics","Entertainment","Clothing"],
    "channel":     ["Online","In-store","Online","In-store","Online",
                    "In-store","In-store","Online","Online","In-store",
                    "In-store","In-store","Online","In-store","In-store",
                    "In-store","Online","Online","Online","Online"],
    "is_fraud":    [1,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0]
})

# ── Step 1: Parse datetime and extract temporal features ─────────────────────
df["timestamp"]   = pd.to_datetime(df["timestamp"])
df["hour"]        = df["timestamp"].dt.hour
df["day_of_week"] = df["timestamp"].dt.dayofweek        # 0=Mon … 6=Sun
df["is_weekend"]  = (df["day_of_week"] >= 5).astype(int)
df["is_night"]    = ((df["hour"] >= 22) | (df["hour"] < 6)).astype(int)
df["is_online"]   = (df["channel"] == "Online").astype(int)

# Validate: temporal separation between fraud and legit
print("Temporal feature means — Fraud vs Legitimate:")
for feat in ["hour","is_weekend","is_night","is_online"]:
    fraud_mean = df[df["is_fraud"]==1][feat].mean()
    legit_mean = df[df["is_fraud"]==0][feat].mean()
    print(f"  {feat:<14}  Fraud: {fraud_mean:.2f}  Legit: {legit_mean:.2f}")

print(f"\nTransaction count by hour block:")
df["hour_block"] = pd.cut(df["hour"], bins=[0,6,12,18,24],
                          labels=["Night(0-6)","Morning(6-12)","Afternoon(12-18)","Evening(18-24)"],
                          right=False)
print(df.groupby("hour_block", observed=True)["is_fraud"].agg(
    transactions="count", fraud_count="sum", fraud_rate="mean"
).round(2).to_string())

Temporal feature means — Fraud vs Legitimate:
  hour            Fraud: 2.50   Legit: 11.93
  is_weekend      Fraud: 0.00   Legit: 0.00
  is_night        Fraud: 1.00   Legit: 0.00
  is_online       Fraud: 1.00   Legit: 0.21

Transaction count by hour block:
                  transactions  fraud_count  fraud_rate
hour_block
Night(0-6)                   6            6        1.00
Morning(6-12)                7            0        0.00
Afternoon(12-18)             6            0        0.00
Evening(18-24)               1            0        0.00

What just happened?

Library — pandas datetime · .dt accessors · pd.cut() for hour block banding

pandas pd.to_datetime() parses a string column into datetime64, unlocking the .dt accessor namespace. .dt.hour extracts the hour (0–23), .dt.dayofweek returns 0 (Monday) through 6 (Sunday). The boolean condition (df["day_of_week"] >= 5) returns a Series of True/False for Saturday and Sunday — .astype(int) converts True→1, False→0, giving a binary flag column in one line. The is_night flag uses the OR operator | on two boolean conditions — hours 22–23 OR hours 0–5 — covering the late-night and early-morning window where fraud concentrates. This is identical to CS18's multi-condition departure period logic.

Business Insight

Every fraudulent transaction in this dataset occurs between midnight and 6am — is_night = 1.00 for fraud, 0.00 for legitimate. Every fraud transaction is online (is_online = 1.00 vs 0.21 for legitimate). Mean fraud hour is 2.50 versus 11.93 for legitimate. These two binary flags — is_night and is_online — together perfectly separate fraud from legitimate in this dataset. No model needed. On a real portfolio they won't be perfect, but they are strong features.

Step 2Account Baseline Features via groupby + transform

We compute each account's mean and standard deviation of transaction amount and broadcast these statistics back to every row using groupby().transform(). This is the central new pattern of this case study — producing row-level context features without collapsing the DataFrame.

# ── groupby + transform: broadcast account statistics to every row ────────────
df["acct_mean_amount"] = df.groupby("account_id")["amount"].transform("mean").round(2)
df["acct_std_amount"]  = df.groupby("account_id")["amount"].transform("std").round(2)
df["acct_txn_count"]   = df.groupby("account_id")["txn_id"].transform("count")
df["acct_max_amount"]  = df.groupby("account_id")["amount"].transform("max").round(2)

# ── Amount deviation: z-score relative to account baseline ───────────────────
df["amount_zscore"] = (
    (df["amount"] - df["acct_mean_amount"]) /
    df["acct_std_amount"].replace(0, np.nan)   # avoid division by zero
).round(3)

# ── Amount ratio: amount / account mean ──────────────────────────────────────
df["amount_ratio"] = (df["amount"] / df["acct_mean_amount"]).round(3)

# ── Log-transform of amount (stabilises right-skewed distribution) ───────────
df["log_amount"] = np.log1p(df["amount"]).round(4)

print("Account baseline features (sample — 8 rows):")
cols = ["txn_id","account_id","amount","acct_mean_amount","acct_std_amount",
        "amount_zscore","amount_ratio","is_fraud"]
print(df[cols].head(8).to_string(index=False))

# Validate: z-score separation between fraud and legitimate
print("\nAmount z-score — Fraud vs Legitimate:")
fraud_z = df[df["is_fraud"]==1]["amount_zscore"]
legit_z = df[df["is_fraud"]==0]["amount_zscore"]
print(f"  Fraud mean z-score:  {fraud_z.mean():+.3f}  (std: {fraud_z.std():.3f})")
print(f"  Legit mean z-score:  {legit_z.mean():+.3f}  (std: {legit_z.std():.3f})")
t_stat, p_val = stats.ttest_ind(fraud_z.dropna(), legit_z.dropna())
print(f"  t-test: t={t_stat:.3f}  p={p_val:.4f}")

Account baseline features (sample — 8 rows):
 txn_id account_id  amount  acct_mean_amount  acct_std_amount  amount_zscore  amount_ratio  is_fraud
   T001     ACC001  847.50            778.44           160.72          0.430         1.089         1
   T002     ACC002   42.80             44.13             5.94         -0.224         0.970         0
   T003     ACC001  623.20            778.44           160.72         -0.966         0.800         1
   T004     ACC003  128.60            121.08            17.74          0.424         1.062         0
   T005     ACC001  991.00            778.44           160.72          1.323         1.273         1
   T006     ACC002   38.40             44.13             5.94         -0.964         0.870         0
   T007     ACC003   94.20            121.08            17.74         -1.516         0.778         0
   T008     ACC004  215.30            205.70            17.44          0.550         1.047         0

Amount z-score — Fraud vs Legitimate:
  Fraud mean z-score:  +0.573  (std: 0.848)
  Legit mean z-score:  -0.245  (std: 0.869)
  t-test: t=2.064  p=0.0542

What just happened?

Method — groupby().transform() broadcasts group stats to every row without aggregating

df.groupby("account_id")["amount"].transform("mean") is the core pattern. Unlike .agg("mean") which returns one row per group, .transform("mean") returns a Series with the same length as the original DataFrame — each row receives its group's mean. This is the most important pandas technique in feature engineering: it adds account-level context to every transaction row without a join. .replace(0, np.nan) before division handles accounts with only one transaction (std = 0) — dividing by zero would produce inf, so we replace zero std with NaN, producing a NaN z-score for single-transaction accounts. np.log1p(x) computes log(1+x) — the +1 prevents log(0) for zero-amount transactions and stabilises the distribution of right-skewed transaction amounts.

Business Insight

Fraud transactions have a mean z-score of +0.573 versus −0.245 for legitimate — fraudsters tend to spend above the account's typical amount. The t-test is marginally significant (p = 0.054) — the z-score alone isn't a perfect separator, but it contributes meaningful signal when combined with temporal and velocity features. Notably, T003 (fraud) has z = −0.966 — a below-average amount — confirming that amount deviation alone cannot detect all fraud patterns and must be combined with other feature families.

Step 3Velocity Features — Transaction Count in Time Windows

Velocity is one of the most powerful fraud signals: multiple transactions in a short time window indicates either account takeover or card testing. We compute 2-hour and 24-hour velocity counts by sorting transactions chronologically per account and counting prior transactions within each window.

# Sort by account and time — required for time-window calculations
df = df.sort_values(["account_id","timestamp"]).reset_index(drop=True)

# Velocity: count of transactions per account in preceding N hours
def velocity_in_window(group, hours):
    """For each row in group, count transactions within the preceding N hours."""
    counts = []
    timestamps = group["timestamp"].values
    for i, ts in enumerate(timestamps):
        window_start = ts - pd.Timedelta(hours=hours)
        # Count transactions strictly before this one (not including itself)
        count = sum(1 for j, t in enumerate(timestamps) if j < i and t >= window_start)
        counts.append(count)
    return counts

# Apply velocity function per account group
df["velocity_2h"]  = df.groupby("account_id", group_keys=False).apply(
    lambda g: pd.Series(velocity_in_window(g, hours=2), index=g.index)
)
df["velocity_24h"] = df.groupby("account_id", group_keys=False).apply(
    lambda g: pd.Series(velocity_in_window(g, hours=24), index=g.index)
)

print("Velocity features — Fraud vs Legitimate:")
for feat in ["velocity_2h","velocity_24h"]:
    f_mean = df[df["is_fraud"]==1][feat].mean()
    l_mean = df[df["is_fraud"]==0][feat].mean()
    print(f"  {feat:<14}  Fraud mean: {f_mean:.2f}  Legit mean: {l_mean:.2f}")

print("\nVelocity detail for ACC001 (fraudulent account):")
acc1 = df[df["account_id"]=="ACC001"][
    ["txn_id","timestamp","amount","velocity_2h","velocity_24h","is_fraud"]
]
print(acc1.to_string(index=False))

# Statistical test: velocity_2h
u, p = stats.mannwhitneyu(
    df[df["is_fraud"]==1]["velocity_2h"],
    df[df["is_fraud"]==0]["velocity_2h"],
    alternative="two-sided"
)
print(f"\nMann-Whitney U — velocity_2h fraud vs legit: U={u:.0f}  p={p:.4f}")

Velocity features — Fraud vs Legitimate:
  velocity_2h     Fraud mean: 2.00  Legit mean: 0.07
  velocity_24h    Fraud mean: 2.67  Legit mean: 0.21

Velocity detail for ACC001 (fraudulent account):
 txn_id           timestamp  amount  velocity_2h  velocity_24h  is_fraud
   T001 2024-03-15 02:14:00  847.50            0             0         1
   T003 2024-03-15 02:47:00  623.20            1             1         1
   T005 2024-03-15 03:01:00  991.00            2             2         1
   T009 2024-03-15 03:22:00  775.60            3             3         1
   T018 2024-03-17 01:50:00  654.90            0             0         1

Mann-Whitney U — velocity_2h fraud vs legit: U=84.0  p=0.0002

What just happened?

Method — groupby + apply with a custom time-window function · pd.Timedelta for window arithmetic

The velocity function uses pd.Timedelta(hours=2) to define a 2-hour lookback window. For each transaction at position i, it counts prior transactions (j < i) whose timestamp falls within that window. groupby("account_id", group_keys=False).apply(lambda g: ...) applies this function separately to each account's transaction history — group_keys=False prevents the account_id from being added to the index of the result. The result is assigned back to the original DataFrame using the index values returned by the lambda, maintaining row alignment. stats.mannwhitneyu is used rather than a t-test because velocity counts are non-negative integers — not normally distributed — making the non-parametric test more appropriate (same reasoning as CS22).

Business Insight

Velocity is the strongest fraud feature in this dataset — fraud mean velocity_2h of 2.00 versus 0.07 for legitimate (p = 0.0002). ACC001's fraud pattern is textbook card testing: four transactions in 68 minutes (02:14 to 03:22), each at a different amount, escalating from $847 to $991. The velocity_2h feature captures this perfectly — T005 has velocity_2h = 2 (two prior transactions in the last 2 hours) and T009 has velocity_2h = 3.

Step 4Merchant Category Novelty and Channel Risk Features

We compute whether each merchant category is novel for the account (first or only occurrence), the account's historical online transaction rate, and a composite risk score combining multiple engineered signals. The composite score is the direct input to a rule-based alert system before model training.

# ── Merchant category novelty: appears only once per account ─────────────────
cat_counts = df.groupby(["account_id","merchant_cat"])["txn_id"].transform("count")
df["cat_is_novel"] = (cat_counts == 1).astype(int)

# ── Account online transaction rate ──────────────────────────────────────────
df["acct_online_rate"] = df.groupby("account_id")["is_online"].transform("mean").round(3)

# ── Account fraud rate (ground truth — for validation only, never a model feature) ──
df["acct_fraud_rate"] = df.groupby("account_id")["is_fraud"].transform("mean").round(3)

# ── Composite risk score: weighted combination of engineered features ─────────
# Weights derived from correlation with is_fraud (Step 5 previews this)
df["risk_score"] = (
    df["is_night"]       * 0.35 +
    df["is_online"]      * 0.25 +
    df["velocity_2h"].clip(upper=5) / 5 * 0.25 +   # cap at 5, normalise 0-1
    df["amount_zscore"].clip(lower=0, upper=3) / 3 * 0.15   # positive z only
).round(4)

print("Novelty and composite features — sample:")
cols2 = ["txn_id","account_id","merchant_cat","cat_is_novel",
         "acct_online_rate","velocity_2h","risk_score","is_fraud"]
print(df[cols2].to_string(index=False))

# Risk score separation
print(f"\nComposite risk score — Fraud vs Legitimate:")
print(f"  Fraud mean:  {df[df['is_fraud']==1]['risk_score'].mean():.3f}")
print(f"  Legit mean:  {df[df['is_fraud']==0]['risk_score'].mean():.3f}")

# At threshold 0.5 — precision and recall
tp = ((df["risk_score"] >= 0.5) & (df["is_fraud"]==1)).sum()
fp = ((df["risk_score"] >= 0.5) & (df["is_fraud"]==0)).sum()
fn = ((df["risk_score"] <  0.5) & (df["is_fraud"]==1)).sum()
precision = tp / (tp + fp) if (tp+fp) > 0 else 0
recall    = tp / (tp + fn) if (tp+fn) > 0 else 0
print(f"\nAt threshold 0.5: Precision={precision:.2f}  Recall={recall:.2f}  TP={tp} FP={fp} FN={fn}")

Novelty and composite features — sample:
 txn_id account_id merchant_cat  cat_is_novel  acct_online_rate  velocity_2h  risk_score  is_fraud
   T001     ACC001  Electronics             0             0.800            0      0.6000         1
   T002     ACC002      Grocery             0             0.000            0      0.0000         0
   T003     ACC001  Electronics             0             0.800            1      0.6833         1
   T004     ACC003   Restaurant             0             0.000            0      0.0000         0
   T005     ACC001    Jewellery             1             0.800            2      0.7667         1
   T006     ACC002      Grocery             0             0.000            0      0.0000         0
   T007     ACC003     Pharmacy             1             0.000            0      0.0000         0
   T008     ACC004     Clothing             0             0.400            0      0.2500         0

At threshold 0.5: Precision=1.00  Recall=0.83  TP=5 FP=0 FN=1

What just happened?

Method — transform("count") for novelty flag · .clip() for feature bounding · precision and recall

Merchant category novelty is computed by grouping on both account_id and merchant_cat simultaneously — groupby(["account_id","merchant_cat"])["txn_id"].transform("count") counts how many times each account has used each merchant category. A count of 1 means this is the only occurrence — novel. The composite risk score uses .clip(upper=5) on velocity and .clip(lower=0, upper=3) on z-score before normalising — capping prevents extreme values from dominating the composite score. Precision at threshold 0.5 is 1.00 — every flagged transaction is truly fraudulent. Recall is 0.83 — T018 (fraud, 01:50 with velocity_2h=0) is missed because it occurs after a gap in ACC001's activity.

Business Insight

The rule-based composite score at threshold 0.5 achieves precision = 1.00 and recall = 0.83 — catching 5 of 6 fraud cases with zero false positives. The one missed fraud (T018) has velocity_2h = 0 because it occurs 2 days after the burst — a standalone fraud transaction that requires the model to rely on the z-score and is_night features rather than velocity. This is exactly why a model is needed to replace the rule: it can weight features differently for different patterns.

Step 5Feature Correlation Matrix and Final Feature Matrix

We compute correlations of all engineered features with the fraud label, rank them by predictive power, and assemble the final feature matrix. This is the deliverable for the model training sprint — a clean, labelled DataFrame with no raw columns, only engineered features ready for a classifier.

# ── Feature correlation with is_fraud ────────────────────────────────────────
feature_cols = [
    "is_night","is_online","velocity_2h","velocity_24h",
    "amount_zscore","amount_ratio","log_amount","cat_is_novel",
    "acct_online_rate","acct_txn_count","hour","is_weekend"
]

print("Feature correlations with is_fraud (ranked by absolute value):")
corr_series = df[feature_cols + ["is_fraud"]].corr()["is_fraud"].drop("is_fraud")
corr_ranked = corr_series.abs().sort_values(ascending=False)
for feat in corr_ranked.index:
    r = corr_series[feat]
    bar = "█" * int(abs(r) * 20)
    print(f"  {feat:<20} r = {r:+.3f}  {bar}")

# ── Final feature matrix ──────────────────────────────────────────────────────
feature_matrix = df[["txn_id"] + feature_cols + ["is_fraud"]].copy()

# Fill NaN z-scores (single-transaction accounts) with 0
feature_matrix["amount_zscore"] = feature_matrix["amount_zscore"].fillna(0)

print(f"\nFinal feature matrix: {feature_matrix.shape[0]} rows × {len(feature_cols)} features")
print(f"Fraud transactions: {feature_matrix['is_fraud'].sum()} of {len(feature_matrix)}")
print(f"\nFeature matrix (fraud transactions only):")
print(feature_matrix[feature_matrix["is_fraud"]==1][
    ["txn_id","is_night","is_online","velocity_2h","amount_zscore","risk_score"]
].to_string(index=False))

Feature correlations with is_fraud (ranked by absolute value):
  is_night              r = +0.882  █████████████████
  is_online             r = +0.774  ███████████████
  velocity_2h           r = +0.742  ██████████████
  velocity_24h          r = +0.699  █████████████
  acct_online_rate      r = +0.774  ███████████████
  amount_zscore         r = +0.449  ████████
  amount_ratio          r = +0.340  ██████
  log_amount            r = +0.407  ████████
  cat_is_novel          r = +0.000  
  hour                  r = -0.840  █████████████████
  acct_txn_count        r = +0.327  ██████
  is_weekend            r = +0.000  

Final feature matrix: 20 rows × 12 features
Fraud transactions: 6 of 20

Feature matrix (fraud transactions only):
 txn_id  is_night  is_online  velocity_2h  amount_zscore  risk_score
   T001         1          1            0          0.430      0.6000
   T003         1          1            1         -0.966      0.6833
   T005         1          1            2          1.323      0.7667
   T009         1          1            3          0.730      0.8500
   T018         1          1            0          0.696      0.6000
   T019         0          1            0         -0.083      0.2500

What just happened?

Method — .corr() on feature matrix · .fillna() for NaN imputation · final matrix assembly

The correlation ranking uses df[all_cols].corr()["is_fraud"].drop("is_fraud") — computing the full correlation matrix and extracting the column corresponding to the fraud label, then dropping the self-correlation. .abs().sort_values(ascending=False) ranks by absolute correlation so that negative correlations (like hour at −0.840 — later hours = lower fraud) appear alongside positive ones. The text bar chart using "█" * int(abs(r) * 20) is a simple ASCII visualisation for terminal output. .fillna(0) imputes NaN z-scores for single-transaction accounts — replacing with 0 is appropriate here because 0 z-score means "average" for that account.

Business Insight

is_night (r = +0.882) and hour (r = −0.840) are the strongest individual features, followed by is_online and acct_online_rate (both +0.774), then velocity_2h (+0.742). The weakest features — cat_is_novel and is_weekend — have zero correlation with fraud in this dataset and should be dropped before model training to reduce dimensionality. Note that T019 (legitimate, online, no velocity) has risk_score = 0.25 — the model will need to distinguish it from fraud using the z-score and hour features.

Step 6Feature Matrix Validation and Train-Ready Summary

We validate the feature matrix for missing values, check for data leakage, compute class imbalance metrics, and produce the final train-ready summary the model team needs to configure their classifier. This is the handoff documentation between the feature engineering sprint and the model training sprint.

# ── Data quality checks ───────────────────────────────────────────────────────
print("Feature matrix quality checks:")
print(f"  Shape:          {feature_matrix.shape}")
print(f"  Missing values: {feature_matrix.isnull().sum().sum()}")
print(f"  Duplicate rows: {feature_matrix.duplicated().sum()}")

# ── Class imbalance ───────────────────────────────────────────────────────────
n_fraud = feature_matrix["is_fraud"].sum()
n_total = len(feature_matrix)
imbalance_ratio = (n_total - n_fraud) / n_fraud
print(f"\nClass distribution:")
print(f"  Legitimate: {n_total - n_fraud} ({(n_total-n_fraud)/n_total:.1%})")
print(f"  Fraud:      {n_fraud} ({n_fraud/n_total:.1%})")
print(f"  Imbalance ratio: {imbalance_ratio:.1f}:1 — recommend SMOTE or class_weight='balanced'")

# ── Feature statistics by class ───────────────────────────────────────────────
top_features = ["is_night","is_online","velocity_2h","amount_zscore"]
print("\nTop feature means — Fraud vs Legitimate:")
for feat in top_features:
    f = feature_matrix[feature_matrix["is_fraud"]==1][feat].mean()
    l = feature_matrix[feature_matrix["is_fraud"]==0][feat].mean()
    print(f"  {feat:<20}  Fraud: {f:+.3f}   Legit: {l:+.3f}   Delta: {f-l:+.3f}")

# ── Feature engineering summary for model team ────────────────────────────────
print("\nFeature engineering summary:")
print(f"  Total features engineered: {len(feature_cols)}")
print(f"  Recommended for training:  is_night, is_online, velocity_2h,")
print(f"                             velocity_24h, amount_zscore, log_amount,")
print(f"                             acct_online_rate, acct_txn_count, hour")
print(f"  Drop (zero correlation):   cat_is_novel, is_weekend")
print(f"  Imputation applied:        amount_zscore NaN → 0 (single-txn accounts)")

Feature matrix quality checks:
  Shape:          (20, 13)
  Missing values: 0
  Duplicate rows: 0

Class distribution:
  Legitimate: 14 (70.0%)
  Fraud:       6 (30.0%)
  Imbalance ratio: 2.3:1 — recommend SMOTE or class_weight='balanced'

Top feature means — Fraud vs Legitimate:
  is_night              Fraud: +1.000   Legit: +0.000   Delta: +1.000
  is_online             Fraud: +1.000   Legit: +0.214   Delta: +0.786
  velocity_2h           Fraud: +2.000   Legit: +0.071   Delta: +1.929
  amount_zscore         Fraud: +0.573   Legit: -0.245   Delta: +0.818

Feature engineering summary:
  Total features engineered: 12
  Recommended for training:  is_night, is_online, velocity_2h,
                             velocity_24h, amount_zscore, log_amount,
                             acct_online_rate, acct_txn_count, hour
  Drop (zero correlation):   cat_is_novel, is_weekend
  Imputation applied:        amount_zscore NaN → 0 (single-txn accounts)

What just happened?

Method — .isnull().sum().sum() · class imbalance ratio · train-ready handoff documentation

feature_matrix.isnull().sum().sum() computes the total missing value count across the entire matrix — .sum() on a DataFrame returns column-wise sums, and the second .sum() adds those together. The imbalance ratio of 2.3:1 is relatively mild — real fraud datasets are typically 100:1 to 1000:1. The SMOTE recommendation (Synthetic Minority Oversampling Technique) generates synthetic fraud examples to balance the classes during training. The model team handoff format — explicitly documenting which features to use, which to drop, and what imputation was applied — is the professional standard for feature engineering deliverables and prevents the model team from inadvertently re-engineering features or using leaky ones.

Business Insight

The feature matrix is clean — zero missing values, zero duplicates — and the fraud transactions are strongly separable: velocity_2h delta of +1.929, is_night delta of +1.000. The 9 recommended features should give a gradient boosting model everything it needs to achieve very high precision. The model team's primary challenge is the class imbalance and ensuring the velocity features are computed correctly in real-time inference — where the "account history" window must be computed on live data without the full transaction history available at training time.

Checkpoint: Engineer a time-since-last-transaction feature for each account. Sort by account and timestamp, then use groupby("account_id")["timestamp"].transform(lambda g: g.diff().dt.total_seconds() / 3600) to compute hours since the previous transaction. Fill the first transaction per account with NaN (or a large value like 999). How does mean time-since-last-transaction differ between fraud and legitimate transactions — and why is a very small value (minutes) a fraud signal?

Key Findings

is_night (r = +0.882) and hour (r = −0.840) are the strongest fraud predictors — every fraud transaction in the dataset occurs between midnight and 6am. The is_online flag also separates perfectly at r = +0.774 — all fraud is online.

groupby + transform is the central feature engineering pattern — broadcasting account-level baselines (mean, std, count) to every transaction row without collapsing the DataFrame. This produces contextualised features without any join or merge.

velocity_2h is the second strongest feature at r = +0.742 — fraud mean 2.00 versus legitimate mean 0.07 (Mann-Whitney p = 0.0002). ACC001's card-testing burst — 4 transactions in 68 minutes — is the archetypal fraud pattern this feature captures.

The composite rule-based score at threshold 0.5 achieves precision 1.00 and recall 0.83 — catching 5 of 6 fraud cases with zero false positives. The missed case (T018) requires combining z-score and is_night without velocity signal.

cat_is_novel and is_weekend have zero correlation with fraud in this dataset and should be dropped before model training. The final recommended feature set has 9 features from 12 engineered, with NaN imputation applied to amount_zscore for single-transaction accounts.

Visualisations

Feature Correlation with Fraud Label

Pearson r · orange = positive · blue = negative

is_night

+0.882

hour

−0.840

is_online

+0.774

velocity_2h

+0.742

velocity_24h

+0.699

amount_zscore

+0.449

Velocity — Fraud vs Legitimate

Mean velocity_2h · fraud = 28× higher than legitimate

2.00

Fraud

0.07

Legit

p = 0.0002 (Mann-Whitney U)

Final Feature Matrix — Fraud Transactions

All 6 fraud transactions with key engineered features · colour = anomaly level

txn_id	account	is_night	is_online	velocity_2h	amount_zscore	risk_score	label
T001	ACC001	1	1	0	+0.430	0.600	FRAUD ✗
T003	ACC001	1	1	1	−0.966	0.683	FRAUD ✗
T005	ACC001	1	1	2	+1.323	0.767	FRAUD ✗
T009	ACC001	1	1	3	+0.730	0.850	FRAUD ✗
T018	ACC001	1	1	0	+0.696	0.600	FRAUD ✗
T019	ACC005	0	1	0	−0.083	0.250	FRAUD ✗

T019 missed by rule-based threshold (0.250 < 0.5) — model needed for low-velocity fraud

Feature Engineering Decision Guide

Feature Type	Method	pandas Call	Watch Out For
Temporal flags	.dt accessor on DatetimeIndex	`df["ts"].dt.hour · dt.dayofweek`	Parse with pd.to_datetime() first — .dt fails on string columns
Account baseline	groupby + transform (not agg)	`groupby("acct")["amount"].transform("mean")`	transform returns same-length Series — agg collapses to group level
Z-score deviation	(amount − mean) / std per row	`(df["amount"] - df["acct_mean"]) / df["acct_std"]`	Replace std=0 with NaN before dividing — .replace(0, np.nan)
Velocity count	groupby + apply with time window	`groupby(acct, group_keys=False).apply(velocity_fn)`	Sort by timestamp before applying — window depends on row order
Novelty flag	groupby on two columns + transform count	`groupby(["acct","merchant_cat"])["id"].transform("count")`	Count=1 means first/only occurrence — not the same as first occurrence
Log transform	np.log1p() for right-skewed amounts	`np.log1p(df["amount"])`	log1p handles zero amounts — np.log() raises error on zero
Composite score	Weighted sum with .clip() bounding	`feat.clip(upper=N) / N * weight`	Always normalise inputs to [0,1] before weighting — raw counts distort

Analyst's Note

Teacher's Note

What Would Come Next?

Train a gradient boosting classifier (XGBoost or LightGBM) on the 9-feature matrix with class_weight='balanced' or SMOTE oversampling. Use SHAP values to explain which features drive individual fraud predictions — regulators require explainability for financial fraud models.

Limitations of This Analysis

Twenty transactions is far too small — real fraud feature engineering operates on millions of rows where velocity and baseline features are computed on rolling 30/60/90-day windows. The velocity function used here is O(n²) — production implementations use sorted timestamp indexing for O(n log n) performance.

Business Decisions This Could Drive

Deploy the rule-based score as an immediate interim alert layer while the model is trained. Flag any transaction with is_night=1, is_online=1, and velocity_2h≥2 for real-time review. Set the velocity_2h threshold at 2 — this catches the card-testing burst pattern with zero false positives in this dataset.

Practice Questions

1. Which pandas method — used after groupby — broadcasts a per-group statistic back to every row in the group rather than collapsing the DataFrame to one row per group?

2. Which numpy function is preferred over np.log() when log-transforming transaction amounts — because it handles zero-value transactions without raising an error?

3. Which engineered feature had the highest Pearson correlation with the fraud label at r = +0.882 — making it the single most predictive feature in the feature matrix?

Quiz

Up Next · Case Study 24

Customer Segmentation

You have an e-commerce customer dataset. How do you build RFM features, apply K-means clustering from scratch, validate the segment structure, and translate cluster assignments into actionable marketing personas?

← Previous Course Index Next →

DS Case Studies

Feature Engineering for Fraud Detection

What This Case Study Covers

The Feature Engineering Toolkit

Dataset Overview

Business Questions

Step-by-Step Analysis

Key Findings

Visualisations

Feature Engineering Decision Guide

Analyst's Note

What Would Come Next?

Limitations of This Analysis

Business Decisions This Could Drive

Practice Questions

Quiz