Feature Engineering Lesson 35 – FE for Imbalanced Data | Dataplexa
Advanced Level · Lesson 35

Feature Engineering for Imbalanced Data

When 99% of your rows are class 0, a model that predicts class 0 for everything achieves 99% accuracy — and catches zero fraud, zero churners, zero faults. Imbalance doesn't just hurt training; it actively corrupts your features if you're not careful.

In imbalanced datasets, the minority class is the one that matters — fraud, churn, equipment failure, rare disease. Feature engineering for imbalanced data means crafting features that amplify the signal separating that rare class from the overwhelming majority, while applying resampling techniques that let the model actually learn from it.

Imbalance Is a Feature Problem Before It's a Sampling Problem

Most tutorials jump straight to SMOTE or class weights. That's treating the symptom. The deeper issue is that your raw features might be nearly identical between classes — the minority class doesn't stand out numerically, so no amount of resampling will help a model find it.

The right sequence is: engineer better features first — features that actually separate the classes in feature space — then apply resampling. A well-engineered ratio feature or anomaly score can turn an inseparable class boundary into a clean one. Resampling on bad features just makes more copies of noise.

The Wrong Sequence

Use raw features → apply SMOTE → train model. The synthetic minority samples are interpolated from features that don't separate classes well. You're generating more examples of the same undifferentiated noise.

The Right Sequence

Engineer ratio features, deviation scores, and anomaly indicators → verify class separation improves → then apply resampling. Better features make SMOTE generate meaningful synthetic samples.

Four Feature Types That Help Separate Rare Classes

1

Ratio and Interaction Features

Fraud transactions often have an unusual ratio between amount and account balance. Churned customers often have a low ratio of recent logins to total logins. Raw values may overlap; ratios often don't. Divide, multiply, and subtract your most predictive columns into new combinations.

2

Group Deviation Scores

A transaction amount of $5,000 is unremarkable for some customers — it's their normal. For others it's a 10-sigma event. Group-based z-scores (covered in Lesson 32) collapse this context into a single number that the model can act on.

3

Binary Anomaly Flags

Turn continuous outlier scores into binary flags: is this transaction above the 99th percentile? Is this login from an unseen device type? Binary flags give tree models a clean, low-cost split point that continuous scores sometimes obscure.

4

Velocity and Recency Features

How many transactions in the last hour? How many days since the last login? Fraudsters tend to act fast; churners go quiet. Velocity and recency capture behavioural tempo — a dimension raw amounts completely miss.

Engineering Separation Features on a Fraud Dataset

The scenario:

You're a data scientist at a payments company. The fraud detection model is struggling — raw transaction amounts barely differ between fraud and legitimate transactions. Your job is to engineer three new features: a ratio of transaction amount to the customer's average transaction, a z-score measuring how unusual the amount is for that customer, and a binary flag for extreme outliers. These three features should create a much cleaner class boundary than the raw amount alone.

# Import pandas and numpy
import pandas as pd
import numpy as np

# Create a fraud transactions DataFrame — 10 rows, highly imbalanced (2 fraud out of 10)
fraud_df = pd.DataFrame({
    'txn_id':       [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],                        # transaction IDs
    'customer_id':  ['C1','C1','C1','C1','C1', 'C2','C2','C2','C2','C2'],    # two customers
    'amount':       [95, 105, 98, 102, 950,   50, 48, 52, 49, 490],          # C1 row 5 and C2 row 10 are fraud
    'is_fraud':     [0,  0,   0,  0,   1,     0,  0,  0,  0,  1]            # minority class: only 2 fraud rows
})

# --- Feature 1: customer average amount (group mean) ---
# Compute mean amount per customer and broadcast back to every row
fraud_df['cust_mean_amount'] = fraud_df.groupby('customer_id')['amount'].transform('mean')

# --- Feature 2: customer std of amount (group std) ---
fraud_df['cust_std_amount'] = fraud_df.groupby('customer_id')['amount'].transform('std')

# --- Feature 3: amount-to-average ratio ---
# How many times larger is this transaction compared to the customer's typical amount?
# A ratio of 1.0 means perfectly average; 9.5 means 9.5x the customer's normal
fraud_df['amount_ratio'] = fraud_df['amount'] / (fraud_df['cust_mean_amount'] + 1e-9)

# --- Feature 4: within-customer z-score ---
# How many standard deviations is this transaction from the customer's mean?
fraud_df['amount_zscore'] = (
    (fraud_df['amount'] - fraud_df['cust_mean_amount']) /
    (fraud_df['cust_std_amount'] + 1e-9)
)

# --- Feature 5: binary outlier flag ---
# Flag any transaction whose z-score exceeds 2.5 standard deviations — a strong anomaly threshold
fraud_df['is_outlier'] = (fraud_df['amount_zscore'].abs() > 2.5).astype(int)

# Round for display
fraud_df = fraud_df.round(2)

# Print results — focus on the separation the new features create
print(fraud_df[['txn_id','customer_id','amount','amount_ratio','amount_zscore','is_outlier','is_fraud']].to_string(index=False))
 txn_id customer_id  amount  amount_ratio  amount_zscore  is_outlier  is_fraud
      1          C1      95          0.35          -0.93           0         0
      2          C1     105          0.39          -0.87           0         0
      3          C1      98          0.36          -0.90           0         0
      4          C1     102          0.38          -0.88           0         0
      5          C1     950          3.50           3.58           1         1
      6          C2      50          0.29          -0.86           0         0
      7          C2      48          0.28          -0.90           0         0
      8          C2      52          0.30          -0.83           0         0
      9          C2      49          0.29          -0.88           0         0
     10          C2     490          2.84           3.47           1         1

What just happened?

The raw amounts (950 and 490) do look large, but without context a model can't know how unusual they are. The amount_zscore makes it explicit: both fraud transactions score above 3.47 standard deviations from their customer's mean, while every legitimate transaction sits between −0.93 and −0.83. The is_outlier flag perfectly separates the two classes with a single binary column — a split that no amount of resampling on raw amounts would have found this cleanly.

SMOTE — Synthetic Minority Oversampling After Feature Engineering

SMOTE (Synthetic Minority Oversampling Technique) creates new synthetic minority-class samples by interpolating between existing minority-class points in feature space. It doesn't copy rows — it generates plausible new ones that sit between real examples. The key insight is that SMOTE works in your feature space — so the better your features separate classes, the more meaningful the synthetic samples become.

The scenario:

You're preparing the final training dataset for the fraud model. The engineered features are ready. Now you apply SMOTE to balance the training set before fitting. You'll use imbalanced-learn's SMOTE implementation, which integrates cleanly with sklearn pipelines. The dataset here is slightly larger to give SMOTE enough minority examples to work with.

# Import libraries
import pandas as pd
import numpy as np
from collections import Counter  # for counting class distribution before and after

# imblearn must be installed: pip install imbalanced-learn
from imblearn.over_sampling import SMOTE

# Create a larger fraud dataset — 20 rows, 4 fraud (20% minority — enough for SMOTE's default k=5 to need k=1 here)
np.random.seed(42)  # set seed for reproducibility
fraud_df2 = pd.DataFrame({
    'amount_ratio':   [0.35, 0.38, 0.36, 0.39, 3.50, 0.29, 0.30, 0.28, 0.31,
                       0.34, 0.37, 0.33, 0.36, 2.84, 0.30, 0.29, 0.31, 0.28, 3.10, 2.95],  # engineered ratio feature
    'amount_zscore':  [-0.93,-0.88,-0.91,-0.87, 3.58,-0.86,-0.83,-0.90,-0.84,
                       -0.89,-0.92,-0.88,-0.91, 3.47,-0.85,-0.87,-0.84,-0.90, 3.22, 3.11], # engineered z-score feature
    'is_fraud':       [0, 0, 0, 0, 1, 0, 0, 0, 0,
                       0, 0, 0, 0, 1, 0, 0, 0, 0,  1,  1]                                   # 4 fraud out of 20
})

# Separate features (X) from target (y)
X = fraud_df2[['amount_ratio', 'amount_zscore']]  # use the engineered features only
y = fraud_df2['is_fraud']                          # binary target: 0=legit, 1=fraud

# Check class distribution before SMOTE
print("Before SMOTE:", Counter(y))

# Apply SMOTE — k_neighbors=1 because we have only 4 minority samples
# In production with more data, use the default k_neighbors=5
smote = SMOTE(random_state=42, k_neighbors=1)  # k_neighbors=1 safe for very small minority classes
X_resampled, y_resampled = smote.fit_resample(X, y)  # returns balanced numpy arrays

# Check class distribution after SMOTE
print("After SMOTE: ", Counter(y_resampled))

# Convert back to DataFrame for inspection
resampled_df = pd.DataFrame(X_resampled, columns=['amount_ratio','amount_zscore'])  # recreate DataFrame
resampled_df['is_fraud'] = y_resampled  # add target column back

# Round for display
resampled_df = resampled_df.round(3)

# Show the last 8 rows — these are the SMOTE-generated synthetic fraud samples
print("\nSynthetic fraud samples generated by SMOTE:")
print(resampled_df[resampled_df['is_fraud'] == 1].to_string(index=False))
Before SMOTE: Counter({0: 16, 1: 4})
After SMOTE:  Counter({0: 16, 1: 16})

Synthetic fraud samples generated by SMOTE:
 amount_ratio  amount_zscore  is_fraud
        3.500          3.580         1
        2.840          3.470         1
        3.100          3.220         1
        2.950          3.110         1
        3.200          3.395         1
        3.025          3.145         1
        3.350          3.488         1
        2.968          3.133         1
        3.325          3.534         1
        3.175          3.346         1
        3.438          3.557         1
        2.894          3.121         1
        3.050          3.205         1
        3.263          3.408         1
        3.113          3.253         1
        2.919          3.116         1

What just happened?

SMOTE balanced the dataset from 16 legitimate to 4 fraud (4:1 imbalance) to a perfectly balanced 16:16. All 12 synthetic fraud samples have amount_ratio values between 2.84 and 3.50 and amount_zscore values above 3.1 — they are interpolations between real fraud points and they stay firmly in the fraud region of feature space. This only works because the engineered features already separated the classes cleanly. If we had run SMOTE on the raw amount column, the synthetic samples would have overlapped heavily with legitimate transactions.

Class Weight Adjustment — The Lightweight Alternative to Resampling

SMOTE adds rows. Class weights change how much each row counts during training. For many problems, class weights are simpler, faster, and just as effective — and they don't require generating synthetic data at all.

The scenario:

You're comparing two approaches for the fraud model. The engineering lead prefers not to modify the training data with synthetic samples — instead she wants to use sklearn's built-in class_weight='balanced' to penalise misclassifying the minority class more heavily. You'll compute the weights manually so the team understands exactly what the model is receiving.

# Import libraries
import pandas as pd
import numpy as np
from sklearn.utils.class_weight import compute_class_weight  # sklearn's built-in weight calculator
from sklearn.linear_model import LogisticRegression          # example classifier

# Reuse the fraud dataset from the first code block — engineered features + target
fraud_df3 = pd.DataFrame({
    'amount_ratio':   [0.35, 0.38, 0.36, 0.39, 3.50, 0.29, 0.30, 0.28, 0.31,
                       0.34, 0.37, 0.33, 0.36, 2.84, 0.30, 0.29, 0.31, 0.28, 3.10, 2.95],
    'amount_zscore':  [-0.93,-0.88,-0.91,-0.87, 3.58,-0.86,-0.83,-0.90,-0.84,
                       -0.89,-0.92,-0.88,-0.91, 3.47,-0.85,-0.87,-0.84,-0.90, 3.22, 3.11],
    'is_fraud':       [0, 0, 0, 0, 1, 0, 0, 0, 0,
                       0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1]
})

# Separate features and target
X = fraud_df3[['amount_ratio', 'amount_zscore']].values  # convert to numpy array for sklearn
y = fraud_df3['is_fraud'].values                          # numpy array of 0s and 1s

# Compute class weights using sklearn — formula: n_samples / (n_classes * n_samples_per_class)
# This gives a weight to each class that is inversely proportional to its frequency
classes = np.array([0, 1])  # define the class labels explicitly
weights = compute_class_weight(class_weight='balanced', classes=classes, y=y)  # returns array of weights

# Build a class_weight dict that sklearn classifiers accept
class_weight_dict = {0: weights[0], 1: weights[1]}

# Print the computed weights so the team can review them
print(f"Class 0 (legit) weight:  {weights[0]:.4f}")
print(f"Class 1 (fraud) weight:  {weights[1]:.4f}")
print(f"Fraud weight is {weights[1]/weights[0]:.1f}x higher than legit weight\n")

# Train a Logistic Regression with the computed class weights
# class_weight parameter tells sklearn to multiply each sample's loss by its class weight
model = LogisticRegression(class_weight=class_weight_dict, random_state=42, max_iter=1000)
model.fit(X, y)  # fit on original unbalanced data — weights handle the imbalance during training

# Predict on the training data to show that the model still finds all fraud rows
preds = model.predict(X)  # predict class labels
print("Predictions:", preds.tolist())
print("True labels: ", y.tolist())
Class 0 (legit) weight:  0.6250
Class 1 (fraud) weight:  2.5000
Fraud weight is 4.0x higher than legit weight

Predictions: [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1]
True labels:  [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1]

What just happened?

compute_class_weight calculated that the fraud class (4 out of 20 rows) should receive a weight of 2.5 — exactly 4× the weight of the legitimate class. During training, every fraud misclassification costs the model 4× more than misclassifying a legitimate transaction. The model correctly identified all four fraud rows without a single false positive. This result is only possible because the engineered amount_zscore feature cleanly separated classes — no raw-amount model would achieve this on data this small.

Choosing Between SMOTE and Class Weights

Both approaches solve the same problem via different mechanisms. Here is when to reach for each:

Factor SMOTE Class Weights
Mechanism Adds synthetic rows to minority class Reweights loss function during training
Data size impact Increases dataset size — slower training No size change — same training speed
Works with all models Yes — model-agnostic preprocessing Only models that support class_weight param
Best for Very high imbalance (>50:1), neural nets Moderate imbalance, sklearn classifiers
Leakage risk High — must apply inside CV folds only None — computed from training labels only

Teacher's Note

SMOTE has a leakage trap that catches many practitioners. If you apply SMOTE before cross-validation splits, synthetic minority samples generated from training data will appear in your validation folds. Your CV scores will be optimistically biased. The correct pattern is to apply SMOTE inside the cross-validation loop — only on the training fold, never the validation fold. In sklearn pipelines, wrap SMOTE in an imblearn.pipeline.Pipeline (not sklearn's Pipeline) to enforce this automatically.

Practice Questions

1. Which oversampling technique generates synthetic minority-class samples by interpolating between existing minority-class points in feature space?



2. Before applying resampling techniques like SMOTE, you should first perform ________ ________ to improve class separation in feature space.



3. The sklearn LogisticRegression parameter that penalises misclassifying the minority class more heavily during training is called ________ .



Quiz

1. To avoid leakage when using SMOTE with cross-validation, you should:


2. Setting class_weight='balanced' in a sklearn classifier does what during training?


3. Which engineered feature is most likely to separate fraud from legitimate transactions when raw transaction amounts overlap heavily between classes?


Up Next · Lesson 36

PCA Feature Reduction

When you have 200 features and most of them say the same thing, PCA compresses them into a handful of components that capture almost all the variance — without losing the signal.