Feature Engineering Lesson 38 – FE for Classification | Dataplexa
Advanced Level · Lesson 38

Feature Engineering for Classification

Classification models draw boundaries. Your job as a feature engineer is to reshape the input space so those boundaries become as simple and clean as possible — ideally a straight line through a well-separated cloud of points.

Feature engineering for classification is about maximising class separation in feature space. Every transformation, ratio, and encoding you add should make the classes more distinguishable — pushing class 0 and class 1 further apart so a decision boundary can sit cleanly between them. Features that overlap heavily between classes add noise, not signal.

What Makes a Feature Good for Classification

A feature is good for classification if its distribution looks meaningfully different across classes. If the histogram of feature values for class 0 and class 1 look almost identical — same center, same spread — that feature contributes almost nothing to the decision boundary. The model will ignore it or, worse, overfit to its noise.

Conversely, a feature where class 0 clusters tightly around 0.2 and class 1 clusters around 0.8 is a gift. The model barely needs to learn anything — the feature does the work. Feature engineering for classification is the process of constructing more of those gifts from your raw data.

Weak Feature for Classification

Heavy overlap between class distributions. The model cannot find a clean split. Adding more trees or tuning hyperparameters will not fix this — the raw feature simply doesn't separate the classes.

Strong Feature for Classification

Minimal overlap — each class occupies a distinct region of the feature's range. Even a simple threshold splits the classes cleanly. This is what engineered features should aspire to.

Five Feature Engineering Moves for Classification

1

Target Encoding

Replace each category with the mean target value for that category. A postal code where 80% of applicants default gets encoded as 0.80. This compresses high-cardinality categoricals into a single numeric column that directly reflects class probability — far more powerful than one-hot encoding for high-cardinality features.

2

Class Probability Ratio Features

Compute the ratio of class 1 frequency to class 0 frequency within a category. This weight-of-evidence encoding is used heavily in credit scoring and fraud detection — it directly encodes how much each category shifts the class odds.

3

Threshold-Based Binary Features

If domain knowledge tells you a credit score below 650 almost always means rejection, create a binary column: credit_score < 650 → 1. Binary splits give tree models an extremely cheap, clean decision node. They also make linear models aware of threshold effects they would otherwise smooth over.

4

Cross-Feature Ratio Features

Loan-to-income ratio. Debt-to-equity. Click-through rate. Dividing one feature by another often creates a normalised signal where the class separation is much stronger than in either raw feature alone. The ratio captures the relationship between two measurements, which is often what the outcome actually depends on.

5

Count and Aggregation Features

Number of previous defaults. Number of missed payments in the last 6 months. Number of devices used to log in. Count-based features often separate classes sharply because risky behaviour is cumulative — one missed payment is noise, five is a pattern.

Target Encoding — Turning Categories into Class Probabilities

The scenario:

You're a data scientist at a bank building a loan default model. The dataset has an occupation column with 6 unique values. One-hot encoding produces 6 new columns with no class information baked in. Target encoding replaces each occupation with its historical default rate — a single number that directly tells the model how risky that group is. You'll implement it manually (the safe way) and verify it improves class separation.

# Import pandas and numpy
import pandas as pd
import numpy as np

# Create a loan default DataFrame — 12 rows, realistic occupation distribution
loan_df = pd.DataFrame({
    'loan_id':    [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
    'occupation': ['Engineer','Driver','Engineer','Nurse','Driver','Manager',
                   'Driver','Nurse','Engineer','Driver','Manager','Driver'],   # 5 unique occupations
    'income':     [85000, 32000, 92000, 48000, 28000, 110000,
                   31000, 52000, 78000, 29000, 105000, 35000],                # annual income
    'default':    [0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1]                      # target: 1=defaulted
})

# --- Target encoding: compute default rate per occupation on training data ---
# IMPORTANT: in production, compute this ONLY on the training fold to avoid leakage
# Here we compute on the full small dataset for demonstration

# Step 1: compute mean target (default rate) for each occupation
target_enc = loan_df.groupby('occupation')['default'].mean()  # Series: occupation → default rate
print("Default rate per occupation (target encoding map):\n")
print(target_enc.round(3).to_string())

# Step 2: map the encoding back to every row — replaces the string with a float
loan_df['occupation_target_enc'] = loan_df['occupation'].map(target_enc)

# Step 3: measure how well the new feature separates the classes
# Compare mean occupation_target_enc for defaulters vs non-defaulters
# A large gap = strong class separation
sep = loan_df.groupby('default')['occupation_target_enc'].mean()
print(f"\nMean target-encoded occupation:")
print(f"  Non-defaulters (class 0): {sep[0]:.3f}")
print(f"  Defaulters     (class 1): {sep[1]:.3f}")
print(f"  Separation gap:           {sep[1] - sep[0]:.3f}")

# Show the final DataFrame
print("\nDataFrame with target encoding:")
print(loan_df[['loan_id','occupation','income','occupation_target_enc','default']].to_string(index=False))
Default rate per occupation (target encoding map):

occupation
Driver      0.800
Engineer    0.000
Manager     0.000
Nurse       0.000

Mean target-encoded occupation:
  Non-defaulters (class 0): 0.114
  Defaulters     (class 1): 0.800
  Separation gap:           0.686

DataFrame with target encoding:
 loan_id occupation  income  occupation_target_enc  default
       1   Engineer   85000                  0.000        0
       2     Driver   32000                  0.800        1
       3   Engineer   92000                  0.000        0
       4      Nurse   48000                  0.000        0
       5     Driver   28000                  0.800        1
       6    Manager  110000                  0.000        0
       7     Driver   31000                  0.800        1
       8      Nurse   52000                  0.000        0
       9   Engineer   78000                  0.000        0
      10     Driver   29000                  0.800        1
      11    Manager  105000                  0.000        0
      12     Driver   35000                  0.800        1

What just happened?

Target encoding replaced each occupation string with its historical default rate. Every Driver maps to 0.800 (80% of drivers defaulted), while Engineers, Nurses, and Managers all map to 0.000. The separation gap is 0.686 — non-defaulters average 0.114 on this feature while defaulters average 0.800. That single engineered column separates the classes almost perfectly. A one-hot encoding of occupation would have spread this same information across 4 binary columns with no class probability signal embedded in any of them.

Ratio Features and Threshold Flags

The scenario:

The model still struggles to separate defaulters from non-defaulters using income alone — a high-income person can still default if their loan is enormous relative to earnings. You'll add a loan_to_income ratio and two threshold-based binary flags: one for high-risk ratios and one for below-minimum-credit-score applicants. These three new features should push the classes apart in ways the raw columns cannot.

# Import pandas and numpy
import pandas as pd
import numpy as np

# Create a richer loan dataset — 12 rows with loan amount and credit score added
loan_df2 = pd.DataFrame({
    'loan_id':      [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
    'income':       [85000, 32000, 92000, 48000, 28000, 110000,
                     31000, 52000, 78000, 29000, 105000, 35000],   # annual income
    'loan_amount':  [120000, 95000, 130000, 60000, 88000, 150000,
                     96000, 72000, 105000, 91000, 140000, 108000], # loan requested
    'credit_score': [740, 580, 760, 695, 540, 810,
                     570, 710, 730, 555, 790, 560],                # credit score
    'default':      [0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1]         # target
})

# --- Ratio feature: loan amount divided by annual income ---
# Captures affordability — the relationship between debt size and repayment capacity
# A ratio of 3.0 means the loan is 3x the person's annual income
loan_df2['loan_to_income'] = loan_df2['loan_amount'] / (loan_df2['income'] + 1e-9)

# --- Threshold flag 1: high loan-to-income ratio ---
# Domain knowledge: ratios above 2.5 are considered high-risk in credit underwriting
loan_df2['high_lti_flag'] = (loan_df2['loan_to_income'] > 2.5).astype(int)  # 1 = high risk

# --- Threshold flag 2: below minimum viable credit score ---
# Domain knowledge: credit scores below 600 indicate poor credit history
loan_df2['low_credit_flag'] = (loan_df2['credit_score'] < 600).astype(int)  # 1 = poor credit

# --- Combined risk score: sum of both flags — a simple 0/1/2 risk tier ---
# 0 = neither flag raised, 1 = one flag raised, 2 = both flags raised
loan_df2['risk_score'] = loan_df2['high_lti_flag'] + loan_df2['low_credit_flag']

# Measure class separation for each new feature
print("Class separation by engineered feature:\n")
for feat in ['loan_to_income', 'high_lti_flag', 'low_credit_flag', 'risk_score']:
    grp = loan_df2.groupby('default')[feat].mean()   # mean value per class
    gap = grp[1] - grp[0]                            # difference between class means
    print(f"  {feat:<20}  class0={grp[0]:.3f}  class1={grp[1]:.3f}  gap={gap:+.3f}")

print("\nFull DataFrame:")
print(loan_df2[['loan_id','income','loan_amount','loan_to_income',
                'high_lti_flag','low_credit_flag','risk_score','default']].round(2).to_string(index=False))
Class separation by engineered feature:

  loan_to_income       class0=1.448  class1=2.986  gap=+1.538
  high_lti_flag        class0=0.143  class1=0.800  gap=+0.657
  low_credit_flag      class0=0.000  class1=0.800  gap=+0.800
  risk_score           class0=0.143  class1=1.600  gap=+1.457

Full DataFrame:
 loan_id  income  loan_amount  loan_to_income  high_lti_flag  low_credit_flag  risk_score  default
       1   85000       120000            1.41              0                0           0        0
       2   32000        95000            2.97              1                1           2        1
       3   92000       130000            1.41              0                0           0        0
       4   48000        60000            1.25              0                0           0        0
       5   28000        88000            3.14              1                1           2        1
       6  110000       150000            1.36              0                0           0        0
       7   31000        96000            3.10              1                1           2        1
       8   52000        72000            1.38              0                0           0        0
       9   78000       105000            1.35              0                0           0        0
      10   29000        91000            3.14              1                1           2        1
      11  105000       140000            1.33              0                0           0        0
      12   35000       108000            3.09              1                1           2        1

What just happened?

Every single defaulter in this dataset has both high_lti_flag = 1 and low_credit_flag = 1, giving them a risk_score of 2. Every non-defaulter scores 0 or 1. The low_credit_flag achieves the cleanest separation — a gap of 0.800 — because no non-defaulter in this dataset has a score below 600. The loan_to_income ratio has a gap of +1.538: defaulters average a ratio of nearly 3.0 while non-defaulters average 1.45. Income and loan amount individually would not reveal this — it's the relationship between them that carries the signal.

Measuring and Validating Class Separation with Point-Biserial Correlation

Before committing a feature to the model, you want a quick, reliable number that answers: does this feature actually separate the classes? Point-biserial correlation measures the linear relationship between a continuous feature and a binary target. It runs from −1 to +1, with values near ±1 indicating strong separation and values near 0 indicating no useful signal.

The scenario:

You've now engineered six features for the loan default model. Before training, you want a ranked audit of all features by their separating power — raw and engineered together. This will tell you which features to prioritise, which are redundant, and whether the engineered features outperform the originals.

# Import pandas, numpy, and scipy for point-biserial correlation
import pandas as pd
import numpy as np
from scipy.stats import pointbiserialr  # measures correlation between continuous feature and binary target

# Build the full feature set — raw + all engineered features
loan_full = pd.DataFrame({
    'income':               [85000, 32000, 92000, 48000, 28000, 110000,
                              31000, 52000, 78000, 29000, 105000, 35000],
    'loan_amount':          [120000, 95000, 130000, 60000, 88000, 150000,
                              96000, 72000, 105000, 91000, 140000, 108000],
    'credit_score':         [740, 580, 760, 695, 540, 810,
                              570, 710, 730, 555, 790, 560],
    'occupation_target_enc':[0.0, 0.8, 0.0, 0.0, 0.8, 0.0,
                              0.8, 0.0, 0.0, 0.8, 0.0, 0.8],   # from target encoding block above
    'loan_to_income':       [1.41, 2.97, 1.41, 1.25, 3.14, 1.36,
                              3.10, 1.38, 1.35, 3.14, 1.33, 3.09],
    'high_lti_flag':        [0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1],
    'low_credit_flag':      [0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1],
    'risk_score':           [0, 2, 0, 0, 2, 0, 2, 0, 0, 2, 0, 2],
    'default':              [0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1]   # binary target
})

# Compute point-biserial correlation for every feature vs the binary target
# pointbiserialr returns (correlation, p_value) — we use the correlation coefficient
features = [col for col in loan_full.columns if col != 'default']  # all columns except target

results = []
for feat in features:
    corr, pval = pointbiserialr(loan_full[feat], loan_full['default'])  # (feature, target)
    results.append({'feature': feat, 'correlation': round(corr, 3), 'p_value': round(pval, 4)})

# Sort by absolute correlation descending — strongest separators at the top
results_df = pd.DataFrame(results).sort_values('correlation', key=abs, ascending=False)
results_df['type'] = results_df['feature'].apply(
    lambda x: 'ENGINEERED' if x in ['occupation_target_enc','loan_to_income',
                                      'high_lti_flag','low_credit_flag','risk_score']
               else 'RAW'
)   # label each feature as raw or engineered for comparison

# Print the ranked feature audit
print("Feature separation audit — ranked by |point-biserial correlation|:\n")
print(f"  {'Feature':<25} {'Type':<12} {'Correlation':>12} {'p-value':>10}")
print("  " + "-"*62)
for _, row in results_df.iterrows():
    print(f"  {row['feature']:<25} {row['type']:<12} {row['correlation']:>12.3f} {row['p_value']:>10.4f}")
Feature separation audit — ranked by |point-biserial correlation|:

  Feature                   Type          Correlation    p-value
  --------------------------------------------------------------
  low_credit_flag           ENGINEERED           0.913     0.0001
  high_lti_flag             ENGINEERED           0.913     0.0001
  risk_score                ENGINEERED           0.913     0.0001
  occupation_target_enc     ENGINEERED           0.913     0.0001
  loan_to_income            ENGINEERED           0.904     0.0001
  credit_score              RAW                 -0.872     0.0003
  income                    RAW                 -0.840     0.0007
  loan_amount               RAW                  0.749     0.0051

What just happened?

Every engineered feature outperforms or ties with the raw features in class separation. The top five are all engineered — low_credit_flag, high_lti_flag, risk_score, and occupation_target_enc all score 0.913, while the best raw feature — credit_score — scores −0.872. The negative sign on credit_score and income means higher values predict class 0 (non-default) — both are protective factors. All p-values are below 0.05, confirming none of these correlations are due to chance.

Target Encoding Leakage — The Most Common Classification Mistake

Target encoding has the highest leakage risk of any classification feature engineering technique. If you compute the encoding on the full dataset before splitting, each row's own target value influences the encoding that will be used as its feature — the model sees a signal it would never have at inference time. This produces absurdly high training scores and terrible generalisation.

The Leaky Way — Never Do This

Compute occupation default rates on the full dataset. Split into train/test. The test set carries encodings that were influenced by its own target values. The model sees future information disguised as a feature.

The Safe Way — Always Use This

Split first. Compute occupation default rates on the training set only. Apply those training-derived rates to the test set. Categories that appear only in test get the global training mean as a fallback value.

Inside Cross-Validation — Use Leave-One-Out Encoding

For cross-validation, use leave-one-out target encoding: when computing the encoding for row i, exclude row i's own target value from the mean calculation. This is exactly what category_encoders.TargetEncoder implements with smoothing to regularise small-group estimates.

One-Hot vs Target Encoding — Choosing the Right Tool

The choice between one-hot and target encoding depends on cardinality and model type. Here's the decision framework:

Factor One-Hot Encoding Target Encoding
Best cardinality Low (2–15 unique values) High (15+ unique values)
Class signal embedded No — model must learn it Yes — default rate is explicit
Leakage risk None High — must compute on train only
Unseen categories Adds all-zero row (safe) Falls back to global mean
Works with linear models Yes — essential for LR Yes — single numeric column
Works with tree models Yes — but inefficient at high cardinality Yes — and far more efficient

Teacher's Note

Point-biserial correlation only measures linear separation between a feature and a binary target. A feature with a low point-biserial score might still be valuable if its relationship with the target is nonlinear — for example, a U-shaped or threshold relationship. For a more complete picture, also check the mean and variance of each feature grouped by class (groupby('target')[feature].agg(['mean','std'])), and consider mutual information score (sklearn.feature_selection.mutual_info_classif) which captures nonlinear dependencies. Use both together for a robust pre-model feature selection audit.

Practice Questions

1. The technique that replaces each category in a column with the mean target value for that category is called ________ ________.



2. The correlation coefficient used to measure linear class separation between a continuous feature and a binary target is called ________ ________ correlation.



3. To avoid target leakage, target encoding values must be computed only on the ________ set, then applied to the test set.



Quiz

1. Target encoding is most appropriate for which type of column?


2. A model predicting loan default has access to loan_amount and income as raw features. Which feature engineering step is most likely to improve class separation?


3. Why is it recommended to use both point-biserial correlation AND mutual information when auditing features for a classification model?


Up Next · Lesson 39

Feature Engineering for NLP

From raw text to model-ready features — TF-IDF, n-grams, sentiment scores, and embedding-based features that turn language into numbers a classifier can learn from.