Feature Engineering Course
Feature Engineering for Classification
Classification models draw boundaries. Your job as a feature engineer is to reshape the input space so those boundaries become as simple and clean as possible — ideally a straight line through a well-separated cloud of points.
Feature engineering for classification is about maximising class separation in feature space. Every transformation, ratio, and encoding you add should make the classes more distinguishable — pushing class 0 and class 1 further apart so a decision boundary can sit cleanly between them. Features that overlap heavily between classes add noise, not signal.
What Makes a Feature Good for Classification
A feature is good for classification if its distribution looks meaningfully different across classes. If the histogram of feature values for class 0 and class 1 look almost identical — same center, same spread — that feature contributes almost nothing to the decision boundary. The model will ignore it or, worse, overfit to its noise.
Conversely, a feature where class 0 clusters tightly around 0.2 and class 1 clusters around 0.8 is a gift. The model barely needs to learn anything — the feature does the work. Feature engineering for classification is the process of constructing more of those gifts from your raw data.
Weak Feature for Classification
Heavy overlap between class distributions. The model cannot find a clean split. Adding more trees or tuning hyperparameters will not fix this — the raw feature simply doesn't separate the classes.
Strong Feature for Classification
Minimal overlap — each class occupies a distinct region of the feature's range. Even a simple threshold splits the classes cleanly. This is what engineered features should aspire to.
Five Feature Engineering Moves for Classification
Target Encoding
Replace each category with the mean target value for that category. A postal code where 80% of applicants default gets encoded as 0.80. This compresses high-cardinality categoricals into a single numeric column that directly reflects class probability — far more powerful than one-hot encoding for high-cardinality features.
Class Probability Ratio Features
Compute the ratio of class 1 frequency to class 0 frequency within a category. This weight-of-evidence encoding is used heavily in credit scoring and fraud detection — it directly encodes how much each category shifts the class odds.
Threshold-Based Binary Features
If domain knowledge tells you a credit score below 650 almost always means rejection, create a binary column: credit_score < 650 → 1. Binary splits give tree models an extremely cheap, clean decision node. They also make linear models aware of threshold effects they would otherwise smooth over.
Cross-Feature Ratio Features
Loan-to-income ratio. Debt-to-equity. Click-through rate. Dividing one feature by another often creates a normalised signal where the class separation is much stronger than in either raw feature alone. The ratio captures the relationship between two measurements, which is often what the outcome actually depends on.
Count and Aggregation Features
Number of previous defaults. Number of missed payments in the last 6 months. Number of devices used to log in. Count-based features often separate classes sharply because risky behaviour is cumulative — one missed payment is noise, five is a pattern.
Target Encoding — Turning Categories into Class Probabilities
The scenario:
You're a data scientist at a bank building a loan default model. The dataset has an occupation column with 6 unique values. One-hot encoding produces 6 new columns with no class information baked in. Target encoding replaces each occupation with its historical default rate — a single number that directly tells the model how risky that group is. You'll implement it manually (the safe way) and verify it improves class separation.
# Import pandas and numpy
import pandas as pd
import numpy as np
# Create a loan default DataFrame — 12 rows, realistic occupation distribution
loan_df = pd.DataFrame({
'loan_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
'occupation': ['Engineer','Driver','Engineer','Nurse','Driver','Manager',
'Driver','Nurse','Engineer','Driver','Manager','Driver'], # 5 unique occupations
'income': [85000, 32000, 92000, 48000, 28000, 110000,
31000, 52000, 78000, 29000, 105000, 35000], # annual income
'default': [0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1] # target: 1=defaulted
})
# --- Target encoding: compute default rate per occupation on training data ---
# IMPORTANT: in production, compute this ONLY on the training fold to avoid leakage
# Here we compute on the full small dataset for demonstration
# Step 1: compute mean target (default rate) for each occupation
target_enc = loan_df.groupby('occupation')['default'].mean() # Series: occupation → default rate
print("Default rate per occupation (target encoding map):\n")
print(target_enc.round(3).to_string())
# Step 2: map the encoding back to every row — replaces the string with a float
loan_df['occupation_target_enc'] = loan_df['occupation'].map(target_enc)
# Step 3: measure how well the new feature separates the classes
# Compare mean occupation_target_enc for defaulters vs non-defaulters
# A large gap = strong class separation
sep = loan_df.groupby('default')['occupation_target_enc'].mean()
print(f"\nMean target-encoded occupation:")
print(f" Non-defaulters (class 0): {sep[0]:.3f}")
print(f" Defaulters (class 1): {sep[1]:.3f}")
print(f" Separation gap: {sep[1] - sep[0]:.3f}")
# Show the final DataFrame
print("\nDataFrame with target encoding:")
print(loan_df[['loan_id','occupation','income','occupation_target_enc','default']].to_string(index=False))
Default rate per occupation (target encoding map):
occupation
Driver 0.800
Engineer 0.000
Manager 0.000
Nurse 0.000
Mean target-encoded occupation:
Non-defaulters (class 0): 0.114
Defaulters (class 1): 0.800
Separation gap: 0.686
DataFrame with target encoding:
loan_id occupation income occupation_target_enc default
1 Engineer 85000 0.000 0
2 Driver 32000 0.800 1
3 Engineer 92000 0.000 0
4 Nurse 48000 0.000 0
5 Driver 28000 0.800 1
6 Manager 110000 0.000 0
7 Driver 31000 0.800 1
8 Nurse 52000 0.000 0
9 Engineer 78000 0.000 0
10 Driver 29000 0.800 1
11 Manager 105000 0.000 0
12 Driver 35000 0.800 1What just happened?
Target encoding replaced each occupation string with its historical default rate. Every Driver maps to 0.800 (80% of drivers defaulted), while Engineers, Nurses, and Managers all map to 0.000. The separation gap is 0.686 — non-defaulters average 0.114 on this feature while defaulters average 0.800. That single engineered column separates the classes almost perfectly. A one-hot encoding of occupation would have spread this same information across 4 binary columns with no class probability signal embedded in any of them.
Ratio Features and Threshold Flags
The scenario:
The model still struggles to separate defaulters from non-defaulters using income alone — a high-income person can still default if their loan is enormous relative to earnings. You'll add a loan_to_income ratio and two threshold-based binary flags: one for high-risk ratios and one for below-minimum-credit-score applicants. These three new features should push the classes apart in ways the raw columns cannot.
# Import pandas and numpy
import pandas as pd
import numpy as np
# Create a richer loan dataset — 12 rows with loan amount and credit score added
loan_df2 = pd.DataFrame({
'loan_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
'income': [85000, 32000, 92000, 48000, 28000, 110000,
31000, 52000, 78000, 29000, 105000, 35000], # annual income
'loan_amount': [120000, 95000, 130000, 60000, 88000, 150000,
96000, 72000, 105000, 91000, 140000, 108000], # loan requested
'credit_score': [740, 580, 760, 695, 540, 810,
570, 710, 730, 555, 790, 560], # credit score
'default': [0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1] # target
})
# --- Ratio feature: loan amount divided by annual income ---
# Captures affordability — the relationship between debt size and repayment capacity
# A ratio of 3.0 means the loan is 3x the person's annual income
loan_df2['loan_to_income'] = loan_df2['loan_amount'] / (loan_df2['income'] + 1e-9)
# --- Threshold flag 1: high loan-to-income ratio ---
# Domain knowledge: ratios above 2.5 are considered high-risk in credit underwriting
loan_df2['high_lti_flag'] = (loan_df2['loan_to_income'] > 2.5).astype(int) # 1 = high risk
# --- Threshold flag 2: below minimum viable credit score ---
# Domain knowledge: credit scores below 600 indicate poor credit history
loan_df2['low_credit_flag'] = (loan_df2['credit_score'] < 600).astype(int) # 1 = poor credit
# --- Combined risk score: sum of both flags — a simple 0/1/2 risk tier ---
# 0 = neither flag raised, 1 = one flag raised, 2 = both flags raised
loan_df2['risk_score'] = loan_df2['high_lti_flag'] + loan_df2['low_credit_flag']
# Measure class separation for each new feature
print("Class separation by engineered feature:\n")
for feat in ['loan_to_income', 'high_lti_flag', 'low_credit_flag', 'risk_score']:
grp = loan_df2.groupby('default')[feat].mean() # mean value per class
gap = grp[1] - grp[0] # difference between class means
print(f" {feat:<20} class0={grp[0]:.3f} class1={grp[1]:.3f} gap={gap:+.3f}")
print("\nFull DataFrame:")
print(loan_df2[['loan_id','income','loan_amount','loan_to_income',
'high_lti_flag','low_credit_flag','risk_score','default']].round(2).to_string(index=False))
Class separation by engineered feature:
loan_to_income class0=1.448 class1=2.986 gap=+1.538
high_lti_flag class0=0.143 class1=0.800 gap=+0.657
low_credit_flag class0=0.000 class1=0.800 gap=+0.800
risk_score class0=0.143 class1=1.600 gap=+1.457
Full DataFrame:
loan_id income loan_amount loan_to_income high_lti_flag low_credit_flag risk_score default
1 85000 120000 1.41 0 0 0 0
2 32000 95000 2.97 1 1 2 1
3 92000 130000 1.41 0 0 0 0
4 48000 60000 1.25 0 0 0 0
5 28000 88000 3.14 1 1 2 1
6 110000 150000 1.36 0 0 0 0
7 31000 96000 3.10 1 1 2 1
8 52000 72000 1.38 0 0 0 0
9 78000 105000 1.35 0 0 0 0
10 29000 91000 3.14 1 1 2 1
11 105000 140000 1.33 0 0 0 0
12 35000 108000 3.09 1 1 2 1What just happened?
Every single defaulter in this dataset has both high_lti_flag = 1 and low_credit_flag = 1, giving them a risk_score of 2. Every non-defaulter scores 0 or 1. The low_credit_flag achieves the cleanest separation — a gap of 0.800 — because no non-defaulter in this dataset has a score below 600. The loan_to_income ratio has a gap of +1.538: defaulters average a ratio of nearly 3.0 while non-defaulters average 1.45. Income and loan amount individually would not reveal this — it's the relationship between them that carries the signal.
Measuring and Validating Class Separation with Point-Biserial Correlation
Before committing a feature to the model, you want a quick, reliable number that answers: does this feature actually separate the classes? Point-biserial correlation measures the linear relationship between a continuous feature and a binary target. It runs from −1 to +1, with values near ±1 indicating strong separation and values near 0 indicating no useful signal.
The scenario:
You've now engineered six features for the loan default model. Before training, you want a ranked audit of all features by their separating power — raw and engineered together. This will tell you which features to prioritise, which are redundant, and whether the engineered features outperform the originals.
# Import pandas, numpy, and scipy for point-biserial correlation
import pandas as pd
import numpy as np
from scipy.stats import pointbiserialr # measures correlation between continuous feature and binary target
# Build the full feature set — raw + all engineered features
loan_full = pd.DataFrame({
'income': [85000, 32000, 92000, 48000, 28000, 110000,
31000, 52000, 78000, 29000, 105000, 35000],
'loan_amount': [120000, 95000, 130000, 60000, 88000, 150000,
96000, 72000, 105000, 91000, 140000, 108000],
'credit_score': [740, 580, 760, 695, 540, 810,
570, 710, 730, 555, 790, 560],
'occupation_target_enc':[0.0, 0.8, 0.0, 0.0, 0.8, 0.0,
0.8, 0.0, 0.0, 0.8, 0.0, 0.8], # from target encoding block above
'loan_to_income': [1.41, 2.97, 1.41, 1.25, 3.14, 1.36,
3.10, 1.38, 1.35, 3.14, 1.33, 3.09],
'high_lti_flag': [0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1],
'low_credit_flag': [0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1],
'risk_score': [0, 2, 0, 0, 2, 0, 2, 0, 0, 2, 0, 2],
'default': [0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1] # binary target
})
# Compute point-biserial correlation for every feature vs the binary target
# pointbiserialr returns (correlation, p_value) — we use the correlation coefficient
features = [col for col in loan_full.columns if col != 'default'] # all columns except target
results = []
for feat in features:
corr, pval = pointbiserialr(loan_full[feat], loan_full['default']) # (feature, target)
results.append({'feature': feat, 'correlation': round(corr, 3), 'p_value': round(pval, 4)})
# Sort by absolute correlation descending — strongest separators at the top
results_df = pd.DataFrame(results).sort_values('correlation', key=abs, ascending=False)
results_df['type'] = results_df['feature'].apply(
lambda x: 'ENGINEERED' if x in ['occupation_target_enc','loan_to_income',
'high_lti_flag','low_credit_flag','risk_score']
else 'RAW'
) # label each feature as raw or engineered for comparison
# Print the ranked feature audit
print("Feature separation audit — ranked by |point-biserial correlation|:\n")
print(f" {'Feature':<25} {'Type':<12} {'Correlation':>12} {'p-value':>10}")
print(" " + "-"*62)
for _, row in results_df.iterrows():
print(f" {row['feature']:<25} {row['type']:<12} {row['correlation']:>12.3f} {row['p_value']:>10.4f}")
Feature separation audit — ranked by |point-biserial correlation|: Feature Type Correlation p-value -------------------------------------------------------------- low_credit_flag ENGINEERED 0.913 0.0001 high_lti_flag ENGINEERED 0.913 0.0001 risk_score ENGINEERED 0.913 0.0001 occupation_target_enc ENGINEERED 0.913 0.0001 loan_to_income ENGINEERED 0.904 0.0001 credit_score RAW -0.872 0.0003 income RAW -0.840 0.0007 loan_amount RAW 0.749 0.0051
What just happened?
Every engineered feature outperforms or ties with the raw features in class separation. The top five are all engineered — low_credit_flag, high_lti_flag, risk_score, and occupation_target_enc all score 0.913, while the best raw feature — credit_score — scores −0.872. The negative sign on credit_score and income means higher values predict class 0 (non-default) — both are protective factors. All p-values are below 0.05, confirming none of these correlations are due to chance.
Target Encoding Leakage — The Most Common Classification Mistake
Target encoding has the highest leakage risk of any classification feature engineering technique. If you compute the encoding on the full dataset before splitting, each row's own target value influences the encoding that will be used as its feature — the model sees a signal it would never have at inference time. This produces absurdly high training scores and terrible generalisation.
The Leaky Way — Never Do This
Compute occupation default rates on the full dataset. Split into train/test. The test set carries encodings that were influenced by its own target values. The model sees future information disguised as a feature.
The Safe Way — Always Use This
Split first. Compute occupation default rates on the training set only. Apply those training-derived rates to the test set. Categories that appear only in test get the global training mean as a fallback value.
Inside Cross-Validation — Use Leave-One-Out Encoding
For cross-validation, use leave-one-out target encoding: when computing the encoding for row i, exclude row i's own target value from the mean calculation. This is exactly what category_encoders.TargetEncoder implements with smoothing to regularise small-group estimates.
One-Hot vs Target Encoding — Choosing the Right Tool
The choice between one-hot and target encoding depends on cardinality and model type. Here's the decision framework:
| Factor | One-Hot Encoding | Target Encoding |
|---|---|---|
| Best cardinality | Low (2–15 unique values) | High (15+ unique values) |
| Class signal embedded | No — model must learn it | Yes — default rate is explicit |
| Leakage risk | None | High — must compute on train only |
| Unseen categories | Adds all-zero row (safe) | Falls back to global mean |
| Works with linear models | Yes — essential for LR | Yes — single numeric column |
| Works with tree models | Yes — but inefficient at high cardinality | Yes — and far more efficient |
Teacher's Note
Point-biserial correlation only measures linear separation between a feature and a binary target. A feature with a low point-biserial score might still be valuable if its relationship with the target is nonlinear — for example, a U-shaped or threshold relationship. For a more complete picture, also check the mean and variance of each feature grouped by class (groupby('target')[feature].agg(['mean','std'])), and consider mutual information score (sklearn.feature_selection.mutual_info_classif) which captures nonlinear dependencies. Use both together for a robust pre-model feature selection audit.
Practice Questions
1. The technique that replaces each category in a column with the mean target value for that category is called ________ ________.
2. The correlation coefficient used to measure linear class separation between a continuous feature and a binary target is called ________ ________ correlation.
3. To avoid target leakage, target encoding values must be computed only on the ________ set, then applied to the test set.
Quiz
1. Target encoding is most appropriate for which type of column?
2. A model predicting loan default has access to loan_amount and income as raw features. Which feature engineering step is most likely to improve class separation?
3. Why is it recommended to use both point-biserial correlation AND mutual information when auditing features for a classification model?
Up Next · Lesson 39
Feature Engineering for NLP
From raw text to model-ready features — TF-IDF, n-grams, sentiment scores, and embedding-based features that turn language into numbers a classifier can learn from.