Feature Engineering Course
Weight of Evidence
Credit scoring teams have used Weight of Evidence for decades — long before gradient boosting existed. It turns a categorical column into a single number that measures how strongly that category separates good outcomes from bad ones. It's interpretable, leak-resistant when done correctly, and still standard in regulated financial modelling today.
Weight of Evidence (WoE) encodes each category by the log-ratio of the proportion of events (target = 1) to the proportion of non-events (target = 0) within that category. A positive WoE means the category has more events than the overall rate — it leans toward the positive class. A negative WoE means it skews toward non-events. Zero means the category behaves exactly like the overall population.
The WoE Formula
For each category c, Weight of Evidence is calculated as:
where:
P(Events in c) = (count of 1s in c) / (total 1s in dataset)
P(Non-events in c) = (count of 0s in c) / (total 0s in dataset)
Positive WoE — category skews toward events (target = 1)
The proportion of defaults in this category exceeds the proportion of non-defaults. In credit scoring, a high-WoE category is a risk flag. In conversion modelling, it indicates a high-converting segment.
Zero WoE — category matches the population rate
The event rate in this category is identical to the overall dataset rate. This category provides no discriminatory information — it carries no signal above the baseline.
Negative WoE — category skews toward non-events (target = 0)
The proportion of non-defaults in this category exceeds the proportion of defaults. In credit scoring, this is a good-risk indicator. The more negative the WoE, the cleaner the category's record.
Computing WoE from Scratch
The scenario: You're a credit risk analyst at a consumer lending firm. Your model predicts whether a loan will default within 12 months. One feature is employment_type — the borrower's employment status. You want to encode it using Weight of Evidence so the resulting column is directly interpretable by the compliance team and compatible with the logistic regression model used internally.
# Import pandas and numpy
import pandas as pd
import numpy as np
# Loan applicant training data — employment_type is the categorical feature
loan_df = pd.DataFrame({
'loan_id': ['L01','L02','L03','L04','L05','L06',
'L07','L08','L09','L10','L11','L12'],
'employment_type': ['Full-time','Self-employed','Full-time','Contract',
'Self-employed','Full-time','Contract','Full-time',
'Self-employed','Contract','Full-time','Self-employed'],
'defaulted': [0,1,0,1,1,0,0,0,1,1,0,1]
})
# Total events (defaults) and non-events across the entire training set
total_events = loan_df['defaulted'].sum()
total_non_events = len(loan_df) - total_events
print(f"Total events (defaults): {total_events}")
print(f"Total non-events (no default): {total_non_events}\n")
# Group by employment_type and count events and non-events per category
woe_table = loan_df.groupby('employment_type')['defaulted'].agg(
events='sum',
total='count'
).reset_index()
# Non-events per category = total rows in category minus events
woe_table['non_events'] = woe_table['total'] - woe_table['events']
# Proportion of all events that fall in this category
woe_table['p_events'] = woe_table['events'] / total_events
# Proportion of all non-events that fall in this category
woe_table['p_non_events'] = woe_table['non_events'] / total_non_events
# WoE = ln(p_events / p_non_events) — add small epsilon to avoid log(0)
woe_table['woe'] = np.log(
(woe_table['p_events'] + 1e-6) / (woe_table['p_non_events'] + 1e-6)
).round(4)
# Print the full WoE table
print(woe_table[['employment_type','events','non_events',
'p_events','p_non_events','woe']].to_string(index=False))
Total events (defaults): 6
Total non-events (no default): 6
employment_type events non_events p_events p_non_events woe
Contract 2 1 0.333333 0.166667 0.6931
Full-time 1 4 0.166667 0.666667 -1.3863
Self-employed 3 1 0.500000 0.166667 1.0986What just happened?
Self-employed borrowers have WoE = 1.10 — they represent 50% of all defaults but only 17% of all non-defaults. Strong risk signal. Full-time employees have WoE = −1.39 — they represent only 17% of defaults but 67% of non-defaults. Strong safety signal. Contract workers sit in between at 0.69. The WoE column gives the logistic regression model a single interpretable number per category that is already on the log-odds scale — mathematically the most natural input for a logistic regression.
Applying WoE Encoding to the Dataset
The scenario: With the WoE table computed, you now need to map the WoE values back onto every row of the training dataset — and then apply the same mapping to the test set using only the training-derived values. This is the same pattern used in target and frequency encoding: fit on train, apply to both.
# Import pandas and numpy
import pandas as pd
import numpy as np
# Re-create the training data and woe_table from the previous block
loan_train = pd.DataFrame({
'loan_id': ['L01','L02','L03','L04','L05','L06',
'L07','L08','L09','L10','L11','L12'],
'employment_type': ['Full-time','Self-employed','Full-time','Contract',
'Self-employed','Full-time','Contract','Full-time',
'Self-employed','Contract','Full-time','Self-employed'],
'defaulted': [0,1,0,1,1,0,0,0,1,1,0,1]
})
# WoE mapping derived from training data (from previous block)
woe_map = {
'Full-time': -1.3863,
'Contract': 0.6931,
'Self-employed': 1.0986
}
# Global mean WoE — fallback for unseen categories at test time
# Using 0.0 means the unknown category gets no discriminatory signal
fallback_woe = 0.0
# Apply WoE mapping to training data
loan_train['emp_woe'] = loan_train['employment_type'].map(woe_map)
# Test data — includes 'Unemployed', never seen in training
loan_test = pd.DataFrame({
'loan_id': ['T01','T02','T03','T04'],
'employment_type': ['Full-time','Unemployed','Self-employed','Contract']
})
# Apply WoE to test — fillna(0) treats unknown category as population-neutral
loan_test['emp_woe'] = loan_test['employment_type'].map(woe_map).fillna(fallback_woe)
# Print training result
print("Training data with WoE applied:")
print(loan_train[['loan_id','employment_type','emp_woe','defaulted']].to_string(index=False))
print()
print("Test data with WoE applied (Unemployed → fallback 0.0):")
print(loan_test.to_string(index=False))
Training data with WoE applied:
loan_id employment_type emp_woe defaulted
L01 Full-time -1.3863 0
L02 Self-employed 1.0986 1
L03 Full-time -1.3863 0
L04 Contract 0.6931 1
L05 Self-employed 1.0986 1
L06 Full-time -1.3863 0
L07 Contract 0.6931 0
L08 Full-time -1.3863 0
L09 Self-employed 1.0986 1
L10 Contract 0.6931 1
L11 Full-time -1.3863 0
L12 Self-employed 1.0986 1
Test data with WoE applied (Unemployed → fallback 0.0):
loan_id employment_type emp_woe
T01 Full-time -1.3863
T02 Unemployed 0.0000
T03 Self-employed 1.0986
T04 Contract 0.6931What just happened?
The WoE map was applied to both sets using .map(). The unseen "Unemployed" category in the test set received the fallback value of 0.0 — meaning the model treats it as population-neutral: no evidence for or against default. The training labels align visibly with the WoE values — every 1.0986 (Self-employed) row is a default, every −1.3863 (Full-time) row is not.
Information Value — Measuring Column Discriminatory Power
The scenario: Your team has computed WoE for three candidate features: employment_type, loan_purpose, and region. Before deciding which ones to include in the final model, you want to rank them by Information Value (IV) — the standard WoE companion metric that quantifies how much discriminatory power a feature has overall.
# Import pandas and numpy
import pandas as pd
import numpy as np
# Pre-computed WoE tables for three features (condensed for illustration)
# Each entry: (p_events, p_non_events, woe)
woe_data = {
'employment_type': [
(0.1667, 0.6667, -1.3863), # Full-time
(0.3333, 0.1667, 0.6931), # Contract
(0.5000, 0.1667, 1.0986), # Self-employed
],
'loan_purpose': [
(0.4000, 0.3000, 0.2877), # Home improvement
(0.3500, 0.4000, -0.1335), # Car purchase
(0.2500, 0.3000, -0.1823), # Debt consolidation
],
'region': [
(0.5000, 0.5000, 0.0000), # North
(0.3000, 0.2500, 0.1823), # South
(0.2000, 0.2500, -0.2231), # West
]
}
# IV = sum over categories of (p_events - p_non_events) * WoE
# Higher IV = stronger overall discriminatory power for the feature
iv_results = []
for feature, categories in woe_data.items():
iv = sum((pe - pne) * woe for pe, pne, woe in categories)
iv_results.append({'feature': feature, 'iv': round(iv, 4)})
iv_df = pd.DataFrame(iv_results).sort_values('iv', ascending=False).reset_index(drop=True)
# Standard IV interpretation thresholds
def iv_label(iv):
if iv < 0.02: return 'Useless'
elif iv < 0.1: return 'Weak'
elif iv < 0.3: return 'Medium'
elif iv < 0.5: return 'Strong'
else: return 'Very strong'
iv_df['strength'] = iv_df['iv'].apply(iv_label)
print("Information Value ranking:")
print(iv_df.to_string(index=False))
Information Value ranking:
feature iv strength
employment_type 1.0217 Very strong
loan_purpose 0.0539 Weak
region 0.0203 WeakWhat just happened?
IV sums up all the category-level WoE differences into a single feature-level score. employment_type scores 1.02 — "Very strong" — confirming it is a powerful discriminator. loan_purpose and region both come in as "Weak" — their categories barely separate defaults from non-defaults. In a real credit scorecard, features with IV below 0.02 are typically dropped before modelling.
IV Interpretation Guide
| IV Range | Strength | Recommendation |
|---|---|---|
| < 0.02 | Useless | Drop — provides no discriminatory signal |
| 0.02 – 0.1 | Weak | Include with caution — marginal value |
| 0.1 – 0.3 | Medium | Include — meaningful signal for the model |
| 0.3 – 0.5 | Strong | Include — strong discriminator |
| > 0.5 | Very strong | Investigate — may indicate target leakage |
Teacher's Note
WoE encoding has a special mathematical property that makes it a natural fit for logistic regression: the encoded values are already on the log-odds scale. When you feed WoE-encoded features into a logistic regression, the model's coefficient for that feature directly represents how much additional log-odds each unit of WoE adds to the prediction. This makes the model output genuinely interpretable to compliance teams and regulators — which is exactly why WoE remains the encoding of choice in Basel-regulated credit scoring models. Watch out for IV values above 0.5 on features you haven't carefully audited: extremely high IV sometimes signals that a feature is inadvertently leaking the target, not that it's genuinely that powerful.
Practice Questions
1. WoE is calculated as the natural ________ of the ratio of event proportion to non-event proportion.
2. What WoE value indicates that a category has the same event rate as the overall population?
3. What companion metric summarises the overall discriminatory power of a feature using its WoE values?
Quiz
1. Why is WoE encoding particularly well-suited for logistic regression models?
2. A feature has an Information Value of 0.85. What should you do?
3. Why do we add a small epsilon (1e-6) when computing WoE?
Up Next · Lesson 21
Rare Label Encoding
Handle the long tail of infrequent categories — group rare labels together before they distort your encoding or overfit your model.