Feature Engineering Lesson 20 – Weight Evidence | Dataplexa
Intermediate Level · Lesson 20

Weight of Evidence

Credit scoring teams have used Weight of Evidence for decades — long before gradient boosting existed. It turns a categorical column into a single number that measures how strongly that category separates good outcomes from bad ones. It's interpretable, leak-resistant when done correctly, and still standard in regulated financial modelling today.

Weight of Evidence (WoE) encodes each category by the log-ratio of the proportion of events (target = 1) to the proportion of non-events (target = 0) within that category. A positive WoE means the category has more events than the overall rate — it leans toward the positive class. A negative WoE means it skews toward non-events. Zero means the category behaves exactly like the overall population.

The WoE Formula

For each category c, Weight of Evidence is calculated as:

WoE(c) = ln ( P(Events in c) / P(Non-events in c) )

where:
  P(Events in c) = (count of 1s in c) / (total 1s in dataset)
  P(Non-events in c) = (count of 0s in c) / (total 0s in dataset)
+

Positive WoE — category skews toward events (target = 1)

The proportion of defaults in this category exceeds the proportion of non-defaults. In credit scoring, a high-WoE category is a risk flag. In conversion modelling, it indicates a high-converting segment.

0

Zero WoE — category matches the population rate

The event rate in this category is identical to the overall dataset rate. This category provides no discriminatory information — it carries no signal above the baseline.

Negative WoE — category skews toward non-events (target = 0)

The proportion of non-defaults in this category exceeds the proportion of defaults. In credit scoring, this is a good-risk indicator. The more negative the WoE, the cleaner the category's record.

Computing WoE from Scratch

The scenario: You're a credit risk analyst at a consumer lending firm. Your model predicts whether a loan will default within 12 months. One feature is employment_type — the borrower's employment status. You want to encode it using Weight of Evidence so the resulting column is directly interpretable by the compliance team and compatible with the logistic regression model used internally.

# Import pandas and numpy
import pandas as pd
import numpy as np

# Loan applicant training data — employment_type is the categorical feature
loan_df = pd.DataFrame({
    'loan_id':         ['L01','L02','L03','L04','L05','L06',
                        'L07','L08','L09','L10','L11','L12'],
    'employment_type': ['Full-time','Self-employed','Full-time','Contract',
                        'Self-employed','Full-time','Contract','Full-time',
                        'Self-employed','Contract','Full-time','Self-employed'],
    'defaulted':       [0,1,0,1,1,0,0,0,1,1,0,1]
})

# Total events (defaults) and non-events across the entire training set
total_events     = loan_df['defaulted'].sum()
total_non_events = len(loan_df) - total_events
print(f"Total events (defaults):     {total_events}")
print(f"Total non-events (no default): {total_non_events}\n")

# Group by employment_type and count events and non-events per category
woe_table = loan_df.groupby('employment_type')['defaulted'].agg(
    events='sum',
    total='count'
).reset_index()

# Non-events per category = total rows in category minus events
woe_table['non_events'] = woe_table['total'] - woe_table['events']

# Proportion of all events that fall in this category
woe_table['p_events'] = woe_table['events'] / total_events

# Proportion of all non-events that fall in this category
woe_table['p_non_events'] = woe_table['non_events'] / total_non_events

# WoE = ln(p_events / p_non_events) — add small epsilon to avoid log(0)
woe_table['woe'] = np.log(
    (woe_table['p_events'] + 1e-6) / (woe_table['p_non_events'] + 1e-6)
).round(4)

# Print the full WoE table
print(woe_table[['employment_type','events','non_events',
                 'p_events','p_non_events','woe']].to_string(index=False))
Total events (defaults):     6
Total non-events (no default): 6

 employment_type  events  non_events  p_events  p_non_events     woe
       Contract       2           1  0.333333      0.166667  0.6931
      Full-time       1           4  0.166667      0.666667 -1.3863
  Self-employed       3           1  0.500000      0.166667  1.0986

What just happened?

Self-employed borrowers have WoE = 1.10 — they represent 50% of all defaults but only 17% of all non-defaults. Strong risk signal. Full-time employees have WoE = −1.39 — they represent only 17% of defaults but 67% of non-defaults. Strong safety signal. Contract workers sit in between at 0.69. The WoE column gives the logistic regression model a single interpretable number per category that is already on the log-odds scale — mathematically the most natural input for a logistic regression.

Applying WoE Encoding to the Dataset

The scenario: With the WoE table computed, you now need to map the WoE values back onto every row of the training dataset — and then apply the same mapping to the test set using only the training-derived values. This is the same pattern used in target and frequency encoding: fit on train, apply to both.

# Import pandas and numpy
import pandas as pd
import numpy as np

# Re-create the training data and woe_table from the previous block
loan_train = pd.DataFrame({
    'loan_id':         ['L01','L02','L03','L04','L05','L06',
                        'L07','L08','L09','L10','L11','L12'],
    'employment_type': ['Full-time','Self-employed','Full-time','Contract',
                        'Self-employed','Full-time','Contract','Full-time',
                        'Self-employed','Contract','Full-time','Self-employed'],
    'defaulted':       [0,1,0,1,1,0,0,0,1,1,0,1]
})

# WoE mapping derived from training data (from previous block)
woe_map = {
    'Full-time':     -1.3863,
    'Contract':       0.6931,
    'Self-employed':  1.0986
}

# Global mean WoE — fallback for unseen categories at test time
# Using 0.0 means the unknown category gets no discriminatory signal
fallback_woe = 0.0

# Apply WoE mapping to training data
loan_train['emp_woe'] = loan_train['employment_type'].map(woe_map)

# Test data — includes 'Unemployed', never seen in training
loan_test = pd.DataFrame({
    'loan_id':         ['T01','T02','T03','T04'],
    'employment_type': ['Full-time','Unemployed','Self-employed','Contract']
})

# Apply WoE to test — fillna(0) treats unknown category as population-neutral
loan_test['emp_woe'] = loan_test['employment_type'].map(woe_map).fillna(fallback_woe)

# Print training result
print("Training data with WoE applied:")
print(loan_train[['loan_id','employment_type','emp_woe','defaulted']].to_string(index=False))
print()
print("Test data with WoE applied (Unemployed → fallback 0.0):")
print(loan_test.to_string(index=False))
Training data with WoE applied:
 loan_id employment_type   emp_woe  defaulted
     L01       Full-time   -1.3863          0
     L02   Self-employed    1.0986          1
     L03       Full-time   -1.3863          0
     L04        Contract    0.6931          1
     L05   Self-employed    1.0986          1
     L06       Full-time   -1.3863          0
     L07        Contract    0.6931          0
     L08       Full-time   -1.3863          0
     L09   Self-employed    1.0986          1
     L10        Contract    0.6931          1
     L11       Full-time   -1.3863          0
     L12   Self-employed    1.0986          1

Test data with WoE applied (Unemployed → fallback 0.0):
 loan_id employment_type  emp_woe
     T01       Full-time  -1.3863
     T02      Unemployed   0.0000
     T03   Self-employed   1.0986
     T04        Contract   0.6931

What just happened?

The WoE map was applied to both sets using .map(). The unseen "Unemployed" category in the test set received the fallback value of 0.0 — meaning the model treats it as population-neutral: no evidence for or against default. The training labels align visibly with the WoE values — every 1.0986 (Self-employed) row is a default, every −1.3863 (Full-time) row is not.

Information Value — Measuring Column Discriminatory Power

The scenario: Your team has computed WoE for three candidate features: employment_type, loan_purpose, and region. Before deciding which ones to include in the final model, you want to rank them by Information Value (IV) — the standard WoE companion metric that quantifies how much discriminatory power a feature has overall.

# Import pandas and numpy
import pandas as pd
import numpy as np

# Pre-computed WoE tables for three features (condensed for illustration)
# Each entry: (p_events, p_non_events, woe)
woe_data = {
    'employment_type': [
        (0.1667, 0.6667, -1.3863),   # Full-time
        (0.3333, 0.1667,  0.6931),   # Contract
        (0.5000, 0.1667,  1.0986),   # Self-employed
    ],
    'loan_purpose': [
        (0.4000, 0.3000,  0.2877),   # Home improvement
        (0.3500, 0.4000, -0.1335),   # Car purchase
        (0.2500, 0.3000, -0.1823),   # Debt consolidation
    ],
    'region': [
        (0.5000, 0.5000,  0.0000),   # North
        (0.3000, 0.2500,  0.1823),   # South
        (0.2000, 0.2500, -0.2231),   # West
    ]
}

# IV = sum over categories of (p_events - p_non_events) * WoE
# Higher IV = stronger overall discriminatory power for the feature
iv_results = []
for feature, categories in woe_data.items():
    iv = sum((pe - pne) * woe for pe, pne, woe in categories)
    iv_results.append({'feature': feature, 'iv': round(iv, 4)})

iv_df = pd.DataFrame(iv_results).sort_values('iv', ascending=False).reset_index(drop=True)

# Standard IV interpretation thresholds
def iv_label(iv):
    if iv < 0.02:   return 'Useless'
    elif iv < 0.1:  return 'Weak'
    elif iv < 0.3:  return 'Medium'
    elif iv < 0.5:  return 'Strong'
    else:           return 'Very strong'

iv_df['strength'] = iv_df['iv'].apply(iv_label)

print("Information Value ranking:")
print(iv_df.to_string(index=False))
Information Value ranking:
          feature      iv     strength
 employment_type  1.0217  Very strong
     loan_purpose  0.0539         Weak
           region  0.0203         Weak

What just happened?

IV sums up all the category-level WoE differences into a single feature-level score. employment_type scores 1.02 — "Very strong" — confirming it is a powerful discriminator. loan_purpose and region both come in as "Weak" — their categories barely separate defaults from non-defaults. In a real credit scorecard, features with IV below 0.02 are typically dropped before modelling.

IV Interpretation Guide

IV Range Strength Recommendation
< 0.02 Useless Drop — provides no discriminatory signal
0.02 – 0.1 Weak Include with caution — marginal value
0.1 – 0.3 Medium Include — meaningful signal for the model
0.3 – 0.5 Strong Include — strong discriminator
> 0.5 Very strong Investigate — may indicate target leakage

Teacher's Note

WoE encoding has a special mathematical property that makes it a natural fit for logistic regression: the encoded values are already on the log-odds scale. When you feed WoE-encoded features into a logistic regression, the model's coefficient for that feature directly represents how much additional log-odds each unit of WoE adds to the prediction. This makes the model output genuinely interpretable to compliance teams and regulators — which is exactly why WoE remains the encoding of choice in Basel-regulated credit scoring models. Watch out for IV values above 0.5 on features you haven't carefully audited: extremely high IV sometimes signals that a feature is inadvertently leaking the target, not that it's genuinely that powerful.

Practice Questions

1. WoE is calculated as the natural ________ of the ratio of event proportion to non-event proportion.



2. What WoE value indicates that a category has the same event rate as the overall population?



3. What companion metric summarises the overall discriminatory power of a feature using its WoE values?



Quiz

1. Why is WoE encoding particularly well-suited for logistic regression models?


2. A feature has an Information Value of 0.85. What should you do?


3. Why do we add a small epsilon (1e-6) when computing WoE?


Up Next · Lesson 21

Rare Label Encoding

Handle the long tail of infrequent categories — group rare labels together before they distort your encoding or overfit your model.