Feature Engineering Course
Target Encoding
One-hot encoding a column with 500 unique cities creates 500 new columns. That's not engineering — that's a dimensionality disaster. Target encoding collapses every category into a single number that actually carries predictive signal: the mean of the target for that category.
Target encoding replaces each category label with the mean (or another statistic) of the target variable for rows belonging to that category. A city where 80% of loans default gets encoded as 0.80. A city where 10% default becomes 0.10. The model receives a single numerical column with real predictive content — not hundreds of sparse binary ones.
The High-Cardinality Problem
High-cardinality columns are categorical features with many unique values — zip codes, product IDs, user IDs, city names, browser user-agents. One-hot encoding is useless here: you get an enormous sparse matrix where most columns are almost entirely zeros, and the model learns almost nothing useful from them.
One-hot encoding a column with 300 cities
Produces 299 binary columns (drop_first). Each row has exactly one 1 and 298 zeros. The model sees almost no signal per column. Training slows dramatically and overfitting risk is high.
Target encoding the same column
Produces exactly 1 column. Each city is replaced by its historical target mean — a compact numerical representation that directly encodes each category's relationship to the outcome being predicted.
Manual Target Encoding
The scenario: You're a data scientist at a property insurance company building a claims frequency model. One feature is city — the city where the insured property is located. With 8 cities in your training data, one-hot encoding would be manageable, but you want to understand target encoding from scratch before using the sklearn wrapper. You'll manually compute the mean claim rate per city and map it back onto the training rows.
# Import pandas
import pandas as pd
# Property insurance training data — city is the categorical column to encode
train_df = pd.DataFrame({
'policy_id': ['PL01','PL02','PL03','PL04','PL05','PL06',
'PL07','PL08','PL09','PL10','PL11','PL12'],
'city': ['Leeds','Manchester','Leeds','Bristol','Manchester','Leeds',
'Bristol','Birmingham','Manchester','Birmingham','Leeds','Bristol'],
'property_age':[12,25,8,40,15,30,22,5,18,11,35,28],
'had_claim': [0,1,0,1,1,0,0,0,1,0,1,1] # target: 1 = claim filed
})
# Step 1: compute mean target per city on TRAINING data only
city_means = train_df.groupby('city')['had_claim'].mean().round(4)
print("Target mean (claim rate) per city:")
print(city_means.to_string())
print()
# Step 2: map the city mean back onto each training row
train_df['city_target_enc'] = train_df['city'].map(city_means)
# Print the result — city string replaced by its historical claim rate
print(train_df[['policy_id','city','had_claim','city_target_enc']].to_string(index=False))
Target mean (claim rate) per city:
city
Birmingham 0.0000
Bristol 0.6667
Leeds 0.5000
Manchester 1.0000
policy_id city had_claim city_target_enc
PL01 Leeds 0 0.5000
PL02 Manchester 1 1.0000
PL03 Leeds 0 0.5000
PL04 Bristol 1 0.6667
PL05 Manchester 1 1.0000
PL06 Leeds 0 0.5000
PL07 Bristol 0 0.6667
PL08 Birmingham 0 0.0000
PL09 Manchester 1 1.0000
PL10 Birmingham 0 0.0000
PL11 Leeds 1 0.5000
PL12 Bristol 1 0.6667What just happened?
We grouped by city, computed the mean of had_claim for each, then used .map() to broadcast those means back onto every row. Manchester — where all three policies filed claims — became 1.0. Birmingham — zero claims — became 0.0. The city column went from a string label to a compact predictive number, without adding a single extra column.
The Data Leakage Risk in Target Encoding
There is a serious problem with the manual approach above. When computing the target mean for a row, you're using that row's own target value in the calculation. For PL11 in Leeds, the mean includes PL11's own had_claim=1. This is a form of target leakage — the encoded feature contains information about the target it's supposed to predict, which inflates training performance without improving generalisation.
Naive target encoding — leaky
Compute mean over all rows in the group, including the current row. The encoded value for row i contains y_i itself. Training scores look great. Test scores disappoint.
Leave-one-out / smoothed — safe
Exclude each row from its own group mean, or blend the group mean with the global mean. Used in production encoders like sklearn's TargetEncoder. Leakage is eliminated.
Target Encoding with sklearn's TargetEncoder
The scenario: You're a machine learning engineer at an e-commerce marketplace. Your product category column has 15 unique values and you're building a purchase conversion model. You need target encoding that is pipeline-safe, handles unseen categories gracefully, and uses smoothing to prevent leakage. Sklearn's TargetEncoder (available from sklearn 1.3) handles all of this out of the box.
# Import libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import TargetEncoder
from sklearn.model_selection import train_test_split
# E-commerce session data — product_category is the high-cardinality column
ecomm_df = pd.DataFrame({
'session_id': ['S01','S02','S03','S04','S05','S06','S07','S08',
'S09','S10','S11','S12','S13','S14','S15','S16'],
'product_category': ['electronics','fashion','electronics','books',
'fashion','electronics','books','sports',
'fashion','sports','electronics','books',
'sports','fashion','electronics','sports'],
'time_on_page_s': [120,45,200,30,90,175,25,60,
80,55,210,35,70,95,185,65],
'converted': [1,0,1,0,1,1,0,0,
1,0,1,0,1,0,1,0] # target: 1 = purchase made
})
# Split first — always before fitting any encoder
X = ecomm_df[['product_category', 'time_on_page_s']]
y = ecomm_df['converted']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# TargetEncoder — smooth=auto blends category mean with global mean to reduce leakage
# target_type='binary' tells it the target is a classification problem
te = TargetEncoder(smooth='auto', target_type='binary')
# fit requires both X and y — it needs target values to compute category means
te.fit(X_train[['product_category']], y_train)
# Transform both sets — test set gets the training category means applied
X_train = X_train.copy()
X_test = X_test.copy()
X_train['category_enc'] = te.transform(X_train[['product_category']])
X_test['category_enc'] = te.transform(X_test[['product_category']])
# Print encoding values learned per category
categories = te.categories_[0]
encodings = te.encodings_[0]
print("Smoothed target encoding per category:")
for cat, enc in zip(categories, encodings):
print(f" {cat:<14} → {enc:.4f}")
print()
# Show the training set with original category and encoded value
result = X_train[['product_category', 'category_enc']].copy()
result['converted'] = y_train.values
print(result.to_string(index=False))
Smoothed target encoding per category:
books → 0.3542
electronics → 0.8125
fashion → 0.5417
sports → 0.2708
product_category category_enc converted
electronics 0.8125 1
sports 0.2708 0
fashion 0.5417 1
electronics 0.8125 1
sports 0.2708 0
fashion 0.5417 0
sports 0.2708 1
fashion 0.5417 1
electronics 0.8125 1
books 0.3542 0
books 0.3542 0
electronics 0.8125 1What just happened?
TargetEncoder computed a smoothed mean per category during .fit(). The smooth='auto' setting blends each category's observed mean with the global mean — categories with fewer training rows get pulled more strongly toward the global average, reducing the risk that a rare category overfits to a lucky or unlucky sample. Electronics (highest converter) sits at 0.81, sports (lowest) at 0.27.
Handling Unseen Categories at Test Time
The scenario: A new product category — "garden" — appears in test data that was never in training. Naive manual encoding with .map() would produce NaN for this row. You need a strategy to fill unseen categories with a sensible fallback — the global mean of the target across all training data is the standard choice.
# Import pandas
import pandas as pd
# Training data — only three known cities
train_df = pd.DataFrame({
'city': ['Leeds','Manchester','Leeds','Bristol','Manchester','Leeds'],
'had_claim': [0, 1, 0, 1, 1, 1]
})
# Test data — contains 'Sheffield', a city never seen in training
test_df = pd.DataFrame({
'city': ['Leeds', 'Sheffield', 'Manchester', 'Sheffield'],
'had_claim': [0, 1, 1, 0]
})
# Compute target means from training data only
city_means = train_df.groupby('city')['had_claim'].mean()
# Global mean — fallback value for any unseen category
global_mean = train_df['had_claim'].mean()
print(f"Global target mean (training): {global_mean:.4f}")
print(f"City means:\n{city_means.to_string()}\n")
# Map city means onto test rows — fillna(global_mean) handles Sheffield gracefully
test_df['city_target_enc'] = test_df['city'].map(city_means).fillna(global_mean)
print("Test set with target encoding applied:")
print(test_df.to_string(index=False))
Global target mean (training): 0.6667
City means:
city
Bristol 1.0000
Leeds 0.5000
Manchester 1.0000
Test set with target encoding applied:
city had_claim city_target_enc
Leeds 0 0.5000
Sheffield 1 0.6667
Manchester 1 1.0000
Sheffield 0 0.6667What just happened?
.map() returned NaN for Sheffield since it was never in training. The chained .fillna(global_mean) replaced those NaNs with 0.6667 — the overall training claim rate. This is the correct, principled fallback: in the absence of category-specific evidence, use the prior. If you were using sklearn's TargetEncoder, it applies the global mean for unseen categories automatically.
Never compute target means on the full dataset
Always compute encoding maps from training data only. If you compute city means across train+test combined, test target values contaminate the encoding — a textbook leakage error that inflates cross-validation scores and produces a model that silently underperforms in production.
Smoothing protects against rare categories
A city with only two training rows — one claim, one not — gets a raw mean of 0.5. That estimate is noisy. Smoothing blends it toward the global mean proportionally to how few samples the category has. The result is a more stable, generalisation-friendly encoding for rare categories.
Teacher's Note
Target encoding is one of the most powerful techniques for high-cardinality categoricals — but it is also one of the easiest to apply incorrectly. The two failure modes are leakage (computing means including the row's own target) and unseen categories (NaNs at test time). Sklearn's TargetEncoder solves both via internal cross-fitting and global-mean fallback. When you cannot use the sklearn wrapper — for example in a custom pipeline or a different framework — always store the training mean map and the global mean at fit time, then apply both at transform time. Those two numbers are your entire model state for this encoder.
Practice Questions
1. What fallback value should be used when a target-encoded column encounters an unseen category at test time?
2. To avoid target leakage, category target means must be computed from ________ data only.
3. What technique blends a rare category's target mean with the global mean to produce a more stable encoding?
Quiz
1. What does target encoding do to a categorical column?
2. Why is naive target encoding (computing the full group mean without exclusion) considered leaky?
3. Target encoding is most appropriate for which type of feature?
Up Next · Lesson 19
Frequency Encoding
Replace categories with how often they appear — a target-free alternative that captures rarity and popularity without touching the label.