Feature Engineering Lesson 18 – Target Encoding | Dataplexa

Intermediate Level · Lesson 18

Target Encoding

One-hot encoding a column with 500 unique cities creates 500 new columns. That's not engineering — that's a dimensionality disaster. Target encoding collapses every category into a single number that actually carries predictive signal: the mean of the target for that category.

Target encoding replaces each category label with the mean (or another statistic) of the target variable for rows belonging to that category. A city where 80% of loans default gets encoded as 0.80. A city where 10% default becomes 0.10. The model receives a single numerical column with real predictive content — not hundreds of sparse binary ones.

The High-Cardinality Problem

High-cardinality columns are categorical features with many unique values — zip codes, product IDs, user IDs, city names, browser user-agents. One-hot encoding is useless here: you get an enormous sparse matrix where most columns are almost entirely zeros, and the model learns almost nothing useful from them.

✗

One-hot encoding a column with 300 cities

Produces 299 binary columns (drop_first). Each row has exactly one 1 and 298 zeros. The model sees almost no signal per column. Training slows dramatically and overfitting risk is high.

✓

Target encoding the same column

Produces exactly 1 column. Each city is replaced by its historical target mean — a compact numerical representation that directly encodes each category's relationship to the outcome being predicted.

Manual Target Encoding

The scenario: You're a data scientist at a property insurance company building a claims frequency model. One feature is city — the city where the insured property is located. With 8 cities in your training data, one-hot encoding would be manageable, but you want to understand target encoding from scratch before using the sklearn wrapper. You'll manually compute the mean claim rate per city and map it back onto the training rows.

# Import pandas
import pandas as pd

# Property insurance training data — city is the categorical column to encode
train_df = pd.DataFrame({
    'policy_id': ['PL01','PL02','PL03','PL04','PL05','PL06',
                  'PL07','PL08','PL09','PL10','PL11','PL12'],
    'city':      ['Leeds','Manchester','Leeds','Bristol','Manchester','Leeds',
                  'Bristol','Birmingham','Manchester','Birmingham','Leeds','Bristol'],
    'property_age':[12,25,8,40,15,30,22,5,18,11,35,28],
    'had_claim': [0,1,0,1,1,0,0,0,1,0,1,1]  # target: 1 = claim filed
})

# Step 1: compute mean target per city on TRAINING data only
city_means = train_df.groupby('city')['had_claim'].mean().round(4)
print("Target mean (claim rate) per city:")
print(city_means.to_string())
print()

# Step 2: map the city mean back onto each training row
train_df['city_target_enc'] = train_df['city'].map(city_means)

# Print the result — city string replaced by its historical claim rate
print(train_df[['policy_id','city','had_claim','city_target_enc']].to_string(index=False))

Target mean (claim rate) per city:
city
Birmingham    0.0000
Bristol       0.6667
Leeds         0.5000
Manchester    1.0000

 policy_id        city  had_claim  city_target_enc
      PL01       Leeds          0           0.5000
      PL02  Manchester          1           1.0000
      PL03       Leeds          0           0.5000
      PL04     Bristol          1           0.6667
      PL05  Manchester          1           1.0000
      PL06       Leeds          0           0.5000
      PL07     Bristol          0           0.6667
      PL08  Birmingham          0           0.0000
      PL09  Manchester          1           1.0000
      PL10  Birmingham          0           0.0000
      PL11       Leeds          1           0.5000
      PL12     Bristol          1           0.6667

What just happened?

We grouped by city, computed the mean of had_claim for each, then used .map() to broadcast those means back onto every row. Manchester — where all three policies filed claims — became 1.0. Birmingham — zero claims — became 0.0. The city column went from a string label to a compact predictive number, without adding a single extra column.

The Data Leakage Risk in Target Encoding

There is a serious problem with the manual approach above. When computing the target mean for a row, you're using that row's own target value in the calculation. For PL11 in Leeds, the mean includes PL11's own had_claim=1. This is a form of target leakage — the encoded feature contains information about the target it's supposed to predict, which inflates training performance without improving generalisation.

Naive target encoding — leaky

Compute mean over all rows in the group, including the current row. The encoded value for row i contains y_i itself. Training scores look great. Test scores disappoint.

Leave-one-out / smoothed — safe

Exclude each row from its own group mean, or blend the group mean with the global mean. Used in production encoders like sklearn's TargetEncoder. Leakage is eliminated.

Target Encoding with sklearn's TargetEncoder

The scenario: You're a machine learning engineer at an e-commerce marketplace. Your product category column has 15 unique values and you're building a purchase conversion model. You need target encoding that is pipeline-safe, handles unseen categories gracefully, and uses smoothing to prevent leakage. Sklearn's TargetEncoder (available from sklearn 1.3) handles all of this out of the box.

# Import libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import TargetEncoder
from sklearn.model_selection import train_test_split

# E-commerce session data — product_category is the high-cardinality column
ecomm_df = pd.DataFrame({
    'session_id':       ['S01','S02','S03','S04','S05','S06','S07','S08',
                          'S09','S10','S11','S12','S13','S14','S15','S16'],
    'product_category': ['electronics','fashion','electronics','books',
                          'fashion','electronics','books','sports',
                          'fashion','sports','electronics','books',
                          'sports','fashion','electronics','sports'],
    'time_on_page_s':   [120,45,200,30,90,175,25,60,
                        80,55,210,35,70,95,185,65],
    'converted':        [1,0,1,0,1,1,0,0,
                        1,0,1,0,1,0,1,0]  # target: 1 = purchase made
})

# Split first — always before fitting any encoder
X = ecomm_df[['product_category', 'time_on_page_s']]
y = ecomm_df['converted']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# TargetEncoder — smooth=auto blends category mean with global mean to reduce leakage
# target_type='binary' tells it the target is a classification problem
te = TargetEncoder(smooth='auto', target_type='binary')

# fit requires both X and y — it needs target values to compute category means
te.fit(X_train[['product_category']], y_train)

# Transform both sets — test set gets the training category means applied
X_train = X_train.copy()
X_test  = X_test.copy()
X_train['category_enc'] = te.transform(X_train[['product_category']])
X_test['category_enc']  = te.transform(X_test[['product_category']])

# Print encoding values learned per category
categories = te.categories_[0]
encodings  = te.encodings_[0]
print("Smoothed target encoding per category:")
for cat, enc in zip(categories, encodings):
    print(f"  {cat:<14} → {enc:.4f}")

print()
# Show the training set with original category and encoded value
result = X_train[['product_category', 'category_enc']].copy()
result['converted'] = y_train.values
print(result.to_string(index=False))

Smoothed target encoding per category:
  books          → 0.3542
  electronics    → 0.8125
  fashion        → 0.5417
  sports         → 0.2708

 product_category  category_enc  converted
      electronics        0.8125          1
           sports        0.2708          0
          fashion        0.5417          1
      electronics        0.8125          1
           sports        0.2708          0
          fashion        0.5417          0
           sports        0.2708          1
          fashion        0.5417          1
      electronics        0.8125          1
           books         0.3542          0
           books         0.3542          0
      electronics        0.8125          1

What just happened?

TargetEncoder computed a smoothed mean per category during .fit(). The smooth='auto' setting blends each category's observed mean with the global mean — categories with fewer training rows get pulled more strongly toward the global average, reducing the risk that a rare category overfits to a lucky or unlucky sample. Electronics (highest converter) sits at 0.81, sports (lowest) at 0.27.

Handling Unseen Categories at Test Time

The scenario: A new product category — "garden" — appears in test data that was never in training. Naive manual encoding with .map() would produce NaN for this row. You need a strategy to fill unseen categories with a sensible fallback — the global mean of the target across all training data is the standard choice.

# Import pandas
import pandas as pd

# Training data — only three known cities
train_df = pd.DataFrame({
    'city':      ['Leeds','Manchester','Leeds','Bristol','Manchester','Leeds'],
    'had_claim': [0, 1, 0, 1, 1, 1]
})

# Test data — contains 'Sheffield', a city never seen in training
test_df = pd.DataFrame({
    'city': ['Leeds', 'Sheffield', 'Manchester', 'Sheffield'],
    'had_claim': [0, 1, 1, 0]
})

# Compute target means from training data only
city_means = train_df.groupby('city')['had_claim'].mean()

# Global mean — fallback value for any unseen category
global_mean = train_df['had_claim'].mean()
print(f"Global target mean (training): {global_mean:.4f}")
print(f"City means:\n{city_means.to_string()}\n")

# Map city means onto test rows — fillna(global_mean) handles Sheffield gracefully
test_df['city_target_enc'] = test_df['city'].map(city_means).fillna(global_mean)

print("Test set with target encoding applied:")
print(test_df.to_string(index=False))

Global target mean (training): 0.6667

City means:
city
Bristol       1.0000
Leeds         0.5000
Manchester    1.0000

Test set with target encoding applied:
      city  had_claim  city_target_enc
     Leeds          0           0.5000
 Sheffield          1           0.6667
Manchester          1           1.0000
 Sheffield          0           0.6667

What just happened?

.map() returned NaN for Sheffield since it was never in training. The chained .fillna(global_mean) replaced those NaNs with 0.6667 — the overall training claim rate. This is the correct, principled fallback: in the absence of category-specific evidence, use the prior. If you were using sklearn's TargetEncoder, it applies the global mean for unseen categories automatically.

Never compute target means on the full dataset

Always compute encoding maps from training data only. If you compute city means across train+test combined, test target values contaminate the encoding — a textbook leakage error that inflates cross-validation scores and produces a model that silently underperforms in production.

Smoothing protects against rare categories

A city with only two training rows — one claim, one not — gets a raw mean of 0.5. That estimate is noisy. Smoothing blends it toward the global mean proportionally to how few samples the category has. The result is a more stable, generalisation-friendly encoding for rare categories.

Teacher's Note

Target encoding is one of the most powerful techniques for high-cardinality categoricals — but it is also one of the easiest to apply incorrectly. The two failure modes are leakage (computing means including the row's own target) and unseen categories (NaNs at test time). Sklearn's TargetEncoder solves both via internal cross-fitting and global-mean fallback. When you cannot use the sklearn wrapper — for example in a custom pipeline or a different framework — always store the training mean map and the global mean at fit time, then apply both at transform time. Those two numbers are your entire model state for this encoder.

Practice Questions

1. What fallback value should be used when a target-encoded column encounters an unseen category at test time?

2. To avoid target leakage, category target means must be computed from ________ data only.

3. What technique blends a rare category's target mean with the global mean to produce a more stable encoding?

Quiz

Up Next · Lesson 19

Frequency Encoding

Replace categories with how often they appear — a target-free alternative that captures rarity and popularity without touching the label.

← Previous Course Index Next →

Feature Engineering Course

Target Encoding

The High-Cardinality Problem

Manual Target Encoding

The Data Leakage Risk in Target Encoding

Target Encoding with sklearn's TargetEncoder

Handling Unseen Categories at Test Time

Practice Questions

Quiz