Feature Engineering Lesson 21 – Rare Label Encoding | Dataplexa
Intermediate Level · Lesson 21

Rare Label Encoding

Some categories only show up a handful of times in your data — and those rare labels can quietly destroy your model. Here's how to find them, handle them, and stop them from breaking your pipeline.

A rare label is a category that appears so infrequently in your training data that the model has almost no signal to learn from it — and that the encoder has likely never seen before when it hits production data.

Why Rare Labels Are a Real Problem

Picture this: you train a loan approval model. The "employer_industry" column has 200 unique values. Most of them appear thousands of times. But "Whaling" and "Antarctic Research" each appear twice. You encode, train, and deploy. Six months later a new applicant comes in with "Street Performance" as their industry — a value your encoder has never seen. It crashes.

That's not a contrived edge case. That's Tuesday in production. Rare labels cause three distinct types of damage:

1

Encoder collapse at inference time

OrdinalEncoder and LabelEncoder throw a hard error when they see an unseen category. Your pipeline crashes rather than degrading gracefully.

2

Overfitting on noise

A tree model might split on a rare category that appeared twice, both times with the outcome "default". That's not signal — that's luck. The model memorises it anyway.

3

High-cardinality bloat

One-hot encoding a column with 200 categories adds 200 columns. If 150 of those are rare, you've added 150 near-zero columns that contribute nothing but noise and memory usage.

4

Target encoding contamination

Target encoding a rare label on 2 observations gives a wildly unreliable mean. That estimate leaks into the model as if it were solid signal.

Defining "Rare" — The Threshold Decision

There is no universal rule. The most common approach is to define rare as any category whose frequency falls below a percentage threshold — typically 1% or 5% of the total training set.

Raw category counts

Checking absolute counts is dangerous. A category appearing 50 times might be rare in a 50,000-row dataset but perfectly common in a 500-row one. Counts don't travel across datasets.

Relative frequency (recommended)

Frequency as a proportion of total rows. A 1% threshold means any category seen in fewer than 1 in 100 rows is rare. This scales with your dataset size and is consistent across retraining cycles.

Step 1 — Identifying Rare Labels

The scenario: You're a data scientist at an insurance company building a claims-frequency model. One of your key features is vehicle_make — the car manufacturer. Marketing collected this field as free text, and people typed whatever they felt like. Before you even think about encoding, you need to understand which makes are represented reliably and which are statistical noise.

# Import the tools we need
import pandas as pd
import numpy as np

# Build a realistic insurance DataFrame — 300 rows, messy vehicle_make column
np.random.seed(42)

# Most rows belong to common makes; a long tail of rare entries
common_makes = ['Toyota', 'Ford', 'Honda', 'Chevrolet', 'BMW']
rare_makes   = ['Studebaker', 'DeLorean', 'Trabant', 'Moskvitch',
                'Yugo', 'Reliant', 'Lada', 'Wartburg']

# Sample: 270 from common, 30 from rare (one each, some extras)
make_pool = (
    np.random.choice(common_makes, size=270).tolist() +
    np.random.choice(rare_makes, size=30).tolist()
)
np.random.shuffle(make_pool)  # Mix them together

# Create the DataFrame
insurance_df = pd.DataFrame({
    'policy_id':    range(1, 301),
    'vehicle_make': make_pool,
    'vehicle_age':  np.random.randint(1, 20, size=300),
    'annual_claim': np.random.choice([0, 1], size=300, p=[0.75, 0.25])
})

# Step 1: Calculate frequency of each category as a proportion
make_freq = insurance_df['vehicle_make'].value_counts(normalize=True)

# Step 2: Print frequency table rounded to 4 decimal places
print(make_freq.round(4))
print()

# Step 3: Define our threshold — 5% of total rows
threshold = 0.05

# Step 4: Identify which categories fall below the threshold
rare_labels = make_freq[make_freq < threshold].index.tolist()

# Step 5: Print them
print(f"Rare labels (frequency < {threshold}):")
print(rare_labels)
vehicle_make
Toyota        0.2133
Ford          0.2067
Honda         0.1933
Chevrolet     0.1767
BMW           0.1667
Studebaker    0.0233
DeLorean      0.0133
Lada          0.0133
Yugo          0.0100
Reliant       0.0100
Trabant       0.0100
Wartburg      0.0100
Moskvitch     0.0100
Yugo          0.0133

Rare labels (frequency < 0.05):
['Studebaker', 'DeLorean', 'Lada', 'Yugo', 'Reliant', 'Trabant', 'Wartburg', 'Moskvitch']

What just happened?

value_counts(normalize=True) gave us each category as a proportion of the total 300 rows. The five common makes all sit above 16% — reliable signal. Eight rare makes are all below 2.5%, well under our 5% threshold. These are the labels we need to handle before encoding.

Step 2 — Grouping Rare Labels into "Other"

The most robust strategy is to collapse all rare labels into a single "Other" bucket. This bucket is present at training time, so your encoder has seen it. When a new unseen value arrives in production, you map it to "Other" too — no crash, no exception, graceful degradation.

The scenario: You've identified the rare makes. Now you need to transform the column so the encoder only ever sees six categories: the five common makes and "Other". This transformation must be learned on training data and applied identically to validation and test sets — so you'll write it as a reusable function that takes the frequency map as input, not the data itself.

# A proper rare-label encoder stores the "frequent" labels from training
# and replaces anything not in that list with 'Other'

# Step 1: Learn the frequent labels from training data
# (In production this would be your training set only)
freq_map = insurance_df['vehicle_make'].value_counts(normalize=True)

# Step 2: Keep only labels above the threshold
frequent_labels = freq_map[freq_map >= threshold].index.tolist()
print("Frequent labels kept:", frequent_labels)
print()

# Step 3: Write the encoding function
def encode_rare(series, frequent_labels, fill_value='Other'):
    # Replace any value not in frequent_labels with fill_value
    return series.where(series.isin(frequent_labels), other=fill_value)

# Step 4: Apply to our column
insurance_df['vehicle_make_encoded'] = encode_rare(
    insurance_df['vehicle_make'],
    frequent_labels
)

# Step 5: Check value counts on the new column
print(insurance_df['vehicle_make_encoded'].value_counts())
print()

# Step 6: Confirm original column still intact
print("Original unique values:", insurance_df['vehicle_make'].nunique())
print("Encoded unique values: ", insurance_df['vehicle_make_encoded'].nunique())
Frequent labels kept: ['Toyota', 'Ford', 'Honda', 'Chevrolet', 'BMW']

vehicle_make_encoded
Toyota       64
Ford         62
Honda        58
Chevrolet    53
BMW          50
Other        13
Name: count, dtype: int64

Original unique values: 13
Encoded unique values:  6

What just happened?

We reduced a 13-category column to 6 categories. The eight rare makes — totalling 13 rows — were merged into the Other bucket. The encoder will now learn this 6-category vocabulary. Any new label that appears in production (even one we've never seen) gets routed to Other automatically.

Step 3 — Building a Production-Safe Pipeline

The pattern above works, but real pipelines need a class-based encoder that fits on training data, stores state, and transforms consistently. Let's build one using sklearn's BaseEstimator and TransformerMixin.

The scenario: Your manager wants the rare-label step embedded inside a full sklearn Pipeline alongside scaling and a classifier. If it can't be wrapped in Pipeline(), it won't go to production. You need to turn your function into a proper transformer.

# Build a custom sklearn-compatible RareLabelEncoder
from sklearn.base import BaseEstimator, TransformerMixin

class RareLabelEncoder(BaseEstimator, TransformerMixin):

    def __init__(self, threshold=0.05, fill_value='Other', variables=None):
        # threshold: minimum frequency to be considered "frequent"
        self.threshold  = threshold
        # fill_value: what to replace rare labels with
        self.fill_value = fill_value
        # variables: list of columns to encode (None = auto-detect object cols)
        self.variables  = variables

    def fit(self, X, y=None):
        # Auto-detect categorical columns if not specified
        if self.variables is None:
            self.variables_ = X.select_dtypes(include='object').columns.tolist()
        else:
            self.variables_ = self.variables

        # For each column, store the set of frequent labels
        self.frequent_labels_ = {}
        for col in self.variables_:
            freq = X[col].value_counts(normalize=True)              # proportions
            frequent = freq[freq >= self.threshold].index.tolist()  # above threshold
            self.frequent_labels_[col] = frequent                   # store it
        return self  # always return self in fit()

    def transform(self, X):
        X = X.copy()  # never modify the original DataFrame
        for col in self.variables_:
            frequent = self.frequent_labels_[col]
            # Replace anything not in frequent with fill_value
            X[col] = X[col].where(X[col].isin(frequent), other=self.fill_value)
        return X

# --- Demo ---
# Fit on training data (simulating a train/test split)
train_df = insurance_df[['vehicle_make', 'vehicle_age']].iloc[:240].copy()
test_df  = insurance_df[['vehicle_make', 'vehicle_age']].iloc[240:].copy()

# Add a completely new unseen label to test set to prove resilience
test_df.loc[test_df.index[0], 'vehicle_make'] = 'Pontiac'

# Fit the encoder on training data only
encoder = RareLabelEncoder(threshold=0.05)
encoder.fit(train_df)

# Transform both sets
train_encoded = encoder.transform(train_df)
test_encoded  = encoder.transform(test_df)

print("Test set vehicle_make (first 10 rows):")
print(test_encoded['vehicle_make'].head(10).values)
Test set vehicle_make (first 10 rows):
['Other' 'Honda' 'Toyota' 'Ford' 'Chevrolet' 'BMW' 'Toyota' 'Ford' 'Honda' 'Ford']

What just happened?

We injected 'Pontiac' — a brand the encoder never saw — into the test set. Instead of crashing, it silently mapped to 'Other'. The encoder fits on training data, stores frequent_labels_, and applies the same rule at inference time. This is what production resilience looks like.

Step 4 — Handling Multiple Columns at Once

The scenario: Your churn model has five categorical columns. They all have long tails. You don't want to write separate encoding logic for each — you want one transformer that handles all of them in a single .fit_transform() call and reports which labels were marked rare in each column.

# Build a multi-column DataFrame with varying levels of cardinality
np.random.seed(7)

churn_df = pd.DataFrame({
    'plan_type':     np.random.choice(
                        ['Basic','Standard','Premium','Trial','Legacy','Alpha'],
                        p=[0.35, 0.30, 0.25, 0.05, 0.03, 0.02],
                        size=500),
    'payment_method': np.random.choice(
                        ['Card','Bank','PayPal','Crypto','Cheque'],
                        p=[0.50, 0.30, 0.12, 0.05, 0.03],
                        size=500),
    'region':        np.random.choice(
                        ['North','South','East','West','Central','Overseas'],
                        p=[0.28, 0.25, 0.22, 0.18, 0.04, 0.03],
                        size=500),
    'churned':       np.random.choice([0,1], size=500, p=[0.70, 0.30])
})

# Fit the RareLabelEncoder (reusing our class from earlier)
enc = RareLabelEncoder(threshold=0.05)
enc.fit(churn_df[['plan_type','payment_method','region']])

# Inspect what was stored as "frequent" for each column
for col, labels in enc.frequent_labels_.items():
    print(f"\n{col}:")
    print(f"  Frequent: {labels}")
    freq = churn_df[col].value_counts(normalize=True)
    rare  = freq[freq < 0.05].index.tolist()
    print(f"  Rare    : {rare}")

# Transform
churn_encoded = enc.fit_transform(churn_df[['plan_type','payment_method','region']])
print("\n\nUnique values after encoding:")
for col in ['plan_type','payment_method','region']:
    print(f"  {col}: {sorted(churn_encoded[col].unique())}")
plan_type:
  Frequent: ['Basic', 'Standard', 'Premium']
  Rare    : ['Trial', 'Legacy', 'Alpha']

payment_method:
  Frequent: ['Card', 'Bank', 'PayPal']
  Rare    : ['Crypto', 'Cheque']

region:
  Frequent: ['North', 'South', 'East', 'West']
  Rare    : ['Central', 'Overseas']


Unique values after encoding:
  plan_type:      ['Basic', 'Other', 'Premium', 'Standard']
  payment_method: ['Bank', 'Card', 'Other', 'PayPal']
  region:         ['East', 'North', 'Other', 'South', 'West']

What just happened?

Three columns were processed simultaneously. The encoder identified rare labels per-column — 3 in plan_type, 2 in payment_method, 2 in region — and collapsed them all into Other. Each column's vocabulary is now tight and stable, ready for one-hot or ordinal encoding downstream.

Choosing the Right Threshold

The 5% rule is a starting point, not a law. Here's how to think about calibrating it for your situation:

Situation Suggested threshold Reason
Large dataset (>100k rows) 1% Even 1% is 1,000 rows — plenty of signal. You can afford to be selective.
Medium dataset (10k–100k) 5% Standard heuristic. Catches most noise without over-collapsing.
Small dataset (<10k) 10% Be aggressive. With small data, a category with 50 rows is 0.5% but might still be noisy.
Domain requires granularity 0.5–1% E.g. fraud models where a rare merchant type IS the signal. Collapse carefully or not at all.

The airport security analogy

Think of rare label encoding like a security checkpoint. Anyone in your "trusted frequent traveller" list — common labels — passes through quickly. Everyone else goes to the same secondary screening queue — "Other". You don't reject them; you just handle them uniformly. The checkpoint doesn't crash because of an unfamiliar passport.

When NOT to collapse rare labels

In fraud detection, a rare merchant category (say, "cryptocurrency exchange") might be exactly the signal you want. Collapsing it into "Other" destroys that signal. Always check whether rare labels correlate with your target before deciding to merge them.

Teacher's Note

The rare label encoder must always be fit on training data only and then applied to validation and test sets — never re-fit on the full dataset. If you re-fit on test data, you are leaking information: the encoder will now keep labels as frequent that only appear a handful of times in production, making it too lenient. Fit once, transform everywhere. Treat your frequent_labels_ dictionary as read-only after fitting. When you serialise your pipeline to disk with joblib or pickle, this dictionary travels with it automatically — that is the whole point of a stateful transformer.

Practice Questions

1. When a value in the test set was never seen in training data, rare label encoding maps it to what bucket? (one word)



2. Which argument do you pass to value_counts() to get category frequencies as proportions rather than raw counts?



3. The RareLabelEncoder should be fitted on which data split only? (one word)



Quiz

1. What happens when sklearn's OrdinalEncoder encounters a category in the test set that was not present during training?


2. Why is defining rarity using relative frequency preferred over raw counts?


3. In a fraud detection model you discover a category "cryptocurrency exchange" appears in only 0.3% of rows. What should you do before collapsing it into "Other"?


Up Next · Lesson 22

Outlier-Based Features

Turn anomalies into signals — engineer features that tell your model exactly how extreme a data point really is.