Feature Engineering Course
Frequency Encoding
Sometimes the most useful thing about a category is how common it is. A browser used by 40% of your visitors tells a very different story than one used by 0.02%. Frequency encoding captures that signal without ever looking at the target variable — making it one of the safest, most leak-proof encoding methods available.
Frequency encoding replaces each category value with the proportion (or count) of times it appears in the training data. It requires no target variable, produces a single compact numerical column, and handles high-cardinality features gracefully. Rare categories get low values; common ones get high values — and that contrast is often exactly what the model needs.
Count vs Proportion Encoding
There are two variants. Count encoding replaces each category with the raw number of times it appears in training. Proportion encoding (more common in practice) divides by the total number of rows, giving a value between 0 and 1. Proportion encoding is preferred because it stays valid even if the training set size changes between experiments.
No target required — zero leakage risk
Unlike target encoding, frequency encoding never touches the label column. There is nothing to leak. It can be fitted safely on any data fold, including in cross-validation inner loops, without worrying about label bleedthrough.
Works for any cardinality
Whether your column has 5 categories or 50,000 zip codes, frequency encoding always produces exactly one column. The rare tail of a long-tailed distribution gets naturally compressed toward zero, making it easy for the model to treat rarity as a signal.
Limitation: same frequency, different meaning
Two categories that appear equally often get the same encoded value — even if they have completely different relationships to the target. If "pizza" and "sushi" both appear 200 times but one converts at 80% and the other at 20%, frequency encoding sees them as identical.
Basic Frequency Encoding
The scenario: You're a data engineer at a ride-hailing company building a demand forecasting model. One feature is pickup_zone — the zone where each ride originates. With hundreds of zones in a real deployment, one-hot encoding is out. Target encoding is available but your company has strict leakage policies on the training pipeline. Frequency encoding is the approved method: replace each zone with how often it appears as a proportion of all training trips.
# Import pandas
import pandas as pd
# Ride-hailing trip training data
rides_train = pd.DataFrame({
'trip_id': ['R01','R02','R03','R04','R05','R06',
'R07','R08','R09','R10','R11','R12'],
'pickup_zone':['City','Airport','City','Suburbs','City','Airport',
'City','Suburbs','City','Docklands','City','Airport'],
'fare_gbp': [8,24,9,14,7,22,11,13,8,16,10,25],
'surge_pricing':[0,1,0,0,1,1,0,0,0,1,0,1]
})
# Step 1: compute proportion frequency of each zone in training data
# value_counts(normalize=True) gives proportions that sum to 1.0
zone_freq = rides_train['pickup_zone'].value_counts(normalize=True)
print("Zone proportions in training data:")
print(zone_freq.round(4).to_string())
print()
# Step 2: map proportions back onto each training row
rides_train['zone_freq_enc'] = rides_train['pickup_zone'].map(zone_freq).round(4)
# Print the encoded result
print(rides_train[['trip_id','pickup_zone','zone_freq_enc','surge_pricing']].to_string(index=False))
Zone proportions in training data:
pickup_zone
City 0.5000
Airport 0.2500
Suburbs 0.1667
Docklands 0.0833
trip_id pickup_zone zone_freq_enc surge_pricing
R01 City 0.5000 0
R02 Airport 0.2500 1
R03 City 0.5000 0
R04 Suburbs 0.1667 0
R05 City 0.5000 1
R06 Airport 0.2500 1
R07 City 0.5000 0
R08 Suburbs 0.1667 0
R09 City 0.5000 0
R10 Docklands 0.0833 1
R11 City 0.5000 0
R12 Airport 0.2500 1What just happened?
value_counts(normalize=True) computed the proportion of each zone across all training rows. City — appearing in 6 of 12 trips — became 0.5. Docklands — appearing only once — became 0.083. The .map() then stamped those proportions onto every row. Notice how surge pricing correlates with lower-frequency zones: Airport and Docklands have higher surge rates than City. The frequency encoding captures that rarity signal even without ever looking at the target.
Applying the Frequency Map to Test Data
The scenario: New trips arrive in your test set — including pickups from "Heathrow", a zone that never appeared in training. You need to apply the same frequency map from training to the test set, and handle the unseen zone with a sensible fallback — just as you did with target encoding in Lesson 18.
# Import pandas
import pandas as pd
# Re-establish the training frequency map from the previous block
rides_train = pd.DataFrame({
'pickup_zone':['City','Airport','City','Suburbs','City','Airport',
'City','Suburbs','City','Docklands','City','Airport']
})
# Compute proportion frequency on training data only
zone_freq = rides_train['pickup_zone'].value_counts(normalize=True)
# Test data — contains 'Heathrow', a zone never seen in training
rides_test = pd.DataFrame({
'trip_id': ['T01','T02','T03','T04','T05'],
'pickup_zone':['City','Heathrow','Airport','Heathrow','Suburbs'],
'fare_gbp': [9,35,23,38,12]
})
# Fallback: minimum observed frequency in training — rare category gets rare treatment
# Alternative: use 0 or a small epsilon — choice depends on how you want the model to treat unknowns
fallback_freq = zone_freq.min()
print(f"Fallback frequency for unseen zones: {fallback_freq:.4f}")
# Map training frequencies onto test — fillna applies the fallback to unseen 'Heathrow'
rides_test['zone_freq_enc'] = rides_test['pickup_zone'].map(zone_freq).fillna(fallback_freq).round(4)
print("\nTest set with frequency encoding applied:")
print(rides_test[['trip_id','pickup_zone','zone_freq_enc']].to_string(index=False))
Fallback frequency for unseen zones: 0.0833
Test set with frequency encoding applied:
trip_id pickup_zone zone_freq_enc
T01 City 0.5000
T02 Heathrow 0.0833
T03 Airport 0.2500
T04 Heathrow 0.0833
T05 Suburbs 0.1667What just happened?
The training frequency map was applied to the test rows using .map(). Heathrow — unseen in training — received NaN from the map, which .fillna() replaced with the minimum observed training frequency (0.0833). This treats an unknown zone as "rare" — the most conservative, sensible assumption. The pipeline never crashed, and no target information was needed at any point.
Combining Frequency Encoding with Target Encoding
The scenario: You're a senior data scientist at a digital advertising firm. Your click-through rate model uses ad_platform as a feature. You decide to create both a frequency-encoded column and a target-encoded column — giving the model two complementary signals: how popular the platform is, and how well it converts. Together they carry more information than either alone.
# Import pandas
import pandas as pd
# Ad campaign data — platform is the categorical feature
ads_df = pd.DataFrame({
'ad_id': ['A01','A02','A03','A04','A05','A06','A07','A08',
'A09','A10','A11','A12'],
'ad_platform':['Google','Meta','Google','TikTok','Meta','Google',
'TikTok','Google','Meta','TikTok','Google','Meta'],
'spend_usd': [500,300,450,200,320,480,190,510,290,210,470,310],
'clicked': [1,0,1,1,1,1,0,1,0,1,1,1] # target: 1 = ad was clicked
})
# Frequency encoding — proportion of rows each platform appears in
platform_freq = ads_df['ad_platform'].value_counts(normalize=True)
ads_df['platform_freq'] = ads_df['ad_platform'].map(platform_freq).round(4)
# Target encoding — mean click rate per platform (naive, for illustration only)
# In production this would be smoothed or computed with leave-one-out
platform_ctr = ads_df.groupby('ad_platform')['clicked'].mean()
ads_df['platform_target_enc'] = ads_df['ad_platform'].map(platform_ctr).round(4)
# Print both encoded columns side by side
print("Platform frequency and target encoding comparison:")
print(ads_df[['ad_id','ad_platform','platform_freq','platform_target_enc','clicked']].to_string(index=False))
print("\nSummary per platform:")
summary = ads_df.groupby('ad_platform').agg(
count=('ad_id','count'),
freq_enc=('platform_freq','first'),
target_enc=('platform_target_enc','first')
).reset_index()
print(summary.to_string(index=False))
Platform frequency and target encoding comparison:
ad_id ad_platform platform_freq platform_target_enc clicked
A01 Google 0.4167 1.0000 1
A02 Meta 0.3333 0.5000 0
A03 Google 0.4167 1.0000 1
A04 TikTok 0.2500 0.6667 1
A05 Meta 0.3333 0.5000 1
A06 Google 0.4167 1.0000 1
A07 TikTok 0.2500 0.6667 0
A08 Google 0.4167 1.0000 1
A09 Meta 0.3333 0.5000 0
A10 TikTok 0.2500 0.6667 1
A11 Google 0.4167 1.0000 1
A12 Meta 0.3333 0.5000 1
Summary per platform:
ad_platform count freq_enc target_enc
Google 5 0.4167 1.0000
Meta 4 0.3333 0.5000
TikTok 3 0.2500 0.6667What just happened?
The summary table reveals why both columns matter. Google and Meta have similar frequency (0.42 vs 0.33) but very different CTR (1.00 vs 0.50) — frequency encoding alone can't tell them apart. TikTok is the least frequent (0.25) with a moderate CTR (0.67) — it's a smaller channel but converts reasonably well. Together, the two columns give the model independent signals: volume and effectiveness, neither redundant with the other.
Frequency vs Target Encoding — When to Use Each
| Consideration | Frequency Encoding | Target Encoding |
|---|---|---|
| Leakage risk | None | High if done naively |
| Requires target variable | No | Yes |
| Signal type | Popularity / rarity | Outcome rate per category |
| Unseen category default | 0 or min frequency | Global target mean |
| Best for | Rarity as a signal, unsupervised pipelines | Supervised models where category predicts outcome |
Teacher's Note
Frequency encoding is underrated. It seems almost too simple — just count how often each value appears — but in fraud detection, anomaly detection, and recommendation systems, category rarity is genuinely one of the strongest signals you can give a model. A transaction in a zip code that represents 0.001% of all training transactions is inherently suspicious, regardless of what the target column says. The model learns that small frequency values warrant scrutiny. Whenever you reach for target encoding, ask yourself first whether frequency alone might already be enough — it's leakage-free, cheaper to compute, and requires no careful cross-fitting.
Practice Questions
1. Which pandas method computes the frequency (count or proportion) of each unique value in a Series?
2. What argument do you pass to value_counts() to get proportions instead of raw counts?
3. How much leakage risk does frequency encoding introduce compared to target encoding?
Quiz
1. What value does frequency encoding assign to each category?
2. What is the main limitation of frequency encoding compared to target encoding?
3. An unseen category appears in the test set during frequency encoding. What is the recommended approach?
Up Next · Lesson 20
Weight of Evidence
A classical encoding from credit scoring that measures how strongly each category separates the two classes — and why it's still widely used in regulated industries today.