Frequency Encoding

Sometimes the most useful thing about a category is how common it is. A browser used by 40% of your visitors tells a very different story than one used by 0.02%. Frequency encoding captures that signal without ever looking at the target variable — making it one of the safest, most leak-proof encoding methods available.

Frequency encoding replaces each category value with the proportion (or count) of times it appears in the training data. It requires no target variable, produces a single compact numerical column, and handles high-cardinality features gracefully. Rare categories get low values; common ones get high values — and that contrast is often exactly what the model needs.

Count vs Proportion Encoding

There are two variants. Count encoding replaces each category with the raw number of times it appears in training. Proportion encoding (more common in practice) divides by the total number of rows, giving a value between 0 and 1. Proportion encoding is preferred because it stays valid even if the training set size changes between experiments.

No target required — zero leakage risk

Unlike target encoding, frequency encoding never touches the label column. There is nothing to leak. It can be fitted safely on any data fold, including in cross-validation inner loops, without worrying about label bleedthrough.

Works for any cardinality

Whether your column has 5 categories or 50,000 zip codes, frequency encoding always produces exactly one column. The rare tail of a long-tailed distribution gets naturally compressed toward zero, making it easy for the model to treat rarity as a signal.

Limitation: same frequency, different meaning

Two categories that appear equally often get the same encoded value — even if they have completely different relationships to the target. If "pizza" and "sushi" both appear 200 times but one converts at 80% and the other at 20%, frequency encoding sees them as identical.

Basic Frequency Encoding

The scenario: You're a data engineer at a ride-hailing company building a demand forecasting model. One feature is pickup_zone — the zone where each ride originates. With hundreds of zones in a real deployment, one-hot encoding is out. Target encoding is available but your company has strict leakage policies on the training pipeline. Frequency encoding is the approved method: replace each zone with how often it appears as a proportion of all training trips.

# Import pandas
import pandas as pd

# Ride-hailing trip training data
rides_train = pd.DataFrame({
    'trip_id':    ['R01','R02','R03','R04','R05','R06',
                  'R07','R08','R09','R10','R11','R12'],
    'pickup_zone':['City','Airport','City','Suburbs','City','Airport',
                  'City','Suburbs','City','Docklands','City','Airport'],
    'fare_gbp':   [8,24,9,14,7,22,11,13,8,16,10,25],
    'surge_pricing':[0,1,0,0,1,1,0,0,0,1,0,1]
})

# Step 1: compute proportion frequency of each zone in training data
# value_counts(normalize=True) gives proportions that sum to 1.0
zone_freq = rides_train['pickup_zone'].value_counts(normalize=True)
print("Zone proportions in training data:")
print(zone_freq.round(4).to_string())
print()

# Step 2: map proportions back onto each training row
rides_train['zone_freq_enc'] = rides_train['pickup_zone'].map(zone_freq).round(4)

# Print the encoded result
print(rides_train[['trip_id','pickup_zone','zone_freq_enc','surge_pricing']].to_string(index=False))

Zone proportions in training data:
pickup_zone
City         0.5000
Airport      0.2500
Suburbs      0.1667
Docklands    0.0833

 trip_id pickup_zone  zone_freq_enc  surge_pricing
     R01        City         0.5000              0
     R02     Airport         0.2500              1
     R03        City         0.5000              0
     R04     Suburbs         0.1667              0
     R05        City         0.5000              1
     R06     Airport         0.2500              1
     R07        City         0.5000              0
     R08     Suburbs         0.1667              0
     R09        City         0.5000              0
     R10   Docklands         0.0833              1
     R11        City         0.5000              0
     R12     Airport         0.2500              1

What just happened?

value_counts(normalize=True) computed the proportion of each zone across all training rows. City — appearing in 6 of 12 trips — became 0.5. Docklands — appearing only once — became 0.083. The .map() then stamped those proportions onto every row. Notice how surge pricing correlates with lower-frequency zones: Airport and Docklands have higher surge rates than City. The frequency encoding captures that rarity signal even without ever looking at the target.

Applying the Frequency Map to Test Data

The scenario: New trips arrive in your test set — including pickups from "Heathrow", a zone that never appeared in training. You need to apply the same frequency map from training to the test set, and handle the unseen zone with a sensible fallback — just as you did with target encoding in Lesson 18.

# Import pandas
import pandas as pd

# Re-establish the training frequency map from the previous block
rides_train = pd.DataFrame({
    'pickup_zone':['City','Airport','City','Suburbs','City','Airport',
                  'City','Suburbs','City','Docklands','City','Airport']
})

# Compute proportion frequency on training data only
zone_freq = rides_train['pickup_zone'].value_counts(normalize=True)

# Test data — contains 'Heathrow', a zone never seen in training
rides_test = pd.DataFrame({
    'trip_id':    ['T01','T02','T03','T04','T05'],
    'pickup_zone':['City','Heathrow','Airport','Heathrow','Suburbs'],
    'fare_gbp':   [9,35,23,38,12]
})

# Fallback: minimum observed frequency in training — rare category gets rare treatment
# Alternative: use 0 or a small epsilon — choice depends on how you want the model to treat unknowns
fallback_freq = zone_freq.min()
print(f"Fallback frequency for unseen zones: {fallback_freq:.4f}")

# Map training frequencies onto test — fillna applies the fallback to unseen 'Heathrow'
rides_test['zone_freq_enc'] = rides_test['pickup_zone'].map(zone_freq).fillna(fallback_freq).round(4)

print("\nTest set with frequency encoding applied:")
print(rides_test[['trip_id','pickup_zone','zone_freq_enc']].to_string(index=False))

Fallback frequency for unseen zones: 0.0833

Test set with frequency encoding applied:
 trip_id pickup_zone  zone_freq_enc
     T01        City         0.5000
     T02    Heathrow         0.0833
     T03     Airport         0.2500
     T04    Heathrow         0.0833
     T05     Suburbs         0.1667

What just happened?

The training frequency map was applied to the test rows using .map(). Heathrow — unseen in training — received NaN from the map, which .fillna() replaced with the minimum observed training frequency (0.0833). This treats an unknown zone as "rare" — the most conservative, sensible assumption. The pipeline never crashed, and no target information was needed at any point.

Combining Frequency Encoding with Target Encoding

The scenario: You're a senior data scientist at a digital advertising firm. Your click-through rate model uses ad_platform as a feature. You decide to create both a frequency-encoded column and a target-encoded column — giving the model two complementary signals: how popular the platform is, and how well it converts. Together they carry more information than either alone.

# Import pandas
import pandas as pd

# Ad campaign data — platform is the categorical feature
ads_df = pd.DataFrame({
    'ad_id':      ['A01','A02','A03','A04','A05','A06','A07','A08',
                   'A09','A10','A11','A12'],
    'ad_platform':['Google','Meta','Google','TikTok','Meta','Google',
                   'TikTok','Google','Meta','TikTok','Google','Meta'],
    'spend_usd':  [500,300,450,200,320,480,190,510,290,210,470,310],
    'clicked':   [1,0,1,1,1,1,0,1,0,1,1,1]  # target: 1 = ad was clicked
})

# Frequency encoding — proportion of rows each platform appears in
platform_freq = ads_df['ad_platform'].value_counts(normalize=True)
ads_df['platform_freq'] = ads_df['ad_platform'].map(platform_freq).round(4)

# Target encoding — mean click rate per platform (naive, for illustration only)
# In production this would be smoothed or computed with leave-one-out
platform_ctr = ads_df.groupby('ad_platform')['clicked'].mean()
ads_df['platform_target_enc'] = ads_df['ad_platform'].map(platform_ctr).round(4)

# Print both encoded columns side by side
print("Platform frequency and target encoding comparison:")
print(ads_df[['ad_id','ad_platform','platform_freq','platform_target_enc','clicked']].to_string(index=False))

print("\nSummary per platform:")
summary = ads_df.groupby('ad_platform').agg(
    count=('ad_id','count'),
    freq_enc=('platform_freq','first'),
    target_enc=('platform_target_enc','first')
).reset_index()
print(summary.to_string(index=False))

Platform frequency and target encoding comparison:
 ad_id ad_platform  platform_freq  platform_target_enc  clicked
   A01      Google         0.4167               1.0000        1
   A02        Meta         0.3333               0.5000        0
   A03      Google         0.4167               1.0000        1
   A04      TikTok         0.2500               0.6667        1
   A05        Meta         0.3333               0.5000        1
   A06      Google         0.4167               1.0000        1
   A07      TikTok         0.2500               0.6667        0
   A08      Google         0.4167               1.0000        1
   A09        Meta         0.3333               0.5000        0
   A10      TikTok         0.2500               0.6667        1
   A11      Google         0.4167               1.0000        1
   A12        Meta         0.3333               0.5000        1

Summary per platform:
 ad_platform  count  freq_enc  target_enc
      Google      5    0.4167      1.0000
        Meta      4    0.3333      0.5000
      TikTok      3    0.2500      0.6667

What just happened?

The summary table reveals why both columns matter. Google and Meta have similar frequency (0.42 vs 0.33) but very different CTR (1.00 vs 0.50) — frequency encoding alone can't tell them apart. TikTok is the least frequent (0.25) with a moderate CTR (0.67) — it's a smaller channel but converts reasonably well. Together, the two columns give the model independent signals: volume and effectiveness, neither redundant with the other.

Frequency vs Target Encoding — When to Use Each

Consideration	Frequency Encoding	Target Encoding
Leakage risk	None	High if done naively
Requires target variable	No	Yes
Signal type	Popularity / rarity	Outcome rate per category
Unseen category default	0 or min frequency	Global target mean
Best for	Rarity as a signal, unsupervised pipelines	Supervised models where category predicts outcome

Teacher's Note

Frequency encoding is underrated. It seems almost too simple — just count how often each value appears — but in fraud detection, anomaly detection, and recommendation systems, category rarity is genuinely one of the strongest signals you can give a model. A transaction in a zip code that represents 0.001% of all training transactions is inherently suspicious, regardless of what the target column says. The model learns that small frequency values warrant scrutiny. Whenever you reach for target encoding, ask yourself first whether frequency alone might already be enough — it's leakage-free, cheaper to compute, and requires no careful cross-fitting.

Practice Questions

1. Which pandas method computes the frequency (count or proportion) of each unique value in a Series?

2. What argument do you pass to value_counts() to get proportions instead of raw counts?

3. How much leakage risk does frequency encoding introduce compared to target encoding?

Quiz

Up Next · Lesson 20

Weight of Evidence

A classical encoding from credit scoring that measures how strongly each category separates the two classes — and why it's still widely used in regulated industries today.

← Previous Course Index Next →

Feature Engineering Course

Frequency Encoding

Count vs Proportion Encoding

Basic Frequency Encoding

Applying the Frequency Map to Test Data

Combining Frequency Encoding with Target Encoding

Frequency vs Target Encoding — When to Use Each

Practice Questions

Quiz