Feature Engineering Course
Categorical Features
Machine learning models are mathematical. They multiply, add, and compute distances. The moment you hand them a string like "north" or "self-employed", they have nothing to work with. Categorical encoding is the bridge between human-readable labels and the numbers models actually need — and choosing the wrong bridge is one of the most common and most quietly damaging mistakes in applied ML.
There are four encoding strategies you'll use constantly: one-hot for unordered categories with low cardinality, ordinal mapping for categories with a natural ranking, binary encoding for simple yes/no flags, and frequency encoding for high-cardinality columns where one-hot would explode the feature space. Matching the right strategy to the right column type is the whole game.
One-Hot Encoding — The Standard for Nominal Categories
When categories have no natural order — neighbourhood, property_type, colour — one-hot encoding is the default. It creates one binary column per category. Each row gets a 1 in the column that matches it and 0 everywhere else. No false ranking is implied. The model learns from patterns in the binary flags directly.
The scenario: The housing dataset has a property_type column with three values: flat, house, and bungalow. You need to encode it for use in a logistic regression model that will predict whether the property sells above £350,000. Before encoding you check the distribution — a category that appears in only 1 or 2 rows will cause problems during train/test splitting.
import pandas as pd
housing_df = pd.DataFrame({
'property_type': ['flat','house','bungalow','house','flat',
'house','flat','bungalow','house','flat'],
'sale_price': [245000,410000,310000,560000,198000,
348000,230000,285000,495000,215000]
})
# Step 1 — check distribution before encoding
# A category appearing only once will be all-0 or all-1 after a split — no signal
print("Category distribution:\n")
counts = housing_df['property_type'].value_counts()
pcts = housing_df['property_type'].value_counts(normalize=True).mul(100).round(1)
for val in counts.index:
print(f" {val:<12} {counts[val]} rows ({pcts[val]}%)")
# Step 2 — pd.get_dummies() creates one binary column per category
# prefix='type' keeps column names readable
# drop_first=True removes the 'bungalow' column — it's implied when flat=0 and house=0
encoded = pd.get_dummies(housing_df['property_type'], prefix='type', drop_first=True)
result = pd.concat([encoded, housing_df['sale_price']], axis=1)
print("\nAfter one-hot encoding:\n")
print(result.to_string(index=False))
# Step 3 — validate each binary column against the target
print("\nCorrelation with sale_price:\n")
for col in ['type_flat', 'type_house']:
corr = result[col].corr(result['sale_price'])
print(f" {col:<15} {corr:+.4f}")
Category distribution:
flat 4 rows (40.0%)
house 4 rows (40.0%)
bungalow 2 rows (20.0%)
After one-hot encoding:
type_flat type_house sale_price
1 0 245000
0 1 410000
0 0 310000
0 1 560000
1 0 198000
0 1 348000
1 0 230000
0 0 285000
0 1 495000
1 0 215000
Correlation with sale_price:
type_flat -0.7524
type_house +0.6732What just happened?
.value_counts(normalize=True) checks distribution before encoding — a category appearing fewer than twice causes problems in train/test splits. pd.get_dummies() created one binary column per category with bungalow as the implicit baseline (all zeros). The correlations confirm clear signal — flats at −0.75 and houses at +0.67. Both are worth keeping. The bungalow signal is captured through its absence in the other columns.
Ordinal Encoding — When Order Actually Exists
Ordinal encoding assigns integers that preserve a meaningful ranking. Poor → 0, fair → 1, good → 2, excellent → 3. The numbers aren't arbitrary — they reflect real-world ordering, and the model can use the mathematical distance between them. Using one-hot on an ordinal column throws away that ordering information; using integer encoding on a nominal column invents a false one.
The scenario: The property dataset includes a condition rating (poor/fair/good/excellent) assigned by an inspector. The analytics lead says: "Don't one-hot this — the ordering is real and matters. Encode it as integers, then verify that the mean sale price actually increases monotonically with condition rank. If it doesn't, we need to understand why."
import pandas as pd
housing_df = pd.DataFrame({
'condition': ['good','fair','excellent','poor','good',
'excellent','fair','poor','excellent','good'],
'sale_price':[295000,220000,480000,165000,310000,
510000,240000,180000,495000,330000]
})
# .map() applies a dictionary to each value in the column
# The integer values must reflect the true ranking — you set this deliberately
ordinal_map = {'poor': 0, 'fair': 1, 'good': 2, 'excellent': 3}
housing_df['condition_enc'] = housing_df['condition'].map(ordinal_map)
# Validate monotonicity — mean price should increase with condition rank
# If it doesn't, the ordering may not hold in this specific dataset
print("Mean sale price by condition rank:\n")
summary = (housing_df.groupby(['condition_enc','condition'])['sale_price']
.agg(['mean','count']).round(0).reset_index())
summary.columns = ['rank','condition','mean_price','count']
print(summary.to_string(index=False))
# Correlation — ordinal encoding should show strong positive relationship
corr = housing_df['condition_enc'].corr(housing_df['sale_price'])
print(f"\nCorrelation of condition_enc with sale_price: {corr:+.4f}")
Mean sale price by condition rank:
rank condition mean_price count
0 poor 172500.0 2
1 fair 230000.0 2
2 good 311667.0 3
3 excellent 495000.0 3
Correlation of condition_enc with sale_price: +0.9610What just happened?
.map(ordinal_map) applies the ranking dictionary to every row in one pass. The mean price table is the critical validation step — price should rise monotonically with condition rank if the encoding is correctly ordered. Here it does: £172k → £230k → £312k → £495k. A correlation of +0.961 makes condition one of the strongest predictors in the dataset. This is what happens when you match the right encoding to the right column type.
Frequency Encoding — One Number Per Category
One-hot encoding works beautifully when a column has 3–15 unique values. Apply it to a column with 500 unique postcodes and you have just added 499 new columns — almost all of them near-zero. The model has to navigate a massively expanded feature space where most features carry almost no signal. This is the high-cardinality trap, and it is a common cause of overfitting and slow training.
Frequency encoding solves this by replacing each category label with a single number: how often that category appears in the dataset, expressed as a proportion between 0 and 1. A postcode appearing in 30% of rows gets the value 0.30. One appearing in 5% gets 0.05. The entire column — no matter how many unique values it has — is reduced to one numeric column the model can use directly.
The underlying assumption is that how common a category is correlates with the target. In many real-world datasets this holds — popular postcodes cluster around typical price ranges, frequently chosen payment methods belong to higher-value customers, commonly selected product categories attract a certain buyer profile. When that relationship exists, frequency encoding captures it in one column at zero dimensionality cost. When it does not, target encoding (Lesson 18) is the stronger alternative.
The scenario: The property dataset has a postcode column. In production this would contain thousands of unique values. The modelling lead says: "Don’t one-hot this. Use frequency encoding — replace each postcode with how often it appears in the dataset. Show me the frequency map, apply it, and validate whether the encoded column actually carries signal before we include it."
import pandas as pd
housing_df = pd.DataFrame({
'postcode': ['SW1','E1','SW1','N1','E1','SW1','N1','W1','E1','W1'],
'sale_price': [480000,245000,510000,310000,260000,495000,295000,410000,255000,420000]
})
# Show how many unique values this column has — the first cardinality check
print(f"Unique postcodes: {housing_df['postcode'].nunique()}")
print(f"Rows: {len(housing_df)}")
print(f"\nOne-hot encoding would create {housing_df['postcode'].nunique() - 1} new columns\n")
# FREQUENCY ENCODING — replace each category with its proportion in the dataset
# .value_counts(normalize=True) gives proportions (0 to 1) instead of raw counts
freq_map = housing_df['postcode'].value_counts(normalize=True)
housing_df['postcode_freq'] = housing_df['postcode'].map(freq_map)
print("Frequency map:\n")
for postcode, freq in freq_map.items():
print(f" {postcode:<6} {freq:.2f} ({freq*100:.0f}% of rows)")
print("\nFrequency encoded dataset:\n")
print(housing_df[['postcode','postcode_freq','sale_price']].to_string(index=False))
corr = housing_df['postcode_freq'].corr(housing_df['sale_price'])
print(f"\nCorrelation of postcode_freq with sale_price: {corr:+.4f}")
Unique postcodes: 4
Rows: 10
One-hot encoding would create 3 new columns
Frequency map:
SW1 0.30 (30% of rows)
E1 0.30 (30% of rows)
N1 0.20 (20% of rows)
W1 0.20 (20% of rows)
Frequency encoded dataset:
postcode postcode_freq sale_price
SW1 0.3 480000
E1 0.3 245000
SW1 0.3 510000
N1 0.2 310000
E1 0.3 260000
SW1 0.3 495000
N1 0.2 295000
W1 0.2 410000
E1 0.3 255000
W1 0.2 420000
Correlation of postcode_freq with sale_price: +0.1736What just happened?
.value_counts(normalize=True) gives proportions summing to 1.0 across all categories. .map(freq_map) replaces each category label with its proportion. The correlation of +0.17 is weak in this small sample — how common a postcode is does not strongly predict price here. In real datasets with thousands of rows, frequency encoding often works better because popular postcodes tend to cluster around consistent price ranges. When frequency encoding is weak, target encoding (Lesson 18) is the next tool to reach for.
Binary Encoding — Compact Representation for High-Cardinality
Binary encoding is the middle ground between one-hot and frequency encoding. It converts each category to an integer first, then represents that integer in binary (base-2) digits — one column per bit. A column with 8 unique categories needs only 3 binary columns (2³ = 8) instead of 7 one-hot columns. With 100 unique categories it needs just 7 columns (2⁷ = 128) instead of 99. The more categories you have, the bigger the saving.
The scenario: The property dataset now includes a agent_code column — a unique code for each of 8 estate agents who handled the sales. One-hot encoding would create 7 new binary columns. Your lead says: "That's too many for 8 categories — use binary encoding. Show me how the integer representations map to binary columns, and check whether any of those columns carry correlation with sale price."
import pandas as pd
import numpy as np
housing_df = pd.DataFrame({
'agent_code': ['AG1','AG2','AG3','AG4','AG5','AG6','AG7','AG8',
'AG1','AG3'],
'sale_price': [480000,245000,510000,310000,260000,495000,295000,410000,
520000,490000]
})
# Step 1: Label encode — assign an integer to each unique category
# pd.factorize() returns (integer_codes, unique_labels)
# We use it here just to get a consistent integer per category
codes, uniques = pd.factorize(housing_df['agent_code'])
housing_df['agent_int'] = codes
print("Integer mapping:")
for i, label in enumerate(uniques):
# format(i, '03b') converts integer i to a 3-digit binary string
print(f" {label} → int={i} → binary={format(i, '03b')}")
# Step 2: convert each integer to binary columns
# n_bits = number of bits needed to represent all categories
n_bits = int(np.ceil(np.log2(housing_df['agent_code'].nunique())))
for bit in range(n_bits):
# Right-shift by bit position and AND with 1 extracts each individual bit
housing_df[f'agent_b{bit}'] = (housing_df['agent_int'] >> bit) & 1
binary_cols = [f'agent_b{i}' for i in range(n_bits)]
print(f"\nOne-hot would need {housing_df['agent_code'].nunique() - 1} columns")
print(f"Binary encoding needs {n_bits} columns\n")
# Validate — do any binary columns correlate with sale price?
print("Correlation of binary columns with sale_price:\n")
for col in binary_cols:
corr = housing_df[col].corr(housing_df['sale_price'])
print(f" {col} {corr:+.4f}")
print("\nSample output:\n")
print(housing_df[['agent_code','agent_int'] + binary_cols + ['sale_price']].to_string(index=False))
Integer mapping:
AG1 → int=0 → binary=000
AG2 → int=1 → binary=001
AG3 → int=2 → binary=010
AG4 → int=3 → binary=011
AG5 → int=4 → binary=100
AG6 → int=5 → binary=101
AG7 → int=6 → binary=110
AG8 → int=7 → binary=111
One-hot would need 7 columns
Binary encoding needs 3 columns
Correlation of binary columns with sale_price:
agent_b0 -0.1547
agent_b1 +0.0388
agent_b2 -0.3714
Sample output:
agent_code agent_int agent_b0 agent_b1 agent_b2 sale_price
AG1 0 0 0 0 480000
AG2 1 1 0 0 245000
AG3 2 0 1 0 510000
AG4 3 1 1 0 310000
AG5 4 0 0 1 260000
AG6 5 1 0 1 495000
AG7 6 0 1 1 295000
AG8 7 1 1 1 410000
AG1 0 0 0 0 520000
AG3 2 0 1 0 490000What just happened?
pd.factorize() assigns a unique integer to each category label. np.ceil(np.log2(n)) calculates the minimum number of bits needed to represent n categories — 8 categories need exactly 3 bits (2³ = 8). The bitwise right-shift and AND operation (agent_int >> bit) & 1 extracts each individual bit position. The result: 8 unique agent codes represented in just 3 columns instead of 7. In this small sample the binary columns show weak correlations, but in a real dataset where certain agents consistently handle higher-value properties, those bits would carry genuine signal.
Encoding Strategy Decision Guide
| Situation | Strategy | pandas method | Watch out for |
|---|---|---|---|
| Nominal, <20 unique values | One-hot | pd.get_dummies(drop_first=True) | Categories with <2 rows per split |
| Ordinal — natural order exists | Integer map | series.map(dict) | Validate monotonicity with target |
| Yes/No binary column | Binary (0/1) | (col == 'yes').astype(int) | 90%+ one value = near-zero signal |
| Nominal, 20+ unique values | Frequency | value_counts(normalize=True) | May need target encoding if weak |
| Nominal, many categories, compact needed | Binary | factorize + bitwise shift | log₂(n) columns instead of n−1 |
| Nominal, high-card, supervised | Target encoding | groupby mean — see Lesson 18 | Leakage risk — must use CV folds |
Teacher's Note
The dummy variable trap: keeping all 3 binary columns from a 3-category one-hot creates perfect multicollinearity that breaks linear models — always use drop_first=True. Also, always handle unseen categories before deployment: if a new property arrives with a neighbourhood your encoder never saw, it will produce NaN and silently break predictions. Map unknowns to the most frequent category as a safe fallback.
Practice Questions
1. Which argument in pd.get_dummies() removes one redundant binary column to prevent the dummy variable trap in linear models?
2. The encoding technique that replaces each category with how often it appears in the dataset — expressed as a proportion — is called ___.
3. After ordinal encoding a column like condition, you validate it by checking that the mean sale price increases with the rank — a property called ___.
Quiz
1. Why is integer encoding (north=0, east=1, south=2) wrong for a nominal feature like neighbourhood?
2. A postcode column has 500 unique values. Why is one-hot encoding the wrong choice?
3. What is the main risk of not handling unknown categories before deploying a one-hot or frequency encoder to production?
Up Next · Lesson 6
Date & Time Features
A single datetime column contains month, year, day-of-week, season, time since an event, and more. Learn to extract every usable signal from a timestamp and know which ones are actually worth keeping.