Feature Engineering Lesson 5 – Categorical Features | Dataplexa
Beginner Level · Lesson 5

Categorical Features

Machine learning models are mathematical. They multiply, add, and compute distances. The moment you hand them a string like "north" or "self-employed", they have nothing to work with. Categorical encoding is the bridge between human-readable labels and the numbers models actually need — and choosing the wrong bridge is one of the most common and most quietly damaging mistakes in applied ML.

There are four encoding strategies you'll use constantly: one-hot for unordered categories with low cardinality, ordinal mapping for categories with a natural ranking, binary encoding for simple yes/no flags, and frequency encoding for high-cardinality columns where one-hot would explode the feature space. Matching the right strategy to the right column type is the whole game.

One-Hot Encoding — The Standard for Nominal Categories

When categories have no natural order — neighbourhood, property_type, colour — one-hot encoding is the default. It creates one binary column per category. Each row gets a 1 in the column that matches it and 0 everywhere else. No false ranking is implied. The model learns from patterns in the binary flags directly.

The scenario: The housing dataset has a property_type column with three values: flat, house, and bungalow. You need to encode it for use in a logistic regression model that will predict whether the property sells above £350,000. Before encoding you check the distribution — a category that appears in only 1 or 2 rows will cause problems during train/test splitting.

import pandas as pd

housing_df = pd.DataFrame({
    'property_type': ['flat','house','bungalow','house','flat',
                      'house','flat','bungalow','house','flat'],
    'sale_price':    [245000,410000,310000,560000,198000,
                      348000,230000,285000,495000,215000]
})

# Step 1 — check distribution before encoding
# A category appearing only once will be all-0 or all-1 after a split — no signal
print("Category distribution:\n")
counts = housing_df['property_type'].value_counts()
pcts   = housing_df['property_type'].value_counts(normalize=True).mul(100).round(1)
for val in counts.index:
    print(f"  {val:<12}  {counts[val]} rows  ({pcts[val]}%)")

# Step 2 — pd.get_dummies() creates one binary column per category
# prefix='type' keeps column names readable
# drop_first=True removes the 'bungalow' column — it's implied when flat=0 and house=0
encoded = pd.get_dummies(housing_df['property_type'], prefix='type', drop_first=True)
result  = pd.concat([encoded, housing_df['sale_price']], axis=1)

print("\nAfter one-hot encoding:\n")
print(result.to_string(index=False))

# Step 3 — validate each binary column against the target
print("\nCorrelation with sale_price:\n")
for col in ['type_flat', 'type_house']:
    corr = result[col].corr(result['sale_price'])
    print(f"  {col:<15}  {corr:+.4f}")
Category distribution:

  flat          4 rows  (40.0%)
  house         4 rows  (40.0%)
  bungalow      2 rows  (20.0%)

After one-hot encoding:

  type_flat  type_house  sale_price
          1           0      245000
          0           1      410000
          0           0      310000
          0           1      560000
          1           0      198000
          0           1      348000
          1           0      230000
          0           0      285000
          0           1      495000
          1           0      215000

Correlation with sale_price:

  type_flat       -0.7524
  type_house      +0.6732

What just happened?

.value_counts(normalize=True) checks distribution before encoding — a category appearing fewer than twice causes problems in train/test splits. pd.get_dummies() created one binary column per category with bungalow as the implicit baseline (all zeros). The correlations confirm clear signal — flats at −0.75 and houses at +0.67. Both are worth keeping. The bungalow signal is captured through its absence in the other columns.

Ordinal Encoding — When Order Actually Exists

Ordinal encoding assigns integers that preserve a meaningful ranking. Poor → 0, fair → 1, good → 2, excellent → 3. The numbers aren't arbitrary — they reflect real-world ordering, and the model can use the mathematical distance between them. Using one-hot on an ordinal column throws away that ordering information; using integer encoding on a nominal column invents a false one.

The scenario: The property dataset includes a condition rating (poor/fair/good/excellent) assigned by an inspector. The analytics lead says: "Don't one-hot this — the ordering is real and matters. Encode it as integers, then verify that the mean sale price actually increases monotonically with condition rank. If it doesn't, we need to understand why."

import pandas as pd

housing_df = pd.DataFrame({
    'condition': ['good','fair','excellent','poor','good',
                  'excellent','fair','poor','excellent','good'],
    'sale_price':[295000,220000,480000,165000,310000,
                  510000,240000,180000,495000,330000]
})

# .map() applies a dictionary to each value in the column
# The integer values must reflect the true ranking — you set this deliberately
ordinal_map = {'poor': 0, 'fair': 1, 'good': 2, 'excellent': 3}
housing_df['condition_enc'] = housing_df['condition'].map(ordinal_map)

# Validate monotonicity — mean price should increase with condition rank
# If it doesn't, the ordering may not hold in this specific dataset
print("Mean sale price by condition rank:\n")
summary = (housing_df.groupby(['condition_enc','condition'])['sale_price']
           .agg(['mean','count']).round(0).reset_index())
summary.columns = ['rank','condition','mean_price','count']
print(summary.to_string(index=False))

# Correlation — ordinal encoding should show strong positive relationship
corr = housing_df['condition_enc'].corr(housing_df['sale_price'])
print(f"\nCorrelation of condition_enc with sale_price: {corr:+.4f}")
Mean sale price by condition rank:

  rank  condition  mean_price  count
     0       poor    172500.0      2
     1       fair    230000.0      2
     2       good    311667.0      3
     3  excellent    495000.0      3

Correlation of condition_enc with sale_price: +0.9610

What just happened?

.map(ordinal_map) applies the ranking dictionary to every row in one pass. The mean price table is the critical validation step — price should rise monotonically with condition rank if the encoding is correctly ordered. Here it does: £172k → £230k → £312k → £495k. A correlation of +0.961 makes condition one of the strongest predictors in the dataset. This is what happens when you match the right encoding to the right column type.

Frequency Encoding — One Number Per Category

One-hot encoding works beautifully when a column has 3–15 unique values. Apply it to a column with 500 unique postcodes and you have just added 499 new columns — almost all of them near-zero. The model has to navigate a massively expanded feature space where most features carry almost no signal. This is the high-cardinality trap, and it is a common cause of overfitting and slow training.

Frequency encoding solves this by replacing each category label with a single number: how often that category appears in the dataset, expressed as a proportion between 0 and 1. A postcode appearing in 30% of rows gets the value 0.30. One appearing in 5% gets 0.05. The entire column — no matter how many unique values it has — is reduced to one numeric column the model can use directly.

The underlying assumption is that how common a category is correlates with the target. In many real-world datasets this holds — popular postcodes cluster around typical price ranges, frequently chosen payment methods belong to higher-value customers, commonly selected product categories attract a certain buyer profile. When that relationship exists, frequency encoding captures it in one column at zero dimensionality cost. When it does not, target encoding (Lesson 18) is the stronger alternative.

The scenario: The property dataset has a postcode column. In production this would contain thousands of unique values. The modelling lead says: "Don’t one-hot this. Use frequency encoding — replace each postcode with how often it appears in the dataset. Show me the frequency map, apply it, and validate whether the encoded column actually carries signal before we include it."

import pandas as pd

housing_df = pd.DataFrame({
    'postcode':   ['SW1','E1','SW1','N1','E1','SW1','N1','W1','E1','W1'],
    'sale_price': [480000,245000,510000,310000,260000,495000,295000,410000,255000,420000]
})

# Show how many unique values this column has — the first cardinality check
print(f"Unique postcodes: {housing_df['postcode'].nunique()}")
print(f"Rows: {len(housing_df)}")
print(f"\nOne-hot encoding would create {housing_df['postcode'].nunique() - 1} new columns\n")

# FREQUENCY ENCODING — replace each category with its proportion in the dataset
# .value_counts(normalize=True) gives proportions (0 to 1) instead of raw counts
freq_map = housing_df['postcode'].value_counts(normalize=True)
housing_df['postcode_freq'] = housing_df['postcode'].map(freq_map)

print("Frequency map:\n")
for postcode, freq in freq_map.items():
    print(f"  {postcode:<6}  {freq:.2f}  ({freq*100:.0f}% of rows)")

print("\nFrequency encoded dataset:\n")
print(housing_df[['postcode','postcode_freq','sale_price']].to_string(index=False))

corr = housing_df['postcode_freq'].corr(housing_df['sale_price'])
print(f"\nCorrelation of postcode_freq with sale_price: {corr:+.4f}")
Unique postcodes: 4
Rows: 10

One-hot encoding would create 3 new columns

Frequency map:

  SW1     0.30  (30% of rows)
  E1      0.30  (30% of rows)
  N1      0.20  (20% of rows)
  W1      0.20  (20% of rows)

Frequency encoded dataset:

  postcode  postcode_freq  sale_price
       SW1            0.3      480000
        E1            0.3      245000
       SW1            0.3      510000
        N1            0.2      310000
        E1            0.3      260000
       SW1            0.3      495000
        N1            0.2      295000
        W1            0.2      410000
        E1            0.3      255000
        W1            0.2      420000

Correlation of postcode_freq with sale_price: +0.1736

What just happened?

.value_counts(normalize=True) gives proportions summing to 1.0 across all categories. .map(freq_map) replaces each category label with its proportion. The correlation of +0.17 is weak in this small sample — how common a postcode is does not strongly predict price here. In real datasets with thousands of rows, frequency encoding often works better because popular postcodes tend to cluster around consistent price ranges. When frequency encoding is weak, target encoding (Lesson 18) is the next tool to reach for.

Binary Encoding — Compact Representation for High-Cardinality

Binary encoding is the middle ground between one-hot and frequency encoding. It converts each category to an integer first, then represents that integer in binary (base-2) digits — one column per bit. A column with 8 unique categories needs only 3 binary columns (2³ = 8) instead of 7 one-hot columns. With 100 unique categories it needs just 7 columns (2⁷ = 128) instead of 99. The more categories you have, the bigger the saving.

The scenario: The property dataset now includes a agent_code column — a unique code for each of 8 estate agents who handled the sales. One-hot encoding would create 7 new binary columns. Your lead says: "That's too many for 8 categories — use binary encoding. Show me how the integer representations map to binary columns, and check whether any of those columns carry correlation with sale price."

import pandas as pd
import numpy as np

housing_df = pd.DataFrame({
    'agent_code': ['AG1','AG2','AG3','AG4','AG5','AG6','AG7','AG8',
                   'AG1','AG3'],
    'sale_price': [480000,245000,510000,310000,260000,495000,295000,410000,
                   520000,490000]
})

# Step 1: Label encode — assign an integer to each unique category
# pd.factorize() returns (integer_codes, unique_labels)
# We use it here just to get a consistent integer per category
codes, uniques = pd.factorize(housing_df['agent_code'])
housing_df['agent_int'] = codes

print("Integer mapping:")
for i, label in enumerate(uniques):
    # format(i, '03b') converts integer i to a 3-digit binary string
    print(f"  {label}  →  int={i}  →  binary={format(i, '03b')}")

# Step 2: convert each integer to binary columns
# n_bits = number of bits needed to represent all categories
n_bits = int(np.ceil(np.log2(housing_df['agent_code'].nunique())))

for bit in range(n_bits):
    # Right-shift by bit position and AND with 1 extracts each individual bit
    housing_df[f'agent_b{bit}'] = (housing_df['agent_int'] >> bit) & 1

binary_cols = [f'agent_b{i}' for i in range(n_bits)]

print(f"\nOne-hot would need {housing_df['agent_code'].nunique() - 1} columns")
print(f"Binary encoding needs {n_bits} columns\n")

# Validate — do any binary columns correlate with sale price?
print("Correlation of binary columns with sale_price:\n")
for col in binary_cols:
    corr = housing_df[col].corr(housing_df['sale_price'])
    print(f"  {col}  {corr:+.4f}")

print("\nSample output:\n")
print(housing_df[['agent_code','agent_int'] + binary_cols + ['sale_price']].to_string(index=False))
Integer mapping:
  AG1  →  int=0  →  binary=000
  AG2  →  int=1  →  binary=001
  AG3  →  int=2  →  binary=010
  AG4  →  int=3  →  binary=011
  AG5  →  int=4  →  binary=100
  AG6  →  int=5  →  binary=101
  AG7  →  int=6  →  binary=110
  AG8  →  int=7  →  binary=111

One-hot would need 7 columns
Binary encoding needs 3 columns

Correlation of binary columns with sale_price:

  agent_b0  -0.1547
  agent_b1  +0.0388
  agent_b2  -0.3714

Sample output:

 agent_code  agent_int  agent_b0  agent_b1  agent_b2  sale_price
        AG1          0         0         0         0      480000
        AG2          1         1         0         0      245000
        AG3          2         0         1         0      510000
        AG4          3         1         1         0      310000
        AG5          4         0         0         1      260000
        AG6          5         1         0         1      495000
        AG7          6         0         1         1      295000
        AG8          7         1         1         1      410000
        AG1          0         0         0         0      520000
        AG3          2         0         1         0      490000

What just happened?

pd.factorize() assigns a unique integer to each category label. np.ceil(np.log2(n)) calculates the minimum number of bits needed to represent n categories — 8 categories need exactly 3 bits (2³ = 8). The bitwise right-shift and AND operation (agent_int >> bit) & 1 extracts each individual bit position. The result: 8 unique agent codes represented in just 3 columns instead of 7. In this small sample the binary columns show weak correlations, but in a real dataset where certain agents consistently handle higher-value properties, those bits would carry genuine signal.

Encoding Strategy Decision Guide

Situation Strategy pandas method Watch out for
Nominal, <20 unique values One-hot pd.get_dummies(drop_first=True) Categories with <2 rows per split
Ordinal — natural order exists Integer map series.map(dict) Validate monotonicity with target
Yes/No binary column Binary (0/1) (col == 'yes').astype(int) 90%+ one value = near-zero signal
Nominal, 20+ unique values Frequency value_counts(normalize=True) May need target encoding if weak
Nominal, many categories, compact needed Binary factorize + bitwise shift log₂(n) columns instead of n−1
Nominal, high-card, supervised Target encoding groupby mean — see Lesson 18 Leakage risk — must use CV folds

Teacher's Note

The dummy variable trap: keeping all 3 binary columns from a 3-category one-hot creates perfect multicollinearity that breaks linear models — always use drop_first=True. Also, always handle unseen categories before deployment: if a new property arrives with a neighbourhood your encoder never saw, it will produce NaN and silently break predictions. Map unknowns to the most frequent category as a safe fallback.

Practice Questions

1. Which argument in pd.get_dummies() removes one redundant binary column to prevent the dummy variable trap in linear models?



2. The encoding technique that replaces each category with how often it appears in the dataset — expressed as a proportion — is called ___.



3. After ordinal encoding a column like condition, you validate it by checking that the mean sale price increases with the rank — a property called ___.



Quiz

1. Why is integer encoding (north=0, east=1, south=2) wrong for a nominal feature like neighbourhood?


2. A postcode column has 500 unique values. Why is one-hot encoding the wrong choice?


3. What is the main risk of not handling unknown categories before deploying a one-hot or frequency encoder to production?


Up Next · Lesson 6

Date & Time Features

A single datetime column contains month, year, day-of-week, season, time since an event, and more. Learn to extract every usable signal from a timestamp and know which ones are actually worth keeping.