Feature Engineering Course
Encoding Basics
Machine learning models speak numbers. Categories like "red", "urban", or "platinum" mean nothing to an algorithm until you translate them. Encoding is that translation — and choosing the wrong method can silently break your model.
Encoding converts categorical features — text labels, nominal groups, ordinal ranks — into numerical representations that a model can learn from. The method you choose depends on whether the categories have a natural order, how many unique values exist, and what kind of model you're feeding the data into.
The Two Fundamental Encoding Methods
Before touching code, you need to understand the conceptual difference. The wrong choice doesn't throw an error — it silently introduces bias or throws away information.
Label Encoding — assign an integer to each category
Maps each unique category to a number: 0, 1, 2, 3... Simple and compact. The danger: it implies an ordering that may not exist. If "red=0, green=1, blue=2", the model thinks blue is greater than red, which is nonsense for colour labels.
One-Hot Encoding — create a binary column per category
Creates one new binary (0/1) column for each unique category value. No false ordering. Ideal for nominal categories with no natural rank. The trade-off: if a column has 500 unique values, you get 500 new columns — the curse of dimensionality.
Ordinal Encoding — encode with meaningful order
Like label encoding, but the integers reflect a real rank: "low=0, medium=1, high=2". Correct for features where the order carries information — satisfaction ratings, education levels, severity scores.
When to Use Each Method
| Category Type | Example | Use |
|---|---|---|
| Nominal, few categories (<10) | city, colour, payment method | One-Hot |
| Nominal, many categories (>15) | zip code, product SKU | Target / Frequency Encoding |
| Ordinal — ranked categories | low/medium/high, star rating | Ordinal Encoding |
| Binary categories | yes/no, male/female, true/false | Label Encoding (safe here) |
Label Encoding with scikit-learn
The scenario: You're a data analyst at a telecom company building a customer churn model. One feature is contract_type with three values: "month-to-month", "one-year", and "two-year". These actually have a natural order — longer contracts generally mean more committed customers — so label encoding is appropriate here. You also have a binary paperless_billing column that's a straightforward yes/no.
# Import pandas and LabelEncoder from sklearn
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# Telecom customer churn dataset with categorical features
churn_df = pd.DataFrame({
'customer_id': ['T01', 'T02', 'T03', 'T04', 'T05',
'T06', 'T07', 'T08', 'T09', 'T10'],
'contract_type': ['month-to-month', 'two-year', 'one-year',
'month-to-month', 'two-year', 'one-year',
'month-to-month', 'two-year', 'month-to-month', 'one-year'],
'paperless_billing': ['Yes', 'No', 'Yes', 'Yes', 'No',
'No', 'Yes', 'No', 'Yes', 'Yes'],
'churned': [1, 0, 0, 1, 0, 0, 1, 0, 1, 0]
})
# Instantiate a LabelEncoder — one per column, they track their own mappings
le_contract = LabelEncoder()
le_billing = LabelEncoder()
# fit_transform learns the unique values and encodes in one step
# Alphabetical order is used by default: month-to-month=0, one-year=1, two-year=2
churn_df['contract_encoded'] = le_contract.fit_transform(churn_df['contract_type'])
# Binary column: No=0, Yes=1 — label encoding is perfectly safe here
churn_df['billing_encoded'] = le_billing.fit_transform(churn_df['paperless_billing'])
# Print the mapping so we know exactly which integer maps to which label
print("contract_type mapping:")
for i, cls in enumerate(le_contract.classes_):
print(f" {i} → {cls}")
print("\npaperless_billing mapping:")
for i, cls in enumerate(le_billing.classes_):
print(f" {i} → {cls}")
print()
# Show raw labels alongside their encoded integers
print(churn_df[['customer_id', 'contract_type', 'contract_encoded',
'paperless_billing', 'billing_encoded']].to_string(index=False))
contract_type mapping:
0 → month-to-month
1 → one-year
2 → two-year
paperless_billing mapping:
0 → No
1 → Yes
customer_id contract_type contract_encoded paperless_billing billing_encoded
T01 month-to-month 0 Yes 1
T02 two-year 2 No 0
T03 one-year 1 Yes 1
T04 month-to-month 0 Yes 1
T05 two-year 2 No 0
T06 one-year 1 No 0
T07 month-to-month 0 Yes 1
T08 two-year 2 No 0
T09 month-to-month 0 Yes 1
T10 one-year 1 Yes 1What just happened?
LabelEncoder scanned each column, sorted the unique values alphabetically, and assigned integers starting from 0. The .classes_ attribute stores the mapping so you can always look up which integer corresponds to which label. For paperless_billing the result is a clean 0/1 binary column.
One-Hot Encoding with pandas
The scenario: You're building a property price prediction model at a real estate firm. One feature is property_type — "apartment", "house", "studio", "townhouse". These four categories have no natural ordering whatsoever. A studio isn't "less than" an apartment. One-hot encoding is the right choice, and pandas makes it a single line with pd.get_dummies().
# Import pandas
import pandas as pd
# Property listings data — property_type is nominal with no inherent order
housing_df = pd.DataFrame({
'listing_id': ['P01', 'P02', 'P03', 'P04', 'P05',
'P06', 'P07', 'P08', 'P09', 'P10'],
'property_type': ['apartment', 'house', 'studio', 'townhouse', 'apartment',
'studio', 'house', 'apartment', 'townhouse', 'studio'],
'price_gbp': [320000, 580000, 195000, 440000, 310000,
185000, 620000, 295000, 455000, 200000]
})
# pd.get_dummies() creates one binary column per unique category value
# prefix= adds the original column name so you know where each new column came from
# drop_first=True drops the first dummy column to avoid multicollinearity
encoded = pd.get_dummies(housing_df['property_type'], prefix='type', drop_first=True)
# Join the new dummy columns back onto the original DataFrame
housing_df = pd.concat([housing_df, encoded], axis=1)
# Drop the original text column — the model only needs the encoded versions
housing_df = housing_df.drop(columns=['property_type'])
# Print the result — four categories became three binary columns (one dropped)
print(housing_df.to_string(index=False))
listing_id price_gbp type_house type_studio type_townhouse
P01 320000 0 0 0
P02 580000 1 0 0
P03 195000 0 1 0
P04 440000 0 0 1
P05 310000 0 0 0
P06 185000 0 1 0
P07 620000 1 0 0
P08 295000 0 0 0
P09 455000 0 0 1
P10 200000 0 1 0What just happened?
pd.get_dummies() created one column for each unique property type, then drop_first=True removed type_apartment — the first alphabetically. When all three remaining columns are 0, the model knows it's an apartment. This is the dummy variable trap fix: keeping all four columns would make one perfectly predictable from the others, which breaks linear models.
Ordinal Encoding with scikit-learn
The scenario: You're a data scientist at an e-commerce platform predicting customer lifetime value. Your dataset includes a membership_tier column — "bronze", "silver", "gold", "platinum" — and a satisfaction_score column — "low", "medium", "high". Both are clearly ordinal. Using sklearn's OrdinalEncoder lets you specify the exact order of categories rather than relying on alphabetical assignment.
# Import pandas and OrdinalEncoder
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
# E-commerce customer data with two ordinal features
customer_df = pd.DataFrame({
'customer_id': ['E01', 'E02', 'E03', 'E04', 'E05',
'E06', 'E07', 'E08', 'E09', 'E10'],
'membership_tier': ['gold', 'bronze', 'platinum', 'silver', 'bronze',
'gold', 'platinum', 'silver', 'bronze', 'gold'],
'satisfaction_score': ['high', 'low', 'high', 'medium', 'low',
'medium', 'high', 'medium', 'low', 'high']
})
# OrdinalEncoder takes categories= to define the order for each column explicitly
# The list order maps directly: first item = 0, second = 1, and so on
oe = OrdinalEncoder(categories=[
['bronze', 'silver', 'gold', 'platinum'], # membership_tier order
['low', 'medium', 'high'] # satisfaction_score order
])
# Fit and transform both columns at once — must pass as a 2D array (two columns)
feature_cols = ['membership_tier', 'satisfaction_score']
encoded = oe.fit_transform(customer_df[feature_cols])
# Store results back in the DataFrame with new column names
customer_df['tier_encoded'] = encoded[:, 0].astype(int)
customer_df['satisfaction_encoded'] = encoded[:, 1].astype(int)
# Print the original labels alongside their encoded values
print(customer_df[['customer_id', 'membership_tier', 'tier_encoded',
'satisfaction_score', 'satisfaction_encoded']].to_string(index=False))
customer_id membership_tier tier_encoded satisfaction_score satisfaction_encoded
E01 gold 2 high 2
E02 bronze 0 low 0
E03 platinum 3 high 2
E04 silver 1 medium 1
E05 bronze 0 low 0
E06 gold 2 medium 1
E07 platinum 3 high 2
E08 silver 1 medium 1
E09 bronze 0 low 0
E10 gold 2 high 2What just happened?
OrdinalEncoder used the exact order we specified in categories= to assign integers — bronze gets 0, platinum gets 3, low gets 0, high gets 2. This is crucial: if we had used LabelEncoder instead, it would have sorted alphabetically and assigned bronze=0, gold=1, platinum=2, silver=3 — putting silver above platinum, which is completely wrong.
One-Hot Encoding with scikit-learn
The scenario: You've joined a data team at a food delivery startup where the pipeline uses sklearn Pipeline objects for reproducibility. The pd.get_dummies() approach from earlier won't fit into that system. You need sklearn's OneHotEncoder, which is a proper transformer that can be fitted on training data and applied to new orders at inference time without manual column management.
# Import pandas and OneHotEncoder
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# Food delivery order data with cuisine type as a nominal feature
orders_df = pd.DataFrame({
'order_id': ['O01', 'O02', 'O03', 'O04', 'O05',
'O06', 'O07', 'O08', 'O09', 'O10'],
'cuisine': ['italian', 'indian', 'chinese', 'italian', 'mexican',
'chinese', 'indian', 'mexican', 'italian', 'chinese'],
'delivery_mins': [32, 45, 28, 38, 41, 25, 50, 36, 30, 27]
})
# sparse_output=False returns a dense numpy array instead of a sparse matrix
# drop='first' removes the first category column to avoid multicollinearity
ohe = OneHotEncoder(sparse_output=False, drop='first')
# Fit on the cuisine column — must be 2D, so double brackets [[]]
ohe.fit(orders_df[['cuisine']])
# Transform produces a numpy array — we convert it to a DataFrame with proper names
encoded_array = ohe.transform(orders_df[['cuisine']])
encoded_cols = ohe.get_feature_names_out(['cuisine'])
encoded_df = pd.DataFrame(encoded_array, columns=encoded_cols, dtype=int)
# Concatenate the encoded columns back to the original DataFrame
orders_df = pd.concat([orders_df.drop(columns=['cuisine']), encoded_df], axis=1)
# Print the categories the encoder learned during fit
print(f"Categories learned: {ohe.categories_[0].tolist()}")
print(f"Columns created: {encoded_cols.tolist()}")
print()
print(orders_df.to_string(index=False))
Categories learned: ['chinese', 'indian', 'italian', 'mexican']
Columns created: ['cuisine_indian', 'cuisine_italian', 'cuisine_mexican']
order_id delivery_mins cuisine_indian cuisine_italian cuisine_mexican
O01 32 0 1 0
O02 45 1 0 0
O03 28 0 0 0
O04 38 0 1 0
O05 41 0 0 1
O06 25 0 0 0
O07 50 1 0 0
O08 36 0 0 1
O09 30 0 1 0
O10 27 0 0 0What just happened?
OneHotEncoder learned four cuisine categories during .fit() and created three binary columns after dropping the first (cuisine_chinese). Orders O03, O06, and O10 all show zeros across all three columns — the model knows they are Chinese orders by elimination. get_feature_names_out() generates clean, readable column names automatically.
Label vs One-Hot vs Ordinal — Side by Side
Use for binary / quick encoding
Label Encoding
Safe for binary columns (yes/no) and tree-based models that don't assume numerical order. Risky for nominal multi-class columns in linear models.
Use for nominal categories
One-Hot Encoding
No false ordering. Works with all model types. Use drop='first' to avoid multicollinearity. Avoid for high-cardinality columns.
Use for ranked categories
Ordinal Encoding
Always specify categories= explicitly. Never rely on alphabetical sorting for ordinal features — it will almost certainly produce the wrong order.
Teacher's Note
Always check for unseen categories before transforming test data. If a new category appears at inference time that wasn't present during training, both LabelEncoder and OrdinalEncoder will raise an error by default. OneHotEncoder can handle this gracefully with handle_unknown='ignore', which fills unknown categories with all zeros. In production pipelines, always set this parameter explicitly so a new cuisine or new contract type doesn't crash your deployment.
Practice Questions
1. A feature has values "red", "green", "blue" with no natural order. Which encoding method should you use?
2. What argument do you pass to pd.get_dummies() to avoid the dummy variable trap?
3. Which OrdinalEncoder parameter lets you define the exact rank order of categories?
Quiz
1. Why is label encoding risky for nominal features like "city" in a linear regression model?
2. Which OneHotEncoder parameter prevents errors when an unseen category appears at inference time?
3. A satisfaction survey column has values "poor", "fair", "good", "excellent". What is the correct encoding approach?
Up Next · Lesson 14
Feature Construction
Build entirely new features from existing ones — ratios, differences, flags, and combinations that give your model signals it could never find on its own.