Feature Engineering Lesson 13 – Encoding Basics | Dataplexa
Beginner Level · Lesson 13

Encoding Basics

Machine learning models speak numbers. Categories like "red", "urban", or "platinum" mean nothing to an algorithm until you translate them. Encoding is that translation — and choosing the wrong method can silently break your model.

Encoding converts categorical features — text labels, nominal groups, ordinal ranks — into numerical representations that a model can learn from. The method you choose depends on whether the categories have a natural order, how many unique values exist, and what kind of model you're feeding the data into.

The Two Fundamental Encoding Methods

Before touching code, you need to understand the conceptual difference. The wrong choice doesn't throw an error — it silently introduces bias or throws away information.

1

Label Encoding — assign an integer to each category

Maps each unique category to a number: 0, 1, 2, 3... Simple and compact. The danger: it implies an ordering that may not exist. If "red=0, green=1, blue=2", the model thinks blue is greater than red, which is nonsense for colour labels.

2

One-Hot Encoding — create a binary column per category

Creates one new binary (0/1) column for each unique category value. No false ordering. Ideal for nominal categories with no natural rank. The trade-off: if a column has 500 unique values, you get 500 new columns — the curse of dimensionality.

3

Ordinal Encoding — encode with meaningful order

Like label encoding, but the integers reflect a real rank: "low=0, medium=1, high=2". Correct for features where the order carries information — satisfaction ratings, education levels, severity scores.

When to Use Each Method

Category Type Example Use
Nominal, few categories (<10) city, colour, payment method One-Hot
Nominal, many categories (>15) zip code, product SKU Target / Frequency Encoding
Ordinal — ranked categories low/medium/high, star rating Ordinal Encoding
Binary categories yes/no, male/female, true/false Label Encoding (safe here)

Label Encoding with scikit-learn

The scenario: You're a data analyst at a telecom company building a customer churn model. One feature is contract_type with three values: "month-to-month", "one-year", and "two-year". These actually have a natural order — longer contracts generally mean more committed customers — so label encoding is appropriate here. You also have a binary paperless_billing column that's a straightforward yes/no.

# Import pandas and LabelEncoder from sklearn
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Telecom customer churn dataset with categorical features
churn_df = pd.DataFrame({
    'customer_id': ['T01', 'T02', 'T03', 'T04', 'T05',
                    'T06', 'T07', 'T08', 'T09', 'T10'],
    'contract_type': ['month-to-month', 'two-year', 'one-year',
                     'month-to-month', 'two-year', 'one-year',
                     'month-to-month', 'two-year', 'month-to-month', 'one-year'],
    'paperless_billing': ['Yes', 'No', 'Yes', 'Yes', 'No',
                         'No', 'Yes', 'No', 'Yes', 'Yes'],
    'churned': [1, 0, 0, 1, 0, 0, 1, 0, 1, 0]
})

# Instantiate a LabelEncoder — one per column, they track their own mappings
le_contract = LabelEncoder()
le_billing = LabelEncoder()

# fit_transform learns the unique values and encodes in one step
# Alphabetical order is used by default: month-to-month=0, one-year=1, two-year=2
churn_df['contract_encoded'] = le_contract.fit_transform(churn_df['contract_type'])

# Binary column: No=0, Yes=1 — label encoding is perfectly safe here
churn_df['billing_encoded'] = le_billing.fit_transform(churn_df['paperless_billing'])

# Print the mapping so we know exactly which integer maps to which label
print("contract_type mapping:")
for i, cls in enumerate(le_contract.classes_):
    print(f"  {i} → {cls}")

print("\npaperless_billing mapping:")
for i, cls in enumerate(le_billing.classes_):
    print(f"  {i} → {cls}")

print()
# Show raw labels alongside their encoded integers
print(churn_df[['customer_id', 'contract_type', 'contract_encoded',
                'paperless_billing', 'billing_encoded']].to_string(index=False))
contract_type mapping:
  0 → month-to-month
  1 → one-year
  2 → two-year

paperless_billing mapping:
  0 → No
  1 → Yes

 customer_id   contract_type  contract_encoded paperless_billing  billing_encoded
         T01  month-to-month                 0               Yes                1
         T02        two-year                 2                No                0
         T03        one-year                 1               Yes                1
         T04  month-to-month                 0               Yes                1
         T05        two-year                 2                No                0
         T06        one-year                 1                No                0
         T07  month-to-month                 0               Yes                1
         T08        two-year                 2                No                0
         T09  month-to-month                 0               Yes                1
         T10        one-year                 1               Yes                1

What just happened?

LabelEncoder scanned each column, sorted the unique values alphabetically, and assigned integers starting from 0. The .classes_ attribute stores the mapping so you can always look up which integer corresponds to which label. For paperless_billing the result is a clean 0/1 binary column.

One-Hot Encoding with pandas

The scenario: You're building a property price prediction model at a real estate firm. One feature is property_type — "apartment", "house", "studio", "townhouse". These four categories have no natural ordering whatsoever. A studio isn't "less than" an apartment. One-hot encoding is the right choice, and pandas makes it a single line with pd.get_dummies().

# Import pandas
import pandas as pd

# Property listings data — property_type is nominal with no inherent order
housing_df = pd.DataFrame({
    'listing_id': ['P01', 'P02', 'P03', 'P04', 'P05',
                   'P06', 'P07', 'P08', 'P09', 'P10'],
    'property_type': ['apartment', 'house', 'studio', 'townhouse', 'apartment',
                     'studio', 'house', 'apartment', 'townhouse', 'studio'],
    'price_gbp': [320000, 580000, 195000, 440000, 310000,
                 185000, 620000, 295000, 455000, 200000]
})

# pd.get_dummies() creates one binary column per unique category value
# prefix= adds the original column name so you know where each new column came from
# drop_first=True drops the first dummy column to avoid multicollinearity
encoded = pd.get_dummies(housing_df['property_type'], prefix='type', drop_first=True)

# Join the new dummy columns back onto the original DataFrame
housing_df = pd.concat([housing_df, encoded], axis=1)

# Drop the original text column — the model only needs the encoded versions
housing_df = housing_df.drop(columns=['property_type'])

# Print the result — four categories became three binary columns (one dropped)
print(housing_df.to_string(index=False))
 listing_id  price_gbp  type_house  type_studio  type_townhouse
        P01     320000           0            0               0
        P02     580000           1            0               0
        P03     195000           0            1               0
        P04     440000           0            0               1
        P05     310000           0            0               0
        P06     185000           0            1               0
        P07     620000           1            0               0
        P08     295000           0            0               0
        P09     455000           0            0               1
        P10     200000           0            1               0

What just happened?

pd.get_dummies() created one column for each unique property type, then drop_first=True removed type_apartment — the first alphabetically. When all three remaining columns are 0, the model knows it's an apartment. This is the dummy variable trap fix: keeping all four columns would make one perfectly predictable from the others, which breaks linear models.

Ordinal Encoding with scikit-learn

The scenario: You're a data scientist at an e-commerce platform predicting customer lifetime value. Your dataset includes a membership_tier column — "bronze", "silver", "gold", "platinum" — and a satisfaction_score column — "low", "medium", "high". Both are clearly ordinal. Using sklearn's OrdinalEncoder lets you specify the exact order of categories rather than relying on alphabetical assignment.

# Import pandas and OrdinalEncoder
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

# E-commerce customer data with two ordinal features
customer_df = pd.DataFrame({
    'customer_id': ['E01', 'E02', 'E03', 'E04', 'E05',
                    'E06', 'E07', 'E08', 'E09', 'E10'],
    'membership_tier': ['gold', 'bronze', 'platinum', 'silver', 'bronze',
                       'gold', 'platinum', 'silver', 'bronze', 'gold'],
    'satisfaction_score': ['high', 'low', 'high', 'medium', 'low',
                          'medium', 'high', 'medium', 'low', 'high']
})

# OrdinalEncoder takes categories= to define the order for each column explicitly
# The list order maps directly: first item = 0, second = 1, and so on
oe = OrdinalEncoder(categories=[
    ['bronze', 'silver', 'gold', 'platinum'],   # membership_tier order
    ['low', 'medium', 'high']                   # satisfaction_score order
])

# Fit and transform both columns at once — must pass as a 2D array (two columns)
feature_cols = ['membership_tier', 'satisfaction_score']
encoded = oe.fit_transform(customer_df[feature_cols])

# Store results back in the DataFrame with new column names
customer_df['tier_encoded'] = encoded[:, 0].astype(int)
customer_df['satisfaction_encoded'] = encoded[:, 1].astype(int)

# Print the original labels alongside their encoded values
print(customer_df[['customer_id', 'membership_tier', 'tier_encoded',
                   'satisfaction_score', 'satisfaction_encoded']].to_string(index=False))
 customer_id membership_tier  tier_encoded satisfaction_score  satisfaction_encoded
         E01            gold             2               high                     2
         E02          bronze             0                low                     0
         E03        platinum             3               high                     2
         E04          silver             1             medium                     1
         E05          bronze             0                low                     0
         E06            gold             2             medium                     1
         E07        platinum             3               high                     2
         E08          silver             1             medium                     1
         E09          bronze             0                low                     0
         E10            gold             2               high                     2

What just happened?

OrdinalEncoder used the exact order we specified in categories= to assign integers — bronze gets 0, platinum gets 3, low gets 0, high gets 2. This is crucial: if we had used LabelEncoder instead, it would have sorted alphabetically and assigned bronze=0, gold=1, platinum=2, silver=3 — putting silver above platinum, which is completely wrong.

One-Hot Encoding with scikit-learn

The scenario: You've joined a data team at a food delivery startup where the pipeline uses sklearn Pipeline objects for reproducibility. The pd.get_dummies() approach from earlier won't fit into that system. You need sklearn's OneHotEncoder, which is a proper transformer that can be fitted on training data and applied to new orders at inference time without manual column management.

# Import pandas and OneHotEncoder
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Food delivery order data with cuisine type as a nominal feature
orders_df = pd.DataFrame({
    'order_id': ['O01', 'O02', 'O03', 'O04', 'O05',
                'O06', 'O07', 'O08', 'O09', 'O10'],
    'cuisine': ['italian', 'indian', 'chinese', 'italian', 'mexican',
               'chinese', 'indian', 'mexican', 'italian', 'chinese'],
    'delivery_mins': [32, 45, 28, 38, 41, 25, 50, 36, 30, 27]
})

# sparse_output=False returns a dense numpy array instead of a sparse matrix
# drop='first' removes the first category column to avoid multicollinearity
ohe = OneHotEncoder(sparse_output=False, drop='first')

# Fit on the cuisine column — must be 2D, so double brackets [[]]
ohe.fit(orders_df[['cuisine']])

# Transform produces a numpy array — we convert it to a DataFrame with proper names
encoded_array = ohe.transform(orders_df[['cuisine']])
encoded_cols = ohe.get_feature_names_out(['cuisine'])
encoded_df = pd.DataFrame(encoded_array, columns=encoded_cols, dtype=int)

# Concatenate the encoded columns back to the original DataFrame
orders_df = pd.concat([orders_df.drop(columns=['cuisine']), encoded_df], axis=1)

# Print the categories the encoder learned during fit
print(f"Categories learned: {ohe.categories_[0].tolist()}")
print(f"Columns created:    {encoded_cols.tolist()}")
print()
print(orders_df.to_string(index=False))
Categories learned: ['chinese', 'indian', 'italian', 'mexican']
Columns created:    ['cuisine_indian', 'cuisine_italian', 'cuisine_mexican']

 order_id  delivery_mins  cuisine_indian  cuisine_italian  cuisine_mexican
      O01             32               0                1                0
      O02             45               1                0                0
      O03             28               0                0                0
      O04             38               0                1                0
      O05             41               0                0                1
      O06             25               0                0                0
      O07             50               1                0                0
      O08             36               0                0                1
      O09             30               0                1                0
      O10             27               0                0                0

What just happened?

OneHotEncoder learned four cuisine categories during .fit() and created three binary columns after dropping the first (cuisine_chinese). Orders O03, O06, and O10 all show zeros across all three columns — the model knows they are Chinese orders by elimination. get_feature_names_out() generates clean, readable column names automatically.

Label vs One-Hot vs Ordinal — Side by Side

Use for binary / quick encoding

Label Encoding

Safe for binary columns (yes/no) and tree-based models that don't assume numerical order. Risky for nominal multi-class columns in linear models.

Use for nominal categories

One-Hot Encoding

No false ordering. Works with all model types. Use drop='first' to avoid multicollinearity. Avoid for high-cardinality columns.

Use for ranked categories

Ordinal Encoding

Always specify categories= explicitly. Never rely on alphabetical sorting for ordinal features — it will almost certainly produce the wrong order.

Teacher's Note

Always check for unseen categories before transforming test data. If a new category appears at inference time that wasn't present during training, both LabelEncoder and OrdinalEncoder will raise an error by default. OneHotEncoder can handle this gracefully with handle_unknown='ignore', which fills unknown categories with all zeros. In production pipelines, always set this parameter explicitly so a new cuisine or new contract type doesn't crash your deployment.

Practice Questions

1. A feature has values "red", "green", "blue" with no natural order. Which encoding method should you use?



2. What argument do you pass to pd.get_dummies() to avoid the dummy variable trap?



3. Which OrdinalEncoder parameter lets you define the exact rank order of categories?



Quiz

1. Why is label encoding risky for nominal features like "city" in a linear regression model?


2. Which OneHotEncoder parameter prevents errors when an unseen category appears at inference time?


3. A satisfaction survey column has values "poor", "fair", "good", "excellent". What is the correct encoding approach?


Up Next · Lesson 14

Feature Construction

Build entirely new features from existing ones — ratios, differences, flags, and combinations that give your model signals it could never find on its own.