Data Science Lesson 8 – Encoding | Dataplexa
Feature Engineering · Lesson 8

Encoding

Transform categorical data into numerical formats that machine learning algorithms can process effectively.

1
Identify categorical features in your dataset
2
Choose appropriate encoding technique
3
Apply encoding and validate output
4
Prepare encoded data for machine learning

Why Encoding Matters

Machine learning algorithms speak numbers. Period. When your dataset contains text like "Electronics" or "Mumbai", algorithms throw error messages faster than you can debug them. Encoding translates these categories into numerical values that algorithms understand while preserving the original meaning.

Think of it like translating English to French. The meaning stays the same, but the format changes completely. That smartphone category becomes a 1 or [1,0,0,0,0] depending on your encoding choice. Both represent "Electronics" but in mathematically digestible formats.

Key Insight

The encoding method you choose directly impacts model performance. One-hot encoding works perfectly for 5 categories but creates nightmare scenarios with 500 categories. Choose wisely.

Label Encoding

The simplest encoding technique assigns sequential numbers to categories. Delhi becomes 0, Mumbai becomes 1, and so on. Perfect for ordinal data where categories have natural ordering, dangerous for nominal data where no ranking exists.

The scenario: Flipkart's recommendation team needs to encode product categories for their new ML model. They start with label encoding to understand category distribution patterns.

# Load data and examine categories first
import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv('dataplexa_ecommerce.csv')
print("Unique categories:")
print(df['product_category'].unique())

What just happened?

We identified 5 unique categories in our product data. Label encoding will assign numbers 0-4 to these categories alphabetically. Try this: Check value counts with df['product_category'].value_counts()

Now we apply label encoding to convert these text categories into numbers:

# Apply label encoding to product categories
le = LabelEncoder()
df['category_encoded'] = le.fit_transform(df['product_category'])

# Show the mapping
mapping_df = pd.DataFrame({
    'Original': df['product_category'].unique(),
    'Encoded': le.transform(df['product_category'].unique())
})
print(mapping_df)

What just happened?

LabelEncoder automatically assigned Books=0, Clothing=1, Electronics=2 etc. in alphabetical order. The fit_transform() method learned the mapping and applied it. Try this: Use le.inverse_transform([0,1,2]) to convert back.

Common Mistake

Using label encoding on nominal categories like cities creates artificial ordering. The algorithm thinks Mumbai (1) > Delhi (0), which makes no sense. Use one-hot encoding instead.

One-Hot Encoding

One-hot encoding creates separate binary columns for each category. Each row gets a 1 in exactly one column and 0s everywhere else. No artificial ordering, no mathematical relationships between categories. Clean and fair.

Why does this matter? Because machine learning algorithms interpret numerical distance as similarity. Label encoding makes "Electronics" (2) seem twice as far from "Books" (0) as "Clothing" (1) is. One-hot encoding eliminates these fake relationships.

The scenario: Swiggy's data team needs to encode city information for their delivery time prediction model. Cities have no natural ranking, so one-hot encoding prevents the model from learning nonsensical city hierarchies.

# One-hot encode city column
city_encoded = pd.get_dummies(df['city'], prefix='city')
print("One-hot encoded cities:")
print(city_encoded.head())

What just happened?

Each city became its own column with 1 for that city and 0 for all others. Row 1 shows city_Delhi = 1, meaning this order came from Delhi. Try this: Add drop_first=True to avoid multicollinearity.

But hold on. Five cities created five columns. What happens with 500 cities? Your dataset explodes from manageable to unwieldy. This curse of dimensionality kills model performance and training speed.

# Combine original data with encoded cities
df_encoded = pd.concat([df, city_encoded], axis=1)

# Check the shape increase
print(f"Original shape: {df.shape}")
print(f"After encoding: {df_encoded.shape}")
print(f"Added {city_encoded.shape[1]} new columns")

What just happened?

Our dataset expanded from 11 to 16 columns after encoding just one categorical feature. The axis=1 parameter concatenated columns horizontally. Try this: Use df_encoded.memory_usage().sum() to see memory impact.

Mumbai leads with 2,340 orders, while Chennai has the fewest at 1,650 orders

This chart reveals something important. Mumbai dominates order volume, but the one-hot encoding treats all cities equally. Each gets exactly one column regardless of frequency. This democratic approach prevents the model from assuming Mumbai is "bigger" than other cities numerically.

But notice the distribution. High-volume cities like Mumbai get the same representation as low-volume cities like Chennai. Sometimes you want this fairness. Sometimes you don't. The business context determines the right choice.

Target Encoding

Target encoding replaces categories with their average target values. Instead of arbitrary numbers or sparse columns, categories get encoded based on their actual relationship with what you're predicting. Electronics might become 45000 if that's the average revenue for electronics orders.

This technique shines when dealing with high-cardinality features. Product names with 10,000 unique values? Target encoding creates one meaningful column instead of 10,000 sparse ones. The encoding directly captures predictive power.

The scenario: Paytm's fraud detection team needs to encode merchant categories for their risk model. Some categories consistently show higher fraud rates, and target encoding captures these patterns directly.

# Calculate average revenue by product category
target_encoding = df.groupby('product_category')['revenue'].mean()
print("Target encoding mapping:")
print(target_encoding)

What just happened?

Each category now has its average revenue as its encoded value. Electronics averages ₹78,940 while Books average only ₹8,420. This encoding captures business reality. Try this: Use .median() instead of .mean() for outlier resistance.

# Apply target encoding to the dataset
df['category_target_encoded'] = df['product_category'].map(target_encoding)

# Show results
print("Original vs Target Encoded:")
print(df[['product_category', 'category_target_encoded', 'revenue']].head(8))

What just happened?

Each category got replaced with its average revenue. Row 0 shows Electronics → 78940.20 and actual revenue 85430.0. The encoding captures category spending patterns. Try this: Compare correlation between encoded values and target variable.

Data Leakage Warning

Target encoding on training data creates data leakage. The encoding "knows" future target values. Always use cross-validation or holdout sets to compute target statistics properly.

Electronics generates 9.4x more revenue than Books, captured perfectly in target encoding

This visualization shows why target encoding works so well. The natural ordering from Books (₹8,420) to Electronics (₹78,940) reflects real business patterns. Unlike arbitrary label encoding numbers, these values carry meaning that directly relates to your prediction target.

The steep climb from Clothing to Electronics reveals the premium electronics market's impact. Your model learns that when it sees 78940.20, it should predict higher revenues than when it sees 8420.50.

Choosing the Right Encoding

Label Encoding

Best for: Ordinal data, tree-based models, memory constraints. Example: Education level (High School < Bachelor < Master)

One-Hot Encoding

Best for: Nominal data, linear models, low cardinality. Example: Colors (Red, Blue, Green)

Target Encoding

Best for: High cardinality, strong target correlation. Example: ZIP codes, product IDs

Frequency Encoding

Best for: When frequency matters more than category. Example: Customer segments by activity

📊 Data Insight

Cardinality matters most. Under 10 categories? One-hot encoding works great. Over 50 categories? Consider target or frequency encoding. Between 10-50? Test both approaches.

The Cardinality Test

# Check cardinality for all categorical columns
categorical_cols = ['gender', 'city', 'product_category', 'product_name']

for col in categorical_cols:
    cardinality = df[col].nunique()
    print(f"{col}: {cardinality} unique values")
    
    # Recommend encoding based on cardinality
    if cardinality <= 10:
        recommendation = "One-hot encoding"
    elif cardinality <= 50:
        recommendation = "Label or target encoding"
    else:
        recommendation = "Target or frequency encoding"
    
    print(f"  → Recommendation: {recommendation}\n")

What just happened?

The cardinality check revealed product_name has 847 unique values - way too many for one-hot encoding. Gender and city are perfect for one-hot with only 2-5 categories each. Try this: Add frequency analysis with df[col].value_counts().head()

Perfect example of why you can't use one encoding for everything. Product names with 847 unique values would create 847 new columns. Your model would spend more time processing sparse matrices than learning patterns. Choose encoding methods per column, not per dataset.

Handling Unseen Categories

Real-world data throws curveballs. Your model trains on Delhi, Mumbai, Bangalore, Chennai, and Pune. Then production data arrives with orders from Hyderabad. Unseen categories break poorly designed encoding pipelines.

The scenario: Zomato's pricing model encounters a new restaurant category not present in training data. The system needs to handle this gracefully without crashing or producing nonsensical predictions.

# Simulate unseen category scenario
from sklearn.preprocessing import LabelEncoder

# Train encoder on subset of data
train_cities = ['Delhi', 'Mumbai', 'Bangalore']
le_safe = LabelEncoder()
le_safe.fit(train_cities)

# Try to transform unseen city
test_cities = ['Mumbai', 'Hyderabad']  # Hyderabad is unseen

try:
    result = le_safe.transform(test_cities)
    print("Success:", result)
except ValueError as e:
    print("Error:", str(e))

What just happened?

LabelEncoder crashed when it encountered 'Hyderabad', a city not in its training vocabulary. This breaks production systems instantly. Try this: Build a custom encoder that handles unknown categories gracefully.

# Build robust encoding that handles unseen categories
def safe_target_encoding(train_data, column, target, unknown_value='mean'):
    # Calculate target encoding mapping
    encoding_map = train_data.groupby(column)[target].mean().to_dict()
    
    # Define fallback for unseen categories
    if unknown_value == 'mean':
        fallback = train_data[target].mean()
    elif unknown_value == 'median':
        fallback = train_data[target].median()
    else:
        fallback = unknown_value
    
    def encode_column(data):
        return data.map(encoding_map).fillna(fallback)
    
    return encode_column, encoding_map, fallback

# Test the safe encoder
encoder_func, mapping, fallback_val = safe_target_encoding(
    df, 'product_category', 'revenue', 'mean'
)

print(f"Encoding mapping: {mapping}")
print(f"Fallback value: {fallback_val:.2f}")

What just happened?

Our safe encoder creates a mapping for known categories and sets a fallback value of ₹33,504 (overall mean) for unknown categories. No more crashes, just graceful degradation. Try this: Use the median instead of mean for outlier resistance.

Always plan for unseen categories in production. Use global statistics (mean/median/mode) as fallbacks, or create an "Other" category during training to represent rare cases.

6% of production data contains previously unseen categories requiring fallback handling

This chart represents a typical production scenario. 94% of your data matches training categories perfectly, but that crucial 6% consists of new categories, typos, or edge cases. Without proper fallback mechanisms, these 6% cause system failures.

The yellow slice represents your safety net. Instead of crashing, unknown categories get reasonable default values. Your model continues making predictions, albeit with slightly reduced confidence for these edge cases.

Quiz

1. A Flipkart analyst wants to encode product categories for a revenue prediction model. The categories are Electronics (avg revenue ₹75,000), Books (avg revenue ₹8,000), and Clothing (avg revenue ₹25,000). What would target encoding produce for these categories?


2. Your e-commerce dataset has a "product_name" column with 5,000 unique product names. What's the best encoding approach for this high-cardinality categorical feature?


3. You've built a target encoder on training data with cities: Delhi, Mumbai, Bangalore. In production, you encounter orders from Hyderabad (not in training data). What's the most robust way to handle this?


Up Next

Feature Scaling

Transform your encoded features to the same scale so algorithms treat all variables fairly and converge faster.