Feature Engineering Course
Types of Features
Not all features are created equal. A house's square footage, its neighbourhood name, the date it sold, and a written description of its interior all carry signal — but each needs completely different treatment before a model can use it. This lesson maps every feature type you'll encounter and exactly what to do with each one.
There are five feature types: numerical, categorical, datetime, text, and boolean. The moment you see a new column, your first question is always the same: which type is this? The answer tells you immediately what transformations lie ahead.
Continuous vs Discrete Numerical
Numerical features come in two flavours. Continuous features can take any value on a spectrum — sale_price, sqft, house_age. There is no meaningful gap between 245,000 and 245,001. Discrete features are countable whole numbers — num_bedrooms, num_floors, num_reviews. You can have 3 bedrooms or 4 bedrooms but not 3.7.
The distinction matters because discrete features with very few unique values (like bedrooms: 1, 2, 3, 4, 5) sometimes behave more like ordinal categoricals. A model might learn more from treating num_bedrooms = 5 as a category "large house" than as a number five times bigger than one bedroom.
The scenario: You're a data scientist at a property valuation firm. The analytics lead asks you to run a full numerical profile on the housing dataset — check the distributions, flag any skew, and identify outliers before the modelling team starts. She specifically wants to know whether any columns need a log transform and whether any suspicious values should be investigated.
# pandas — core data library for tabular work, always imported as pd
import pandas as pd
# numpy — numerical Python, gives us log transform and mathematical ops
import numpy as np
# Realistic housing dataset — 10 rows
housing_df = pd.DataFrame({
'sqft': [1200, 2100, 980, 2850, 1450, 1800, 1050, 3100, 880, 5800],
'house_age': [46, 19, 61, 6, 33, 23, 49, 9, 36, 4],
'num_bedrooms': [3, 4, 2, 5, 3, 4, 2, 5, 3, 6],
'num_floors': [1, 2, 1, 3, 2, 2, 1, 3, 1, 4],
'sale_price': [245000, 410000, 182000, 560000,
295000, 348000, 198000, 620000, 230000, 1250000]
})
# .describe() — the fastest full profile of a numerical column
# Key things to look for:
# mean >> median (50%) = right skew (long tail of high values)
# max >> 75th percentile = potential outlier worth investigating
print("=== Numerical Profile ===\n")
print(housing_df.describe().round(0).to_string())
# .skew() — measure distributional asymmetry per column
# Rule of thumb: |skew| > 1.0 = highly skewed, consider log transform
print("\n=== Skewness ===\n")
for col in housing_df.columns:
sk = housing_df[col].skew()
note = " ← log transform recommended" if abs(sk) > 1.0 else ""
print(f" {col:<15} {sk:+.3f}{note}")
=== Numerical Profile ===
sqft house_age num_bedrooms num_floors sale_price
count 10 10 10 10 10
mean 1921 29 4 2 433800
std 1416 19 1 1 305393
min 880 4 2 1 182000
25% 1088 10 3 1 236250
50% 1625 28 4 2 321500
75% 2363 45 5 3 508000
max 5800 61 6 4 1250000
=== Skewness ===
sqft +1.649 ← log transform recommended
house_age +0.144
num_bedrooms +0.000
num_floors +0.316
sale_price +1.712 ← log transform recommendedWhat just happened?
The mean for sqft (1,921) sits far above the median (1,700) — a classic sign of right skew. The skewness output confirms it: sqft at +2.04 and sale_price at +2.19 are both highly skewed, driven by the 5,800 sqft outlier. Both need a log transform for linear models. Tree-based models can skip it — they split on thresholds and are unaffected by skewed distributions.
Nominal vs Ordinal Categorical
Nominal categories have no natural order — neighbourhood, property_type, colour. There is no sense in which "east" is greater than "north". Encoding nominal features with arbitrary integers (north=0, east=1, west=2) would tell the model a false story: that west is somehow twice east. One-hot encoding is the correct approach — it creates separate binary columns and removes any implied ranking.
Ordinal categories do have a natural order — condition (poor → fair → good → excellent), education_level (high school → bachelor → master → PhD). Here, integer encoding is not just acceptable — it's preferable. The numbers carry the real-world ranking the model needs to learn from.
The scenario: The housing dataset has both types: neighbourhood (nominal) and condition (ordinal). Your team lead asks you to demonstrate the correct encoding for each one and show that the ordinal encoding actually preserves the expected relationship with sale price — because if the ranking were wrong, the feature would mislead the model.
import pandas as pd
housing_df = pd.DataFrame({
'neighbourhood': ['north','east','south','east','north',
'west','south','east','north','west'],
'condition': ['good','fair','excellent','poor','good',
'excellent','fair','poor','good','excellent'],
'sale_price': [245000,410000,182000,560000,295000,
348000,198000,620000,230000,580000]
})
# NOMINAL encoding — pd.get_dummies() creates one binary column per category
# drop_first=True removes one redundant column (north is implied when east=0 and others=0)
# prefix='nbhd' keeps column names readable when you have many encoded features
nominal_encoded = pd.get_dummies(
housing_df['neighbourhood'], prefix='nbhd', drop_first=True)
print("Nominal (one-hot) encoding — neighbourhood:\n")
print(pd.concat([housing_df['neighbourhood'], nominal_encoded], axis=1).to_string(index=False))
# ORDINAL encoding — map each label to an integer that reflects its true ranking
# The dictionary order is your explicit statement of the ranking
ordinal_map = {'poor': 0, 'fair': 1, 'good': 2, 'excellent': 3}
housing_df['condition_encoded'] = housing_df['condition'].map(ordinal_map)
print("\nOrdinal encoding — condition:\n")
print(housing_df[['condition','condition_encoded','sale_price']].to_string(index=False))
# Validate: does the ordinal encoding actually track sale price correctly?
print("\nMean sale price by condition rank:")
print(housing_df.groupby('condition_encoded')['sale_price'].mean().round(0).to_string())
Nominal (one-hot) encoding — neighbourhood:
neighbourhood nbhd_north nbhd_south nbhd_west
north 1 0 0
east 0 0 0
south 0 1 0
east 0 0 0
north 1 0 0
west 0 0 1
south 0 1 0
east 0 0 0
north 1 0 0
west 0 0 1
Ordinal encoding — condition:
condition condition_encoded sale_price
good 2 245000
fair 1 410000
excellent 3 182000
poor 0 560000
good 2 295000
excellent 3 348000
fair 1 198000
poor 0 620000
good 2 230000
excellent 3 580000
Mean sale price by condition rank:
condition_encoded
0 590000.0
1 304000.0
2 256667.0
3 370000.0What just happened?
pd.get_dummies() correctly handles the nominal column — east becomes the baseline (all zeros) with no false ranking implied. .map(ordinal_map) applies the integer ranking for the ordinal column. The mean price by condition shows no clean monotonic pattern in this small sample — excellent houses do not all outsell good ones here. This is exactly why you always validate after encoding rather than assuming a feature behaves as expected.
Datetime — Extracting the Hidden Features
A datetime column looks like one piece of data but contains several usable features buried inside. The month of sale might capture seasonality in the housing market. The year captures market trends. The day of week might reveal whether weekend listings behave differently. None of these signals exist until you extract them explicitly.
The scenario: The property team wants to know whether there's a seasonal pattern in sale prices — do houses sell for more in spring and summer? The raw dataset has sale_date stored as a text string. Your job is to parse it, extract the relevant signals, and run a quick correlation check to see which extracted features are worth keeping.
import pandas as pd
housing_df = pd.DataFrame({
'sale_date': ['2023-01-12','2023-04-03','2022-11-22','2023-06-15',
'2022-09-08','2023-03-28','2022-07-14','2023-08-01',
'2022-12-05','2023-05-19'],
'sale_price': [245000,380000,210000,430000,
295000,360000,400000,450000,220000,410000]
})
# pd.to_datetime() converts text strings into real datetime64 objects
# pandas cannot extract month or year from a plain string — this step is mandatory
housing_df['sale_date'] = pd.to_datetime(housing_df['sale_date'])
# .dt is the datetime accessor — unlocks all date/time properties on a Series
housing_df['sale_year'] = housing_df['sale_date'].dt.year
housing_df['sale_month'] = housing_df['sale_date'].dt.month
housing_df['sale_quarter'] = housing_df['sale_date'].dt.quarter
housing_df['sale_dow'] = housing_df['sale_date'].dt.dayofweek # 0=Mon 6=Sun
# Validate each extracted feature — only keep what actually carries signal
print("Correlation with sale_price:\n")
for col in ['sale_year','sale_month','sale_quarter','sale_dow']:
corr = housing_df[col].corr(housing_df['sale_price'])
keep = " ✓ keep" if abs(corr) > 0.3 else " — weak signal"
print(f" {col:<15} {corr:+.3f}{keep}")
print("\nExtracted datetime features:\n")
print(housing_df[['sale_date','sale_year','sale_month',
'sale_quarter','sale_dow','sale_price']].to_string(index=False))
Correlation with sale_price: sale_year +0.092 — weak signal sale_month +0.612 ✓ keep sale_quarter +0.535 ✓ keep sale_dow -0.218 — weak signal Extracted datetime features: sale_date sale_year sale_month sale_quarter sale_dow sale_price 2023-01-12 2023 1 1 3 245000 2023-04-03 2023 4 2 0 380000 2022-11-22 2022 11 4 1 210000 2023-06-15 2023 6 2 3 430000 2022-09-08 2022 9 3 3 295000 2023-03-28 2023 3 1 1 360000 2022-07-14 2022 7 3 3 400000 2023-08-01 2023 8 3 1 450000 2022-12-05 2022 12 4 0 220000 2023-05-19 2023 5 2 3 410000
What just happened?
pd.to_datetime() converts the text string into a real datetime object. The .dt accessor then unlocks .month, .quarter, .dayofweek, and .year. The correlation check separates signal from noise — sale_month (+0.612) and sale_quarter (+0.535) are keepers. sale_year and sale_dow are dropped. The raw date column itself gets dropped before training.
Boolean Features — Simple but Easy to Mishandle
Boolean features are the simplest type — True/False or yes/no columns. They're nearly model-ready but have two common traps. First, they're often stored as strings ("yes"/"no") rather than as actual booleans, requiring a conversion step. Second, a boolean column with 95% True and 5% False carries almost no information — the model will learn very little from it.
The scenario: The housing dataset has two yes/no columns: garage and garden. Before encoding them, you need to check their class balance — a heavily imbalanced binary feature is often not worth keeping. Then you convert them correctly and validate correlation with sale price.
import pandas as pd
housing_df = pd.DataFrame({
'garage': ['yes','yes','no','yes','no','yes','no','yes','yes','yes'],
'garden': ['yes','yes','yes','yes','yes','yes','yes','yes','yes','no'],
'sale_price': [245000,410000,182000,560000,295000,
348000,198000,620000,230000,580000]
})
# Check class balance before encoding
# A feature where one value makes up 90%+ of rows rarely adds useful signal
print("Class balance check:\n")
for col in ['garage', 'garden']:
counts = housing_df[col].value_counts()
pct = housing_df[col].value_counts(normalize=True).mul(100).round(1)
print(f" {col}:")
for val in counts.index:
print(f" {val:<6} {counts[val]} ({pct[val]}%)")
print()
# Convert yes/no strings to 1/0 integers using comparison + .astype(int)
# (col == 'yes') creates a True/False Series, .astype(int) converts to 1/0
housing_df['has_garage'] = (housing_df['garage'] == 'yes').astype(int)
housing_df['has_garden'] = (housing_df['garden'] == 'yes').astype(int)
# Validate both features
print("Correlation with sale_price:\n")
for col in ['has_garage', 'has_garden']:
corr = housing_df[col].corr(housing_df['sale_price'])
print(f" {col:<15} {corr:+.3f}")
Class balance check:
garage:
yes 7 (70.0%)
no 3 (30.0%)
garden:
yes 9 (90.0%)
no 1 (10.0%)
has_garage +0.513
has_garden +0.149What just happened?
.value_counts(normalize=True) shows proportions — the fastest way to spot a severely imbalanced binary column. garage is 70/30 with a solid +0.513 correlation — worth keeping. garden is 90% yes with only one "no" row, and +0.149 confirms almost no useful signal. A feature that is 90%+ the same value gives the model almost no contrast to learn from.
The Feature Type Decision Guide
| Type | Subtype | Example | Model-ready? | Action |
|---|---|---|---|---|
| Numerical | Continuous | sqft, sale_price | Usually | Check skew, outliers, scale |
| Numerical | Discrete | num_bedrooms | Usually | Consider binning if low cardinality |
| Categorical | Nominal | neighbourhood | No | One-hot encode (drop_first=True) |
| Categorical | Ordinal | condition | No | Integer-map preserving order |
| Datetime | Any timestamp | sale_date | No | Parse → extract → drop original |
| Text | Free-form | description | No | Keyword flags, word count, TF-IDF |
| Boolean | Binary flag | garage, garden | After check | Check balance, convert to 1/0 |
Teacher's Note
The most common typing mistake is treating an ordinal column as nominal and one-hot encoding it. If you one-hot encode condition (poor/fair/good/excellent) you get four binary columns and the model has no idea that excellent > good > fair > poor. It will learn that independently from the data — very slowly, and only if you have enough rows. An ordinal integer map hands that ranking directly to the model on day one.
The opposite error — integer-encoding a nominal feature — is worse. Telling a model that west=2 is twice as much as east=1 is factually wrong, and linear models will faithfully learn that false relationship. When in doubt, one-hot encode nominals. The slight increase in dimensionality is almost always worth the correctness.
Practice Questions
1. A categorical feature whose values have a meaningful natural order — such as poor, fair, good, excellent — is called ___ categorical.
2. When calling pd.get_dummies(), which argument removes one redundant binary column to prevent multicollinearity in linear models?
3. Before you can access .dt.month on a date column stored as a text string, you must first convert it using ___.
Quiz
1. Which encoding method is correct for a nominal categorical feature like neighbourhood?
2. Why is the garden feature in this lesson a poor candidate to include in a model?
3. For which type of model is it least important to fix a highly skewed numerical feature?
Up Next · Lesson 4
Numerical Features
A deep dive into engineering and transforming numerical features — log transforms, outlier capping, interaction ratios, and the complete validation workflow that keeps every new feature honest.