Feature Engineering Course
Introduction to Feature Engineering
Most people think building a good machine learning model is about picking the right algorithm. It isn't. It's about giving that algorithm the right information — and that's exactly what feature engineering does.
The Core Idea
Raw data describes things.
Features explain them.
A model doesn't understand that 1978 means "old". It needs you to calculate the age. A model doesn't understand that "SW1A" is an expensive postcode. It needs you to map it to a number. That translation — from raw description to meaningful signal — is feature engineering.
Two Agents, Same Spreadsheet
Here's the clearest way I know to explain what feature engineering actually does. Imagine two estate agents both trying to predict whether a house will sell above £300,000.
Agent A — Raw Data
Gets handed the spreadsheet exactly as it came from the database.
postcode: SW1A 2AA
sale_date: 2023-03-15
garage: yes
Stares at it. The numbers don't tell a story she can act on.
Agent B — Engineered Features
Transforms the same data before making a single prediction.
area_income_band: high
sale_month: 3 (March)
has_garage: 1
Now the data tells a real story a model can learn from.
Agent B didn't collect new data. He didn't change the algorithm. He just transformed what he already had into a form the model could actually use. That's feature engineering — nothing more, nothing less.
Why It Matters More Than the Algorithm
Machine learning competitions have proven this repeatedly: the winning teams rarely win because of a better algorithm. They win because of better features. A simple logistic regression fed brilliant features will outperform a deep neural network fed raw garbage almost every time.
The reason is mechanical. Algorithms are optimisers — they find patterns in the numbers you give them. But they can only find patterns that are present and visible in the data. If signal is buried in the wrong format, mislabelled, or drowned in noise, the algorithm has nothing to grip. No amount of hyperparameter tuning fixes that.
House age vs year built
A model sees year_built = 1978 as a slightly-below-average integer. It sees house_age = 46 as something it can learn correlates with renovation costs, resale difficulty, and buyer hesitation. Same underlying fact — completely different learning opportunity.
Loan-to-income ratio
A bank has loan_amount = £320k and income = £95k sitting in separate columns. A lender's risk officer cares about the ratio: 3.37× income. Neither column alone tells that story. Engineering loan_to_income_ratio packages exactly the signal the model needs.
Purchase timing
A raw timestamp like 2023-12-22 18:43:07 is nearly useless to a model. Extracting days_until_christmas = 3, is_evening = 1, and is_weekend = 0 gives it the seasonal and behavioural context it needs to predict purchase likelihood accurately.
The Five Things Feature Engineering Does
Feature engineering isn't one technique — it's a collection of five activities. You'll learn all of them across this course. Here's the map:
| Activity | What it means | Example | Covered in |
|---|---|---|---|
| Creation | Build new columns from existing ones | house_age = 2024 − year_built | Lessons 2, 14, 17 |
| Transformation | Change how a column is distributed or scaled | log(sale_price) to fix skew | Lessons 10, 12, 23 |
| Encoding | Convert text categories into numbers | "north" → 0, "east" → 1 | Lessons 5, 13, 18–21 |
| Cleaning | Handle missing values and outliers | Fill NaN income with median | Lessons 8, 9, 29 |
| Selection | Decide which features to keep or drop | Drop house_id — no signal | Lessons 24–28, 43 |
Your First Feature in Code
The scenario: You've just joined a property analytics team. Your manager asks one question: "Does the age of a house affect its sale price?" The dataset has year_built. It doesn't have house_age. Your job is to create that feature — then validate it against the target before declaring it useful. This is the complete workflow in miniature: observe, create, validate.
# pandas is Python's core library for working with tabular data
# We always import it as pd — universal convention across the industry
import pandas as pd
# A small housing dataset — 8 realistic rows
housing_df = pd.DataFrame({
'year_built': [1978, 2005, 1963, 2018, 1991, 2001, 1975, 2015],
'sqft': [1200, 2100, 980, 2850, 1450, 1800, 1050, 3100],
'sale_price': [245000, 410000, 182000, 560000,
295000, 348000, 198000, 620000]
})
# FEATURE CREATION — translate year_built into something the model understands
# 1978 is just a number; 46 years old is a meaningful concept
housing_df['house_age'] = 2024 - housing_df['year_built']
# VALIDATION — does this new feature actually correlate with sale price?
# .corr() returns -1 (perfect negative) to +1 (perfect positive)
# Near 0 = no linear relationship — feature probably not worth keeping
corr_age = housing_df['house_age'].corr(housing_df['sale_price'])
corr_sqft = housing_df['sqft'].corr(housing_df['sale_price'])
print(f"house_age vs sale_price: {corr_age:+.3f}")
print(f"sqft vs sale_price: {corr_sqft:+.3f}")
print()
print(housing_df[['year_built', 'house_age', 'sqft', 'sale_price']])
house_age vs sale_price: -0.891 sqft vs sale_price: +0.968 year_built house_age sqft sale_price 0 1978 46 1200 245000 1 2005 19 2100 410000 2 1963 61 980 182000 3 2018 6 2850 560000 4 1991 33 1450 295000 5 2001 23 1800 348000 6 1975 49 1050 198000 7 2015 9 3100 620000
house_age correlates at −0.891 with sale price — a strong negative signal. Older houses sell for less. The feature validated. sqft is even stronger at +0.968. Both are worth keeping. This is the habit: create, then immediately check the number before the feature goes anywhere near a model.
Practice this in your browser — no setup needed
Open Google Colab, paste any code block from this course, and press Shift + Enter. Try changing year_built values or adding extra rows and watch how the correlation shifts. Experimenting hands-on will build intuition faster than reading ever will.
The Three Questions Every Feature Engineer Asks
Before you open a code editor, feature engineering starts in your head. The best practitioners run three questions on every column they encounter:
What does this column really mean?
year_built isn't just a year — it's a proxy for age, renovation likelihood, energy efficiency, building code era, and buyer appeal. Understanding the real-world meaning reveals which transformations are worth trying.
What would a domain expert want to know?
An estate agent doesn't think "this house was built in 1978." They think "this house is 46 years old — it probably needs a new boiler." That expert's mental model is your blueprint. Talk to the people who understand the business problem; they'll tell you which features matter.
Does the signal I created actually exist in this data?
Every feature idea is a hypothesis. Maybe house age doesn't predict price in your specific market. You check, every time. Intuition points the direction — data either confirms it or sends you back. Never assume. Always measure.
Where This Course Takes You
Lessons 1–15
Beginner
Feature types, numerical and categorical handling, datetime extraction, missing data, outliers, scaling, encoding fundamentals, and building your first complete FE workflow.
Lessons 16–35
Intermediate
Polynomial and interaction features, target encoding, weight of evidence, wrapper and filter selection methods, rolling windows, lag features, and imbalanced datasets.
Lessons 36–45
Advanced
PCA, NLP-specific FE, time series engineering, computer vision features, automated FE with Featuretools, ML-based selection, and a complete end-to-end project.
Teacher's Note
Feature engineering is where data science meets domain knowledge. The more you understand the actual problem you're solving — the business, the users, the context — the better your features will be. A data scientist who's spent an afternoon talking to underwriters will engineer better loan features than someone who's only read about credit risk in papers.
As you work through this course, keep asking: "What would a real expert in this field care about?" That question will unlock more good features than any formula ever could.
Practice Questions
1. The process of transforming raw data into meaningful inputs that improve a model's ability to learn is called ___.
2. After creating a new feature, you should always check its ___ with the target variable to confirm it carries useful signal before adding it to a model.
3. Building a new column from existing ones — such as calculating house_age from year_built — is the feature engineering activity called feature ___.
Quiz
1. According to this lesson, what do most people incorrectly believe is the key to building a good ML model?
2. Why is creating house_age from year_built considered good feature engineering?
3. Dropping house_id from a dataset because it carries no predictive signal is an example of which feature engineering activity?
Up Next · Lesson 2
What Are Features?
Get a precise definition of what a feature actually is — and learn to tell features apart from targets, identifiers, and leaky columns that will silently destroy your model's real-world performance.