Feature Engineering Course
What Are Features?
Everyone uses the word "feature" — but very few beginners get a definition precise enough to actually work with. Let's fix that, because everything in this course builds on it.
A feature is any measurable property of the thing you're predicting. One row = one observation. The features describe it. The target is what you're trying to predict about it. Not every column is a feature — some are identifiers, some are targets, some are leakage risks. Deciding what belongs in the model, and in what form, is the job.
Features vs Targets vs Non-Features
Every column in a dataset plays exactly one of four roles. Getting this classification right before you write a single line of model code prevents the most common and most expensive beginner mistakes.
Feature — model input
sqft, house_age, has_garage, neighbourhood_encoded — measurable properties that help the model learn the relationship with the target.
Target — model output
sale_price, approved — what the model is trying to predict. Never used as input. Using it as input is data leakage.
Identifier — exclude before training
house_id, listing_url — unique row labels carry zero predictive signal and cause models to memorise training data rather than learning general patterns.
Leaky feature — dangerous to include
days_on_market — you don't know this at the moment you'd use the model to predict price. Looks amazing in test metrics. Fails completely in production.
The Four Feature Types
Every feature in every dataset falls into one of four types. The type determines exactly what you do with it before it can enter a model.
🔢 Numerical
Continuous numbers a model can do maths on directly — sqft, sale_price, house_age, credit_score.
Action: Check for skew, outliers, and scale differences. Usually model-ready after that.
🏷️ Categorical
Text labels the model cannot compute on — neighbourhood, employment_type, property_type.
Action: Encode into numbers before use — one-hot, ordinal, or target encoding depending on the situation.
📅 Datetime
Timestamps hiding multiple features inside — sale_date, application_date, last_login.
Action: Parse the string, then extract year, month, day-of-week, quarter, season — each becomes its own feature.
📝 Text
Free-form written content — property_description, review_text, support_ticket.
Action: Tokenise, clean, and vectorise — the most complex type, covered in depth in Lesson 7.
Running a Feature Type Audit
The scenario: You're a data scientist at a mortgage lending company. A colleague sends over a loan application dataset and says: "Can you run a quick audit before we start modelling? I need to know which features are ready to use, which ones need transformation, and whether there are any columns we should drop entirely." This is the first thing you do with any new dataset — classify every column before touching a single value.
# pandas — core data manipulation library, always imported as pd
import pandas as pd
# numpy — numerical Python library, imported as np
# used here for np.nan — the standard way to represent a missing value
import numpy as np
# A realistic loan applications dataset — 10 rows, 8 columns
loan_df = pd.DataFrame({
'application_id': ['L001','L002','L003','L004','L005',
'L006','L007','L008','L009','L010'],
'application_date': ['2023-01-15','2023-02-03','2023-02-18','2023-03-07',
'2023-03-22','2023-04-10','2023-05-01','2023-05-19',
'2023-06-08','2023-06-25'],
'loan_amount': [150000,320000,85000,500000,210000,
175000,430000,95000,280000,360000],
'annual_income': [52000,95000,38000,140000,67000,
48000,112000,41000,78000,105000],
'employment_type': ['employed','self-employed','employed','employed',
'unemployed','employed','self-employed',
'employed','employed','self-employed'],
'credit_score': [720,810,640,890,580,695,775,660,740,820],
'property_type': ['apartment','house','apartment','house','apartment',
'house','house','apartment','house','apartment'],
'approved': [1,1,0,1,0,0,1,0,1,1] # target: 1=approved, 0=rejected
})
# .dtypes shows the data type pandas assigned to every column
# object = text/string | int64 = integer | float64 = decimal number
print("Column dtypes:\n")
print(loan_df.dtypes)
# A manual audit dictionary — classify every column by its role
# This becomes a shared reference document for the whole team
audit = {
'application_id': 'IDENTIFIER — drop before training, no signal',
'application_date': 'DATETIME — parse and extract month, quarter, DOW',
'loan_amount': 'NUMERICAL — check skew and outliers',
'annual_income': 'NUMERICAL — check skew and outliers',
'employment_type': 'CATEGORICAL — one-hot encode before use',
'credit_score': 'NUMERICAL — already on a standard scale',
'property_type': 'CATEGORICAL — one-hot encode before use',
'approved': 'TARGET — predict this, never use as input'
}
print("\nFeature audit:\n")
for col, verdict in audit.items():
print(f" {col:<20} {verdict}")
Column dtypes: application_id object application_date object loan_amount int64 annual_income int64 employment_type object credit_score int64 property_type object approved int64 dtype: object Feature audit: application_id IDENTIFIER — drop before training, no signal application_date DATETIME — parse and extract month, quarter, DOW loan_amount NUMERICAL — check skew and outliers annual_income NUMERICAL — check skew and outliers employment_type CATEGORICAL — one-hot encode before use credit_score NUMERICAL — already on a standard scale property_type CATEGORICAL — one-hot encode before use approved TARGET — predict this, never use as input
What just happened?
.dtypes is a pandas attribute — no parentheses — that returns the stored data type of every column. object means text/string. Notice application_date is object — a plain string that pandas cannot do date arithmetic on until you parse it with pd.to_datetime(). Three of the eight columns need transformation before a model can use them. The audit makes that visible before anyone writes a line of modelling code.
Numerical Features — Checking the Distribution
The scenario: Your manager looks at the loan dataset and says: "Two of our numerical columns — loan amount and annual income — look very different in scale. A £500k loan sitting next to a £38k salary might confuse a distance-based model like KNN. Can you show me the distributions and tell me if we have a skew problem before we decide on scaling?" You pull summary statistics and run a skewness check.
import pandas as pd
loan_df = pd.DataFrame({
'loan_amount': [150000,320000,85000,500000,210000,
175000,430000,95000,280000,360000],
'annual_income': [52000,95000,38000,140000,67000,
48000,112000,41000,78000,105000],
'credit_score': [720,810,640,890,580,695,775,660,740,820],
'approved': [1,1,0,1,0,0,1,0,1,1]
})
# .describe() runs count, mean, std, min, quartiles and max in one shot
# std = standard deviation — measures how spread out values are from the mean
# A large gap between mean and max hints at right skew or outliers
print("Summary statistics:\n")
print(loan_df[['loan_amount','annual_income','credit_score']].describe().round(0))
# .skew() measures distributional asymmetry
# Positive = long right tail (a few very large values pulling the mean up)
# Negative = long left tail
# Rule of thumb: |skew| > 1.0 = highly skewed, consider log transform
print("\nSkewness check:\n")
for col in ['loan_amount', 'annual_income', 'credit_score']:
sk = loan_df[col].skew()
flag = " ← consider log transform" if abs(sk) > 1.0 else " ✓ acceptable"
print(f" {col:<20} {sk:+.3f}{flag}")
Summary statistics:
loan_amount annual_income credit_score
count 10.0 10.0 10.0
mean 260500.0 77600.0 733.0
std 143518.0 33930.0 93.0
min 85000.0 38000.0 580.0
25% 162500.0 50500.0 667.5
50% 270000.0 72500.0 730.0
75% 347500.0 98750.0 796.3
max 500000.0 140000.0 890.0
Skewness check:
loan_amount +0.345 ✓ acceptable
annual_income +0.439 ✓ acceptable
credit_score -0.072 ✓ acceptableWhat just happened?
.describe() runs count, mean, std, min, quartiles, and max in one shot — the fastest statistical profile of any numerical column. .skew() measures distributional asymmetry. None of the three features are severely skewed (all under |1.0|), so log transformation is not needed here. The real issue is scale: loan_amount ranges 85k–500k while credit_score runs 580–890. In a KNN model, that difference would make credit score nearly invisible in every distance calculation.
Categorical Features — Why Strings Break Models
The scenario: A junior data scientist on your team asks: "Why can't we just leave employment_type as text? Surely the model can figure out that 'employed' is different from 'unemployed'?" You write a demonstration that shows exactly what happens — and then show the correct encoding.
import pandas as pd
loan_df = pd.DataFrame({
'employment_type': ['employed','self-employed','employed','employed',
'unemployed','employed','self-employed',
'employed','employed','self-employed'],
'approved': [1,1,0,1,0,0,1,0,1,1]
})
# .value_counts() shows category distribution — check for imbalance before encoding
# A category with only 1 row causes problems in train/test splits
print("Category distribution:\n")
print(loan_df['employment_type'].value_counts())
# pd.get_dummies() — pandas built-in one-hot encoder
# Creates one binary column per category (1 = row has that category, 0 = it doesn't)
# prefix='emp' adds a readable prefix to every new column name
# drop_first=True drops one column to avoid the dummy variable trap
# (if self-employed=0 and unemployed=0, the person must be employed — no third column needed)
encoded = pd.get_dummies(loan_df['employment_type'],
prefix='emp',
drop_first=True)
result = pd.concat([encoded, loan_df['approved']], axis=1)
print("\nAfter one-hot encoding:\n")
print(result.to_string())
Category distribution: employed 5 self-employed 3 unemployed 2 Name: employment_type, dtype: int64 After one-hot encoding: emp_self-employed emp_unemployed approved 0 0 0 1 1 1 0 1 2 0 0 0 3 0 0 1 4 0 1 0 5 0 0 0 6 1 0 1 7 0 0 0 8 0 0 1 9 1 0 1
What just happened?
.value_counts() checks category distribution before encoding — a category that appears only once will cause problems in train/test splits. pd.get_dummies() is pandas' built-in one-hot encoder. prefix='emp' adds a readable prefix to every new column name. drop_first=True removes the 'employed' column — it is implied when both other flags are 0. The string "employed" is gone. The model gets pure 0/1 integers it can compute with.
Datetime Features — One Column, Many Signals
The scenario: Your team lead says: "There's probably a seasonal pattern in loan approvals — January applications might behave differently from June ones. Day-of-week could matter too. Can you extract the useful time signals from application_date before we throw the raw timestamp away?" You parse the column and pull out everything that could carry signal.
import pandas as pd
loan_df = pd.DataFrame({
'application_date': ['2023-01-15','2023-02-03','2023-02-18','2023-03-07',
'2023-03-22','2023-04-10','2023-05-01','2023-05-19',
'2023-06-08','2023-06-25'],
'approved': [1,1,0,1,0,0,1,0,1,1]
})
# pd.to_datetime() converts text strings into proper datetime64 objects
# Without this step, Python treats '2023-01-15' as plain text — no date maths possible
loan_df['application_date'] = pd.to_datetime(loan_df['application_date'])
# .dt is the datetime accessor — it unlocks all date/time properties on a Series
# Without .dt, calling .month on a column throws an AttributeError
loan_df['app_month'] = loan_df['application_date'].dt.month # 1=Jan, 12=Dec
loan_df['app_quarter'] = loan_df['application_date'].dt.quarter # Q1–Q4
loan_df['app_dow'] = loan_df['application_date'].dt.dayofweek # 0=Mon, 6=Sun
loan_df['app_year'] = loan_df['application_date'].dt.year
# Show extracted features alongside the original and target
print(loan_df[['application_date','app_month','app_quarter',
'app_dow','app_year','approved']].to_string(index=False))
application_date app_month app_quarter app_dow app_year approved
2023-01-15 1 1 6 2023 1
2023-02-03 2 1 4 2023 1
2023-02-18 2 1 5 2023 0
2023-03-07 3 1 1 2023 1
2023-03-22 3 1 2 2023 0
2023-04-10 4 2 0 2023 0
2023-05-01 5 2 0 2023 1
2023-05-19 5 2 4 2023 0
2023-06-08 6 2 3 2023 1
2023-06-25 6 2 6 2023 1What just happened?
pd.to_datetime() converts text strings into datetime64 objects — mandatory before any date arithmetic. The .dt accessor unlocks all date and time properties on a Series. Without it, calling .month on a column throws an AttributeError. One raw text column became four usable numeric features. Each would then be validated against the target before deciding which to keep.
Text Features — From Words to Numbers
The scenario: Each loan application includes a short free-text field where applicants describe their employment situation. Your team lead asks: "Can we get any signal from those descriptions? Can we tell from the words alone whether someone is likely to get approved?" You start with the simplest possible text feature engineering — word counts and keyword flags — before the full NLP pipeline covered in Lesson 7.
import pandas as pd
loan_df = pd.DataFrame({
'description': [
'stable government job for ten years',
'running my own consultancy business since 2018',
'between jobs currently looking for work',
'senior engineer at large tech company permanent contract',
'recently made redundant still searching',
'part time work irregular hours',
'director of ltd company profitable for five years',
'temporary contract ending soon',
'nurse full time NHS permanent',
'freelance designer multiple clients regular income'
],
'approved': [1,1,0,1,0,0,1,0,1,1]
})
# Basic text feature 1: word count — longer descriptions may indicate more stability
# .str.split() splits each string into a list of words
# .str.len() counts how many words are in that list
loan_df['word_count'] = loan_df['description'].str.split().str.len()
# Basic text feature 2: keyword flags — presence of high-signal words
# .str.contains() returns True/False — .astype(int) converts to 1/0
loan_df['has_permanent'] = loan_df['description'].str.contains(
'permanent|stable|full time', case=False).astype(int)
loan_df['has_risk_word'] = loan_df['description'].str.contains(
'redundant|temporary|irregular|searching|between', case=False).astype(int)
# Check whether our new text features correlate with approval
print("Text feature correlations with approved:\n")
for col in ['word_count', 'has_permanent', 'has_risk_word']:
corr = loan_df[col].corr(loan_df['approved'])
print(f" {col:<18} {corr:+.3f}")
print("\nSample output:\n")
print(loan_df[['description','word_count',
'has_permanent','has_risk_word','approved']].to_string(index=False))
Text feature correlations with approved:
word_count +0.132
has_permanent +0.756
has_risk_word -0.816
Sample output:
description word_count has_permanent has_risk_word approved
stable government job for ten years 6 1 0 1
running my own consultancy business since 2018 7 0 0 1
between jobs currently looking for work 6 0 1 0
senior engineer at large tech company permanent contract 7 1 0 1
recently made redundant still searching 5 0 1 0
part time work irregular hours 5 0 1 0
director of ltd company profitable for five years 8 0 0 1
temporary contract ending soon 4 0 1 0
nurse full time NHS permanent 5 1 0 1
freelance designer multiple clients regular income 7 0 0 1What just happened?
.str.split().str.len() chains two string operations: split each description into a list of words, then count how many are in that list. .str.contains() checks for a keyword pattern — the pipe | means "or" and case=False ignores capitalisation. has_risk_word at −0.816 and has_permanent at +0.756 are among the strongest predictors in the dataset, extracted from free text with just two keyword flags. Raw word count at +0.132 barely matters.
Try it yourself — extend the keyword list
Open Google Colab and paste the text features code block. Then add your own keywords to the has_permanent and has_risk_word patterns — try adding "contract", "pension", or "director" — and rerun the correlations. You'll immediately see how choosing keywords deliberately versus randomly changes the signal you extract.
The Complete Feature Audit Table
Here is the full loan dataset classified — every column mapped to its type, role, required action, and whether it goes into the model:
| Column | Type | Role | Action required | Include? |
|---|---|---|---|---|
| application_id | Identifier | Non-feature | Drop before training | ✗ |
| application_date | Datetime | Feature source | Extract month, quarter, day-of-week | → |
| loan_amount | Numerical | Feature | Scale before use | ✓ |
| annual_income | Numerical | Feature | Scale before use | ✓ |
| employment_type | Categorical | Feature | One-hot encode | ✓ |
| credit_score | Numerical | Feature | Ready to use | ✓ |
| property_type | Categorical | Feature | One-hot encode | ✓ |
| description | Text | Feature source | Keyword flags, word count; TF-IDF in L7 | → |
| approved | Binary | Target | Predict this — never use as input | ✗ |
Teacher's Note
The most expensive mistake in feature engineering is including a leaky feature — something containing future information about the target. In the loan dataset, imagine a column called underwriter_flag that an underwriter manually adds after reviewing the application outcome. If you train on that, your model learns from a human decision that already incorporated the approval result. It'll score brilliantly in testing and be completely useless in production.
Always ask before including any column: "At the exact moment I would use this model to make a prediction, would this feature's value be available?" If the answer is "only after the fact" — it's leakage. Drop it without hesitation.
Practice Questions
1. The column a model is trained to predict — such as approved or sale_price — is called the ___.
2. The pandas function used to perform one-hot encoding on a categorical column is pd.___().
3. To access datetime properties like .month or .dayofweek on a pandas Series, you must first use the ___ accessor.
Quiz
1. Why should a column like application_id be dropped before model training?
2. From the text feature engineering results, which engineered feature had the strongest relationship with loan approval?
3. What is a leaky feature?
Up Next · Lesson 3
Types of Features
Go deep on continuous vs discrete numerics, nominal vs ordinal categoricals, and exactly which transformation each type demands before a model can use it.