Feature Engineering Course
Feature Construction
The best features in your dataset don't always exist yet. Sometimes you have to build them — from ratios, differences, flags, and combinations of what you already have. That's feature construction, and it's where domain knowledge turns into model performance.
Feature construction (also called feature creation) is the process of engineering entirely new columns by combining, transforming, or extracting information from existing features. Unlike scaling or encoding — which reshape existing data — construction generates signals that were never directly measured but are often the most predictive thing in your entire dataset.
Why Constructed Features Often Outperform Raw Ones
A bank has monthly_income and monthly_debt as separate columns. Each one alone tells part of the story. But the ratio — debt_to_income_ratio — tells the whole story in a single number. A model could theoretically learn this relationship from raw data, but you're making its job dramatically easier — and more reliable on small datasets — by encoding the relationship directly.
Ratio features
Divide one feature by another to express a proportion or rate. Debt-to-income, click-through rate, revenue-per-user. Ratios normalise scale and often carry more signal than either raw value alone.
Difference features
Subtract one value from another to capture change, gap, or deviation. Price vs list price, actual vs expected, current vs historical. Differences highlight relative performance rather than absolute magnitude.
Flag features
Binary 0/1 indicators that capture whether a condition is met. Is the account overdue? Did the customer open the email? Has the property been renovated? Flags encode business rules directly into the feature space.
Aggregate features
Group-level statistics: the average spend per customer segment, total transactions per account, maximum price in a product category. Aggregates compress multi-row information into a single per-entity number.
Ratio and Difference Features
The scenario: You're a data scientist at a mortgage lending company. Your model predicts loan default risk. The underwriting team has told you that raw income and raw debt figures matter less than their relationship — how much of someone's income is already spoken for. You're going to construct three new features: debt_to_income_ratio, loan_to_value_ratio, and income_after_debt — all derived from columns already in the dataset.
# Import pandas
import pandas as pd
# Mortgage applicant data with raw financial features
mortgage_df = pd.DataFrame({
'applicant_id': ['M01', 'M02', 'M03', 'M04', 'M05',
'M06', 'M07', 'M08', 'M09', 'M10'],
'monthly_income': [5200, 8400, 3900, 12000, 6100,
4500, 9800, 7200, 3200, 11000],
'monthly_debt': [1800, 2100, 2200, 1500, 3100,
1200, 2800, 4100, 1900, 1800],
'loan_amount': [180000, 320000, 140000, 480000, 210000,
160000, 390000, 270000, 120000, 420000],
'property_value': [220000, 400000, 175000, 600000, 250000,
200000, 480000, 310000, 150000, 520000]
})
# Ratio: monthly debt divided by monthly income — the core risk metric in lending
mortgage_df['debt_to_income'] = (mortgage_df['monthly_debt'] /
mortgage_df['monthly_income']).round(3)
# Ratio: loan amount divided by property value — how much of the asset is borrowed
mortgage_df['loan_to_value'] = (mortgage_df['loan_amount'] /
mortgage_df['property_value']).round(3)
# Difference: how much income remains after servicing all existing debt each month
mortgage_df['income_after_debt'] = mortgage_df['monthly_income'] - mortgage_df['monthly_debt']
# Print the constructed features alongside the applicant IDs
print(mortgage_df[['applicant_id', 'monthly_income', 'monthly_debt',
'debt_to_income', 'loan_to_value', 'income_after_debt']].to_string(index=False))
applicant_id monthly_income monthly_debt debt_to_income loan_to_value income_after_debt
M01 5200 1800 0.346 0.818 3400
M02 8400 2100 0.250 0.800 6300
M03 3900 2200 0.564 0.800 1700
M04 12000 1500 0.125 0.800 10500
M05 6100 3100 0.508 0.840 3000
M06 4500 1200 0.267 0.800 3300
M07 9800 2800 0.286 0.813 7000
M08 7200 4100 0.569 0.871 3100
M09 3200 1900 0.594 0.800 1300
M10 11000 1800 0.164 0.808 9200What just happened?
Three new columns were created entirely from arithmetic on existing ones. M03 and M08 both earn less than M04, but their debt_to_income ratios of 0.564 and 0.569 flag them as high-risk — over half their income already goes to debt. M04 earns the second highest income and has a ratio of just 0.125. That risk story was invisible in the raw columns alone.
Flag Features
The scenario: You're working at a subscription box company building a churn prediction model. Your dataset tracks customer behaviour over time. The data team suspects that certain binary conditions — whether a customer has ever contacted support, whether they skipped a delivery, whether their last order was a discounted one — are stronger predictors of churn than any continuous feature. You're going to construct these as explicit binary flag columns.
# Import pandas
import pandas as pd
# Subscription customer behavioural data
sub_df = pd.DataFrame({
'customer_id': ['C01', 'C02', 'C03', 'C04', 'C05',
'C06', 'C07', 'C08', 'C09', 'C10'],
'support_contacts': [0, 3, 0, 1, 0, 5, 0, 2, 0, 1],
'skipped_deliveries':[0, 0, 2, 0, 3, 1, 0, 4, 0, 0],
'last_order_discount':[0, 15, 0, 0, 20, 10, 0, 25, 0, 5],
'months_subscribed': [24, 3, 8, 14, 2, 5, 36, 6, 18, 11]
})
# Flag: has the customer ever contacted support? (1 if support_contacts > 0)
sub_df['has_contacted_support'] = (sub_df['support_contacts'] > 0).astype(int)
# Flag: has the customer ever skipped a delivery? (1 if skipped_deliveries > 0)
sub_df['has_skipped_delivery'] = (sub_df['skipped_deliveries'] > 0).astype(int)
# Flag: was the last order discounted at all? (1 if discount > 0)
sub_df['last_order_was_discounted'] = (sub_df['last_order_discount'] > 0).astype(int)
# Flag: is the customer a long-term subscriber? (1 if subscribed > 12 months)
sub_df['is_long_term'] = (sub_df['months_subscribed'] > 12).astype(int)
# Print the raw behavioural columns and the four constructed flags
flag_cols = ['customer_id', 'support_contacts', 'skipped_deliveries',
'has_contacted_support', 'has_skipped_delivery',
'last_order_was_discounted', 'is_long_term']
print(sub_df[flag_cols].to_string(index=False))
customer_id support_contacts skipped_deliveries has_contacted_support has_skipped_delivery last_order_was_discounted is_long_term
C01 0 0 0 0 0 1
C02 3 0 1 0 1 0
C03 0 2 0 1 0 0
C04 1 0 1 0 0 1
C05 0 3 0 1 1 0
C06 5 1 1 1 1 0
C07 0 0 0 0 0 1
C08 2 4 1 1 1 0
C09 0 0 0 0 0 1
C10 1 0 1 0 1 0What just happened?
The boolean condition inside each parenthesis returns a True/False Series. .astype(int) converts that to 1 and 0. C06 now has all four risk flags lit — 5 support contacts, skipped a delivery, last order was discounted, and subscribed for only 5 months. That profile is a clean churn signal even before the model sees it.
Aggregate Features from Group Statistics
The scenario: You're a data analyst at a retail chain building a store performance model. Each row in your transaction dataset represents a single sale. Before modelling, you need to construct store-level aggregate features — the average transaction value per store, the total number of transactions, and the maximum single-sale amount. These group-level summaries compress hundreds of rows per store into a handful of powerful features.
# Import pandas
import pandas as pd
# Individual transaction records across three stores
transactions_df = pd.DataFrame({
'transaction_id': range(1, 13),
'store_id': ['S01', 'S01', 'S02', 'S01', 'S03', 'S02',
'S03', 'S01', 'S02', 'S03', 'S01', 'S02'],
'sale_amount': [120, 340, 95, 210, 450, 180,
320, 155, 270, 490, 88, 310]
})
# groupby + agg computes multiple statistics per store in one call
# 'mean', 'sum', 'count', 'max' are all standard aggregation functions
store_features = transactions_df.groupby('store_id')['sale_amount'].agg(
avg_transaction='mean',
total_revenue='sum',
transaction_count='count',
max_sale='max'
).reset_index()
# Round avg_transaction for readability
store_features['avg_transaction'] = store_features['avg_transaction'].round(1)
# Merge the store-level features back onto the original transaction rows
# left join so every transaction keeps its row, now enriched with store stats
enriched_df = transactions_df.merge(store_features, on='store_id', how='left')
# Print the store-level aggregate feature table first
print("Store-level aggregate features:")
print(store_features.to_string(index=False))
print()
# Print the first six rows of the enriched transaction table
print("Enriched transaction rows (first 6):")
print(enriched_df.head(6).to_string(index=False))
Store-level aggregate features:
store_id avg_transaction total_revenue transaction_count max_sale
S01 182.6 913 5 340
S02 211.0 855 4 310
S03 420.0 1260 3 490
Enriched transaction rows (first 6):
transaction_id store_id sale_amount avg_transaction total_revenue transaction_count max_sale
1 S01 120 182.6 913 5 340
2 S01 340 182.6 913 5 340
3 S02 95 211.0 855 4 310
4 S01 210 182.6 913 5 340
5 S03 450 420.0 1260 3 490
6 S02 180 211.0 855 4 310What just happened?
groupby().agg() collapsed all transactions per store into a single summary row. Then .merge() broadcast those store-level stats back onto every individual transaction row. Now each transaction knows its store's average sale and total revenue — context a model can use to judge whether this particular sale is above or below the norm for that location.
Combining Construction Techniques
The scenario: You're finalising the feature set for a credit card fraud detection model. The raw data has transaction amount, the customer's average transaction, and whether the merchant category is high-risk. A single constructed feature — amount_vs_avg_ratio — combines ratio logic with the group aggregate you just computed. Then a flag marks whether both the ratio is high and the merchant is risky. These compound features are often the most powerful ones in fraud models.
# Import pandas
import pandas as pd
# Credit card transaction data for fraud detection
fraud_df = pd.DataFrame({
'txn_id': ['TX01', 'TX02', 'TX03', 'TX04', 'TX05',
'TX06', 'TX07', 'TX08', 'TX09', 'TX10'],
'amount': [45, 1800, 62, 33, 2400, 78, 55, 3100, 41, 90],
'customer_avg_txn': [60, 55, 70, 40, 65, 80, 50, 72, 45, 88],
'high_risk_merchant':[0, 1, 0, 0, 1, 0, 0, 1, 0, 0]
})
# Ratio: how many times larger is this transaction vs the customer's usual spend?
fraud_df['amount_vs_avg_ratio'] = (fraud_df['amount'] /
fraud_df['customer_avg_txn']).round(2)
# Flag: is this transaction more than 5x the customer's average spend?
fraud_df['is_unusually_large'] = (fraud_df['amount_vs_avg_ratio'] > 5).astype(int)
# Compound flag: large AND at a high-risk merchant — the classic fraud signal
fraud_df['high_risk_flag'] = (
(fraud_df['is_unusually_large'] == 1) &
(fraud_df['high_risk_merchant'] == 1)
).astype(int)
# Print the constructed features for all transactions
print(fraud_df[['txn_id', 'amount', 'customer_avg_txn', 'amount_vs_avg_ratio',
'is_unusually_large', 'high_risk_merchant', 'high_risk_flag']].to_string(index=False))
txn_id amount customer_avg_txn amount_vs_avg_ratio is_unusually_large high_risk_merchant high_risk_flag TX01 45 60 0.75 0 0 0 TX02 1800 55 32.73 1 1 1 TX03 62 70 0.89 0 0 0 TX04 33 40 0.82 0 0 0 TX05 2400 65 36.92 1 1 1 TX06 78 80 0.97 0 0 0 TX07 55 50 1.10 0 0 0 TX08 3100 72 43.06 1 1 1 TX09 41 45 0.91 0 0 0 TX10 90 88 1.02 0 0 0
What just happened?
TX02, TX05, and TX08 were flagged by high_risk_flag. All three have amounts 30–43× the customer's normal spend AND were at high-risk merchants. The & operator combines two boolean conditions — both must be true for the compound flag to fire. TX10 spent slightly above average but at a normal merchant, so it stays clean.
Protect against division by zero
When constructing ratio features, always check whether the denominator can be zero. Use df['col'].replace(0, np.nan) or add a small epsilon before dividing to avoid silent infinities breaking your pipeline.
Construct on train, apply to test
For aggregate features computed from group statistics, compute the aggregates on training data only and join them onto the test set. Computing aggregates on the full dataset before splitting leaks future group statistics into training — the same data leakage rule from Lessons 10 and 12 applies here.
Teacher's Note
Feature construction is where your domain knowledge pays its biggest dividends. A ratio or flag that encodes something a subject-matter expert already knows — like a debt-to-income threshold or a fraud velocity rule — will almost always beat a raw column that a model has to discover the same insight from scratch. Talk to the people who understand the business problem. The best constructed features often come from a ten-minute conversation with an underwriter, a fraud analyst, or a customer success manager rather than from automated feature search tools.
Practice Questions
1. After creating a boolean condition like (df['col'] > 0), what method converts it to a binary 0/1 integer column?
2. Which pandas method is used to compute group-level aggregate statistics like mean or sum per category?
3. To avoid data leakage, group-level aggregate features should be computed on ________ data only.
Quiz
1. Which constructed feature best captures credit risk from income and debt columns?
2. What is a flag feature?
3. What should you do before constructing a ratio feature to avoid silent errors in production?
Up Next · Lesson 15
Feature Engineering Workflow
Bring everything together — a structured end-to-end process for applying FE techniques in the right order on any real dataset.