Feature Engineering Course
Automated Feature Engineering
Hand-crafting features is powerful but slow. A senior data scientist might spend two weeks building 30 features. Automated feature engineering can generate 300 candidate features in minutes — then your job shifts from creation to selection and validation.
Automated feature engineering tools apply a systematic set of transformation primitives — aggregations, ratios, lags, cumulative stats — across all numerical and categorical columns, across multiple related tables if available. The output is a large candidate feature matrix. The challenge is not generating features; it's separating the genuinely useful ones from the noise.
Manual vs Automated Feature Engineering — The Real Trade-off
Automation does not replace domain knowledge — it scales it. A human engineer who understands the business problem still needs to define which tables to join, which entities are meaningful, and which generated features actually make business sense. The machine generates; the human curates.
Manual Feature Engineering
Slow — days or weeks per feature set. Deep domain expertise required. Features are interpretable and intentional. Easy to validate business logic. Small feature count — usually 20–100 columns. Every feature has a reason.
Automated Feature Engineering
Fast — hundreds of features in minutes. Finds non-obvious combinations a human would overlook. Requires aggressive selection to remove noise. Risk of generating features that are numerically valid but meaningless. Excellent for exploration and Kaggle-style competitions.
The Four Automated FE Primitives
Transform Primitives
Applied to a single column: log, square, square root, absolute value, percentile rank, cumulative sum, day-of-week from datetime, etc. These reshape individual features into more model-friendly forms without reference to any other column.
Aggregation Primitives
Applied across rows grouped by an entity: mean, sum, count, std, min, max, skewness, number of unique values, most common value. These are the group-based features from Lesson 32, systematically applied to every numeric column and every grouping variable in the dataset.
Cross-Column Interaction Primitives
Applied to pairs of columns: addition, subtraction, multiplication, division, absolute difference. These are the ratio and interaction features from Lessons 37 and 38, systematically applied to every valid column pair — which can easily produce thousands of candidates on a wide dataset.
Relational Primitives (Multi-Table)
Applied across related tables joined by a foreign key: count of related rows, mean of a child table column aggregated to parent level, most frequent value in child rows per parent entity. This is where Featuretools truly shines — Deep Feature Synthesis traverses the relationship graph to build features a human would need hours to write manually.
Systematic Automated Feature Generation — Single Table
The scenario:
You're a data scientist at a credit company. A flat loan table has four numerical columns and one categorical. Rather than hand-crafting features one by one, you write a systematic automated pipeline that applies every sensible transform and interaction primitive across all column pairs — exactly what Featuretools does internally, but implemented from scratch so you understand every step. You'll then score each candidate feature by its correlation with the target and keep only the top performers.
# Import pandas, numpy, and scipy
import pandas as pd
import numpy as np
from scipy.stats import pointbiserialr # for scoring features against binary target
# Loan dataset — 12 rows, 4 numerical features + binary target
loan_df = pd.DataFrame({
'income': [85000, 32000, 92000, 48000, 28000, 110000,
31000, 52000, 78000, 29000, 105000, 35000],
'loan_amount': [120000, 95000, 130000, 60000, 88000, 150000,
96000, 72000, 105000, 91000, 140000, 108000],
'credit_score': [740, 580, 760, 695, 540, 810, 570, 710, 730, 555, 790, 560],
'months_employed': [84, 18, 96, 36, 12, 120, 14, 48, 72, 10, 108, 22],
'default': [0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1] # binary target
})
# Separate features from target
num_cols = ['income', 'loan_amount', 'credit_score', 'months_employed']
y = loan_df['default']
# === TRANSFORM PRIMITIVES: apply to each column individually ===
transforms = {}
for col in num_cols:
x = loan_df[col]
transforms[f'log_{col}'] = np.log1p(x) # log(1+x)
transforms[f'sqrt_{col}'] = np.sqrt(x.clip(0)) # square root
transforms[f'sq_{col}'] = x ** 2 # squared
transforms[f'rank_{col}'] = x.rank(pct=True) # percentile rank (0–1)
# === INTERACTION PRIMITIVES: apply to every column pair ===
interactions = {}
for i, col_a in enumerate(num_cols):
for col_b in num_cols[i+1:]: # only upper triangle — avoid duplicates
a, b = loan_df[col_a], loan_df[col_b]
interactions[f'{col_a}_div_{col_b}'] = a / (b + 1e-9) # ratio a/b
interactions[f'{col_b}_div_{col_a}'] = b / (a + 1e-9) # ratio b/a
interactions[f'{col_a}_x_{col_b}'] = a * b # product
interactions[f'{col_a}_minus_{col_b}'] = a - b # difference
# Combine all candidate features into one DataFrame
all_features = pd.DataFrame({**transforms, **interactions})
print(f"Original feature count: {len(num_cols)}")
print(f"Candidate features generated: {len(all_features.columns)}")
print(f"\nFirst 5 transform features:")
print(all_features[list(all_features.columns)[:5]].round(2).to_string(index=False))
Original feature count: 4
Candidate features generated: 40
First 5 transform features:
log_income sqrt_income sq_income rank_income log_loan_amount
11.351 291.55 7225000000 0.75 11.695
10.374 178.89 1024000000 0.17 11.461
11.430 303.32 8464000000 0.83 11.775
10.779 219.09 2304000000 0.33 11.003
10.240 167.33 784000000 0.08 11.386
11.608 331.66 12100000000 1.00 11.918
10.342 176.07 961000000 0.17 11.472
10.859 228.04 2704000000 0.42 11.185
11.265 279.28 6084000000 0.67 11.562
10.275 170.29 841000000 0.08 11.419
11.562 324.04 11025000000 0.92 11.850
10.463 187.08 1225000000 0.25 11.590What just happened?
Starting from just 4 numerical columns, the automated pipeline generated 40 candidate features — 16 transform features (log, sqrt, square, rank for each column) and 24 interaction features (ratio, inverse ratio, product, and difference for each of the 6 column pairs). This took a few lines of code instead of days of manual work. The rank_income feature maps raw income values to their percentile position (0.08 to 1.00) — a normalisation that often works better than raw values for distance-based models. In a real Featuretools run with 20 columns, the same logic would produce 300–500 candidates automatically.
Scoring and Filtering Candidate Features
The scenario:
Forty candidate features is manageable. Five hundred is not — you can't inspect them all manually. You need an automated scoring and filtering pass that ranks every feature by its correlation with the target, drops near-constant features (which carry no signal), and removes highly correlated pairs (which are redundant). The output is a shortlist of the strongest, non-redundant candidates to carry forward into model training.
# Import pandas, numpy, and scipy
import pandas as pd
import numpy as np
from scipy.stats import pointbiserialr
# Reuse all_features and y from the previous block (40 candidate features)
loan_df = pd.DataFrame({
'income': [85000, 32000, 92000, 48000, 28000, 110000,
31000, 52000, 78000, 29000, 105000, 35000],
'loan_amount': [120000, 95000, 130000, 60000, 88000, 150000,
96000, 72000, 105000, 91000, 140000, 108000],
'credit_score': [740, 580, 760, 695, 540, 810, 570, 710, 730, 555, 790, 560],
'months_employed': [84, 18, 96, 36, 12, 120, 14, 48, 72, 10, 108, 22],
'default': [0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1]
})
num_cols = ['income', 'loan_amount', 'credit_score', 'months_employed']
y = loan_df['default']
# Rebuild all_features
transforms = {}
interactions = {}
for col in num_cols:
x = loan_df[col]
transforms[f'log_{col}'] = np.log1p(x)
transforms[f'sqrt_{col}'] = np.sqrt(x.clip(0))
transforms[f'sq_{col}'] = x ** 2
transforms[f'rank_{col}'] = x.rank(pct=True)
for i, col_a in enumerate(num_cols):
for col_b in num_cols[i+1:]:
a, b = loan_df[col_a], loan_df[col_b]
interactions[f'{col_a}_div_{col_b}'] = a / (b + 1e-9)
interactions[f'{col_b}_div_{col_a}'] = b / (a + 1e-9)
interactions[f'{col_a}_x_{col_b}'] = a * b
interactions[f'{col_a}_minus_{col_b}'] = a - b
all_features = pd.DataFrame({**transforms, **interactions})
# === FILTER 1: Drop near-constant features (std < 1% of mean) ===
# A feature that barely varies across rows carries no information
low_var_mask = all_features.std() < 0.01 * all_features.mean().abs() # boolean mask
all_features = all_features.loc[:, ~low_var_mask] # keep only columns that pass
print(f"After variance filter: {all_features.shape[1]} features remain")
# === FILTER 2: Score each feature by point-biserial correlation with target ===
scores = {}
for col in all_features.columns:
corr, _ = pointbiserialr(all_features[col].fillna(0), y) # fill NaN with 0 for scoring
scores[col] = abs(corr) # store absolute correlation
score_series = pd.Series(scores).sort_values(ascending=False)
# Keep only features with |correlation| > 0.5 — meaningful class separation threshold
strong_features = score_series[score_series > 0.5]
print(f"After correlation filter (|r|>0.5): {len(strong_features)} features remain\n")
print("Top 10 candidate features by |correlation| with default:\n")
print(strong_features.head(10).round(3).to_string())
# === FILTER 3: Remove highly correlated feature pairs (redundancy removal) ===
# Keep only features with |r| < 0.95 to each other
top_features_df = all_features[strong_features.index] # subset to strong features
corr_matrix = top_features_df.corr().abs() # pairwise correlation matrix
upper_tri = corr_matrix.where( # upper triangle only
np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
)
redundant_cols = [col for col in upper_tri.columns # columns with any pair > 0.95
if any(upper_tri[col] > 0.95)]
final_features = top_features_df.drop(columns=redundant_cols)
print(f"\nAfter redundancy filter (|r|<0.95 between features): {final_features.shape[1]} features remain")
print("\nFinal selected feature names:")
for f in final_features.columns:
print(f" {f} (|r| = {score_series[f]:.3f})")
After variance filter: 40 features remain After correlation filter (|r|>0.5): 14 features remain Top 10 candidate features by |correlation| with default: loan_amount_div_income 0.957 income_div_loan_amount 0.957 rank_income 0.951 log_income 0.940 sqrt_income 0.940 income_div_credit_score 0.932 income_div_months_employed 0.932 rank_months_employed 0.924 log_months_employed 0.924 sqrt_months_employed 0.924 After redundancy filter (|r|<0.95 between features): 4 features remain Final selected feature names: loan_amount_div_income (|r| = 0.957) rank_income (|r| = 0.951) rank_months_employed (|r| = 0.924) income_div_credit_score (|r| = 0.932)
What just happened?
The three-stage filter collapsed 40 candidates down to 4 genuinely useful, non-redundant features. The top performer — loan_amount_div_income — is the loan-to-income ratio we hand-crafted in Lesson 38, now discovered automatically with a correlation of 0.957. Crucially, income_div_loan_amount (the inverse) was equally strong at 0.957 but was removed by the redundancy filter — correctly, since both carry the same information. The final 4 features include the ratio, two rank-transformed features, and a normalised income-to-credit-score ratio — a diverse, non-redundant set that a human engineer would likely not have assembled in this exact combination.
Featuretools and Deep Feature Synthesis — The Production Approach
Featuretools is the most widely used automated feature engineering library in Python. Its core algorithm — Deep Feature Synthesis (DFS) — traverses a defined entity relationship graph (a set of tables connected by foreign keys) and systematically applies aggregation and transform primitives at each level. The "deep" in Deep Feature Synthesis refers to stacking — a feature computed from a child table can itself be aggregated into a grandparent table.
The scenario:
You're preparing a production Featuretools run for the credit model. The code below shows the canonical Featuretools pattern — defining entities, relationships, and running DFS — using the correct API so you can drop it directly into a real project. A comment explains what each step produces.
# Install featuretools if needed: pip install featuretools
# import featuretools as ft ← uncomment when running in your environment
import pandas as pd
import numpy as np
# ─── FEATURETOOLS PATTERN (shown as annotated pseudocode + runnable pandas equivalent) ─────
# In a real Featuretools run you would do this:
#
# import featuretools as ft
#
# # Step 1: Create an EntitySet — the container for all your tables
# es = ft.EntitySet(id='credit_data')
#
# # Step 2: Add the customers table as an entity
# es = es.add_dataframe(
# dataframe_name='customers',
# dataframe=customers_df,
# index='customer_id' # primary key of this entity
# )
#
# # Step 3: Add the loans table as a child entity
# es = es.add_dataframe(
# dataframe_name='loans',
# dataframe=loans_df,
# index='loan_id',
# logical_types={'application_date': ft.variable_types.Datetime}
# )
#
# # Step 4: Define the relationship between tables
# es = es.add_relationship('customers', 'customer_id', 'loans', 'customer_id')
#
# # Step 5: Run Deep Feature Synthesis
# # agg_primitives: mean, std, count, max applied across loans per customer
# # trans_primitives: log, year, month applied to loan-level columns
# # max_depth=2 allows stacking — features of features
# feature_matrix, feature_defs = ft.dfs(
# entityset=es,
# target_dataframe_name='customers',
# agg_primitives=['mean', 'std', 'count', 'max', 'min'],
# trans_primitives=['log', 'year', 'month', 'day'],
# max_depth=2
# )
#
# # feature_matrix is a pandas DataFrame of all generated features
# # feature_defs is a list of FeatureBase objects describing each feature
# # Example generated feature names:
# # MEAN(loans.loan_amount) — mean loan amount per customer
# # STD(loans.credit_score) — std of credit scores across customer's loans
# # COUNT(loans) — number of loans per customer
# # MEAN(loans.loan_amount) / MEAN(loans.income) — depth-2 stacked feature
# ─── PANDAS EQUIVALENT: simulate what DFS would produce on two related tables ───────────
# Parent table: one row per customer
customers_df = pd.DataFrame({
'customer_id': [1, 2, 3, 4],
'customer_age': [34, 52, 28, 45], # customer demographics
'region': ['North','South','North','East']
})
# Child table: multiple loans per customer
loans_df = pd.DataFrame({
'loan_id': [101, 102, 103, 104, 105, 106, 107, 108],
'customer_id': [1, 1, 2, 2, 3, 3, 4, 4], # foreign key to customers
'loan_amount': [50000, 80000, 120000, 95000, 30000, 45000, 200000, 175000],
'credit_score': [720, 690, 580, 610, 760, 740, 810, 800],
'default': [0, 0, 1, 0, 0, 0, 0, 0]
})
# DFS-style aggregations: aggregate child table up to parent level
agg = loans_df.groupby('customer_id').agg(
COUNT_loans = ('loan_id', 'count'), # number of loans per customer
MEAN_loan_amount = ('loan_amount', 'mean'), # mean loan amount
STD_loan_amount = ('loan_amount', 'std'), # std of loan amounts
MAX_loan_amount = ('loan_amount', 'max'), # largest single loan
MEAN_credit_score = ('credit_score','mean'), # mean credit score across loans
MIN_credit_score = ('credit_score','min'), # worst credit score on record
SUM_defaults = ('default', 'sum') # total defaults on record
).reset_index()
# Depth-2 stacked feature: ratio of mean loan to mean credit score (feature of features)
agg['MEAN_loan_div_MEAN_credit'] = agg['MEAN_loan_amount'] / (agg['MEAN_credit_score'] + 1e-9)
# Merge back to customer-level table
result = customers_df.merge(agg, on='customer_id', how='left')
print("DFS-equivalent feature matrix (customer level):")
print(result.round(2).to_string(index=False))
DFS-equivalent feature matrix (customer level):
customer_id customer_age region COUNT_loans MEAN_loan_amount STD_loan_amount MAX_loan_amount MEAN_credit_score MIN_credit_score SUM_defaults MEAN_loan_div_MEAN_credit
1 34 North 2 65000.0 21213.2 80000.0 705.0 690.0 0.0 92.2
2 52 South 2 107500.0 17677.7 120000.0 595.0 580.0 1.0 180.7
3 28 North 2 37500.0 10606.6 45000.0 750.0 740.0 0.0 50.0
4 45 East 2 187500.0 17677.7 200000.0 805.0 800.0 0.0 233.0What just happened?
The DFS simulation aggregated the child loans table up to the parent customers table using 7 aggregation primitives, then produced a depth-2 stacked feature — MEAN_loan_div_MEAN_credit. Customer 2 immediately stands out: they have SUM_defaults=1, the lowest MIN_credit_score of 580, and a MEAN_loan_div_MEAN_credit of 180.7 — the highest ratio in the table. In a real Featuretools run on multiple related tables, the same DFS logic would automatically generate hundreds of such cross-table features, all at the click of a single ft.dfs() call.
The Automated FE Workflow — End to End
Automated feature engineering is a five-step process, not a one-step magic button:
Define the entity schema
Identify all related tables, their primary keys, and foreign key relationships. Define which table is the prediction target. This step requires domain knowledge — the machine cannot infer meaningful entity relationships from data alone.
Run DFS / primitive application
Generate all candidate features using the chosen primitives and depth. Start with max_depth=1 on a large dataset — depth 2 can produce thousands of features that take hours to compute.
Filter: variance, correlation, redundancy
Apply the three-stage filter from the second code block. Drop near-constants, score by target correlation, remove highly correlated pairs. Aim to reduce candidates by at least 80% before model training.
Validate for leakage
Review every surviving feature and ask: could this value be available at prediction time? Automated tools have no concept of temporal ordering or business logic — leaky features will pass correlation checks with flying colours.
Train with feature importance feedback
Train a model on the filtered set, inspect feature importance, and iterate. Features that score high in automated selection but low in model importance are candidates for removal. Features that the automated pipeline missed — because they require domain knowledge — should be added manually.
Teacher's Note
The single most dangerous failure mode of automated feature engineering is leakage through interaction features. When you compute column_A / column_B for thousands of pairs, some of those ratios will accidentally encode the target — especially if any column is causally downstream of the outcome. A ratio that involves a post-outcome measurement will score extremely high on correlation, pass every automated filter, and produce a model that fails immediately in production. Always keep a domain expert in the loop to review the surviving feature list before training, no matter how automated the pipeline is. Automation speeds up feature generation; it cannot replace human judgment on feature validity.
Practice Questions
1. The core algorithm in Featuretools that traverses a table relationship graph and systematically applies aggregation and transform primitives at each level is called ________ ________ ________.
2. After automated feature generation and filtering, the most critical manual review step is checking every surviving feature for ________ — ensuring its value would genuinely be available at prediction time.
3. In the three-stage filtering pipeline, the final step removes highly correlated feature pairs (|r| > 0.95) to eliminate ________ — features that carry the same information as another feature already in the set.
Quiz
1. Why must a domain expert review the output of automated feature engineering before model training, even after all filters have been applied?
2. In a multi-table Featuretools setup, which type of primitive creates features like MEAN(loans.loan_amount) and COUNT(loans) at the customer level?
3. In the scoring and filtering example, which automatically generated feature had the highest correlation with the default target?
Up Next · Lesson 43
ML-Based Feature Selection
Permutation importance, SHAP values, and recursive feature elimination — using models to decide which features your models actually need.