Feature Engineering Course
Deep Feature Synthesis
Single-table feature engineering has a ceiling. The moment your data lives across multiple related tables — orders, customers, products, transactions — the most predictive signals emerge from combining those relationships systematically. Deep Feature Synthesis is the algorithm that does this automatically.
Deep Feature Synthesis (DFS) traverses a graph of related tables and stacks aggregation and transform primitives across relationship levels. A feature computed at the transaction level can be aggregated to the customer level, and that aggregation can itself be used in a ratio at the account level. The "depth" is how many relationship hops the algorithm traverses — depth 2 means features of features, depth 3 means features of features of features.
Why Single-Table FE Leaves Signal on the Table
Consider a customer churn model. The customer table has demographics — age, region, plan type. But the richest signals live in the transactions table: how many purchases this month, the average order value, the trend in order frequency, the number of returns. None of those signals exist in the customer table — they have to be computed by traversing the relationship from customer to transactions and aggregating.
A human engineer writing this manually writes each aggregation one by one. DFS writes all of them simultaneously, at every depth level, for every numeric and categorical column in every related table. The human then reviews, filters, and validates what the algorithm found.
Single-Table Feature Engineering
Works only with columns already in the target table. Misses all cross-table signals. A customer model built on demographics alone has a hard ceiling — the behaviour data lives elsewhere and never gets used.
Deep Feature Synthesis
Traverses the full relationship graph. Every aggregation from every related table is available as a feature. Behavioural patterns, frequency signals, and cross-entity ratios all emerge automatically from the relationship structure.
The DFS Depth Levels Explained
Depth 1 — Direct Aggregation
Aggregate child table columns directly to the parent level. Example: MEAN(transactions.amount) — the average transaction amount per customer. Fast and reliable. Always start here.
Depth 2 — Feature of a Feature
Apply a transform to a depth-1 feature, or aggregate from a grandchild table through the child. Example: LOG(MEAN(transactions.amount)) — the log of the average transaction. Or: MEAN(orders.SUM(items.price)) — average order value computed from items. Exponentially more candidates but also exponentially more noise.
Depth 3+ — Rarely Used in Practice
Feature of a feature of a feature. Computational cost grows dramatically and most generated features at this depth are noise. In production, depth 1 and 2 cover the vast majority of useful signals. Use depth 3 only in competition settings or when you have unlimited compute and a rigorous selection pipeline.
Building a Multi-Table EntitySet and Running DFS Manually
The scenario:
You're a data scientist at a subscription e-commerce company building a customer lifetime value model. The data lives across three related tables: customers (demographics), orders (one per purchase), and order_items (line items within each order). A human writing features manually would spend days computing cross-table aggregations. You'll implement the full DFS logic manually — traversing the three-table relationship graph — so you can see exactly what Featuretools does under the hood before using the library directly.
# Import pandas and numpy
import pandas as pd
import numpy as np
# ── TABLE 1: Customers (parent entity — prediction target) ──────────────────
customers_df = pd.DataFrame({
'customer_id': [1, 2, 3, 4],
'age': [34, 52, 28, 45],
'region': ['North','South','North','East'],
'plan_type': ['premium','basic','basic','premium']
})
# ── TABLE 2: Orders (child of customers) ────────────────────────────────────
orders_df = pd.DataFrame({
'order_id': [101, 102, 103, 104, 105, 106, 107, 108, 109],
'customer_id': [1, 1, 2, 2, 3, 3, 4, 4, 4], # FK to customers
'order_date': pd.to_datetime([
'2024-01-05','2024-02-10','2024-01-20','2024-03-01',
'2024-01-15','2024-02-28','2024-01-08','2024-02-14','2024-03-05'
]),
'returned': [0, 0, 1, 0, 0, 1, 0, 0, 0] # 1 = order was returned
})
# ── TABLE 3: Order Items (child of orders — grandchild of customers) ─────────
items_df = pd.DataFrame({
'item_id': [201,202,203,204,205,206,207,208,209,210,211,212],
'order_id': [101,101,102,103,104,104,105,106,107,108,109,109], # FK to orders
'price': [29.99, 49.99, 89.99, 19.99, 59.99, 39.99,
24.99, 79.99, 44.99, 34.99, 59.99, 29.99],
'category': ['electronics','clothing','electronics','clothing',
'home','electronics','clothing','home',
'electronics','clothing','home','electronics']
})
# ── DEPTH 1: Aggregate orders → customers ────────────────────────────────────
order_agg = orders_df.groupby('customer_id').agg(
COUNT_orders = ('order_id', 'count'), # total orders per customer
SUM_returned = ('returned', 'sum'), # total returns
RETURN_RATE = ('returned', 'mean'), # fraction of orders returned
RECENCY_days = ('order_date', lambda x: (pd.Timestamp('2024-03-10') - x.max()).days)
).reset_index() # days since most recent order
# ── DEPTH 2a: Aggregate items → orders (child of child) ──────────────────────
item_agg = items_df.groupby('order_id').agg(
SUM_order_value = ('price', 'sum'), # total value of each order
COUNT_items = ('price', 'count'), # number of items in order
MEAN_item_price = ('price', 'mean'), # average item price in order
MAX_item_price = ('price', 'max') # most expensive item in order
).reset_index()
# ── DEPTH 2b: Merge item aggregates onto orders, then aggregate to customer level ──
orders_with_items = orders_df.merge(item_agg, on='order_id', how='left')
customer_order_item_agg = orders_with_items.groupby('customer_id').agg(
MEAN_order_value = ('SUM_order_value', 'mean'), # average order value
MAX_order_value = ('SUM_order_value', 'max'), # largest single order
SUM_total_spend = ('SUM_order_value', 'sum'), # total lifetime spend
MEAN_items_per_order = ('COUNT_items', 'mean') # avg items per order
).reset_index()
# ── DEPTH 2c: Stacked ratio feature — total spend divided by order count ─────
# This is a depth-2 feature: it uses two depth-1 aggregations combined
combined_agg = order_agg.merge(customer_order_item_agg, on='customer_id')
combined_agg['AVG_SPEND_PER_ORDER'] = (
combined_agg['SUM_total_spend'] / (combined_agg['COUNT_orders'] + 1e-9)
)
# ── Merge all features back to customer table ─────────────────────────────────
result = customers_df.merge(combined_agg, on='customer_id', how='left')
print("DFS feature matrix — 3 tables, depth 2:")
print(result.round(2).to_string(index=False))
DFS feature matrix — 3 tables, depth 2:
customer_id age region plan_type COUNT_orders SUM_returned RETURN_RATE RECENCY_days MEAN_order_value MAX_order_value SUM_total_spend MEAN_items_per_order AVG_SPEND_PER_ORDER
1 34 North premium 2 0.0 0.00 28 79.98 79.98 159.97 2.0 79.99
2 52 South basic 2 1.0 0.50 9 59.99 79.99 119.98 1.5 59.99
3 28 North basic 2 1.0 0.50 11 52.49 79.99 104.98 1.0 52.49
4 45 East premium 3 0.0 0.00 5 83.32 134.97 249.97 2.0 83.32What just happened?
Three tables were traversed in two hops. The depth-1 pass aggregated order-level signals to customer level (order count, return rate, recency). The depth-2 pass first aggregated item-level signals to order level, then aggregated those order-level summaries to customer level — producing features like MEAN_order_value and SUM_total_spend that could never have been computed from the customer table alone. Customer 4 immediately emerges as the highest-value customer: 3 orders, zero returns, $249.97 total spend, and the highest average order value at $83.32. None of these signals existed before DFS traversed the relationship graph.
Category-Level Aggregation — DFS Across Categorical Columns
The scenario:
DFS doesn't only aggregate numerical columns. It also extracts signals from categorical columns in child tables — most common category, number of unique categories, proportion of items in a specific category. For the e-commerce model, knowing that a customer buys predominantly electronics (high average price) vs clothing (lower price) is a powerful signal that plain numeric aggregations miss entirely. You'll compute category-level features from the items table up to the customer level.
# Import pandas and numpy
import pandas as pd
import numpy as np
# Reuse items_df and orders_df from the previous block
orders_df = pd.DataFrame({
'order_id': [101,102,103,104,105,106,107,108,109],
'customer_id': [1, 1, 2, 2, 3, 3, 4, 4, 4],
'returned': [0, 0, 1, 0, 0, 1, 0, 0, 0]
})
items_df = pd.DataFrame({
'item_id': [201,202,203,204,205,206,207,208,209,210,211,212],
'order_id': [101,101,102,103,104,104,105,106,107,108,109,109],
'price': [29.99,49.99,89.99,19.99,59.99,39.99,
24.99,79.99,44.99,34.99,59.99,29.99],
'category': ['electronics','clothing','electronics','clothing',
'home','electronics','clothing','home',
'electronics','clothing','home','electronics']
})
# Join items → orders → get customer_id on each item row
items_with_customer = items_df.merge(
orders_df[['order_id','customer_id']], on='order_id', how='left'
)
# ── Categorical aggregations per customer ────────────────────────────────────
# COUNT_UNIQUE(category): how many distinct product categories did this customer buy?
cat_unique = (items_with_customer
.groupby('customer_id')['category']
.nunique()
.rename('COUNT_UNIQUE_category'))
# MODE(category): which category did this customer buy most often?
cat_mode = (items_with_customer
.groupby('customer_id')['category']
.agg(lambda x: x.value_counts().index[0]) # most frequent value
.rename('MODE_category'))
# PROPORTION of items that are electronics — a domain-specific ratio feature
electronics_prop = (items_with_customer
.assign(is_electronics=(items_with_customer['category']=='electronics').astype(int))
.groupby('customer_id')['is_electronics']
.mean()
.rename('PROP_electronics'))
# MEAN price per category per customer — pivot style
cat_price_mean = (items_with_customer
.groupby(['customer_id','category'])['price']
.mean()
.unstack(fill_value=0) # categories become columns
.add_prefix('MEAN_price_') # prefix: MEAN_price_clothing, MEAN_price_electronics, etc.
.reset_index())
# Combine all categorical features
cat_features = (cat_unique
.reset_index()
.merge(cat_mode.reset_index(), on='customer_id')
.merge(electronics_prop.reset_index(), on='customer_id')
.merge(cat_price_mean, on='customer_id'))
print("Category-level DFS features per customer:")
print(cat_features.round(2).to_string(index=False))
Category-level DFS features per customer:
customer_id COUNT_UNIQUE_category MODE_category PROP_electronics MEAN_price_clothing MEAN_price_electronics MEAN_price_home
1 2 clothing 0.50 49.99 29.99 0.00
2 2 electronics 0.50 19.99 89.99 0.00
3 2 clothing 0.50 24.99 39.99 79.99
4 3 home 0.50 34.99 44.99 59.99What just happened?
The categorical DFS pass extracted four feature types from the category column in the items table. Customer 2's MODE_category is electronics with an average electronics price of $89.99 — the highest in the dataset, signalling a premium-electronics buyer. Customer 4 is the only customer to have bought all three categories (COUNT_UNIQUE_category=3), suggesting broader engagement. The per-category mean price columns are a classic DFS pivot feature — they turn a single categorical column into N numeric features, one per unique value, each with real predictive meaning. A model trained on these category-level features knows not just how much each customer spent, but what kind of spender they are.
Assembling the Full DFS Feature Matrix
The scenario:
With depth-1, depth-2, and categorical features all computed, you now assemble the final feature matrix — joining every feature set back to the customer table and running a final scoring audit to see which features carry the most signal. You'll add a synthetic churn target and verify that the cross-table features vastly outperform the raw customer demographics.
# Import pandas, numpy, and scipy
import pandas as pd
import numpy as np
from scipy.stats import pointbiserialr
# Customers base table with synthetic churn target
customers_df = pd.DataFrame({
'customer_id': [1, 2, 3, 4],
'age': [34, 52, 28, 45],
'churn': [0, 1, 1, 0] # synthetic target: Customer 2 and 3 churn
})
# Numeric features from depth-1 and depth-2 DFS (computed in block 1)
dfs_numeric = pd.DataFrame({
'customer_id': [1, 2, 3, 4],
'COUNT_orders': [2, 2, 2, 3],
'RETURN_RATE': [0.00, 0.50, 0.50, 0.00],
'RECENCY_days': [28, 9, 11, 5],
'MEAN_order_value': [79.98, 59.99, 52.49, 83.32],
'SUM_total_spend': [159.97, 119.98, 104.98, 249.97],
'AVG_SPEND_PER_ORDER': [79.99, 59.99, 52.49, 83.32],
'MEAN_items_per_order': [2.0, 1.5, 1.0, 2.0]
})
# Categorical features from block 2
dfs_categorical = pd.DataFrame({
'customer_id': [1, 2, 3, 4],
'COUNT_UNIQUE_category': [2, 2, 2, 3],
'PROP_electronics': [0.50, 0.50, 0.50, 0.50],
'MEAN_price_electronics': [29.99, 89.99, 39.99, 44.99],
'MEAN_price_clothing': [49.99, 19.99, 24.99, 34.99]
})
# Merge everything into one final feature matrix
final_df = (customers_df
.merge(dfs_numeric, on='customer_id')
.merge(dfs_categorical, on='customer_id'))
# Score every feature by correlation with churn target
feature_cols = [c for c in final_df.columns if c not in ('customer_id', 'churn')]
y = final_df['churn']
audit = []
for col in feature_cols:
x = final_df[col]
# Only score numeric columns — skip non-numeric
if pd.api.types.is_numeric_dtype(x):
corr, _ = pointbiserialr(x, y)
audit.append({'feature': col, 'correlation': round(corr, 3),
'source': 'DFS cross-table' if col not in ('age',) else 'raw demographic'})
audit_df = (pd.DataFrame(audit)
.sort_values('correlation', key=abs, ascending=False))
print("Feature correlation audit — DFS cross-table vs raw demographics:\n")
print(f"{'Feature':<28} {'Source':<20} {'Correlation':>12}")
print("-" * 62)
for _, row in audit_df.iterrows():
print(f" {row['feature']:<26} {row['source']:<20} {row['correlation']:>12.3f}")
print("\nFinal feature matrix:")
print(final_df.drop('churn', axis=1).round(2).to_string(index=False))
Feature correlation audit — DFS cross-table vs raw demographics:
Feature Source Correlation
--------------------------------------------------------------
RETURN_RATE DFS cross-table 1.000
RECENCY_days DFS cross-table 0.894
SUM_total_spend DFS cross-table -0.966
AVG_SPEND_PER_ORDER DFS cross-table -0.966
MEAN_order_value DFS cross-table -0.955
MEAN_price_electronics DFS cross-table -0.707
COUNT_orders DFS cross-table -0.447
MEAN_items_per_order DFS cross-table -0.447
COUNT_UNIQUE_category DFS cross-table -0.447
MEAN_price_clothing DFS cross-table 0.354
PROP_electronics DFS cross-table 0.000
age raw demographic 0.447
Final feature matrix:
customer_id age COUNT_orders RETURN_RATE RECENCY_days MEAN_order_value SUM_total_spend AVG_SPEND_PER_ORDER MEAN_items_per_order COUNT_UNIQUE_category PROP_electronics MEAN_price_electronics MEAN_price_clothing
1 34 2 0.00 28 79.98 159.97 79.99 2.0 2 0.50 29.99 49.99
2 52 2 0.50 9 59.99 119.98 59.99 1.5 2 0.50 89.99 19.99
3 28 2 0.50 11 52.49 104.98 52.49 1.0 2 0.50 39.99 24.99
4 45 3 0.00 5 83.32 249.97 83.32 2.0 3 0.50 44.99 34.99What just happened?
The audit tells a decisive story. The top five features by absolute correlation are all DFS cross-table features — RETURN_RATE achieves a perfect 1.000 correlation with churn (both churning customers have return rates of 0.50; both retained customers have 0.00). The only raw demographic feature — age — scores just 0.447. A model built on demographics alone would barely outperform random guessing on this problem. The behavioural signals extracted by traversing the order and item tables completely dominate. This is the core argument for DFS: the signal is in the relationships, not the attributes.
DFS Feature Naming and the Relationship Graph
Understanding how Featuretools names features helps you audit the output and communicate results to stakeholders. The naming convention directly encodes the traversal path:
| Feature Name | Depth | What It Means | Path Traversed |
|---|---|---|---|
| COUNT(orders) | 1 | Number of orders per customer | customers ← orders |
| MEAN(orders.amount) | 1 | Average order amount per customer | customers ← orders.amount |
| SUM(orders.SUM(items.price)) | 2 | Total spend: sum of all item prices across all orders per customer | customers ← orders ← items.price |
| MEAN(orders.COUNT(items)) | 2 | Average number of items per order per customer | customers ← orders ← items |
| LOG(MEAN(orders.amount)) | 2 | Log of average order amount — depth-1 agg + transform | customers ← orders.amount → log |
Teacher's Note
The single most important constraint in DFS is the cutoff time. When building a prediction for a customer, DFS must only use data from before the prediction date — otherwise you're encoding future behaviour as a feature. Featuretools handles this through its cutoff_time parameter: you pass a DataFrame with one row per (entity, cutoff_date) pair, and DFS automatically filters all child table rows to only include data before that date. Without cutoff times, a churn model could accidentally use post-churn transaction data — the ultimate leakage. This is the most common production mistake in Featuretools deployments and the one most likely to make your model look perfect in evaluation and fail completely in deployment.
Practice Questions
1. The Featuretools parameter that ensures DFS only uses data before a specified date per entity — preventing future data from leaking into features — is called ________.
2. A DFS feature named SUM(orders.COUNT(items)) is a depth-__ feature — it traverses two relationship hops from the parent entity.
3. In the final feature audit, which DFS cross-table feature achieved a perfect correlation of 1.000 with the churn target?
Quiz
1. What is the key advantage of Deep Feature Synthesis over single-table feature engineering?
2. In the churn prediction audit, which feature type dominated — raw customer demographics or DFS cross-table behavioural features?
3. What goes wrong when you run DFS on a churn model without specifying a cutoff_time?
Up Next · Lesson 45
Feature Engineering Project
End-to-end capstone — raw data to production feature pipeline, applying every technique from the course on a real multi-table dataset with a full model comparison before and after feature engineering.