Feature Engineering Lesson 44 – Deep Feature Synthesis | Dataplexa

Advanced Level · Lesson 44

Deep Feature Synthesis

Single-table feature engineering has a ceiling. The moment your data lives across multiple related tables — orders, customers, products, transactions — the most predictive signals emerge from combining those relationships systematically. Deep Feature Synthesis is the algorithm that does this automatically.

Deep Feature Synthesis (DFS) traverses a graph of related tables and stacks aggregation and transform primitives across relationship levels. A feature computed at the transaction level can be aggregated to the customer level, and that aggregation can itself be used in a ratio at the account level. The "depth" is how many relationship hops the algorithm traverses — depth 2 means features of features, depth 3 means features of features of features.

Why Single-Table FE Leaves Signal on the Table

Consider a customer churn model. The customer table has demographics — age, region, plan type. But the richest signals live in the transactions table: how many purchases this month, the average order value, the trend in order frequency, the number of returns. None of those signals exist in the customer table — they have to be computed by traversing the relationship from customer to transactions and aggregating.

A human engineer writing this manually writes each aggregation one by one. DFS writes all of them simultaneously, at every depth level, for every numeric and categorical column in every related table. The human then reviews, filters, and validates what the algorithm found.

Single-Table Feature Engineering

Works only with columns already in the target table. Misses all cross-table signals. A customer model built on demographics alone has a hard ceiling — the behaviour data lives elsewhere and never gets used.

Deep Feature Synthesis

Traverses the full relationship graph. Every aggregation from every related table is available as a feature. Behavioural patterns, frequency signals, and cross-entity ratios all emerge automatically from the relationship structure.

The DFS Depth Levels Explained

Depth 1 — Direct Aggregation

Aggregate child table columns directly to the parent level. Example: MEAN(transactions.amount) — the average transaction amount per customer. Fast and reliable. Always start here.

Depth 2 — Feature of a Feature

Apply a transform to a depth-1 feature, or aggregate from a grandchild table through the child. Example: LOG(MEAN(transactions.amount)) — the log of the average transaction. Or: MEAN(orders.SUM(items.price)) — average order value computed from items. Exponentially more candidates but also exponentially more noise.

Depth 3+ — Rarely Used in Practice

Feature of a feature of a feature. Computational cost grows dramatically and most generated features at this depth are noise. In production, depth 1 and 2 cover the vast majority of useful signals. Use depth 3 only in competition settings or when you have unlimited compute and a rigorous selection pipeline.

Building a Multi-Table EntitySet and Running DFS Manually

The scenario:

You're a data scientist at a subscription e-commerce company building a customer lifetime value model. The data lives across three related tables: customers (demographics), orders (one per purchase), and order_items (line items within each order). A human writing features manually would spend days computing cross-table aggregations. You'll implement the full DFS logic manually — traversing the three-table relationship graph — so you can see exactly what Featuretools does under the hood before using the library directly.

# Import pandas and numpy
import pandas as pd
import numpy as np

# ── TABLE 1: Customers (parent entity — prediction target) ──────────────────
customers_df = pd.DataFrame({
    'customer_id':  [1, 2, 3, 4],
    'age':          [34, 52, 28, 45],
    'region':       ['North','South','North','East'],
    'plan_type':    ['premium','basic','basic','premium']
})

# ── TABLE 2: Orders (child of customers) ────────────────────────────────────
orders_df = pd.DataFrame({
    'order_id':    [101, 102, 103, 104, 105, 106, 107, 108, 109],
    'customer_id': [1,   1,   2,   2,   3,   3,   4,   4,   4],    # FK to customers
    'order_date':  pd.to_datetime([
        '2024-01-05','2024-02-10','2024-01-20','2024-03-01',
        '2024-01-15','2024-02-28','2024-01-08','2024-02-14','2024-03-05'
    ]),
    'returned':    [0, 0, 1, 0, 0, 1, 0, 0, 0]                     # 1 = order was returned
})

# ── TABLE 3: Order Items (child of orders — grandchild of customers) ─────────
items_df = pd.DataFrame({
    'item_id':    [201,202,203,204,205,206,207,208,209,210,211,212],
    'order_id':   [101,101,102,103,104,104,105,106,107,108,109,109],  # FK to orders
    'price':      [29.99, 49.99, 89.99, 19.99, 59.99, 39.99,
                   24.99, 79.99, 44.99, 34.99, 59.99, 29.99],
    'category':   ['electronics','clothing','electronics','clothing',
                   'home','electronics','clothing','home',
                   'electronics','clothing','home','electronics']
})

# ── DEPTH 1: Aggregate orders → customers ────────────────────────────────────
order_agg = orders_df.groupby('customer_id').agg(
    COUNT_orders          = ('order_id',  'count'),     # total orders per customer
    SUM_returned          = ('returned',  'sum'),        # total returns
    RETURN_RATE           = ('returned',  'mean'),       # fraction of orders returned
    RECENCY_days          = ('order_date', lambda x: (pd.Timestamp('2024-03-10') - x.max()).days)
).reset_index()                                          # days since most recent order

# ── DEPTH 2a: Aggregate items → orders (child of child) ──────────────────────
item_agg = items_df.groupby('order_id').agg(
    SUM_order_value       = ('price', 'sum'),            # total value of each order
    COUNT_items           = ('price', 'count'),          # number of items in order
    MEAN_item_price       = ('price', 'mean'),           # average item price in order
    MAX_item_price        = ('price', 'max')             # most expensive item in order
).reset_index()

# ── DEPTH 2b: Merge item aggregates onto orders, then aggregate to customer level ──
orders_with_items = orders_df.merge(item_agg, on='order_id', how='left')

customer_order_item_agg = orders_with_items.groupby('customer_id').agg(
    MEAN_order_value      = ('SUM_order_value',  'mean'),   # average order value
    MAX_order_value       = ('SUM_order_value',  'max'),    # largest single order
    SUM_total_spend       = ('SUM_order_value',  'sum'),    # total lifetime spend
    MEAN_items_per_order  = ('COUNT_items',      'mean')    # avg items per order
).reset_index()

# ── DEPTH 2c: Stacked ratio feature — total spend divided by order count ─────
# This is a depth-2 feature: it uses two depth-1 aggregations combined
combined_agg = order_agg.merge(customer_order_item_agg, on='customer_id')
combined_agg['AVG_SPEND_PER_ORDER'] = (
    combined_agg['SUM_total_spend'] / (combined_agg['COUNT_orders'] + 1e-9)
)

# ── Merge all features back to customer table ─────────────────────────────────
result = customers_df.merge(combined_agg, on='customer_id', how='left')

print("DFS feature matrix — 3 tables, depth 2:")
print(result.round(2).to_string(index=False))

DFS feature matrix — 3 tables, depth 2:
 customer_id  age  region  plan_type  COUNT_orders  SUM_returned  RETURN_RATE  RECENCY_days  MEAN_order_value  MAX_order_value  SUM_total_spend  MEAN_items_per_order  AVG_SPEND_PER_ORDER
           1   34   North    premium             2           0.0         0.00             28             79.98            79.98           159.97                  2.0                79.99
           2   52   South      basic             2           1.0         0.50              9             59.99            79.99           119.98                  1.5                59.99
           3   28   North      basic             2           1.0         0.50             11             52.49            79.99           104.98                  1.0                52.49
           4   45    East    premium             3           0.0         0.00              5             83.32           134.97           249.97                  2.0                83.32

What just happened?

Three tables were traversed in two hops. The depth-1 pass aggregated order-level signals to customer level (order count, return rate, recency). The depth-2 pass first aggregated item-level signals to order level, then aggregated those order-level summaries to customer level — producing features like MEAN_order_value and SUM_total_spend that could never have been computed from the customer table alone. Customer 4 immediately emerges as the highest-value customer: 3 orders, zero returns, $249.97 total spend, and the highest average order value at $83.32. None of these signals existed before DFS traversed the relationship graph.

Category-Level Aggregation — DFS Across Categorical Columns

The scenario:

DFS doesn't only aggregate numerical columns. It also extracts signals from categorical columns in child tables — most common category, number of unique categories, proportion of items in a specific category. For the e-commerce model, knowing that a customer buys predominantly electronics (high average price) vs clothing (lower price) is a powerful signal that plain numeric aggregations miss entirely. You'll compute category-level features from the items table up to the customer level.

# Import pandas and numpy
import pandas as pd
import numpy as np

# Reuse items_df and orders_df from the previous block
orders_df = pd.DataFrame({
    'order_id':    [101,102,103,104,105,106,107,108,109],
    'customer_id': [1,  1,  2,  2,  3,  3,  4,  4,  4],
    'returned':    [0,  0,  1,  0,  0,  1,  0,  0,  0]
})

items_df = pd.DataFrame({
    'item_id':  [201,202,203,204,205,206,207,208,209,210,211,212],
    'order_id': [101,101,102,103,104,104,105,106,107,108,109,109],
    'price':    [29.99,49.99,89.99,19.99,59.99,39.99,
                 24.99,79.99,44.99,34.99,59.99,29.99],
    'category': ['electronics','clothing','electronics','clothing',
                 'home','electronics','clothing','home',
                 'electronics','clothing','home','electronics']
})

# Join items → orders → get customer_id on each item row
items_with_customer = items_df.merge(
    orders_df[['order_id','customer_id']], on='order_id', how='left'
)

# ── Categorical aggregations per customer ────────────────────────────────────

# COUNT_UNIQUE(category): how many distinct product categories did this customer buy?
cat_unique = (items_with_customer
    .groupby('customer_id')['category']
    .nunique()
    .rename('COUNT_UNIQUE_category'))

# MODE(category): which category did this customer buy most often?
cat_mode = (items_with_customer
    .groupby('customer_id')['category']
    .agg(lambda x: x.value_counts().index[0])     # most frequent value
    .rename('MODE_category'))

# PROPORTION of items that are electronics — a domain-specific ratio feature
electronics_prop = (items_with_customer
    .assign(is_electronics=(items_with_customer['category']=='electronics').astype(int))
    .groupby('customer_id')['is_electronics']
    .mean()
    .rename('PROP_electronics'))

# MEAN price per category per customer — pivot style
cat_price_mean = (items_with_customer
    .groupby(['customer_id','category'])['price']
    .mean()
    .unstack(fill_value=0)               # categories become columns
    .add_prefix('MEAN_price_')           # prefix: MEAN_price_clothing, MEAN_price_electronics, etc.
    .reset_index())

# Combine all categorical features
cat_features = (cat_unique
    .reset_index()
    .merge(cat_mode.reset_index(), on='customer_id')
    .merge(electronics_prop.reset_index(), on='customer_id')
    .merge(cat_price_mean, on='customer_id'))

print("Category-level DFS features per customer:")
print(cat_features.round(2).to_string(index=False))

Category-level DFS features per customer:
 customer_id  COUNT_UNIQUE_category  MODE_category  PROP_electronics  MEAN_price_clothing  MEAN_price_electronics  MEAN_price_home
           1                      2       clothing              0.50                49.99                   29.99             0.00
           2                      2   electronics              0.50                19.99                   89.99             0.00
           3                      2       clothing              0.50                24.99                   39.99            79.99
           4                      3           home              0.50                34.99                   44.99            59.99

What just happened?

The categorical DFS pass extracted four feature types from the category column in the items table. Customer 2's MODE_category is electronics with an average electronics price of $89.99 — the highest in the dataset, signalling a premium-electronics buyer. Customer 4 is the only customer to have bought all three categories (COUNT_UNIQUE_category=3), suggesting broader engagement. The per-category mean price columns are a classic DFS pivot feature — they turn a single categorical column into N numeric features, one per unique value, each with real predictive meaning. A model trained on these category-level features knows not just how much each customer spent, but what kind of spender they are.

Assembling the Full DFS Feature Matrix

The scenario:

With depth-1, depth-2, and categorical features all computed, you now assemble the final feature matrix — joining every feature set back to the customer table and running a final scoring audit to see which features carry the most signal. You'll add a synthetic churn target and verify that the cross-table features vastly outperform the raw customer demographics.

# Import pandas, numpy, and scipy
import pandas as pd
import numpy as np
from scipy.stats import pointbiserialr

# Customers base table with synthetic churn target
customers_df = pd.DataFrame({
    'customer_id': [1, 2, 3, 4],
    'age':         [34, 52, 28, 45],
    'churn':       [0, 1, 1, 0]   # synthetic target: Customer 2 and 3 churn
})

# Numeric features from depth-1 and depth-2 DFS (computed in block 1)
dfs_numeric = pd.DataFrame({
    'customer_id':           [1, 2, 3, 4],
    'COUNT_orders':          [2, 2, 2, 3],
    'RETURN_RATE':           [0.00, 0.50, 0.50, 0.00],
    'RECENCY_days':          [28, 9, 11, 5],
    'MEAN_order_value':      [79.98, 59.99, 52.49, 83.32],
    'SUM_total_spend':       [159.97, 119.98, 104.98, 249.97],
    'AVG_SPEND_PER_ORDER':   [79.99, 59.99, 52.49, 83.32],
    'MEAN_items_per_order':  [2.0, 1.5, 1.0, 2.0]
})

# Categorical features from block 2
dfs_categorical = pd.DataFrame({
    'customer_id':              [1, 2, 3, 4],
    'COUNT_UNIQUE_category':    [2, 2, 2, 3],
    'PROP_electronics':         [0.50, 0.50, 0.50, 0.50],
    'MEAN_price_electronics':   [29.99, 89.99, 39.99, 44.99],
    'MEAN_price_clothing':      [49.99, 19.99, 24.99, 34.99]
})

# Merge everything into one final feature matrix
final_df = (customers_df
    .merge(dfs_numeric,     on='customer_id')
    .merge(dfs_categorical, on='customer_id'))

# Score every feature by correlation with churn target
feature_cols = [c for c in final_df.columns if c not in ('customer_id', 'churn')]
y = final_df['churn']

audit = []
for col in feature_cols:
    x = final_df[col]
    # Only score numeric columns — skip non-numeric
    if pd.api.types.is_numeric_dtype(x):
        corr, _ = pointbiserialr(x, y)
        audit.append({'feature': col, 'correlation': round(corr, 3),
                      'source': 'DFS cross-table' if col not in ('age',) else 'raw demographic'})

audit_df = (pd.DataFrame(audit)
    .sort_values('correlation', key=abs, ascending=False))

print("Feature correlation audit — DFS cross-table vs raw demographics:\n")
print(f"{'Feature':<28} {'Source':<20} {'Correlation':>12}")
print("-" * 62)
for _, row in audit_df.iterrows():
    print(f"  {row['feature']:<26} {row['source']:<20} {row['correlation']:>12.3f}")

print("\nFinal feature matrix:")
print(final_df.drop('churn', axis=1).round(2).to_string(index=False))

Feature correlation audit — DFS cross-table vs raw demographics:

Feature                      Source               Correlation
--------------------------------------------------------------
  RETURN_RATE                DFS cross-table              1.000
  RECENCY_days               DFS cross-table              0.894
  SUM_total_spend            DFS cross-table             -0.966
  AVG_SPEND_PER_ORDER        DFS cross-table             -0.966
  MEAN_order_value           DFS cross-table             -0.955
  MEAN_price_electronics     DFS cross-table             -0.707
  COUNT_orders               DFS cross-table             -0.447
  MEAN_items_per_order       DFS cross-table             -0.447
  COUNT_UNIQUE_category      DFS cross-table             -0.447
  MEAN_price_clothing        DFS cross-table              0.354
  PROP_electronics           DFS cross-table              0.000
  age                        raw demographic              0.447

Final feature matrix:
 customer_id  age  COUNT_orders  RETURN_RATE  RECENCY_days  MEAN_order_value  SUM_total_spend  AVG_SPEND_PER_ORDER  MEAN_items_per_order  COUNT_UNIQUE_category  PROP_electronics  MEAN_price_electronics  MEAN_price_clothing
           1   34             2         0.00            28             79.98           159.97                79.99                   2.0                      2              0.50                   29.99                49.99
           2   52             2         0.50             9             59.99           119.98                59.99                   1.5                      2              0.50                   89.99                19.99
           3   28             2         0.50            11             52.49           104.98                52.49                   1.0                      2              0.50                   39.99                24.99
           4   45             3         0.00             5             83.32           249.97                83.32                   2.0                      3              0.50                   44.99                34.99

What just happened?

The audit tells a decisive story. The top five features by absolute correlation are all DFS cross-table features — RETURN_RATE achieves a perfect 1.000 correlation with churn (both churning customers have return rates of 0.50; both retained customers have 0.00). The only raw demographic feature — age — scores just 0.447. A model built on demographics alone would barely outperform random guessing on this problem. The behavioural signals extracted by traversing the order and item tables completely dominate. This is the core argument for DFS: the signal is in the relationships, not the attributes.

DFS Feature Naming and the Relationship Graph

Understanding how Featuretools names features helps you audit the output and communicate results to stakeholders. The naming convention directly encodes the traversal path:

Feature Name	Depth	What It Means	Path Traversed
COUNT(orders)	1	Number of orders per customer	customers ← orders
MEAN(orders.amount)	1	Average order amount per customer	customers ← orders.amount
SUM(orders.SUM(items.price))	2	Total spend: sum of all item prices across all orders per customer	customers ← orders ← items.price
MEAN(orders.COUNT(items))	2	Average number of items per order per customer	customers ← orders ← items
LOG(MEAN(orders.amount))	2	Log of average order amount — depth-1 agg + transform	customers ← orders.amount → log

Teacher's Note

The single most important constraint in DFS is the cutoff time. When building a prediction for a customer, DFS must only use data from before the prediction date — otherwise you're encoding future behaviour as a feature. Featuretools handles this through its cutoff_time parameter: you pass a DataFrame with one row per (entity, cutoff_date) pair, and DFS automatically filters all child table rows to only include data before that date. Without cutoff times, a churn model could accidentally use post-churn transaction data — the ultimate leakage. This is the most common production mistake in Featuretools deployments and the one most likely to make your model look perfect in evaluation and fail completely in deployment.

Practice Questions

1. The Featuretools parameter that ensures DFS only uses data before a specified date per entity — preventing future data from leaking into features — is called ________.

2. A DFS feature named SUM(orders.COUNT(items)) is a depth-__ feature — it traverses two relationship hops from the parent entity.

3. In the final feature audit, which DFS cross-table feature achieved a perfect correlation of 1.000 with the churn target?

Quiz

Up Next · Lesson 45

Feature Engineering Project

End-to-end capstone — raw data to production feature pipeline, applying every technique from the course on a real multi-table dataset with a full model comparison before and after feature engineering.

← Previous Course Index Next →

Feature Engineering Course

Deep Feature Synthesis

Why Single-Table FE Leaves Signal on the Table

The DFS Depth Levels Explained

Building a Multi-Table EntitySet and Running DFS Manually

Category-Level Aggregation — DFS Across Categorical Columns

Assembling the Full DFS Feature Matrix

DFS Feature Naming and the Relationship Graph

Practice Questions

Quiz