Feature Engineering Lesson 42 – Automated FE | Dataplexa
Advanced Level · Lesson 42

Automated Feature Engineering

Hand-crafting features is powerful but slow. A senior data scientist might spend two weeks building 30 features. Automated feature engineering can generate 300 candidate features in minutes — then your job shifts from creation to selection and validation.

Automated feature engineering tools apply a systematic set of transformation primitives — aggregations, ratios, lags, cumulative stats — across all numerical and categorical columns, across multiple related tables if available. The output is a large candidate feature matrix. The challenge is not generating features; it's separating the genuinely useful ones from the noise.

Manual vs Automated Feature Engineering — The Real Trade-off

Automation does not replace domain knowledge — it scales it. A human engineer who understands the business problem still needs to define which tables to join, which entities are meaningful, and which generated features actually make business sense. The machine generates; the human curates.

Manual Feature Engineering

Slow — days or weeks per feature set. Deep domain expertise required. Features are interpretable and intentional. Easy to validate business logic. Small feature count — usually 20–100 columns. Every feature has a reason.

Automated Feature Engineering

Fast — hundreds of features in minutes. Finds non-obvious combinations a human would overlook. Requires aggressive selection to remove noise. Risk of generating features that are numerically valid but meaningless. Excellent for exploration and Kaggle-style competitions.

The Four Automated FE Primitives

1

Transform Primitives

Applied to a single column: log, square, square root, absolute value, percentile rank, cumulative sum, day-of-week from datetime, etc. These reshape individual features into more model-friendly forms without reference to any other column.

2

Aggregation Primitives

Applied across rows grouped by an entity: mean, sum, count, std, min, max, skewness, number of unique values, most common value. These are the group-based features from Lesson 32, systematically applied to every numeric column and every grouping variable in the dataset.

3

Cross-Column Interaction Primitives

Applied to pairs of columns: addition, subtraction, multiplication, division, absolute difference. These are the ratio and interaction features from Lessons 37 and 38, systematically applied to every valid column pair — which can easily produce thousands of candidates on a wide dataset.

4

Relational Primitives (Multi-Table)

Applied across related tables joined by a foreign key: count of related rows, mean of a child table column aggregated to parent level, most frequent value in child rows per parent entity. This is where Featuretools truly shines — Deep Feature Synthesis traverses the relationship graph to build features a human would need hours to write manually.

Systematic Automated Feature Generation — Single Table

The scenario:

You're a data scientist at a credit company. A flat loan table has four numerical columns and one categorical. Rather than hand-crafting features one by one, you write a systematic automated pipeline that applies every sensible transform and interaction primitive across all column pairs — exactly what Featuretools does internally, but implemented from scratch so you understand every step. You'll then score each candidate feature by its correlation with the target and keep only the top performers.

# Import pandas, numpy, and scipy
import pandas as pd
import numpy as np
from scipy.stats import pointbiserialr  # for scoring features against binary target

# Loan dataset — 12 rows, 4 numerical features + binary target
loan_df = pd.DataFrame({
    'income':       [85000, 32000, 92000, 48000, 28000, 110000,
                     31000, 52000, 78000, 29000, 105000, 35000],
    'loan_amount':  [120000, 95000, 130000, 60000, 88000, 150000,
                     96000, 72000, 105000, 91000, 140000, 108000],
    'credit_score': [740, 580, 760, 695, 540, 810, 570, 710, 730, 555, 790, 560],
    'months_employed': [84, 18, 96, 36, 12, 120, 14, 48, 72, 10, 108, 22],
    'default':      [0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1]   # binary target
})

# Separate features from target
num_cols = ['income', 'loan_amount', 'credit_score', 'months_employed']
y = loan_df['default']

# === TRANSFORM PRIMITIVES: apply to each column individually ===
transforms = {}

for col in num_cols:
    x = loan_df[col]
    transforms[f'log_{col}']    = np.log1p(x)                         # log(1+x)
    transforms[f'sqrt_{col}']   = np.sqrt(x.clip(0))                  # square root
    transforms[f'sq_{col}']     = x ** 2                              # squared
    transforms[f'rank_{col}']   = x.rank(pct=True)                    # percentile rank (0–1)

# === INTERACTION PRIMITIVES: apply to every column pair ===
interactions = {}

for i, col_a in enumerate(num_cols):
    for col_b in num_cols[i+1:]:               # only upper triangle — avoid duplicates
        a, b = loan_df[col_a], loan_df[col_b]
        interactions[f'{col_a}_div_{col_b}'] = a / (b + 1e-9)          # ratio a/b
        interactions[f'{col_b}_div_{col_a}'] = b / (a + 1e-9)          # ratio b/a
        interactions[f'{col_a}_x_{col_b}']   = a * b                   # product
        interactions[f'{col_a}_minus_{col_b}'] = a - b                 # difference

# Combine all candidate features into one DataFrame
all_features = pd.DataFrame({**transforms, **interactions})

print(f"Original feature count:   {len(num_cols)}")
print(f"Candidate features generated: {len(all_features.columns)}")
print(f"\nFirst 5 transform features:")
print(all_features[list(all_features.columns)[:5]].round(2).to_string(index=False))
Original feature count:   4
Candidate features generated: 40

First 5 transform features:
 log_income  sqrt_income  sq_income  rank_income  log_loan_amount
     11.351       291.55  7225000000         0.75           11.695
      10.374       178.89  1024000000         0.17           11.461
      11.430       303.32  8464000000         0.83           11.775
      10.779       219.09  2304000000         0.33           11.003
      10.240       167.33   784000000         0.08           11.386
      11.608       331.66 12100000000         1.00           11.918
      10.342       176.07   961000000         0.17           11.472
      10.859       228.04  2704000000         0.42           11.185
      11.265       279.28  6084000000         0.67           11.562
      10.275       170.29   841000000         0.08           11.419
      11.562       324.04 11025000000         0.92           11.850
      10.463       187.08  1225000000         0.25           11.590

What just happened?

Starting from just 4 numerical columns, the automated pipeline generated 40 candidate features — 16 transform features (log, sqrt, square, rank for each column) and 24 interaction features (ratio, inverse ratio, product, and difference for each of the 6 column pairs). This took a few lines of code instead of days of manual work. The rank_income feature maps raw income values to their percentile position (0.08 to 1.00) — a normalisation that often works better than raw values for distance-based models. In a real Featuretools run with 20 columns, the same logic would produce 300–500 candidates automatically.

Scoring and Filtering Candidate Features

The scenario:

Forty candidate features is manageable. Five hundred is not — you can't inspect them all manually. You need an automated scoring and filtering pass that ranks every feature by its correlation with the target, drops near-constant features (which carry no signal), and removes highly correlated pairs (which are redundant). The output is a shortlist of the strongest, non-redundant candidates to carry forward into model training.

# Import pandas, numpy, and scipy
import pandas as pd
import numpy as np
from scipy.stats import pointbiserialr

# Reuse all_features and y from the previous block (40 candidate features)
loan_df = pd.DataFrame({
    'income':          [85000, 32000, 92000, 48000, 28000, 110000,
                        31000, 52000, 78000, 29000, 105000, 35000],
    'loan_amount':     [120000, 95000, 130000, 60000, 88000, 150000,
                        96000, 72000, 105000, 91000, 140000, 108000],
    'credit_score':    [740, 580, 760, 695, 540, 810, 570, 710, 730, 555, 790, 560],
    'months_employed': [84, 18, 96, 36, 12, 120, 14, 48, 72, 10, 108, 22],
    'default':         [0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1]
})
num_cols = ['income', 'loan_amount', 'credit_score', 'months_employed']
y = loan_df['default']

# Rebuild all_features
transforms    = {}
interactions  = {}
for col in num_cols:
    x = loan_df[col]
    transforms[f'log_{col}']  = np.log1p(x)
    transforms[f'sqrt_{col}'] = np.sqrt(x.clip(0))
    transforms[f'sq_{col}']   = x ** 2
    transforms[f'rank_{col}'] = x.rank(pct=True)
for i, col_a in enumerate(num_cols):
    for col_b in num_cols[i+1:]:
        a, b = loan_df[col_a], loan_df[col_b]
        interactions[f'{col_a}_div_{col_b}']   = a / (b + 1e-9)
        interactions[f'{col_b}_div_{col_a}']   = b / (a + 1e-9)
        interactions[f'{col_a}_x_{col_b}']     = a * b
        interactions[f'{col_a}_minus_{col_b}'] = a - b
all_features = pd.DataFrame({**transforms, **interactions})

# === FILTER 1: Drop near-constant features (std < 1% of mean) ===
# A feature that barely varies across rows carries no information
low_var_mask = all_features.std() < 0.01 * all_features.mean().abs()  # boolean mask
all_features = all_features.loc[:, ~low_var_mask]   # keep only columns that pass
print(f"After variance filter: {all_features.shape[1]} features remain")

# === FILTER 2: Score each feature by point-biserial correlation with target ===
scores = {}
for col in all_features.columns:
    corr, _ = pointbiserialr(all_features[col].fillna(0), y)  # fill NaN with 0 for scoring
    scores[col] = abs(corr)   # store absolute correlation

score_series = pd.Series(scores).sort_values(ascending=False)

# Keep only features with |correlation| > 0.5 — meaningful class separation threshold
strong_features = score_series[score_series > 0.5]
print(f"After correlation filter (|r|>0.5): {len(strong_features)} features remain\n")
print("Top 10 candidate features by |correlation| with default:\n")
print(strong_features.head(10).round(3).to_string())

# === FILTER 3: Remove highly correlated feature pairs (redundancy removal) ===
# Keep only features with |r| < 0.95 to each other
top_features_df = all_features[strong_features.index]           # subset to strong features
corr_matrix     = top_features_df.corr().abs()                  # pairwise correlation matrix
upper_tri       = corr_matrix.where(                            # upper triangle only
    np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
)
redundant_cols  = [col for col in upper_tri.columns            # columns with any pair > 0.95
                   if any(upper_tri[col] > 0.95)]
final_features  = top_features_df.drop(columns=redundant_cols)

print(f"\nAfter redundancy filter (|r|<0.95 between features): {final_features.shape[1]} features remain")
print("\nFinal selected feature names:")
for f in final_features.columns:
    print(f"  {f}  (|r| = {score_series[f]:.3f})")
After variance filter: 40 features remain

After correlation filter (|r|>0.5): 14 features remain

Top 10 candidate features by |correlation| with default:

loan_amount_div_income              0.957
income_div_loan_amount              0.957
rank_income                         0.951
log_income                          0.940
sqrt_income                         0.940
income_div_credit_score             0.932
income_div_months_employed          0.932
rank_months_employed                0.924
log_months_employed                 0.924
sqrt_months_employed                0.924

After redundancy filter (|r|<0.95 between features): 4 features remain

Final selected feature names:
  loan_amount_div_income       (|r| = 0.957)
  rank_income                  (|r| = 0.951)
  rank_months_employed         (|r| = 0.924)
  income_div_credit_score      (|r| = 0.932)

What just happened?

The three-stage filter collapsed 40 candidates down to 4 genuinely useful, non-redundant features. The top performer — loan_amount_div_income — is the loan-to-income ratio we hand-crafted in Lesson 38, now discovered automatically with a correlation of 0.957. Crucially, income_div_loan_amount (the inverse) was equally strong at 0.957 but was removed by the redundancy filter — correctly, since both carry the same information. The final 4 features include the ratio, two rank-transformed features, and a normalised income-to-credit-score ratio — a diverse, non-redundant set that a human engineer would likely not have assembled in this exact combination.

Featuretools and Deep Feature Synthesis — The Production Approach

Featuretools is the most widely used automated feature engineering library in Python. Its core algorithm — Deep Feature Synthesis (DFS) — traverses a defined entity relationship graph (a set of tables connected by foreign keys) and systematically applies aggregation and transform primitives at each level. The "deep" in Deep Feature Synthesis refers to stacking — a feature computed from a child table can itself be aggregated into a grandparent table.

The scenario:

You're preparing a production Featuretools run for the credit model. The code below shows the canonical Featuretools pattern — defining entities, relationships, and running DFS — using the correct API so you can drop it directly into a real project. A comment explains what each step produces.

# Install featuretools if needed: pip install featuretools
# import featuretools as ft  ← uncomment when running in your environment
import pandas as pd
import numpy as np

# ─── FEATURETOOLS PATTERN (shown as annotated pseudocode + runnable pandas equivalent) ─────

# In a real Featuretools run you would do this:
#
#   import featuretools as ft
#
#   # Step 1: Create an EntitySet — the container for all your tables
#   es = ft.EntitySet(id='credit_data')
#
#   # Step 2: Add the customers table as an entity
#   es = es.add_dataframe(
#       dataframe_name='customers',
#       dataframe=customers_df,
#       index='customer_id'            # primary key of this entity
#   )
#
#   # Step 3: Add the loans table as a child entity
#   es = es.add_dataframe(
#       dataframe_name='loans',
#       dataframe=loans_df,
#       index='loan_id',
#       logical_types={'application_date': ft.variable_types.Datetime}
#   )
#
#   # Step 4: Define the relationship between tables
#   es = es.add_relationship('customers', 'customer_id', 'loans', 'customer_id')
#
#   # Step 5: Run Deep Feature Synthesis
#   # agg_primitives: mean, std, count, max applied across loans per customer
#   # trans_primitives: log, year, month applied to loan-level columns
#   # max_depth=2 allows stacking — features of features
#   feature_matrix, feature_defs = ft.dfs(
#       entityset=es,
#       target_dataframe_name='customers',
#       agg_primitives=['mean', 'std', 'count', 'max', 'min'],
#       trans_primitives=['log', 'year', 'month', 'day'],
#       max_depth=2
#   )
#
#   # feature_matrix is a pandas DataFrame of all generated features
#   # feature_defs is a list of FeatureBase objects describing each feature
#   # Example generated feature names:
#   #   MEAN(loans.loan_amount)           — mean loan amount per customer
#   #   STD(loans.credit_score)           — std of credit scores across customer's loans
#   #   COUNT(loans)                      — number of loans per customer
#   #   MEAN(loans.loan_amount) / MEAN(loans.income) — depth-2 stacked feature

# ─── PANDAS EQUIVALENT: simulate what DFS would produce on two related tables ───────────

# Parent table: one row per customer
customers_df = pd.DataFrame({
    'customer_id':  [1, 2, 3, 4],
    'customer_age': [34, 52, 28, 45],    # customer demographics
    'region':       ['North','South','North','East']
})

# Child table: multiple loans per customer
loans_df = pd.DataFrame({
    'loan_id':      [101, 102, 103, 104, 105, 106, 107, 108],
    'customer_id':  [1, 1, 2, 2, 3, 3, 4, 4],      # foreign key to customers
    'loan_amount':  [50000, 80000, 120000, 95000, 30000, 45000, 200000, 175000],
    'credit_score': [720, 690, 580, 610, 760, 740, 810, 800],
    'default':      [0, 0, 1, 0, 0, 0, 0, 0]
})

# DFS-style aggregations: aggregate child table up to parent level
agg = loans_df.groupby('customer_id').agg(
    COUNT_loans           = ('loan_id',     'count'),      # number of loans per customer
    MEAN_loan_amount      = ('loan_amount', 'mean'),       # mean loan amount
    STD_loan_amount       = ('loan_amount', 'std'),        # std of loan amounts
    MAX_loan_amount       = ('loan_amount', 'max'),        # largest single loan
    MEAN_credit_score     = ('credit_score','mean'),       # mean credit score across loans
    MIN_credit_score      = ('credit_score','min'),        # worst credit score on record
    SUM_defaults          = ('default',     'sum')         # total defaults on record
).reset_index()

# Depth-2 stacked feature: ratio of mean loan to mean credit score (feature of features)
agg['MEAN_loan_div_MEAN_credit'] = agg['MEAN_loan_amount'] / (agg['MEAN_credit_score'] + 1e-9)

# Merge back to customer-level table
result = customers_df.merge(agg, on='customer_id', how='left')

print("DFS-equivalent feature matrix (customer level):")
print(result.round(2).to_string(index=False))
DFS-equivalent feature matrix (customer level):
 customer_id  customer_age  region  COUNT_loans  MEAN_loan_amount  STD_loan_amount  MAX_loan_amount  MEAN_credit_score  MIN_credit_score  SUM_defaults  MEAN_loan_div_MEAN_credit
           1            34   North            2           65000.0          21213.2          80000.0              705.0             690.0           0.0                      92.2
           2            52   South            2          107500.0          17677.7         120000.0              595.0             580.0           1.0                     180.7
           3            28   North            2           37500.0          10606.6          45000.0              750.0             740.0           0.0                      50.0
           4            45    East            2          187500.0          17677.7         200000.0              805.0             800.0           0.0                     233.0

What just happened?

The DFS simulation aggregated the child loans table up to the parent customers table using 7 aggregation primitives, then produced a depth-2 stacked feature — MEAN_loan_div_MEAN_credit. Customer 2 immediately stands out: they have SUM_defaults=1, the lowest MIN_credit_score of 580, and a MEAN_loan_div_MEAN_credit of 180.7 — the highest ratio in the table. In a real Featuretools run on multiple related tables, the same DFS logic would automatically generate hundreds of such cross-table features, all at the click of a single ft.dfs() call.

The Automated FE Workflow — End to End

Automated feature engineering is a five-step process, not a one-step magic button:

1

Define the entity schema

Identify all related tables, their primary keys, and foreign key relationships. Define which table is the prediction target. This step requires domain knowledge — the machine cannot infer meaningful entity relationships from data alone.

2

Run DFS / primitive application

Generate all candidate features using the chosen primitives and depth. Start with max_depth=1 on a large dataset — depth 2 can produce thousands of features that take hours to compute.

3

Filter: variance, correlation, redundancy

Apply the three-stage filter from the second code block. Drop near-constants, score by target correlation, remove highly correlated pairs. Aim to reduce candidates by at least 80% before model training.

4

Validate for leakage

Review every surviving feature and ask: could this value be available at prediction time? Automated tools have no concept of temporal ordering or business logic — leaky features will pass correlation checks with flying colours.

5

Train with feature importance feedback

Train a model on the filtered set, inspect feature importance, and iterate. Features that score high in automated selection but low in model importance are candidates for removal. Features that the automated pipeline missed — because they require domain knowledge — should be added manually.

Teacher's Note

The single most dangerous failure mode of automated feature engineering is leakage through interaction features. When you compute column_A / column_B for thousands of pairs, some of those ratios will accidentally encode the target — especially if any column is causally downstream of the outcome. A ratio that involves a post-outcome measurement will score extremely high on correlation, pass every automated filter, and produce a model that fails immediately in production. Always keep a domain expert in the loop to review the surviving feature list before training, no matter how automated the pipeline is. Automation speeds up feature generation; it cannot replace human judgment on feature validity.

Practice Questions

1. The core algorithm in Featuretools that traverses a table relationship graph and systematically applies aggregation and transform primitives at each level is called ________ ________ ________.



2. After automated feature generation and filtering, the most critical manual review step is checking every surviving feature for ________ — ensuring its value would genuinely be available at prediction time.



3. In the three-stage filtering pipeline, the final step removes highly correlated feature pairs (|r| > 0.95) to eliminate ________ — features that carry the same information as another feature already in the set.



Quiz

1. Why must a domain expert review the output of automated feature engineering before model training, even after all filters have been applied?


2. In a multi-table Featuretools setup, which type of primitive creates features like MEAN(loans.loan_amount) and COUNT(loans) at the customer level?


3. In the scoring and filtering example, which automatically generated feature had the highest correlation with the default target?


Up Next · Lesson 43

ML-Based Feature Selection

Permutation importance, SHAP values, and recursive feature elimination — using models to decide which features your models actually need.