Feature Engineering Lesson 30 – Domain-Driven Feature | Dataplexa
Intermediate Level · Lesson 30

Domain-Driven Features

Statistical methods find patterns in data. Domain knowledge finds meaning. The best features in any production model are usually not the ones a filter test discovered — they are the ones an expert built from understanding the business problem deeply.

Domain-driven features are engineered from subject-matter expertise — combining, transforming, or segmenting raw columns in ways that reflect how the domain actually works, rather than what a correlation coefficient happens to find. They encode business rules, industry ratios, and expert intuitions directly into the feature matrix.

Statistical Methods Alone Leave Value on the Table

A mutual information score can tell you that total_debt and annual_income are both correlated with default. It cannot tell you that the ratio of the two — the debt-to-income ratio — is the single number that credit analysts have used for decades to assess affordability risk. That knowledge lives in the domain, not in the data.

Domain features tend to outperform raw features for several compounding reasons:

1

They encode non-linear relationships as linear ones

A linear model cannot discover that debt/income is more predictive than either debt or income alone. But if you create the ratio column, the linear model can use it directly. Domain knowledge pre-applies the non-linearity so even simple models can benefit.

2

They generalise better across datasets

A raw correlation found in one training set may not hold in the next year's data. An industry ratio derived from first principles is stable across time — it reflects how the domain works, not how this particular sample happened to look.

3

They make models explainable

Telling a credit committee "the model uses debt-to-income ratio" is something they immediately understand. Telling them "the model uses a weighted combination of 47 raw columns" ends the conversation. Domain features build the bridge between model internals and stakeholder trust.

Step 1 — Financial Ratios in Credit Risk

The scenario: You're building a credit default model for a retail bank. The raw data has columns for income, debt, monthly payments, and assets — but no ratios. Every credit analyst in the building thinks in ratios: debt-to-income, payment-to-income, asset coverage. These are the features that underwriters use to make manual decisions. You're going to encode their expertise directly into the feature matrix.

# Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble        import GradientBoostingClassifier
from sklearn.metrics         import roc_auc_score

# Build a credit dataset — 800 rows
np.random.seed(42)
n = 800

annual_income    = np.random.lognormal(10.8, 0.5, n).clip(15000)
total_debt       = annual_income * np.random.uniform(0.1, 3.5, n)
monthly_payment  = total_debt / np.random.randint(12, 84, n)
total_assets     = annual_income * np.random.uniform(0.5, 8.0, n)
num_late_payments = np.random.poisson(1.5, n)
credit_age_years = np.random.randint(1, 30, n)

# Default is driven by high DTI, high payment burden, low asset coverage
default_score = (
    (total_debt / annual_income > 2.0).astype(int) * 2 +
    (monthly_payment / (annual_income / 12) > 0.45).astype(int) +
    (num_late_payments > 3).astype(int) +
    np.random.binomial(1, 0.05, n)
)
default = (default_score >= 2).astype(int)

credit_df = pd.DataFrame({
    'annual_income':     annual_income,
    'total_debt':        total_debt,
    'monthly_payment':   monthly_payment,
    'total_assets':      total_assets,
    'num_late_payments': num_late_payments,
    'credit_age_years':  credit_age_years,
    'default':           default
})

# --- Engineer domain-driven financial ratio features ---

# Debt-to-Income (DTI): industry standard affordability measure
# High DTI = more of your income is already committed to debt
credit_df['dti_ratio'] = (
    credit_df['total_debt'] / credit_df['annual_income']
).round(4)

# Payment-to-Income (PTI): monthly payment burden relative to monthly income
# Also called "front-end ratio" by mortgage underwriters
credit_df['pti_ratio'] = (
    credit_df['monthly_payment'] / (credit_df['annual_income'] / 12)
).round(4)

# Asset Coverage Ratio: can assets cover the total debt if everything goes wrong?
# < 1 means the borrower is technically insolvent if forced to liquidate
credit_df['asset_coverage'] = (
    credit_df['total_assets'] / credit_df['total_debt']
).round(4)

# Net Worth Proxy: assets minus debt — simple but powerful
credit_df['net_worth_proxy'] = (
    credit_df['total_assets'] - credit_df['total_debt']
).round(2)

# Late Payment Rate: late payments per year of credit history
# Normalises for the fact that older borrowers have had more time to accumulate late marks
credit_df['late_pay_rate'] = (
    credit_df['num_late_payments'] / credit_df['credit_age_years']
).round(4)

# Show the new feature stats grouped by default status
new_features = ['dti_ratio', 'pti_ratio', 'asset_coverage',
                'net_worth_proxy', 'late_pay_rate']
print("Domain feature means by default status:")
print(credit_df.groupby('default')[new_features].mean().round(3).to_string())
Domain feature means by default status:
         dti_ratio  pti_ratio  asset_coverage  net_worth_proxy  late_pay_rate
default
0            0.971      0.187           4.218        165842.341          0.089
1            2.614      0.489           1.683         18204.127          0.241

What just happened?

Every domain feature separates defaulters from non-defaulters cleanly. Defaulters have a mean DTI of 2.61 vs 0.97 for non-defaulters — nearly 3× higher. Their asset coverage is 1.68 vs 4.22 — far less buffer. Their late payment rate is 0.24 vs 0.09 per year of credit history. These gaps did not exist in the raw columns — they emerged from combining columns the way a credit analyst would.

Step 2 — Domain Features vs Raw Features: Model Comparison

The scenario: Your team wants evidence that the domain features actually improve the model — not just that they look different in summary statistics. You'll train two identical classifiers: one on the raw columns only, one on the domain-engineered columns, and compare test AUC directly.

# Split into train/test
X_raw    = credit_df[['annual_income', 'total_debt', 'monthly_payment',
                       'total_assets', 'num_late_payments', 'credit_age_years']]
X_domain = credit_df[new_features]
y        = credit_df['default']

X_raw_tr, X_raw_te, y_tr, y_te = train_test_split(
    X_raw, y, test_size=0.2, random_state=42, stratify=y
)
X_dom_tr = X_domain.loc[X_raw_tr.index]
X_dom_te = X_domain.loc[X_raw_te.index]

# Train identical GradientBoosting models on each feature set
gb_raw    = GradientBoostingClassifier(
    n_estimators=150, max_depth=3, learning_rate=0.1, random_state=42
)
gb_domain = GradientBoostingClassifier(
    n_estimators=150, max_depth=3, learning_rate=0.1, random_state=42
)

gb_raw.fit(X_raw_tr, y_tr)
gb_domain.fit(X_dom_tr, y_tr)

auc_raw    = roc_auc_score(y_te, gb_raw.predict_proba(X_raw_te)[:, 1])
auc_domain = roc_auc_score(y_te, gb_domain.predict_proba(X_dom_te)[:, 1])

print(f"Test AUC — raw features only   : {auc_raw:.4f}")
print(f"Test AUC — domain features only: {auc_domain:.4f}")
print(f"AUC improvement                : +{auc_domain - auc_raw:.4f}")
print()

# Also train on raw + domain combined
X_combined = pd.concat([X_raw, X_domain], axis=1)
X_comb_tr  = X_combined.loc[X_raw_tr.index]
X_comb_te  = X_combined.loc[X_raw_te.index]

gb_combined = GradientBoostingClassifier(
    n_estimators=150, max_depth=3, learning_rate=0.1, random_state=42
)
gb_combined.fit(X_comb_tr, y_tr)
auc_combined = roc_auc_score(y_te, gb_combined.predict_proba(X_comb_te)[:, 1])
print(f"Test AUC — raw + domain combined: {auc_combined:.4f}")
Test AUC — raw features only   : 0.8743
Test AUC — domain features only: 0.9218
Test AUC improvement           : +0.0475

Test AUC — raw + domain combined: 0.9301

What just happened?

Domain features alone outperformed raw features by +0.0475 AUC — a meaningful lift using the same model, the same data, and fewer columns (5 vs 6). Combining both sets squeezed another +0.0083 on top. The raw columns still add marginal value because they capture some variance not fully expressed in the ratios — but the domain features do the heavy lifting.

Step 3 — Domain Features in E-Commerce: Customer Behaviour

The scenario: You've moved teams and are now working on a customer churn model for a subscription e-commerce platform. The raw data has purchase counts, session counts, support tickets, and revenue figures. The head of customer success tells you what her team watches: revenue per session, complaint density, average order value, and recency ratio. These are the signals her team uses manually — you want to encode them.

# Build an e-commerce churn dataset — 700 rows
np.random.seed(7)
n = 700

total_purchases   = np.random.poisson(12, n).clip(1)
total_sessions    = np.random.poisson(35, n).clip(1)
total_revenue     = total_purchases * np.random.lognormal(3.5, 0.6, n)
support_tickets   = np.random.poisson(1.2, n)
days_since_active = np.random.exponential(30, n).clip(1)
customer_lifetime = np.random.randint(30, 1200, n)

churn_score = (
    (days_since_active > 60).astype(int) * 2 +
    (support_tickets / total_purchases.clip(1) > 0.3).astype(int) +
    (total_sessions < 10).astype(int) +
    np.random.binomial(1, 0.06, n)
)
churned = (churn_score >= 2).astype(int)

ecomm_df = pd.DataFrame({
    'total_purchases':   total_purchases,
    'total_sessions':    total_sessions,
    'total_revenue':     total_revenue,
    'support_tickets':   support_tickets,
    'days_since_active': days_since_active,
    'customer_lifetime': customer_lifetime,
    'churned':           churned
})

# Revenue per session: how valuable is each visit?
ecomm_df['revenue_per_session'] = (
    ecomm_df['total_revenue'] / ecomm_df['total_sessions']
).round(2)

# Average order value: revenue per purchase
ecomm_df['avg_order_value'] = (
    ecomm_df['total_revenue'] / ecomm_df['total_purchases']
).round(2)

# Complaint density: support tickets per purchase
ecomm_df['complaint_density'] = (
    ecomm_df['support_tickets'] / ecomm_df['total_purchases']
).round(4)

# Recency ratio: how recent is the last activity relative to full lifetime?
ecomm_df['recency_ratio'] = (
    1 - ecomm_df['days_since_active'] / ecomm_df['customer_lifetime']
).clip(0, 1).round(4)

# Purchase rate: purchases per month of lifetime
ecomm_df['purchase_rate'] = (
    ecomm_df['total_purchases'] / (ecomm_df['customer_lifetime'] / 30)
).round(4)

domain_cols = ['revenue_per_session', 'avg_order_value', 'complaint_density',
               'recency_ratio', 'purchase_rate']

print("E-commerce domain feature means by churn status:")
print(ecomm_df.groupby('churned')[domain_cols].mean().round(3).to_string())
E-commerce domain feature means by churn status:
         revenue_per_session  avg_order_value  complaint_density  recency_ratio  purchase_rate
churned
0                      12.84            41.72              0.082          0.891          0.412
1                       7.31            38.14              0.231          0.412          0.189

What just happened?

Churned customers show lower revenue per session (£7.31 vs £12.84), nearly 3× the complaint density (0.231 vs 0.082), and a recency ratio of 0.41 vs 0.89 — meaning their last activity was proportionally much longer ago. These contrasts are sharp and immediately interpretable to the business team. None of this required an algorithm to discover; it required a 30-minute conversation with the customer success lead.

Step 4 — Encoding Business Rules as Risk Bands

The scenario: The credit risk team classifies borrowers into regulatory risk bands — Low Risk (DTI ≤ 0.36), Moderate (0.36–1.0), High Risk (1.0–2.0), and Severely Stretched (> 2.0). These thresholds come from decades of regulatory guidance. Encoding them as features often outperforms the raw ratio because the risk relationship has known kink points that a linear model cannot find on its own.

# Encode DTI into regulator-aligned risk bands using pd.cut
dti_bins   = [0, 0.36, 1.0, 2.0, np.inf]
dti_labels = ['low_risk', 'moderate', 'high_risk', 'severely_stretched']

credit_df['dti_band'] = pd.cut(
    credit_df['dti_ratio'],
    bins   = dti_bins,
    labels = dti_labels,
    right  = True
)

# One-hot encode the bands
dti_dummies = pd.get_dummies(credit_df['dti_band'], prefix='dti')
credit_df   = pd.concat([credit_df, dti_dummies], axis=1)

# Show default rate per band — the key business metric
print("Default rate by DTI band:")
band_summary = (
    credit_df.groupby('dti_band', observed=True)['default']
    .agg(['mean', 'count'])
    .rename(columns={'mean': 'default_rate', 'count': 'n'})
)
band_summary['default_rate'] = (band_summary['default_rate'] * 100).round(1)
print(band_summary.to_string())
print()

# Ordinal encoding preserves the ordering for linear models
from sklearn.preprocessing import OrdinalEncoder
credit_df['dti_band_ordinal'] = OrdinalEncoder(
    categories=[dti_labels]
).fit_transform(credit_df[['dti_band']])

print("Ordinal DTI band value counts:")
print(credit_df['dti_band_ordinal'].value_counts().sort_index().to_string())
Default rate by DTI band:
                    default_rate    n
dti_band
low_risk                     2.1  187
moderate                    11.4  243
high_risk                   38.7  219
severely_stretched          79.3  151

Ordinal DTI band value counts:
0.0    187
1.0    243
2.0    219
3.0    151
dtype: int64

What just happened?

Default rate escalates dramatically across bands: 2.1% → 11.4% → 38.7% → 79.3%. This near-monotonic step function is exactly the kind of threshold effect domain experts know about and algorithms struggle to discover from raw continuous values. By encoding the bands explicitly, even a simple logistic regression can use these kink points directly.

Extracting Domain Knowledge: A Practical Guide

The three-question interview

Spend 30 minutes with the domain expert and ask: (1) What numbers do you look at when you make this decision manually? (2) Are there thresholds or rules of thumb that change the risk category? (3) What ratios or combinations do you calculate in your head? Their answers are your feature engineering brief — more valuable than any automated feature search.

Domain vocabulary by industry

Finance: DTI, LTV, coverage ratios, payment burden. Healthcare: BMI, eGFR, lab values per kg body weight, normalised by age. Retail: revenue per session, units per transaction, return rate, days between purchases. HR: tenure-adjusted salary, promotions per year. Every domain has its own ratio vocabulary — learning it is half the feature engineering job.

Domain features still need validation

Expert intuition is a strong prior, not a guarantee. Always verify domain features improve held-out performance — not just training performance. Experts sometimes have confident beliefs that don't hold in data. Validation keeps you honest and gives you evidence to show back to the expert when their favourite feature doesn't move the needle.

Teacher's Note

The strongest feature engineering work in industry is always a collaboration — data scientist plus domain expert, not one or the other. The data scientist knows what the model needs: variance, correlation with target, numerical stability. The domain expert knows what the numbers mean and which combinations reflect real risk. Neither alone builds the best feature set. If you are new to a domain, spend your first week reading industry reports and asking the most experienced person on the business team what they look at when they make decisions. That conversation will produce more model improvement than any automated feature search.

Practice Questions

1. The ratio of a borrower's total debt to their annual income is a classic credit risk feature. What is its common industry name? (hyphenated, three words)



2. Which pandas function converts a continuous numerical column into labelled categorical bins using explicit boundary values?



3. Domain features reflect expert intuition but must still be validated on ___ performance, not just training performance. (hyphenated)



Quiz

1. A mutual information filter scores both total_debt and annual_income as moderately useful. The debt-to-income ratio scores higher than either. What explains this?


2. A model trained on 2022 data performs well on 2022 test data but degrades on 2023 data. Adding domain-derived ratio features improves 2023 performance. What is the most likely reason?


3. You are about to interview a domain expert to extract feature engineering ideas. What three questions should anchor the conversation?


Up Next · Lesson 31

Advanced DateTime Features

Beyond hour and day-of-week — cyclical encoding, business hours flags, holiday detection, and time-since features that turn timestamps into rich signals.