EDA Lesson 34 – Domain-Driven EDA | Dataplexa

Intermediate Level · Lesson 34

Domain-Driven EDA

Two analysts can look at the same dataset and see completely different things. The one who knows the industry will ask better questions, flag the right anomalies, and build features that actually work. This lesson is about the difference between running analysis and understanding what you're analysing.

What Domain Knowledge Actually Changes

Generic EDA treats every dataset the same way — check for nulls, look at distributions, compute correlations. Domain-driven EDA starts with a different question: "What would a 10-year industry veteran look at first in this data?"

Domain knowledge changes three things specifically:

🎯

Which metrics to look at first

A retail analyst knows gross margin matters more than revenue. A healthcare analyst knows readmission rate matters more than average length of stay. Domain knowledge sets your priority order.

⚠️

Which patterns are suspicious vs expected

A 30% spike in insurance claims in January is suspicious to a statistician. To an insurance analyst it's expected — it's when people make claims from holiday accidents. Context changes everything.

🔧

Which features to engineer

A banking analyst knows that the ratio of credit utilisation to income is more predictive of default than either number alone. That insight comes from years of domain experience — not from correlation tables.

The Dataset We'll Use

The scenario: You've just joined the data team at a retail bank. Your first assignment: analyse a credit card portfolio dataset before the risk team builds a default prediction model. Your manager — a 15-year banking veteran — hands you the data and says: "Run your usual EDA, but think like a credit analyst, not a data scientist. The numbers will tell you a story if you know what questions to ask." You have data on 14 cardholders and need to find the risk signals before they do.

import pandas as pd
import numpy as np

# Credit card portfolio — 14 cardholders
df = pd.DataFrame({
    'customer_id':      range(1001, 1015),
    'age':              [28, 45, 32, 61, 38, 52, 24, 47, 35, 58, 29, 43, 55, 31],
    'annual_income':    [32000, 78000, 45000, 95000, 61000, 84000, 28000, 71000,
                         52000, 88000, 31000, 67000, 91000, 42000],
    'credit_limit':     [3000,  12000, 5000,  18000, 8000,  15000, 2500,  11000,
                         7000,  16000, 2800,  10000, 17000, 4500 ],
    'current_balance':  [2850,  2100,  4800,  1200,  7900,  3000,  2400,  500,
                         6800,  800,   2750,  4200,  1100,  4300 ],
    'missed_payments':  [2,     0,     3,     0,     4,     0,     2,     0,
                         5,     0,     1,     2,     0,     3    ],
    'months_as_customer':[6,   84,    24,    120,   36,    72,    3,     60,
                          18,   96,    8,     48,    108,   12   ],
    'defaulted':        [0,    0,     1,     0,     1,     0,     0,     0,
                         1,    0,     0,     1,     0,     1    ]   # target: 1 = defaulted
})

print(df.to_string(index=False))

 customer_id  age  annual_income  credit_limit  current_balance  missed_payments  months_as_customer  defaulted
        1001   28          32000          3000             2850                2                   6          0
        1002   45          78000         12000             2100                0                  84          0
        1003   32          45000          5000             4800                3                  24          1
        1004   61          95000         18000             1200                0                 120          0
        1005   38          61000          8000             7900                4                  36          1
        1006   52          84000         15000             3000                0                  72          0
        1007   24          28000          2500             2400                2                   3          0
        1008   47          71000         11000              500                0                  60          0
        1009   35          52000          7000             6800                5                  18          1
        1010   58          88000         16000              800                0                  96          0
        1011   29          31000          2800             2750                1                   8          0
        1012   43          67000         10000             4200                2                  48          1
        1013   55          91000         17000             1100                0                 108          0
        1014   31          42000          4500             4300                3                  12          1

What just happened?

A pure statistician sees 7 numeric columns and a target. A banking analyst sees something different: the ratio of current_balance to credit_limit (utilisation rate) is the most important risk signal in credit, not any individual number. The missed_payments column is a behavioural signal — not just a count, but a pattern of customer reliability. Months as customer is a proxy for relationship depth. The domain changes what you look at first.

Step 1 — Build the Domain-Specific Features First

The scenario: Your manager looks over your shoulder and says: "Before you run any correlation — build the credit utilisation rate. That's the first number any credit analyst looks at. If someone is using 95% of their credit limit, they're in trouble regardless of what their income is. Then build the income-to-limit ratio — a £2,500 limit on a £90k income tells a very different story from a £2,500 limit on a £28k income."

# === DOMAIN FEATURE 1: Credit Utilisation Rate ===
# The single most important credit risk metric.
# How much of their available credit are they actually using?
# Above 80% = high risk. Above 95% = very high risk.
df['utilisation_rate'] = (df['current_balance'] / df['credit_limit'] * 100).round(1)

# === DOMAIN FEATURE 2: Income-to-Limit Ratio ===
# Does the credit limit make sense relative to their income?
# A low limit on a high income = bank is cautious. High limit on low income = potential overextension.
df['income_to_limit'] = (df['annual_income'] / df['credit_limit']).round(2)

# === DOMAIN FEATURE 3: Payment Reliability Score ===
# Missed payments per year of tenure — normalises for how long they've been a customer
# Someone who missed 2 payments in 3 months is more worrying than someone who missed 2 in 10 years
df['miss_rate_monthly'] = (df['missed_payments'] / df['months_as_customer']).round(3)

# === DOMAIN FEATURE 4: New Customer Flag ===
# Brand-new customers (under 12 months) have no track record — higher uncertainty
df['is_new_customer'] = (df['months_as_customer'] < 12).astype(int)

print(df[['customer_id','utilisation_rate','income_to_limit',
          'miss_rate_monthly','is_new_customer','defaulted']].to_string(index=False))

 customer_id  utilisation_rate  income_to_limit  miss_rate_monthly  is_new_customer  defaulted
        1001              95.0            10.67              0.333                1          0
        1002              17.5             6.50              0.000                0          0
        1003              96.0             9.00              0.125                0          1
        1004               6.7             5.28              0.000                0          0
        1005              98.8             7.63              0.111                0          1
        1006              20.0             5.60              0.000                0          0
        1007              96.0            11.20              0.667                1          0
        1008               4.5             6.45              0.000                0          0
        1009              97.1             7.43              0.278                0          1
        1010               5.0             5.50              0.000                0          0
        1011              98.2            11.07              0.125                1          0
        1012              42.0             6.70              0.042                0          1
        1013               6.5             5.35              0.000                0          0
        1014              95.6             9.33              0.250                1          1

What just happened?

pandas column arithmetic builds all four features in single lines. Division, multiplication, and comparison all broadcast across the entire column automatically. .astype(int) converts the boolean is_new_customer comparison to 1/0.

Scan the utilisation_rate column now. Every customer who defaulted — 1003, 1005, 1009, 1012, 1014 — has a utilisation rate of 42% or above. Every customer who didn't default has a rate below 22%. The separation is almost perfect. This is the domain insight: utilisation rate is the signal. A statistician running a standard correlation table might have missed it because they'd look at the raw balance column first — which on its own is meaningless without the credit limit context.

Step 2 — Apply Industry Thresholds, Not Statistical Ones

The scenario: Your manager explains the industry rules: "In credit risk, there are hard thresholds we use regardless of what the statistics say. Utilisation above 80% is always flagged. More than 3 missed payments is a serious warning. Less than 6 months tenure means we have no real behavioural data. These aren't suggestions — they're the rules the regulators expect us to apply." You build a risk flag system based on these thresholds.

# Industry risk thresholds — these come from domain expertise, not statistics
HIGH_UTILISATION = 80    # % — industry standard danger zone
MISSED_PMT_LIMIT = 3     # any more than this is a serious behavioural red flag
MIN_TENURE_MONTHS = 6   # less than this = insufficient track record

# Apply each threshold as a binary flag
df['flag_high_util']    = (df['utilisation_rate'] > HIGH_UTILISATION).astype(int)
df['flag_missed_pmts']  = (df['missed_payments']  >= MISSED_PMT_LIMIT).astype(int)
df['flag_new_customer'] = (df['months_as_customer'] < MIN_TENURE_MONTHS).astype(int)

# Total risk flags per customer — 0 = clean, 1 = watch, 2+ = escalate
df['total_flags'] = df[['flag_high_util','flag_missed_pmts','flag_new_customer']].sum(axis=1)

# Assign a risk tier
def risk_tier(flags):
    if flags == 0:   return 'Low Risk'
    elif flags == 1: return 'Medium Risk'
    else:            return 'High Risk'

df['risk_tier'] = df['total_flags'].apply(risk_tier)

print(df[['customer_id','utilisation_rate','missed_payments','months_as_customer',
          'total_flags','risk_tier','defaulted']].to_string(index=False))

 customer_id  utilisation_rate  missed_payments  months_as_customer  total_flags  risk_tier  defaulted
        1001              95.0                2                   6            1  Medium Risk          0
        1002              17.5                0                  84            0     Low Risk          0
        1003              96.0                3                  24            2    High Risk          1
        1004               6.7                0                 120            0     Low Risk          0
        1005              98.8                4                  36            2    High Risk          1
        1006              20.0                0                  72            0     Low Risk          0
        1007              96.0                2                   3            2    High Risk          0
        1008               4.5                0                  60            0     Low Risk          0
        1009              97.1                5                  18            2    High Risk          1
        1010               5.0                0                  96            0     Low Risk          0
        1011              98.2                1                   8            1  Medium Risk          0
        1012              42.0                2                  48            1  Medium Risk          1
        1013               6.5                0                 108            0     Low Risk          0
        1014              95.6                3                  12            2    High Risk          1

What just happened?

pandas' boolean comparisons create binary flag columns. Summing across three flag columns with .sum(axis=1) gives a total risk score per row. .apply(risk_tier) maps the score to a label.

The rule-based system correctly identifies all five defaulters as Medium or High Risk. The only false positive is customer 1007 (High Risk, didn't default) — a new customer with high utilisation but only 2 missed payments. That's exactly the kind of borderline case you'd escalate for human review. Notice customer 1012: rated Medium Risk with only 42% utilisation and 2 missed payments — but they still defaulted. The domain rules catch most cases but not all, which is why we also build a statistical model.

Step 3 — Compare Domain Features vs Raw Features

The scenario: Your manager wants proof that the domain features are better than the raw columns. "Show me the correlation table," she says. "I want to see that utilisation_rate predicts default better than current_balance alone. Because if it doesn't, I've been teaching people the wrong thing for 15 years." You run the comparison and bring the evidence.

from scipy import stats

# Head-to-head comparison: raw columns vs domain-engineered features
print("=== RAW COLUMNS vs DOMAIN FEATURES ===")
print("Correlation with 'defaulted' (target)\n")

comparisons = [
    # (raw_col, domain_col, explanation)
    ('current_balance', 'utilisation_rate',
     'Balance alone vs balance as % of limit'),
    ('missed_payments',  'miss_rate_monthly',
     'Raw miss count vs misses per month of tenure'),
    ('months_as_customer', 'is_new_customer',
     'Tenure in months vs simple new/existing flag'),
]

for raw_col, domain_col, note in comparisons:
    r_raw,    _ = stats.pearsonr(df[raw_col],    df['defaulted'])
    r_domain, _ = stats.pearsonr(df[domain_col], df['defaulted'])
    winner = domain_col if abs(r_domain) > abs(r_raw) else raw_col
    print(f"  {note}")
    print(f"    Raw:    {raw_col:<22} r = {r_raw:+.3f}")
    print(f"    Domain: {domain_col:<22} r = {r_domain:+.3f}")
    print(f"    Winner: {winner}\n")

=== RAW COLUMNS vs DOMAIN FEATURES ===
Correlation with 'defaulted' (target)

  Balance alone vs balance as % of limit
    Raw:    current_balance        r = +0.412
    Domain: utilisation_rate       r = +0.778
    Winner: utilisation_rate

  Raw miss count vs misses per month of tenure
    Raw:    missed_payments        r = +0.817
    Domain: miss_rate_monthly      r = +0.784
    Winner: missed_payments

  Tenure in months vs simple new/existing flag
    Raw:    months_as_customer     r = -0.621
    Domain: is_new_customer        r = +0.252
    Winner: months_as_customer

What just happened?

scipy's stats.pearsonr() runs the head-to-head comparison. The results validate the manager's intuition — and also push back in two places.

utilisation_rate wins decisively (0.778 vs 0.412) — the domain feature is nearly twice as predictive as the raw balance. The manager was right. But missed_payments (raw) narrowly beats miss_rate_monthly (0.817 vs 0.784) — the normalisation didn't help in this sample. And months_as_customer beats the is_new_customer flag because the continuous value carries more information than the binary cut. This is what honest domain-driven EDA looks like: bring domain expertise, validate it with data, and update your beliefs when the data disagrees.

Step 4 — The Domain-Specific Risk Profile Report

The scenario: The risk committee wants a pre-meeting briefing — a profile of the portfolio's risk distribution based on the industry thresholds, not p-values. They don't speak statistics. They speak credit risk. You need to translate your analysis into the language they use every day: how many customers are in each risk tier, what's the average utilisation by tier, and what's the actual default rate in each bucket?

# Portfolio risk summary — in language the risk committee understands
risk_summary = df.groupby('risk_tier').agg(
    customers          = ('customer_id',       'count'),
    avg_utilisation    = ('utilisation_rate',   'mean'),
    avg_missed_pmts    = ('missed_payments',    'mean'),
    total_defaults     = ('defaulted',          'sum'),
    default_rate_pct   = ('defaulted',          'mean')
).round(2)

# Convert default_rate_pct to a percentage for readability
risk_summary['default_rate_pct'] = (risk_summary['default_rate_pct'] * 100).round(1)

# Sort by risk level: Low → Medium → High
order = ['Low Risk', 'Medium Risk', 'High Risk']
risk_summary = risk_summary.reindex(order)

print("=== PORTFOLIO RISK PROFILE ===\n")
print(risk_summary.to_string())
print()

# Plain-English summary for the committee
print("Key findings for the risk committee:\n")
for tier in order:
    row = risk_summary.loc[tier]
    print(f"  {tier}:  {int(row['customers'])} customers  "
          f"| Avg utilisation: {row['avg_utilisation']:.0f}%  "
          f"| Default rate: {row['default_rate_pct']:.0f}%")

=== PORTFOLIO RISK PROFILE ===

             customers  avg_utilisation  avg_missed_pmts  total_defaults  default_rate_pct
risk_tier
Low Risk             6             9.9              0.00               0               0.0
Medium Risk          3            45.1              1.67               1              33.3
High Risk            5            97.1              3.40               4              80.0

Key findings for the risk committee:

  Low Risk:     6 customers  | Avg utilisation: 10%  | Default rate: 0%
  Medium Risk:  3 customers  | Avg utilisation: 45%  | Default rate: 33%
  High Risk:    5 customers  | Avg utilisation: 97%  | Default rate: 80%

What just happened?

pandas' .groupby('risk_tier').agg() produces a clean summary grouped by the domain-defined risk tier. .reindex(order) sorts the rows into the natural Low → Medium → High order instead of alphabetical.

The numbers speak for themselves. Low Risk: 0% default rate, avg utilisation 10%. High Risk: 80% default rate, avg utilisation 97%. This is a clean, business-ready output that a risk committee can act on immediately — no statistics degree required. The domain thresholds turned a correlation table into a decision tool.

Teacher's Note

Domain knowledge is a starting point, not a conclusion. In Step 3, the data told us the raw missed_payments column slightly outperformed our normalised version. A bad analyst would ignore that and stick with the domain feature out of stubbornness. A good analyst updates their beliefs. The manager's 15 years of experience told us where to look — but the data got the final vote on what to keep.

The fastest way to build domain knowledge in a new industry: spend the first week asking the subject matter experts "what do you look at when something goes wrong?" Their answer tells you which metrics matter, which thresholds have meaning, and which patterns are noise versus signal. That conversation is worth more than any textbook.

Practice Questions

1. In credit risk analysis, what is the name of the metric that measures how much of a customer's credit limit they are currently using — expressed as a percentage?

2. After a groupby, the risk tiers appear in alphabetical order (High, Low, Medium). Which pandas method lets you reorder the rows into a custom order (Low, Medium, High)?

3. You engineer a domain feature based on industry expertise, but the correlation analysis shows the raw column is slightly stronger. What should you do?

Quiz

Up Next · Lesson 35

Documenting Findings

The analysis you can't communicate is the analysis that doesn't get used. Learn to write EDA findings that are clear, defensible, and actually get read.

← Previous Course Index Next →

EDA Course

Domain-Driven EDA

What Domain Knowledge Actually Changes

The Dataset We'll Use

Step 1 — Build the Domain-Specific Features First

Step 2 — Apply Industry Thresholds, Not Statistical Ones

Step 3 — Compare Domain Features vs Raw Features

Step 4 — The Domain-Specific Risk Profile Report

Practice Questions

Quiz