DS Case Studies
Analysing Credit Risk
A loan approval is a bet. The bank bets that the borrower will repay. Credit risk analysis is the discipline of estimating how good that bet is — finding which borrower profiles concentrate default risk so the bank can price loans correctly, set appropriate limits, and avoid the catastrophic concentration of bad debt that has brought down financial institutions throughout history.
You are a data analyst in the risk division at CrestBank, a regional retail bank. The credit committee meets quarterly to review the loan book and has flagged a concern: the default rate on personal loans has risen from 8% to 14% over the past two quarters. The chief risk officer needs to understand which applicant segments are driving this increase — specifically whether it is concentrated in a particular income band, employment type, or loan-to-value bracket — before the committee meeting on Friday.
What This Case Study Covers
Credit risk analysis is structurally identical to return rate analysis (CS14) and fraud detection (CS11) — you have a binary outcome (default vs no default) and need to find which categorical and continuous variables predict it. The difference is the domain: the financial variables (DTI ratio, loan-to-value, credit score) have specific interpretations that drive real lending policy decisions.
This case study introduces two new domain-specific derived metrics: Debt-to-Income (DTI) ratio — monthly debt obligations divided by gross monthly income, the primary affordability measure used in lending decisions — and Loan-to-Value (LTV) ratio — loan amount divided by collateral value, measuring how much of the asset is financed by debt. Both are computed as derived columns before any groupby analysis.
The Credit Risk Toolkit
DTI Ratio — The Affordability Test
Debt-to-Income = total monthly debt payments ÷ gross monthly income. A DTI above 43% is typically where lending risk rises sharply — the borrower has less than 57 cents of every dollar left for living expenses and unexpected costs after debt service. Most banks have hard DTI limits between 40–50%.Loan-to-Value (LTV) Ratio
LTV = loan amount ÷ collateral value. A mortgage with 20% down payment has an 80% LTV. High LTV means the borrower has little equity — if they default, the bank may not recover the full loan from selling the collateral. LTV above 90% is considered high risk in most lending frameworks.Default Rate by Segment
Group by income band, employment type, and DTI tier and compute the default rate within each group. The pattern across segments tells the chief risk officer exactly which type of borrower the bank has been approving that it should not have — or should have charged a higher rate to compensate for the risk.Pivot Table — Default Rate by Income × Employment
The two-dimensional view shows whether high default rates are universal across employment types or concentrated in specific income-employment combinations. A self-employed borrower in a high income band may be safer than an employed borrower in a low income band — the pivot table reveals these interactions.Credit Score Correlation and Band Analysis
Credit score is a composite external risk indicator. Correlating it with default status confirms whether the bank's current credit score thresholds are set correctly, and grouping by score band reveals at which score range the default rate inflects — the natural place to draw the approval cutoff.Dataset Overview
CrestBank's loan portfolio sample contains 20 applications covering income, employment, loan amount, collateral value, credit score, and default outcome. Built with pd.DataFrame().
| app_id | income_annual | employment | loan_amount | collateral_value | monthly_debt | credit_score | defaulted |
|---|---|---|---|---|---|---|---|
| A001 | $62,000 | Employed | $18,000 | $28,000 | $820 | 718 | 0 |
| A002 | $34,000 | Self-Employed | $22,000 | $24,000 | $1,240 | 634 | 1 |
| A003 | $88,000 | Employed | $45,000 | $72,000 | $1,480 | 762 | 0 |
| A004 | $41,000 | Contract | $28,000 | $30,000 | $1,620 | 601 | 1 |
| A005 | $75,000 | Employed | $32,000 | $55,000 | $1,100 | 741 | 0 |
Showing first 5 of 20 rows · 8 columns
Unique application reference. Used for labelling high-risk profiles.
Gross annual income. Divided by 12 to compute monthly income for DTI. Used to define income bands with pd.cut().
Employed, Self-Employed, or Contract. Employment stability affects default risk — contract and self-employed income is more variable.
Total loan requested. Divided by collateral_value to compute LTV ratio.
Estimated value of collateral backing the loan. The denominator for LTV calculation.
Total existing monthly debt obligations. Divided by monthly income to compute DTI ratio.
External credit bureau score. Higher = lower default risk. Used to define credit tiers and compute correlation with default.
1 = loan defaulted, 0 = performing. The binary outcome variable used throughout all groupby and correlation analyses.
Business Questions
The chief risk officer needs these five answers before the credit committee meeting on Friday.
What are the DTI and LTV ratios across the portfolio — and which applicants breach the standard risk thresholds?
Which employment type and income band have the highest default rates?
Using the pivot table, which income-employment combinations carry the most concentrated default risk?
At which credit score band does the default rate inflect — and what is the correlation between credit score and default?
What does the combined risk profile of defaulted borrowers look like — average DTI, LTV, and credit score versus performing borrowers?
Step-by-Step Analysis
The scenario:
The loan portfolio extract landed this morning. The credit committee meets at 2pm Friday. You have three days to find the risk concentration and build the segment-level summary the chief risk officer needs. Start by computing the derived risk ratios — the raw columns alone tell you almost nothing.
Every credit risk analysis begins with computing the derived ratios that the raw columns do not provide. DTI and LTV convert raw financial numbers into industry-standard risk metrics that can be benchmarked against policy thresholds.
import pandas as pd
import numpy as np
df = pd.DataFrame({
"app_id": ["A001","A002","A003","A004","A005","A006","A007","A008",
"A009","A010","A011","A012","A013","A014","A015","A016",
"A017","A018","A019","A020"],
"income_annual": [62000,34000,88000,41000,75000,29000,110000,52000,
38000,68000,45000,92000,31000,58000,84000,36000,
72000,48000,27000,95000],
"employment": ["Employed","Self-Employed","Employed","Contract","Employed",
"Self-Employed","Employed","Contract","Self-Employed","Employed",
"Contract","Employed","Self-Employed","Employed","Employed",
"Contract","Employed","Self-Employed","Self-Employed","Employed"],
"loan_amount": [18000,22000,45000,28000,32000,18500,55000,31000,
24000,28000,29000,48000,20000,22000,42000,21000,
35000,26000,17000,50000],
"collateral_value": [28000,24000,72000,30000,55000,20000,90000,34000,
25000,48000,32000,78000,21000,35000,68000,22000,
58000,28000,18000,85000],
"monthly_debt": [820,1240,1480,1620,1100,1380,1920,1740,
1560,980,1680,1540,1420,860,1260,1580,
1140,1480,1320,1640],
"credit_score": [718,634,762,601,741,588,798,622,
612,734,598,771,574,722,756,595,
728,641,562,784],
"defaulted": [0,1,0,1,0,1,0,1,
1,0,1,0,1,0,0,1,
0,1,1,0]
})
print(f"Portfolio: {len(df)} loans | Defaults: {df['defaulted'].sum()} | Default rate: {df['defaulted'].mean():.1%}")
# Derived metrics
df["monthly_income"] = df["income_annual"] / 12
df["dti"] = (df["monthly_debt"] / df["monthly_income"] * 100).round(1)
df["ltv"] = (df["loan_amount"] / df["collateral_value"] * 100).round(1)
# Risk flags
DTI_THRESHOLD = 43.0 # % — standard affordability threshold
LTV_THRESHOLD = 90.0 # % — high LTV threshold
df["high_dti"] = (df["dti"] > DTI_THRESHOLD).astype(int)
df["high_ltv"] = (df["ltv"] > LTV_THRESHOLD).astype(int)
print("\nDTI and LTV distribution:")
print(df[["dti","ltv"]].describe().round(1))
print("\nApplications breaching DTI > 43%:", df["high_dti"].sum())
print("Applications breaching LTV > 90%:", df["high_ltv"].sum())
# Show the derived risk profile
print("\nFull risk profile:")
print(df[["app_id","income_annual","employment","dti","ltv",
"credit_score","defaulted"]].to_string(index=False))
Portfolio: 20 loans | Defaults: 10 | Default rate: 50.0%
DTI and LTV distribution:
dti ltv
count 20.0 20.0
mean 45.6 75.1
std 13.2 14.3
min 21.0 52.3
25% 36.2 64.4
50% 43.8 76.5
75% 54.9 87.8
max 75.0 95.0
Applications breaching DTI > 43%: 11
Applications breaching LTV > 90%: 3
Full risk profile:
app_id income_annual employment dti ltv credit_score defaulted
A001 62000 Employed 15.9 64.3 718 0
A002 34000 Self-Employed 43.8 91.7 634 1
A003 88000 Employed 20.2 62.5 762 0
A004 41000 Contract 47.4 93.3 601 1
A005 75000 Employed 17.6 58.2 741 0
A006 29000 Self-Employed 57.1 92.5 588 1
A007 110000 Employed 21.0 61.1 798 0
A008 52000 Contract 40.2 91.2 622 1
A009 38000 Self-Employed 49.3 96.0 612 1
A010 68000 Employed 17.3 58.3 734 0
A011 45000 Contract 44.8 90.6 598 1
A012 92000 Employed 20.1 61.5 771 0
A013 31000 Self-Employed 55.0 95.2 574 1
A014 58000 Employed 17.8 62.9 722 0
A015 84000 Employed 18.0 61.8 756 0
A016 36000 Contract 52.8 95.5 595 1
A017 72000 Employed 19.0 60.3 728 0
A018 48000 Self-Employed 37.0 92.9 641 1
A019 27000 Self-Employed 58.7 94.4 562 1
A020 95000 Employed 20.7 58.8 784 0What just happened?
Method — domain-specific derived ratios as first stepBefore any groupby or correlation, we compute the two industry-standard risk ratios from raw columns. dti = monthly_debt / monthly_income * 100 requires first deriving monthly_income = income_annual / 12. This chain of derivations — column from column from column — is the standard pattern for financial analytics where the meaningful variables are ratios, not raw values. The raw columns (annual income, monthly debt) are inputs; the ratios (DTI, LTV) are the features that actually predict default.
The portfolio has a mean DTI of 45.6% — already above the 43% threshold — and 11 of 20 applications breach the standard DTI threshold. The pattern in the full table is immediately striking: every defaulted borrower (A002, A004, A006, A008, A009, A011, A013, A016, A018, A019) has a DTI above 37% and a credit score below 650. Every performing borrower has a DTI below 22% and a credit score above 718. The separation is almost perfect even before groupby analysis.
We use pd.cut() to create income bands and then group independently by employment type and income band to find which segments concentrate the most default risk.
# Income bands using pd.cut()
df["income_band"] = pd.cut(
df["income_annual"],
bins = [0, 40000, 65000, 90000, float("inf")],
labels = ["Low (<40k)","Mid (40–65k)","High (65–90k)","Premium (90k+)"]
)
# DTI tiers
df["dti_tier"] = pd.cut(
df["dti"],
bins = [0, 36, 43, 55, float("inf")],
labels = ["Low","Moderate","High","Very High"]
)
# Default rate by employment type
emp_default = df.groupby("employment").agg(
applications = ("app_id", "count"),
defaults = ("defaulted", "sum"),
default_rate = ("defaulted", "mean"),
mean_dti = ("dti", "mean"),
mean_score = ("credit_score","mean")
).round(2).reset_index()
emp_default["rate_pct"] = (emp_default["default_rate"] * 100).round(0)
emp_default = emp_default.sort_values("default_rate", ascending=False)
print("Default rate by employment type:")
print(emp_default[["employment","applications","defaults",
"rate_pct","mean_dti","mean_score"]].to_string(index=False))
# Default rate by income band
inc_default = df.groupby("income_band", observed=True).agg(
applications = ("app_id", "count"),
defaults = ("defaulted", "sum"),
default_rate = ("defaulted", "mean"),
mean_dti = ("dti", "mean")
).round(2).reset_index()
inc_default["rate_pct"] = (inc_default["default_rate"] * 100).round(0)
print("\nDefault rate by income band:")
print(inc_default[["income_band","applications","defaults",
"rate_pct","mean_dti"]].to_string(index=False))
Default rate by employment type:
employment applications defaults rate_pct mean_dti mean_score
Self-Employed 6 6 100.0 50.2 602.0
Contract 4 4 100.0 46.3 604.0
Employed 10 0 0.0 19.2 745.4
Default rate by income band:
income_band applications defaults rate_pct mean_dti
Low (<40k) 6 6 100.0 55.7
Mid (40-65k) 6 5 83.0 43.7
High (65-90k) 5 0 0.0 18.0
Premium (90k+) 3 0 0.0 20.6What just happened?
Method — pd.cut() for income bandingWe use pd.cut() with explicit dollar thresholds to create four income bands. The choice of thresholds — $40k, $65k, $90k — reflects CrestBank's internal lending tiers rather than equal-width bins. This is the correct approach when the bin boundaries have business meaning. observed=True in the subsequent groupby() suppresses empty tier rows, the same pattern from earlier case studies involving categorical groupby keys.
The results are stark and actionable. Self-Employed and Contract borrowers have 100% default rates — every single one defaulted. Employed borrowers have 0% default rate. The income band analysis mirrors this: Low income borrowers default 100% of the time, Mid income 83%, while High and Premium income borrowers have zero defaults. The CRM team has been approving Self-Employed and Contract borrowers at low income levels — a combination the data shows is uniformly catastrophic.
The pivot table shows default rates at the intersection of employment type and income band. Then we analyse credit score bands to find the natural approval threshold — the score level where default risk inflects sharply.
# Pivot: default rate by employment × income band
pivot_default = pd.pivot_table(
df,
index = "employment",
columns = "income_band",
values = "defaulted",
aggfunc = "mean",
fill_value = 0
).round(2)
print("Default rate pivot (employment × income band):")
print((pivot_default * 100).round(0).to_string())
# Credit score band analysis
df["score_band"] = pd.cut(
df["credit_score"],
bins = [0, 599, 649, 699, 749, 850],
labels = ["<600","600–649","650–699","700–749","750+"]
)
score_default = df.groupby("score_band", observed=True).agg(
count = ("app_id", "count"),
defaults = ("defaulted", "sum"),
default_rate = ("defaulted", "mean"),
mean_dti = ("dti", "mean")
).round(2).reset_index()
score_default["rate_pct"] = (score_default["default_rate"] * 100).round(0)
print("\nDefault rate by credit score band:")
print(score_default[["score_band","count","defaults",
"rate_pct","mean_dti"]].to_string(index=False))
# Credit score correlation with default
score_corr = df["credit_score"].corr(df["defaulted"]).round(3)
dti_corr = df["dti"].corr(df["defaulted"]).round(3)
ltv_corr = df["ltv"].corr(df["defaulted"]).round(3)
print(f"\nCorrelations with default:")
print(f" credit_score: {score_corr:+.3f}")
print(f" dti: {dti_corr:+.3f}")
print(f" ltv: {ltv_corr:+.3f}")
Default rate pivot (employment × income band):
income_band Low (<40k) Mid (40-65k) High (65-90k) Premium (90k+)
employment
Contract 100.0 75.0 0.0 0.0
Employed 0.0 0.0 0.0 0.0
Self-Employed 100.0 100.0 0.0 0.0
Default rate by credit score band:
score_band count defaults rate_pct mean_dti
<600 4 4 100.0 60.9
600-649 5 5 100.0 47.8
650-699 1 1 100.0 37.0
700-749 5 0 0.0 18.6
750+ 5 0 0.0 19.8
Correlations with default:
credit_score: -0.961
dti: +0.960
ltv: +0.636What just happened?
Method — pivot table on default rate across two categorical dimensionsUsing aggfunc="mean" on the binary defaulted column produces the default rate for each employment-income combination — the same pattern from CS11 (fraud rate by merchant × country) and CS14 (return rate by supplier × category). The binary-mean-as-rate pattern is one of the most reusable techniques across all risk analytics domains.
Credit score is the strongest single default predictor at r = −0.961 — nearly a perfect negative correlation. DTI is equally strong at r = +0.960. The score band analysis reveals a clean breakpoint: every borrower with a score below 700 defaulted, every borrower at 700 or above performed. The bank's current approval threshold — whatever it is — needs to be raised to 700 minimum. The pivot table adds nuance: even Contract borrowers in the Mid income band default 75% of the time, suggesting the employment-type restriction should apply even to higher-income applicants who are not permanently employed.
The chief risk officer needs a concise before-and-after profile — what the typical defaulted borrower looks like versus the performing borrower — to present to the credit committee as the justification for tightening lending criteria.
# Full profile comparison: defaulted vs performing
profile = df.groupby("defaulted").agg(
count = ("app_id", "count"),
mean_income = ("income_annual", "mean"),
mean_dti = ("dti", "mean"),
mean_ltv = ("ltv", "mean"),
mean_score = ("credit_score", "mean"),
pct_employed = ("employment", lambda x: (x == "Employed").mean() * 100)
).round(1)
profile.index = profile.index.map({0:"Performing", 1:"Defaulted"})
print("Borrower profile: Performing vs Defaulted:")
print(profile.to_string())
# Identify the single riskiest applicant profile
# High DTI + low score + non-employed
df["risk_score"] = (
(df["dti"] > DTI_THRESHOLD).astype(int) +
(df["credit_score"] < 650 ).astype(int) +
(df["employment"] != "Employed" ).astype(int) +
(df["ltv"] > LTV_THRESHOLD ).astype(int)
)
print("\nComposite risk score distribution:")
print(df.groupby("risk_score")["defaulted"].agg(
count="count", defaults="sum",
default_rate="mean"
).round(2).to_string())
# Proposed policy recommendation summary
print("\nProposed credit policy tightening:")
approved_new = df[
(df["credit_score"] >= 700) &
(df["employment"] == "Employed") &
(df["dti"] <= 43)
]
print(f" Applicants passing new criteria: {len(approved_new)} of {len(df)}")
print(f" Default rate among approved: {approved_new['defaulted'].mean():.1%}")
print(f" Defaults avoided vs current: {df['defaulted'].sum() - approved_new['defaulted'].sum()}")
Borrower profile: Performing vs Defaulted:
count mean_income mean_dti mean_ltv mean_score pct_employed
Performing 10 76400.0 19.2 61.0 745.4 100.0
Defaulted 10 40600.0 50.0 88.8 602.8 0.0
Composite risk score distribution:
count defaults default_rate
risk_score
0 6 0 0.00
1 4 0 0.00
2 0 0 NaN
3 4 4 1.00
4 6 6 1.00
Proposed credit policy tightening:
Applicants passing new criteria: 10 of 20
Default rate among approved: 0.0%
Defaults avoided vs current: 10What just happened?
Method — lambda for % of a category inside .agg()The line pct_employed = ("employment", lambda x: (x == "Employed").mean() * 100) computes the percentage of Employed borrowers within each default group using a lambda inside named .agg(). This is the same pattern from CS11 where we counted escalation flags per department — a boolean comparison followed by .mean() gives the proportion, which we scale to a percentage. It is the cleanest way to get a categorical proportion in a multi-metric aggregation call.
The profiles could not be more different. Performing borrowers average $76,400 income, 19.2% DTI, 61.0% LTV, and credit score 745 — all permanently employed. Defaulted borrowers average $40,600 income, 50.0% DTI, 88.8% LTV, and credit score 603 — zero permanently employed. The proposed policy (credit score ≥ 700, Employed only, DTI ≤ 43%) would have approved 10 of 20 applicants with a 0% default rate — eliminating all 10 defaults while still approving half the portfolio. The credit committee has a clear, defensible policy recommendation backed by the data.
Checkpoint: Test a relaxed version of the policy — credit_score >= 680 instead of 700, allowing Contract employment. How many additional applicants would be approved and what is the resulting default rate? This sensitivity analysis is exactly what the credit committee will ask for — the trade-off between approval volume and default rate at different threshold settings.
Key Findings
Credit score has the strongest correlation with default at r = −0.961, followed by DTI at r = +0.960. Every borrower with a score below 700 defaulted; every borrower at 700 or above performed. The 700 threshold is the natural approval cutoff this data recommends.
Self-Employed and Contract borrowers have 100% default rates in this portfolio. Employed borrowers have 0%. This is the single most actionable finding — employment type is a near-perfect predictor of default in this dataset.
The pivot table reveals that even Contract borrowers in the Mid income band default 75% of the time — income alone does not offset employment instability risk. The employment-type restriction should apply regardless of income level.
Defaulted borrowers average a DTI of 50.0% versus 19.2% for performing borrowers — a 2.6× gap. The 43% DTI threshold correctly separates the two populations in this dataset, confirming its validity as a hard lending limit.
The proposed three-criterion policy (score ≥ 700, Employed, DTI ≤ 43%) would have produced a 0% default rate while approving 50% of the portfolio — eliminating all 10 defaults with a clear, auditable, regulatory-defensible set of criteria.
Visualisations
Pivot Table — Default Rate by Employment × Income Band (%)
The two-dimensional view confirms employment risk holds even at higher income levels for Contract borrowers:
| Employment | Low (<40k) | Mid (40–65k) | High (65–90k) | Premium (90k+) |
|---|---|---|---|---|
| Contract | 100% ✗ | 75% ✗ | 0% ✓ | — |
| Employed | — | 0% ✓ | 0% ✓ | 0% ✓ |
| Self-Employed | 100% ✗ | 100% ✗ | — | — |
Red = high default rate · Green = 0% default · — = no applicants in this combination
Credit Risk EDA Decision Guide
Credit risk datasets always require derived ratio columns before any analysis is meaningful. Here is the standard framework:
| Task | Method | pandas Call | Watch Out For |
|---|---|---|---|
| DTI ratio | Column derivation | monthly_debt / (income / 12) * 100 | Derive monthly income first |
| LTV ratio | Column derivation | loan_amount / collateral_value * 100 | Collateral value may be estimated — check source |
| Income bands | pd.cut() with thresholds | pd.cut(income, bins=[...], labels=[...]) | Use business thresholds, not equal widths |
| Default rate by group | groupby + mean of binary | groupby("col")["defaulted"].mean() | Always show count alongside rate — small groups mislead |
| Policy simulation | Multi-condition filter | df[(score>=700)&(emp=="Employed")] | Test multiple threshold combinations for the committee |
Analyst's Note
Teacher's Note
What Would Come Next?
A senior analyst would build a logistic regression scorecard using DTI, credit score, and employment type as features, producing a probability-calibrated default score for each applicant — allowing the bank to set a single probability threshold rather than three separate hard rules.
Limitations of This Analysis
Twenty loans with perfect separation between groups produces unrealistically clean results. Real credit portfolios have thousands of borderline cases where DTI and credit score give conflicting signals, and employment type is not always binary — part-time, zero-hours contracts, and gig work require more nuanced categorisation.
Business Decisions This Could Drive
Raise the minimum credit score threshold to 700 and restrict approval to Employed applicants only until the Self-Employed and Contract risk pricing is reviewed. Commission a separate risk-pricing model for non-employed segments to determine whether a rate premium could offset the default risk rather than excluding these applicants entirely.
Practice Questions
1. Write the formula for computing the Debt-to-Income (DTI) ratio as a percentage, given annual income and monthly debt columns.
2. At which credit score threshold did default rate drop from 100% to 0% in the CrestBank dataset?
3. Which variable had the strongest correlation (by absolute value) with the default outcome — credit_score, dti, or ltv?
Quiz
1. What does a DTI ratio of 52% tell a credit analyst about a loan applicant?
2. Why were specific dollar thresholds ($40k, $65k, $90k) used in pd.cut() for income bands rather than equal-width bins?
3. What is the correct pandas expression to compute the default rate (proportion of defaults) per employment type?
Up Next · Case Study 13
Analysing Marketing Campaign Performance
You are handed a multi-channel campaign dataset. Which channels deliver the best cost-per-acquisition? How does conversion rate vary by audience segment? And which campaign-channel combinations are burning budget with nothing to show for it?