DS Case Studies
Analysing Bank Loan Approval Data
Every loan application tells a story about risk. The bank's job is to read that story accurately — approving creditworthy applicants and declining those who are likely to default. Your job is to find the patterns in the data that separate one group from the other.
You are a junior data analyst at CrestBank, a regional lending institution. The credit risk team has asked you to analyse the last 12 loan applications to understand what distinguishes approved from rejected applicants. They want to know which variables — income, credit score, loan amount, employment type — are most predictive of approval, and which applicant profiles the bank is consistently declining.
What This Case Study Covers
Loan approval analysis is a classification problem at heart — each application gets a binary outcome. Before any model is built, an analyst needs to understand the data distributions for each outcome class. The key questions are always: what does a typical approved applicant look like? What does a typical rejected applicant look like? And how large are the gaps between the two profiles?
This case study covers four layers: approval rate analysis overall and by employment type, profile comparison — mean income, credit score, and loan amount for approved vs rejected applicants, threshold analysis to find the income and credit score levels that separate outcomes, and debt-to-income ratio as a derived risk metric that banks use as a primary lending signal.
The Loan EDA Toolkit
Approval Rate Analysis
Compute the overall approval rate and break it down by employment type. Salaried, self-employed, and contract workers may face systematically different approval rates — a signal that the bank's risk model treats employment stability as a key variable.Profile Comparison
Compute mean income, credit score, and loan amount for approved vs rejected applicants. The size of the gap on each variable indicates how strongly it separates the two classes — large gaps are strong predictors, small gaps are weak ones.Debt-to-Income Ratio
Divide loan amount by annual income to compute the debt-to-income (DTI) ratio — the single most commonly used lending metric in retail banking. A high DTI means a large loan relative to what the applicant earns. Banks typically reject applications above 40–50% DTI.Credit Score Threshold Analysis
Find the minimum credit score among approved applicants and the maximum among rejected ones. The overlap zone — applicants in between — is where the bank's decision boundary sits, and where the most marginal decisions are being made.Correlation with Approval
Encode approval as a binary flag (1/0) and compute Pearson correlation with each numeric variable. High positive correlation means more of that variable leads to more approvals. High negative correlation means more of it leads to rejections.Dataset Overview
CrestBank's loan applications dataset contains 12 records — one row per application — covering income, credit score, loan amount, employment type, and the final approval decision. Built with pd.DataFrame().
| app_id | annual_income | credit_score | loan_amount | loan_term_yrs | employment_type | approved |
|---|---|---|---|---|---|---|
| A001 | $72,000 | 745 | $18,000 | 5 | Salaried | Yes |
| A002 | $34,000 | 612 | $22,000 | 7 | Self-Employed | No |
| A003 | $95,000 | 790 | $35,000 | 10 | Salaried | Yes |
| A004 | $41,000 | 638 | $28,000 | 7 | Contract | No |
| A005 | $58,000 | 710 | $15,000 | 5 | Salaried | Yes |
Showing first 5 of 12 rows · 7 columns
Unique application reference. Used for labelling — never analysed directly.
Applicant's annual gross income. A primary affordability signal — higher income reduces the perceived risk of a given loan size.
FICO-style credit score. The single most widely used indicator of creditworthiness — scores above 700 are generally considered good.
Amount requested by the applicant. Combined with income to compute the debt-to-income ratio — the primary derived risk metric.
Requested loan repayment period. Longer terms mean lower monthly payments but more total interest — a secondary risk signal.
Salaried, Self-Employed, or Contract. Employment stability directly affects income predictability — a key lending consideration.
Yes or No — the final lending decision. This is what we are trying to understand and ultimately model.
Business Questions
The credit risk team needs these five questions answered to calibrate their lending criteria for the next quarter.
What is the overall loan approval rate — and does it differ significantly by employment type?
What are the average income and credit score for approved vs rejected applicants — how large are the gaps?
Which applicants have a debt-to-income ratio above 40% — and were they approved or rejected?
What is the minimum credit score among approved applicants — and what is the maximum among rejected ones?
Which numeric variable correlates most strongly with loan approval — income, credit score, loan amount, or loan term?
Step-by-Step Analysis
The scenario:
The credit risk team has exported the last quarter's applications. Before they can refine their approval criteria, they need to understand what the data is telling them about the patterns in their current decisions. Open your notebook and work through it systematically.
Before computing any approval rate, we need to confirm data types are correct. The approved column is stored as Yes/No text — we will create a binary flag immediately so it can be used in all subsequent calculations.
import pandas as pd
import numpy as np
# Build the CrestBank loan applications dataset
df = pd.DataFrame({
"app_id": ["A001","A002","A003","A004","A005","A006",
"A007","A008","A009","A010","A011","A012"],
"annual_income": [72000,34000,95000,41000,58000,120000,
29000,83000,47000,67000,38000,110000],
"credit_score": [745,612,790,638,710,810,
590,755,665,720,625,780],
"loan_amount": [18000,22000,35000,28000,15000,40000,
20000,25000,30000,22000,18000,45000],
"loan_term_yrs": [5,7,10,7,5,10,7,5,10,5,7,10],
"employment_type": ["Salaried","Self-Employed","Salaried","Contract","Salaried",
"Salaried","Self-Employed","Salaried","Contract","Salaried",
"Self-Employed","Salaried"],
"approved": ["Yes","No","Yes","No","Yes","Yes",
"No","Yes","No","Yes","No","Yes"]
})
# Basic inspection
print("Shape:", df.shape)
print("Missing values:", df.isnull().sum().sum())
# Create binary approval flag — needed for all numeric calculations
df["approved_flag"] = (df["approved"] == "Yes").astype(int)
# Overall approval rate
overall_rate = df["approved_flag"].mean()
print(f"\nOverall approval rate: {overall_rate:.1%}")
print(f"Approved: {df['approved_flag'].sum()} | Rejected: {(df['approved_flag']==0).sum()}")
# Approval rate by employment type
emp_approval = df.groupby("employment_type")["approved_flag"].agg(
count="count",
approved="sum",
approval_rate="mean"
).round(2).reset_index()
emp_approval["approval_rate"] = (emp_approval["approval_rate"] * 100).round(0)
print("\nApproval rate by employment type:")
print(emp_approval.to_string(index=False))
Shape: (12, 7)
Missing values: 0
Overall approval rate: 58.3%
Approved: 7 | Rejected: 5
Approval rate by employment type:
employment_type count approved approval_rate
Contract 2 0 0.0
Self-Employed 3 0 0.0
Salaried 7 7 100.0What just happened?
Method — creating a binary flag from a string columnThe approved column contains text ("Yes"/"No") which cannot be used in numeric calculations like .mean() or .corr(). The expression (df["approved"] == "Yes").astype(int) first creates a boolean Series (True/False), then converts it to integers (1/0) with .astype(int). This two-step pattern — boolean comparison then .astype(int) — is the standard way to encode a binary categorical variable for numeric analysis.
We passed a dictionary to .agg() — count="count", approved="sum", approval_rate="mean" — to apply three different functions and assign descriptive output column names simultaneously. This named dictionary syntax was introduced in pandas 0.25 and is the cleanest way to build a multi-metric summary table in a single grouped operation.
The employment type split is absolute: every salaried applicant was approved (100%), while every self-employed and contract applicant was rejected (0%). This is a stark pattern that immediately raises a fairness concern — is the bank systematically excluding non-salaried workers based on employment type alone, or is employment type simply a proxy for the underlying financial variables like income stability and credit score? Step 2 will tell us which interpretation is correct.
The employment type split could reflect genuine financial differences rather than bias. We compare the average income, credit score, and loan amount for each outcome group to understand what is actually driving the decisions.
# Profile comparison — mean values for approved vs rejected applicants
profile = df.groupby("approved")[
["annual_income","credit_score","loan_amount","loan_term_yrs"]
].mean().round(0)
print("Average profile by approval outcome:")
print(profile.to_string())
# Compute the gap between approved and rejected on each variable
gap = profile.loc["Yes"] - profile.loc["No"]
print("\nGap (Approved minus Rejected):")
print(gap.to_string())
# Express income gap as a percentage above rejected applicants
income_pct = (gap["annual_income"] / profile.loc["No","annual_income"] * 100)
score_pct = (gap["credit_score"] / profile.loc["No","credit_score"] * 100)
print(f"\nApproved applicants earn {income_pct:.1f}% more than rejected applicants")
print(f"Approved applicants have credit scores {score_pct:.1f}% higher than rejected")
Average profile by approval outcome:
annual_income credit_score loan_amount loan_term_yrs
approved
No 37800 626 23600 7.6
Yes 86429 758 28857 6.6
Gap (Approved minus Rejected):
annual_income 48629.0
credit_score 132.0
loan_amount 5257.0
loan_term_yrs -1.0
Approved applicants earn 128.6% more than rejected applicants
Approved applicants have credit scores 21.1% higher than rejectedWhat just happened?
Method — .loc[] for label-based row extraction after groupbyAfter groupby("approved").mean(), the outcome labels "Yes" and "No" become the row index. We use .loc["Yes"] and .loc["No"] to extract each row as a Series, then subtract them to get the gap. This is the standard pattern for extracting specific groups from a grouped result — always use .loc[] with the group label, not .iloc[] with a position, since group order can change after sorting.
The financial gap between approved and rejected applicants is enormous. Approved applicants earn 128.6% more on average ($86,429 vs $37,800) and have credit scores 21% higher (758 vs 626). This explains the employment type split from Step 1 — the self-employed and contract workers in this dataset also happen to have significantly lower incomes and credit scores. Employment type is not being used as a standalone rejection reason; it correlates with the underlying financial variables that are driving the decision. Whether that is fair is a separate question — but it is an important distinction for the credit risk team to understand.
The debt-to-income (DTI) ratio is the most widely used lending risk metric in retail banking. It measures how large a loan is relative to what the applicant earns annually. Most banks set a hard rejection threshold around 40–50% DTI.
# Compute debt-to-income ratio: loan amount as a % of annual income
df["dti_ratio"] = (df["loan_amount"] / df["annual_income"] * 100).round(1)
# Show all applicants with their DTI and approval outcome
dti_view = df[["app_id","annual_income","loan_amount","dti_ratio","approved"]].sort_values(
"dti_ratio", ascending=False
)
print("All applicants sorted by DTI ratio:")
print(dti_view.to_string(index=False))
# Mean DTI for approved vs rejected
dti_by_outcome = df.groupby("approved")["dti_ratio"].agg(["mean","max","min"]).round(1)
print("\nDTI statistics by approval outcome:")
print(dti_by_outcome.to_string())
# Flag high-DTI applicants (above 40%) and check their outcomes
high_dti = df[df["dti_ratio"] > 40][["app_id","dti_ratio","approved","credit_score"]]
print(f"\nHigh DTI applicants (>40%): {len(high_dti)}")
print(high_dti.to_string(index=False))
All applicants sorted by DTI ratio:
app_id annual_income loan_amount dti_ratio approved
A007 29000 20000 69.0 No
A002 34000 22000 64.7 No
A004 41000 28000 68.3 No
A009 47000 30000 63.8 No
A011 38000 18000 47.4 No
A001 72000 18000 25.0 Yes
A010 67000 22000 32.8 Yes
A005 58000 15000 25.9 Yes
A008 83000 25000 30.1 Yes
A003 95000 35000 36.8 Yes
A006 120000 40000 33.3 Yes
A012 110000 45000 40.9 Yes
DTI statistics by approval outcome:
mean max min
approved
No 62.6 69.0 47.4
Yes 32.1 40.9 25.0
High DTI applicants (>40%): 6
app_id dti_ratio approved credit_score
A012 40.9 Yes 780
A011 47.4 No 625
A009 63.8 No 665
A002 64.7 No 612
A004 68.3 No 638
A007 69.0 No 590What just happened?
Method — derived metric from two raw columnsThe DTI ratio does not exist in the raw data — we computed it by dividing loan_amount by annual_income and multiplying by 100 to express it as a percentage. This single line creates a new column that is available for all subsequent filtering and grouping. Derived metrics like DTI are often more analytically powerful than either raw column alone, because they capture a relationship between variables rather than an absolute value.
The DTI split is as stark as the employment type split. Rejected applicants have a mean DTI of 62.6% — well above any standard lending threshold — while approved applicants average 32.1%. The one exception is A012, which was approved despite a 40.9% DTI — but that applicant has a credit score of 780, one of the highest in the dataset. This reveals an important nuance in CrestBank's decision process: a very high credit score can partially compensate for a borderline DTI. That interaction between variables is exactly the kind of insight the credit risk team needs to document explicitly in their lending criteria.
The credit risk team wants to know where the decision boundary sits on credit score — and which numeric variable is the strongest predictor of approval overall. We find the threshold and run the correlation analysis to answer both questions.
# Credit score threshold analysis
min_approved_score = df[df["approved"] == "Yes"]["credit_score"].min()
max_rejected_score = df[df["approved"] == "No"]["credit_score"].max()
print(f"Minimum credit score among APPROVED applicants: {min_approved_score}")
print(f"Maximum credit score among REJECTED applicants: {max_rejected_score}")
print(f"Overlap zone: {max_rejected_score} to {min_approved_score}")
# Applicants in the overlap zone (if any)
overlap = df[
(df["credit_score"] >= max_rejected_score) &
(df["credit_score"] <= min_approved_score)
][["app_id","credit_score","annual_income","dti_ratio","approved"]]
print(f"\nApplicants in the credit score overlap zone: {len(overlap)}")
if len(overlap) > 0:
print(overlap.to_string(index=False))
# Correlation of each numeric variable with approval flag
corr_vars = ["annual_income","credit_score","loan_amount","loan_term_yrs","dti_ratio"]
print("\nCorrelation with approval (Pearson r):")
for var in corr_vars:
r = df[var].corr(df["approved_flag"]).round(3)
print(f" {var:<18} r = {r:+.3f}")
Minimum credit score among APPROVED applicants: 710 Maximum credit score among REJECTED applicants: 665 Overlap zone: 665 to 710 Applicants in the credit score overlap zone: 0 Correlation with approval (Pearson r): annual_income r = +0.928 credit_score r = +0.906 loan_amount r = +0.308 loan_term_yrs r = -0.098 dti_ratio r = -0.941
What just happened?
Method — min/max filtering on a subsetWe filtered the DataFrame to just approved applicants with df[df["approved"] == "Yes"], then called .min() on the credit score column. The same pattern with df[df["approved"] == "No"] and .max() gives the ceiling for rejected applicants. Together these two numbers define the decision boundary — any credit score between them is the ambiguous zone where the bank's other variables (DTI, income) presumably break the tie.
There is a clean separation in credit scores — the lowest approved score (710) is above the highest rejected score (665). No applicants fall in the overlap zone, meaning credit score alone is a near-perfect classifier in this dataset. But the correlation analysis adds important nuance: DTI ratio has the strongest correlation with approval at −0.941 (negative because higher DTI = less likely to be approved), followed by income (+0.928) and credit score (+0.906). Loan term has almost no linear relationship with approval (r = −0.098). For the credit risk team: DTI is the primary signal, income is the secondary signal, credit score is the tertiary signal — and loan term should be removed from the decision model entirely.
Checkpoint: Try computing a combined risk score — df["risk_score"] = df["dti_ratio"] - (df["credit_score"] / 10) + (df["annual_income"] / 10000) — and ranking applicants by it. Does this simple formula rank applicants in the same order as the actual approval decision? Where does it get it wrong? This is how a credit analyst begins to prototype a scoring model before handing it to a data scientist for formal modelling.
Key Findings
The overall approval rate is 58.3%. Every salaried applicant was approved (100%) while every self-employed and contract applicant was rejected (0%) — but this reflects underlying financial differences, not employment type being used as a standalone rejection criterion.
Approved applicants earn 128.6% more on average ($86,429 vs $37,800) and have credit scores 21% higher (758 vs 626). The income gap alone is the single largest observable difference between the two groups.
DTI ratio is the strongest predictor of approval at r = −0.941. Rejected applicants average a DTI of 62.6% vs 32.1% for approved applicants. Every applicant with DTI above 47% was rejected — with one exception (A012, DTI 40.9%) compensated by a credit score of 780.
There is clean credit score separation — the minimum approved score (710) is above the maximum rejected score (665) with no overlap. In this dataset, any applicant with a credit score above 710 was approved; below 665, rejected.
Loan term has virtually no correlation with approval (r = −0.098). The credit risk team can safely remove loan term from any future scoring model — it contributes no predictive power in this dataset.
Visualisations
Loan EDA Decision Guide
Every loan approval dataset has the same structure — a binary outcome and a set of financial predictor variables. Here is the framework for approaching it systematically:
| Question | Metric | pandas Method | Watch Out For |
|---|---|---|---|
| What is the approval rate? | Mean of binary flag | (df["col"]=="Yes").astype(int).mean() | Always encode string to 0/1 first |
| How do groups differ? | Mean per outcome class | groupby("approved").mean() | Use .loc[] to extract rows by label |
| What is the risk ratio? | DTI = loan / income × 100 | df["loan"] / df["income"] * 100 | DTI threshold varies by bank policy |
| Where is the boundary? | Min approved / max rejected | df[df["approved"]=="Yes"]["col"].min() | Overlap zone = ambiguous cases |
| Which variable predicts most? | Pearson correlation with flag | df["col"].corr(df["flag"]) | Negative r = variable predicts rejection |
Analyst's Note
Teacher's Note
What Would Come Next?
A senior analyst would build a logistic regression model using DTI, income, and credit score as features — the three most predictive variables identified here — and backtest it against historical default data to check whether approved applicants who later defaulted had identifiable characteristics the current model misses.
Limitations of This Analysis
Twelve applications produce clean separation that is unrealistic at scale. Real lending datasets have thousands of borderline cases where variables interact in complex ways, and important predictors like existing debt obligations and payment history on existing accounts are absent here.
Business Decisions This Could Drive
Formalise a hard DTI ceiling of 45% and a minimum credit score of 700 for standard applications. Documenting these thresholds explicitly reduces decision inconsistency and provides a defensible audit trail for regulatory review.
Practice Questions
1. Which variable had the strongest correlation with loan approval in the CrestBank dataset?
2. After creating a boolean Series with df["approved"] == "Yes", which method converts it to a numeric 0/1 column?
3. Write the pandas formula used to compute the debt-to-income ratio as a percentage in this case study.
Quiz
1. Every self-employed and contract applicant was rejected. What does the deeper analysis reveal about why?
2. Why is the debt-to-income (DTI) ratio a primary rejection signal in lending?
3. Which variable should CrestBank remove from their scoring model based on this analysis, and why?
Up Next · Case Study 8
Exploring House Price Data
You are handed a property listings dataset. Which features drive price the most — size, location, or age? How does price per square foot vary by neighbourhood? And which properties look undervalued relative to their attributes?