DS Case Study X – Bank Loan Approval Basics | Dataplexa
Beginner Case Study · CS 7

Analysing Bank Loan Approval Data

Every loan application tells a story about risk. The bank's job is to read that story accurately — approving creditworthy applicants and declining those who are likely to default. Your job is to find the patterns in the data that separate one group from the other.

You are a junior data analyst at CrestBank, a regional lending institution. The credit risk team has asked you to analyse the last 12 loan applications to understand what distinguishes approved from rejected applicants. They want to know which variables — income, credit score, loan amount, employment type — are most predictive of approval, and which applicant profiles the bank is consistently declining.

IndustryBanking & Finance
TechniqueEDA · Classification
Librariespandas · numpy
DifficultyBeginner
Est. Time40–50 min
Overview

What This Case Study Covers

Loan approval analysis is a classification problem at heart — each application gets a binary outcome. Before any model is built, an analyst needs to understand the data distributions for each outcome class. The key questions are always: what does a typical approved applicant look like? What does a typical rejected applicant look like? And how large are the gaps between the two profiles?

This case study covers four layers: approval rate analysis overall and by employment type, profile comparison — mean income, credit score, and loan amount for approved vs rejected applicants, threshold analysis to find the income and credit score levels that separate outcomes, and debt-to-income ratio as a derived risk metric that banks use as a primary lending signal.

The Loan EDA Toolkit

1

Approval Rate Analysis

Compute the overall approval rate and break it down by employment type. Salaried, self-employed, and contract workers may face systematically different approval rates — a signal that the bank's risk model treats employment stability as a key variable.
2

Profile Comparison

Compute mean income, credit score, and loan amount for approved vs rejected applicants. The size of the gap on each variable indicates how strongly it separates the two classes — large gaps are strong predictors, small gaps are weak ones.
3

Debt-to-Income Ratio

Divide loan amount by annual income to compute the debt-to-income (DTI) ratio — the single most commonly used lending metric in retail banking. A high DTI means a large loan relative to what the applicant earns. Banks typically reject applications above 40–50% DTI.
4

Credit Score Threshold Analysis

Find the minimum credit score among approved applicants and the maximum among rejected ones. The overlap zone — applicants in between — is where the bank's decision boundary sits, and where the most marginal decisions are being made.
5

Correlation with Approval

Encode approval as a binary flag (1/0) and compute Pearson correlation with each numeric variable. High positive correlation means more of that variable leads to more approvals. High negative correlation means more of it leads to rejections.
01

Dataset Overview

CrestBank's loan applications dataset contains 12 records — one row per application — covering income, credit score, loan amount, employment type, and the final approval decision. Built with pd.DataFrame().

app_idannual_incomecredit_scoreloan_amountloan_term_yrsemployment_typeapproved
A001$72,000745$18,0005SalariedYes
A002$34,000612$22,0007Self-EmployedNo
A003$95,000790$35,00010SalariedYes
A004$41,000638$28,0007ContractNo
A005$58,000710$15,0005SalariedYes

Showing first 5 of 12 rows · 7 columns

app_idstring · unique identifier

Unique application reference. Used for labelling — never analysed directly.

annual_incomeint64 · USD per year

Applicant's annual gross income. A primary affordability signal — higher income reduces the perceived risk of a given loan size.

credit_scoreint64 · 300–850

FICO-style credit score. The single most widely used indicator of creditworthiness — scores above 700 are generally considered good.

loan_amountint64 · USD

Amount requested by the applicant. Combined with income to compute the debt-to-income ratio — the primary derived risk metric.

loan_term_yrsint64 · years

Requested loan repayment period. Longer terms mean lower monthly payments but more total interest — a secondary risk signal.

employment_typeobject · 3 categories

Salaried, Self-Employed, or Contract. Employment stability directly affects income predictability — a key lending consideration.

approvedobject · binary target

Yes or No — the final lending decision. This is what we are trying to understand and ultimately model.

02

Business Questions

The credit risk team needs these five questions answered to calibrate their lending criteria for the next quarter.

1

What is the overall loan approval rate — and does it differ significantly by employment type?

2

What are the average income and credit score for approved vs rejected applicants — how large are the gaps?

3

Which applicants have a debt-to-income ratio above 40% — and were they approved or rejected?

4

What is the minimum credit score among approved applicants — and what is the maximum among rejected ones?

5

Which numeric variable correlates most strongly with loan approval — income, credit score, loan amount, or loan term?

03

Step-by-Step Analysis

The scenario:

The credit risk team has exported the last quarter's applications. Before they can refine their approval criteria, they need to understand what the data is telling them about the patterns in their current decisions. Open your notebook and work through it systematically.

Step 1Load and Inspect, Then Compute Approval Rates

Before computing any approval rate, we need to confirm data types are correct. The approved column is stored as Yes/No text — we will create a binary flag immediately so it can be used in all subsequent calculations.

import pandas as pd
import numpy as np

# Build the CrestBank loan applications dataset
df = pd.DataFrame({
    "app_id":          ["A001","A002","A003","A004","A005","A006",
                        "A007","A008","A009","A010","A011","A012"],
    "annual_income":   [72000,34000,95000,41000,58000,120000,
                        29000,83000,47000,67000,38000,110000],
    "credit_score":    [745,612,790,638,710,810,
                        590,755,665,720,625,780],
    "loan_amount":     [18000,22000,35000,28000,15000,40000,
                        20000,25000,30000,22000,18000,45000],
    "loan_term_yrs":   [5,7,10,7,5,10,7,5,10,5,7,10],
    "employment_type": ["Salaried","Self-Employed","Salaried","Contract","Salaried",
                        "Salaried","Self-Employed","Salaried","Contract","Salaried",
                        "Self-Employed","Salaried"],
    "approved":        ["Yes","No","Yes","No","Yes","Yes",
                        "No","Yes","No","Yes","No","Yes"]
})

# Basic inspection
print("Shape:", df.shape)
print("Missing values:", df.isnull().sum().sum())

# Create binary approval flag — needed for all numeric calculations
df["approved_flag"] = (df["approved"] == "Yes").astype(int)

# Overall approval rate
overall_rate = df["approved_flag"].mean()
print(f"\nOverall approval rate: {overall_rate:.1%}")
print(f"Approved: {df['approved_flag'].sum()} | Rejected: {(df['approved_flag']==0).sum()}")

# Approval rate by employment type
emp_approval = df.groupby("employment_type")["approved_flag"].agg(
    count="count",
    approved="sum",
    approval_rate="mean"
).round(2).reset_index()
emp_approval["approval_rate"] = (emp_approval["approval_rate"] * 100).round(0)

print("\nApproval rate by employment type:")
print(emp_approval.to_string(index=False))
Shape: (12, 7)
Missing values: 0

Overall approval rate: 58.3%
Approved: 7 | Rejected: 5

Approval rate by employment type:
 employment_type  count  approved  approval_rate
        Contract      2         0           0.0
    Self-Employed      3         0           0.0
         Salaried      7         7         100.0

What just happened?

Method — creating a binary flag from a string column

The approved column contains text ("Yes"/"No") which cannot be used in numeric calculations like .mean() or .corr(). The expression (df["approved"] == "Yes").astype(int) first creates a boolean Series (True/False), then converts it to integers (1/0) with .astype(int). This two-step pattern — boolean comparison then .astype(int) — is the standard way to encode a binary categorical variable for numeric analysis.

Method — .agg() with named aggregations

We passed a dictionary to .agg()count="count", approved="sum", approval_rate="mean" — to apply three different functions and assign descriptive output column names simultaneously. This named dictionary syntax was introduced in pandas 0.25 and is the cleanest way to build a multi-metric summary table in a single grouped operation.

Business Insight

The employment type split is absolute: every salaried applicant was approved (100%), while every self-employed and contract applicant was rejected (0%). This is a stark pattern that immediately raises a fairness concern — is the bank systematically excluding non-salaried workers based on employment type alone, or is employment type simply a proxy for the underlying financial variables like income stability and credit score? Step 2 will tell us which interpretation is correct.

Step 2Applicant Profile Comparison — Approved vs Rejected

The employment type split could reflect genuine financial differences rather than bias. We compare the average income, credit score, and loan amount for each outcome group to understand what is actually driving the decisions.

# Profile comparison — mean values for approved vs rejected applicants
profile = df.groupby("approved")[
    ["annual_income","credit_score","loan_amount","loan_term_yrs"]
].mean().round(0)

print("Average profile by approval outcome:")
print(profile.to_string())

# Compute the gap between approved and rejected on each variable
gap = profile.loc["Yes"] - profile.loc["No"]
print("\nGap (Approved minus Rejected):")
print(gap.to_string())

# Express income gap as a percentage above rejected applicants
income_pct = (gap["annual_income"] / profile.loc["No","annual_income"] * 100)
score_pct  = (gap["credit_score"]  / profile.loc["No","credit_score"]  * 100)
print(f"\nApproved applicants earn {income_pct:.1f}% more than rejected applicants")
print(f"Approved applicants have credit scores {score_pct:.1f}% higher than rejected")
Average profile by approval outcome:
          annual_income  credit_score  loan_amount  loan_term_yrs
approved
No               37800           626        23600              7.6
Yes              86429           758        28857              6.6

Gap (Approved minus Rejected):
annual_income    48629.0
credit_score       132.0
loan_amount       5257.0
loan_term_yrs       -1.0

Approved applicants earn 128.6% more than rejected applicants
Approved applicants have credit scores 21.1% higher than rejected

What just happened?

Method — .loc[] for label-based row extraction after groupby

After groupby("approved").mean(), the outcome labels "Yes" and "No" become the row index. We use .loc["Yes"] and .loc["No"] to extract each row as a Series, then subtract them to get the gap. This is the standard pattern for extracting specific groups from a grouped result — always use .loc[] with the group label, not .iloc[] with a position, since group order can change after sorting.

Business Insight

The financial gap between approved and rejected applicants is enormous. Approved applicants earn 128.6% more on average ($86,429 vs $37,800) and have credit scores 21% higher (758 vs 626). This explains the employment type split from Step 1 — the self-employed and contract workers in this dataset also happen to have significantly lower incomes and credit scores. Employment type is not being used as a standalone rejection reason; it correlates with the underlying financial variables that are driving the decision. Whether that is fair is a separate question — but it is an important distinction for the credit risk team to understand.

Step 3Debt-to-Income Ratio Analysis

The debt-to-income (DTI) ratio is the most widely used lending risk metric in retail banking. It measures how large a loan is relative to what the applicant earns annually. Most banks set a hard rejection threshold around 40–50% DTI.

# Compute debt-to-income ratio: loan amount as a % of annual income
df["dti_ratio"] = (df["loan_amount"] / df["annual_income"] * 100).round(1)

# Show all applicants with their DTI and approval outcome
dti_view = df[["app_id","annual_income","loan_amount","dti_ratio","approved"]].sort_values(
    "dti_ratio", ascending=False
)
print("All applicants sorted by DTI ratio:")
print(dti_view.to_string(index=False))

# Mean DTI for approved vs rejected
dti_by_outcome = df.groupby("approved")["dti_ratio"].agg(["mean","max","min"]).round(1)
print("\nDTI statistics by approval outcome:")
print(dti_by_outcome.to_string())

# Flag high-DTI applicants (above 40%) and check their outcomes
high_dti = df[df["dti_ratio"] > 40][["app_id","dti_ratio","approved","credit_score"]]
print(f"\nHigh DTI applicants (>40%): {len(high_dti)}")
print(high_dti.to_string(index=False))
All applicants sorted by DTI ratio:
 app_id  annual_income  loan_amount  dti_ratio approved
   A007          29000        20000       69.0       No
   A002          34000        22000       64.7       No
   A004          41000        28000       68.3       No
   A009          47000        30000       63.8       No
   A011          38000        18000       47.4       No
   A001          72000        18000       25.0      Yes
   A010          67000        22000       32.8      Yes
   A005          58000        15000       25.9      Yes
   A008          83000        25000       30.1      Yes
   A003          95000        35000       36.8      Yes
   A006         120000        40000       33.3      Yes
   A012         110000        45000       40.9      Yes

DTI statistics by approval outcome:
          mean   max   min
approved
No        62.6  69.0  47.4
Yes       32.1  40.9  25.0

High DTI applicants (>40%): 6
 app_id  dti_ratio approved  credit_score
   A012       40.9      Yes           780
   A011       47.4       No           625
   A009       63.8       No           665
   A002       64.7       No           612
   A004       68.3       No           638
   A007       69.0       No           590

What just happened?

Method — derived metric from two raw columns

The DTI ratio does not exist in the raw data — we computed it by dividing loan_amount by annual_income and multiplying by 100 to express it as a percentage. This single line creates a new column that is available for all subsequent filtering and grouping. Derived metrics like DTI are often more analytically powerful than either raw column alone, because they capture a relationship between variables rather than an absolute value.

Business Insight

The DTI split is as stark as the employment type split. Rejected applicants have a mean DTI of 62.6% — well above any standard lending threshold — while approved applicants average 32.1%. The one exception is A012, which was approved despite a 40.9% DTI — but that applicant has a credit score of 780, one of the highest in the dataset. This reveals an important nuance in CrestBank's decision process: a very high credit score can partially compensate for a borderline DTI. That interaction between variables is exactly the kind of insight the credit risk team needs to document explicitly in their lending criteria.

Step 4Credit Score Thresholds and Correlation Analysis

The credit risk team wants to know where the decision boundary sits on credit score — and which numeric variable is the strongest predictor of approval overall. We find the threshold and run the correlation analysis to answer both questions.

# Credit score threshold analysis
min_approved_score = df[df["approved"] == "Yes"]["credit_score"].min()
max_rejected_score = df[df["approved"] == "No"]["credit_score"].max()

print(f"Minimum credit score among APPROVED applicants: {min_approved_score}")
print(f"Maximum credit score among REJECTED applicants: {max_rejected_score}")
print(f"Overlap zone: {max_rejected_score} to {min_approved_score}")

# Applicants in the overlap zone (if any)
overlap = df[
    (df["credit_score"] >= max_rejected_score) &
    (df["credit_score"] <= min_approved_score)
][["app_id","credit_score","annual_income","dti_ratio","approved"]]
print(f"\nApplicants in the credit score overlap zone: {len(overlap)}")
if len(overlap) > 0:
    print(overlap.to_string(index=False))

# Correlation of each numeric variable with approval flag
corr_vars = ["annual_income","credit_score","loan_amount","loan_term_yrs","dti_ratio"]
print("\nCorrelation with approval (Pearson r):")
for var in corr_vars:
    r = df[var].corr(df["approved_flag"]).round(3)
    print(f"  {var:<18} r = {r:+.3f}")
Minimum credit score among APPROVED applicants: 710
Maximum credit score among REJECTED applicants: 665

Overlap zone: 665 to 710

Applicants in the credit score overlap zone: 0

Correlation with approval (Pearson r):
  annual_income      r = +0.928
  credit_score       r = +0.906
  loan_amount        r = +0.308
  loan_term_yrs      r = -0.098
  dti_ratio          r = -0.941

What just happened?

Method — min/max filtering on a subset

We filtered the DataFrame to just approved applicants with df[df["approved"] == "Yes"], then called .min() on the credit score column. The same pattern with df[df["approved"] == "No"] and .max() gives the ceiling for rejected applicants. Together these two numbers define the decision boundary — any credit score between them is the ambiguous zone where the bank's other variables (DTI, income) presumably break the tie.

Business Insight

There is a clean separation in credit scores — the lowest approved score (710) is above the highest rejected score (665). No applicants fall in the overlap zone, meaning credit score alone is a near-perfect classifier in this dataset. But the correlation analysis adds important nuance: DTI ratio has the strongest correlation with approval at −0.941 (negative because higher DTI = less likely to be approved), followed by income (+0.928) and credit score (+0.906). Loan term has almost no linear relationship with approval (r = −0.098). For the credit risk team: DTI is the primary signal, income is the secondary signal, credit score is the tertiary signal — and loan term should be removed from the decision model entirely.

Checkpoint: Try computing a combined risk score — df["risk_score"] = df["dti_ratio"] - (df["credit_score"] / 10) + (df["annual_income"] / 10000) — and ranking applicants by it. Does this simple formula rank applicants in the same order as the actual approval decision? Where does it get it wrong? This is how a credit analyst begins to prototype a scoring model before handing it to a data scientist for formal modelling.

04

Key Findings

01

The overall approval rate is 58.3%. Every salaried applicant was approved (100%) while every self-employed and contract applicant was rejected (0%) — but this reflects underlying financial differences, not employment type being used as a standalone rejection criterion.

02

Approved applicants earn 128.6% more on average ($86,429 vs $37,800) and have credit scores 21% higher (758 vs 626). The income gap alone is the single largest observable difference between the two groups.

03

DTI ratio is the strongest predictor of approval at r = −0.941. Rejected applicants average a DTI of 62.6% vs 32.1% for approved applicants. Every applicant with DTI above 47% was rejected — with one exception (A012, DTI 40.9%) compensated by a credit score of 780.

04

There is clean credit score separation — the minimum approved score (710) is above the maximum rejected score (665) with no overlap. In this dataset, any applicant with a credit score above 710 was approved; below 665, rejected.

05

Loan term has virtually no correlation with approval (r = −0.098). The credit risk team can safely remove loan term from any future scoring model — it contributes no predictive power in this dataset.

05

Visualisations

Approval Rate by Employment Type
% of applicants approved per employment category
Salaried
100%
100%
Self-Employed
0%
Contract
0%
Mean DTI Ratio — Approved vs Rejected
Debt-to-income ratio · bank threshold ≈ 40%
Rejected
62.6%
62.6%
Approved
32.1%
32.1%
Bank threshold
40%
40%
Correlation with Loan Approval — Pearson r
Positive = more of this variable → more approvals · Negative = more → more rejections
dti_ratio
−0.941
−0.941
annual_income
+0.928
+0.928
credit_score
+0.906
+0.906
loan_amount
+0.308
+0.308
loan_term_yrs
−0.098
−0.098
06

Loan EDA Decision Guide

Every loan approval dataset has the same structure — a binary outcome and a set of financial predictor variables. Here is the framework for approaching it systematically:

Question Metric pandas Method Watch Out For
What is the approval rate?Mean of binary flag(df["col"]=="Yes").astype(int).mean()Always encode string to 0/1 first
How do groups differ?Mean per outcome classgroupby("approved").mean()Use .loc[] to extract rows by label
What is the risk ratio?DTI = loan / income × 100df["loan"] / df["income"] * 100DTI threshold varies by bank policy
Where is the boundary?Min approved / max rejecteddf[df["approved"]=="Yes"]["col"].min()Overlap zone = ambiguous cases
Which variable predicts most?Pearson correlation with flagdf["col"].corr(df["flag"])Negative r = variable predicts rejection
07

Analyst's Note

Teacher's Note

What Would Come Next?

A senior analyst would build a logistic regression model using DTI, income, and credit score as features — the three most predictive variables identified here — and backtest it against historical default data to check whether approved applicants who later defaulted had identifiable characteristics the current model misses.

Limitations of This Analysis

Twelve applications produce clean separation that is unrealistic at scale. Real lending datasets have thousands of borderline cases where variables interact in complex ways, and important predictors like existing debt obligations and payment history on existing accounts are absent here.

Business Decisions This Could Drive

Formalise a hard DTI ceiling of 45% and a minimum credit score of 700 for standard applications. Documenting these thresholds explicitly reduces decision inconsistency and provides a defensible audit trail for regulatory review.

Practice Questions

1. Which variable had the strongest correlation with loan approval in the CrestBank dataset?



2. After creating a boolean Series with df["approved"] == "Yes", which method converts it to a numeric 0/1 column?



3. Write the pandas formula used to compute the debt-to-income ratio as a percentage in this case study.



Quiz

1. Every self-employed and contract applicant was rejected. What does the deeper analysis reveal about why?


2. Why is the debt-to-income (DTI) ratio a primary rejection signal in lending?


3. Which variable should CrestBank remove from their scoring model based on this analysis, and why?


Up Next · Case Study 8

Exploring House Price Data

You are handed a property listings dataset. Which features drive price the most — size, location, or age? How does price per square foot vary by neighbourhood? And which properties look undervalued relative to their attributes?