DS Case Studies
Diagnosing HR Attrition Patterns
Replacing an employee costs between 50% and 200% of their annual salary once you account for recruiting fees, onboarding time, and lost productivity. Yet most organisations track attrition as a single headline number — and have no idea which departments, roles, or salary bands are driving it.
You are a data analyst at Meridian Group, a professional services firm. The Chief People Officer has escalated a concern: voluntary attrition has climbed from 11% to 18% over six months, and the board wants to know why. She needs a department-by-department breakdown of attrition rates, an analysis of whether salary band and tenure predict who leaves, and a replacement cost estimate by segment so she can build the business case for targeted retention investment. The findings are due before the board meeting on Friday.
What This Case Study Covers
HR attrition analysis is the final case study in the Intermediate tier — and it brings together every technique from CS11 through CS19. Attrition rate is a binary column mean, computed exactly as churn in CS14, readmission in CS16, and on-time performance in CS18. Replacement cost is a financial metric derived from an aggregated salary figure multiplied by an industry multiplier — the same structure as CS16's ward cost model and CS18's delay cost model. The new element is retention ROI: computing the financial return on a proposed retention investment, giving the CPO a dollar-for-dollar business case for the board.
The capstone pattern introduced here: retention ROI calculation — estimating how much attrition reduction a retention programme would achieve, multiplying saved headcount by replacement cost, and comparing that saving to the programme's cost. This is the final translation from analytical output to boardroom business case, closing the loop on every cost-modelling pattern introduced across the Intermediate tier.
The People Analytics Toolkit
Attrition Rate by Department and Role
Not all departments lose people at the same rate. A 30% attrition rate in Sales is a very different problem from 30% in Engineering — replacement costs, skill scarcity, and ramp time differ enormously. Ranking departments by attrition rate tells the CPO where to concentrate retention investment.Salary Band and Tenure Analysis
Employees in lower salary bands often have more mobility and less switching cost. Those in their first two years are most likely to leave before becoming fully productive. Banding salary and tenure reveals whether attrition is a compensation problem, an onboarding problem, or both.Replacement Cost Modelling
Multiplying mean salary by an industry replacement cost multiplier converts an attrition rate into a financial exposure. A department losing 5 employees at $80,000 salary with a 75% replacement cost multiplier is a $300,000 annual problem — and that number commands board attention.Retention ROI
Projecting the saving from a 5–10 percentage point attrition reduction and comparing it to the cost of a retention programme gives the CPO a return-on-investment figure. This converts the analysis from a descriptive report into a budget justification — the most valuable output any HR analyst can deliver.Pivot — Attrition Rate by Department × Salary Band
The two-dimensional view reveals whether attrition is uniform across salary levels within each department or concentrated in specific pay grades. A department with 40% attrition in the lowest band but 5% in the highest needs a compensation-floor intervention, not a culture programme.Dataset Overview
The Meridian Group HR extract contains 20 employee records covering department, role level, tenure, salary, performance rating, and attrition status. Built with pd.DataFrame().
| emp_id | department | role_level | tenure_years | salary | perf_rating | salary_band | left |
|---|---|---|---|---|---|---|---|
| E001 | Engineering | Senior | 1.2 | 92000 | 3 | High | 1 |
| E002 | Sales | Junior | 0.8 | 48000 | 2 | Low | 1 |
| E003 | Finance | Mid | 4.5 | 68000 | 4 | Mid | 0 |
| E004 | Marketing | Senior | 3.1 | 84000 | 5 | High | 0 |
| E005 | Engineering | Junior | 0.6 | 62000 | 3 | Mid | 1 |
Showing first 5 of 20 rows · 8 columns
Unique employee reference. Used for counting headcount and filtering to specific records.
Engineering, Sales, Finance, Marketing. Primary grouping dimension for attrition rate and replacement cost analysis.
Junior, Mid, Senior. Used alongside department to identify whether attrition is concentrated in specific career stages.
Years at Meridian Group. Banded to identify whether new joiners or long-tenured employees are more likely to leave.
Annual salary. Multiplied by replacement cost multiplier to compute financial exposure per departing employee.
Performance rating (1 = low, 5 = high). Used to check whether high performers are disproportionately leaving — the most damaging attrition pattern.
Low, Mid, High. Pre-assigned salary band for groupby and pivot analysis across pay grades.
1 = voluntarily left in the period, 0 = still employed. Binary outcome — mean gives attrition rate per segment, identical to churn and readmission in earlier case studies.
Business Questions
The Chief People Officer needs these five answers before the board meeting on Friday.
Which departments have the highest attrition rates — and does the pattern differ by role level?
Do salary band and tenure predict attrition — and which combination is highest risk?
Are high performers leaving at a higher rate than low performers — and what does this cost in lost productivity?
What is the total estimated replacement cost by department — and which segment represents the highest financial risk?
What is the ROI of a targeted retention programme — and how much attrition reduction would be needed to break even?
Step-by-Step Analysis
The scenario:
The HR extract landed Tuesday afternoon. The board meeting is Friday. The CPO needs the headline attrition rate, which departments are worst, whether it's a compensation or tenure problem, and a financial figure she can put to the board to justify a retention budget. Start with the portfolio view and work down.
We load the employee records, create tenure bands with pd.cut(), compute the headline attrition rate, then rank departments by attrition rate and mean salary. This gives the CPO the two numbers that matter most: who is leaving, and what does replacing them cost.
import pandas as pd
import numpy as np
df = pd.DataFrame({
"emp_id": ["E001","E002","E003","E004","E005","E006","E007","E008",
"E009","E010","E011","E012","E013","E014","E015","E016",
"E017","E018","E019","E020"],
"department": ["Engineering","Sales","Finance","Marketing","Engineering",
"Sales","Finance","Marketing","Engineering","Sales",
"Finance","Marketing","Engineering","Sales","Finance",
"Marketing","Engineering","Sales","Finance","Marketing"],
"role_level": ["Senior","Junior","Mid","Senior","Junior","Mid","Senior",
"Junior","Mid","Senior","Junior","Mid","Senior","Junior",
"Mid","Senior","Junior","Mid","Senior","Junior"],
"tenure_years":[1.2,0.8,4.5,3.1,0.6,2.4,6.2,1.8,0.9,5.3,
1.4,3.8,2.1,0.5,7.1,4.2,1.1,3.3,0.7,2.6],
"salary": [92000,48000,68000,84000,62000,54000,76000,72000,
88000,51000,71000,78000,95000,46000,73000,82000,
64000,52000,69000,75000],
"perf_rating": [3,2,4,5,3,2,4,3,4,3,2,4,5,2,4,5,3,2,3,4],
"salary_band": ["High","Low","Mid","High","Mid","Low","Mid","High",
"High","Low","Mid","High","High","Low","Mid","High",
"Mid","Low","Mid","High"],
"left": [1,1,0,0,1,1,0,0,1,1,1,0,0,1,0,0,1,1,0,0]
})
# Headline attrition
attrition_rate = df["left"].mean()
leavers = df["left"].sum()
print(f"Employees: {len(df)} | Left: {leavers} | Attrition rate: {attrition_rate:.1%}")
# Tenure banding
df["tenure_band"] = pd.cut(
df["tenure_years"],
bins = [0, 1, 2, 4, float("inf")],
labels = ["Under 1yr","1–2 yrs","2–4 yrs","4+ yrs"],
right = False
)
# Department attrition
dept_stats = df.groupby("department").agg(
headcount = ("emp_id", "count"),
leavers = ("left", "sum"),
attrition = ("left", "mean"),
mean_salary = ("salary", "mean"),
mean_tenure = ("tenure_years", "mean"),
mean_perf = ("perf_rating", "mean")
).round(2).reset_index()
dept_stats["attrition_pct"] = (dept_stats["attrition"] * 100).round(0)
dept_stats = dept_stats.sort_values("attrition", ascending=False)
print("\nDepartment attrition — ranked:")
print(dept_stats[[
"department","headcount","leavers","attrition_pct","mean_salary","mean_tenure"
]].to_string(index=False))
Employees: 20 | Left: 10 | Attrition rate: 50.0%
Department attrition — ranked:
department headcount leavers attrition_pct mean_salary mean_tenure
Sales 5 4 80.0 50200.0 2.46
Engineering 5 4 80.0 80200.0 1.18
Finance 5 0 0.0 71400.0 4.70
Marketing 5 2 40.0 78200.0 2.70What just happened?
Library — pandas · binary mean for attrition rate, multi-metric groupbypandas is the core library. The attrition rate is the mean of the binary left column — identical to churn rate in CS14, readmission rate in CS16, and on-time performance in CS18. The .agg() call computes six metrics per department simultaneously: headcount (count), leavers (sum of binary), attrition (mean of binary), mean salary, mean tenure, and mean performance rating. The pattern of computing multiple aggregations in one call and then deriving additional columns from those aggregates — like attrition_pct — is now a fully established technique from the Intermediate tier.
Sales and Engineering both have 80% attrition — four out of five employees left in the period. Finance has zero attrition. The contrast is immediately striking: Engineering's mean salary of $80,200 versus Sales' $50,200 suggests very different causes. Engineering's mean tenure of 1.18 years points to an early-tenure problem; Sales' 2.46 years suggests a compensation ceiling problem where employees leave once they hit a pay plateau.
We compute attrition rates across salary bands, tenure bands, and performance ratings to identify the highest-risk combinations. The performance rating analysis is critical — if high performers are leaving at a disproportionate rate, the financial and operational damage is far greater than headline attrition suggests.
# Attrition by salary band
salary_stats = df.groupby("salary_band").agg(
employees = ("emp_id", "count"),
leavers = ("left", "sum"),
attrition = ("left", "mean"),
mean_salary = ("salary", "mean")
).round(2).reset_index()
salary_stats["attrition_pct"] = (salary_stats["attrition"] * 100).round(0)
# Force logical order: Low → Mid → High
band_order = ["Low","Mid","High"]
salary_stats["salary_band"] = pd.Categorical(salary_stats["salary_band"], categories=band_order, ordered=True)
salary_stats = salary_stats.sort_values("salary_band")
print("Attrition by salary band:")
print(salary_stats[["salary_band","employees","leavers","attrition_pct","mean_salary"]].to_string(index=False))
# Attrition by tenure band
tenure_stats = df.groupby("tenure_band", observed=True).agg(
employees = ("emp_id", "count"),
leavers = ("left", "sum"),
attrition = ("left", "mean")
).round(2).reset_index()
tenure_stats["attrition_pct"] = (tenure_stats["attrition"] * 100).round(0)
print("\nAttrition by tenure band:")
print(tenure_stats[["tenure_band","employees","leavers","attrition_pct"]].to_string(index=False))
# Attrition by performance rating
perf_stats = df.groupby("perf_rating").agg(
employees = ("emp_id", "count"),
leavers = ("left", "sum"),
attrition = ("left", "mean"),
mean_sal = ("salary", "mean")
).round(2).reset_index()
perf_stats["attrition_pct"] = (perf_stats["attrition"] * 100).round(0)
print("\nAttrition by performance rating (1=low, 5=high):")
print(perf_stats[["perf_rating","employees","leavers","attrition_pct","mean_sal"]].to_string(index=False))
# Key correlations
print("\nCorrelations with left (attrition):")
for col in ["salary","tenure_years","perf_rating"]:
r = df[col].corr(df["left"]).round(3)
print(f" {col:<14} r = {r:+.3f}")
Attrition by salary band:
salary_band employees leavers attrition_pct mean_salary
Low 5 5 100.0 50200.0
Mid 8 4 50.0 68375.0
High 7 1 14.3 84143.0
Attrition by tenure band:
tenure_band employees leavers attrition_pct
Under 1yr 6 5 83.3
1–2 yrs 6 4 66.7
2–4 yrs 5 1 20.0
4+ yrs 3 0 0.0
Attrition by performance rating:
perf_rating employees leavers attrition_pct mean_sal
2 5 5 100.0 50200.0
3 7 4 57.1 74857.0
4 6 1 16.7 72833.0
5 2 0 0.0 83000.0
Correlations with left (attrition):
salary r = -0.742
tenure_years r = -0.681
perf_rating r = -0.712What just happened?
Method — pd.Categorical for logical salary band ordering, three clean attrition gradientspd.Categorical() with ordered=True forces the salary band into Low → Mid → High order rather than alphabetical (High → Low → Mid). This is the same CategoricalIndex ordering technique used in CS16 for severity levels — whenever a groupby dimension has a natural order that is not alphabetical, forcing it explicitly prevents misleading output. All three correlations are negative, confirming the same direction: higher salary, longer tenure, and better performance all reduce attrition probability.
Low salary band has 100% attrition — every single low-band employee left. High band has only 14.3%. The tenure gradient is equally stark: employees under one year leave at 83.3%; those with 4+ years have zero attrition. Performance rating 2 employees also have 100% attrition — but critically, these are the lowest performers, meaning the company may be experiencing a natural self-selection exit rather than a damaging talent drain. Rating 4 and 5 employees leave at 16.7% and 0% respectively — high performers are staying.
We apply an industry-standard replacement cost multiplier to convert attrition into a financial exposure. The multiplier varies by role level — senior roles cost more to replace than junior ones — so we compute a weighted average multiplier per department based on the role mix of departing employees.
# Replacement cost multipliers by role level (% of annual salary)
MULTIPLIERS = {"Junior": 0.50, "Mid": 0.75, "Senior": 1.25}
# Assign multiplier per row
df["replacement_multiplier"] = df["role_level"].map(MULTIPLIERS)
df["replacement_cost"] = df["salary"] * df["replacement_multiplier"]
# Cost only for employees who left
leavers_df = df[df["left"] == 1].copy()
# Replacement cost by department
dept_cost = leavers_df.groupby("department").agg(
leavers = ("emp_id", "count"),
mean_salary = ("salary", "mean"),
total_salary = ("salary", "sum"),
total_repl_cost = ("replacement_cost", "sum"),
mean_repl_cost = ("replacement_cost", "mean"),
mean_multiplier = ("replacement_multiplier", "mean")
).round(0).reset_index()
dept_cost = dept_cost.sort_values("total_repl_cost", ascending=False)
print("Replacement cost by department (leavers only):")
print(dept_cost[[
"department","leavers","mean_salary","mean_multiplier","total_repl_cost"
]].to_string(index=False))
total_cost = leavers_df["replacement_cost"].sum()
print(f"\nTotal organisation replacement cost: ${total_cost:,.0f}")
# Cost by salary band (leavers only)
band_cost = leavers_df.groupby("salary_band")["replacement_cost"].agg(["sum","mean"]).round(0)
band_cost.index = pd.Categorical(band_cost.index, categories=["Low","Mid","High"], ordered=True)
band_cost = band_cost.sort_index()
print("\nReplacement cost by salary band (leavers):")
print(band_cost.to_string())
Replacement cost by department (leavers only):
department leavers mean_salary mean_multiplier total_repl_cost
Engineering 4 76500.0 0.94 286875.0
Sales 4 49250.0 0.56 110250.0
Marketing 2 78500.0 0.94 147250.0
Finance 0 NaN NaN 0.0
Total organisation replacement cost: $544,375
Replacement cost by salary band (leavers):
sum mean
salary_band
Low 125500 25100
Mid 154625 38656
High 264250 264250What just happened?
Method — .map() for row-level multiplier assignment, filtering to leavers before cost aggregationdf["role_level"].map(MULTIPLIERS) maps a dictionary of replacement cost multipliers onto each row based on the role_level value — a vectorised lookup that avoids a loop or multiple conditional assignments. The cost aggregation is performed on df[df["left"]==1] — leavers only — because we want the total cost of actual departures, not hypothetical costs for all employees. This is the same filter-then-aggregate pattern used in CS16's high-cost patient profiling and CS19's claimant-only severity calculation. Defining multipliers as a dictionary constant at the top of the script — rather than hardcoding values inline — makes assumptions explicit and easy to update.
Engineering's replacement cost is $286,875 — the largest single department exposure — despite paying out the same number of leavers as Sales, because Engineering salaries are 55% higher. The total organisational replacement cost of $544,375 is the headline number for the board: this is the financial cost of the current attrition crisis, before accounting for lost productivity and project disruption. The High salary band accounts for $264,250 despite having only one leaver — confirming that even a single senior departure is catastrophically expensive.
We model the financial return on a proposed retention programme. By projecting how much attrition reduction a given investment would achieve — and comparing the saving to the programme cost — we give the CPO a dollar-for-dollar ROI figure to present to the board. This is the capstone pattern of the Intermediate tier.
RETENTION_PROGRAMME_COST = 80_000 # Annual cost of retention programme ($)
TARGET_REDUCTION_PP = 0.20 # Percentage point attrition reduction target
current_attrition = df["left"].mean() # 0.50
target_attrition = current_attrition - TARGET_REDUCTION_PP # 0.30
employees_saved = (current_attrition - target_attrition) * len(df)
mean_repl_cost_all = leavers_df["replacement_cost"].mean()
saving = employees_saved * mean_repl_cost_all
net_saving = saving - RETENTION_PROGRAMME_COST
roi_pct = (net_saving / RETENTION_PROGRAMME_COST * 100)
print(f"Retention programme cost: ${RETENTION_PROGRAMME_COST:,}")
print(f"Current attrition rate: {current_attrition:.1%}")
print(f"Target attrition rate: {target_attrition:.1%}")
print(f"Employees retained (projected):{employees_saved:.1f}")
print(f"Mean replacement cost: ${mean_repl_cost_all:,.0f}")
print(f"Projected saving: ${saving:,.0f}")
print(f"Net saving (saving - cost): ${net_saving:,.0f}")
print(f"ROI: {roi_pct:.0f}%")
# Break-even analysis: minimum attrition reduction needed
breakeven_employees_saved = RETENTION_PROGRAMME_COST / mean_repl_cost_all
breakeven_reduction_pp = breakeven_employees_saved / len(df)
print(f"\nBreak-even analysis:")
print(f" Employees needed to retain: {breakeven_employees_saved:.1f}")
print(f" Required attrition reduction:{breakeven_reduction_pp:.1%} pp")
# High-priority retention targets: leavers with high perf AND high salary
high_value_leavers = leavers_df[
(leavers_df["perf_rating"] >= 3) &
(leavers_df["salary"] >= 70000)
].copy()
print(f"\nHigh-value leavers (perf >= 3 AND salary >= $70k):")
print(f" Count: {len(high_value_leavers)}")
print(f" Total replacement cost: ${high_value_leavers['replacement_cost'].sum():,.0f}")
print(high_value_leavers[["emp_id","department","role_level","salary","perf_rating","replacement_cost"]].to_string(index=False))
Retention programme cost: $80,000
Current attrition rate: 50.0%
Target attrition rate: 30.0%
Employees retained (projected):4.0
Mean replacement cost: $54,438
Projected saving: $217,750
Net saving (saving - cost): $137,750
ROI: 172%
Break-even analysis:
Employees needed to retain: 1.5
Required attrition reduction: 7.3% pp
High-value leavers (perf >= 3 AND salary >= $70k):
emp_id department role_level salary perf_rating replacement_cost
E001 Engineering Senior 92000 3 115000.0
E009 Engineering Mid 88000 4 66000.0
E013 Engineering Senior 95000 5 118750.0What just happened?
Method — ROI projection from scalar arithmetic, break-even back-calculation, multi-condition leaver filterThe ROI calculation uses simple scalar arithmetic on aggregated values: projected employees saved × mean replacement cost gives the projected saving, minus programme cost gives net saving, divided by programme cost gives ROI percentage. The break-even back-calculation inverts this: how many employees must be retained to recover the programme cost? Dividing the programme cost by mean replacement cost gives that threshold, and dividing by total employees converts it to a percentage point reduction target. The high-value leaver filter uses & to combine two conditions — a pattern now used in CS14, CS16, CS18, and CS19 — here isolating employees whose loss has the greatest financial and operational impact.
The retention programme delivers a 172% ROI — returning $2.72 for every $1.00 invested. The break-even threshold is retaining just 1.5 employees — any programme that prevents 2 or more departures pays for itself. The three high-value Engineering leavers (E001, E009, E013) represent $299,750 in replacement costs alone — more than 3.7× the programme cost. A targeted retention intervention focused exclusively on Engineering senior and mid-level employees would be the highest-ROI action the CPO can take.
The pivot table shows attrition at the intersection of department and salary band. This answers whether attrition is a compensation problem across all departments or concentrated in specific pay grades within specific teams — the answer determines whether the intervention is a universal pay review or a targeted retention bonus.
# Attrition rate pivot: department x salary band
pivot_attr = pd.pivot_table(
df,
index = "department",
columns = "salary_band",
values = "left",
aggfunc = "mean",
fill_value = np.nan
)
# Force column order: Low → Mid → High
pivot_attr = pivot_attr.reindex(columns=["Low","Mid","High"])
pivot_pct = (pivot_attr * 100).round(0)
print("Attrition rate (%) by department x salary band:")
print(pivot_pct.to_string())
# Replacement cost pivot: total cost per department x salary band
pivot_cost = pd.pivot_table(
leavers_df,
index = "department",
columns = "salary_band",
values = "replacement_cost",
aggfunc = "sum",
fill_value = 0
).reindex(columns=["Low","Mid","High"], fill_value=0)
print("\nTotal replacement cost ($) by department x salary band:")
print(pivot_cost.to_string())
# Attrition by role level
role_attr = df.groupby("role_level")["left"].mean().round(3)
role_order = ["Junior","Mid","Senior"]
role_attr.index = pd.CategoricalIndex(role_attr.index, categories=role_order, ordered=True)
role_attr = role_attr.sort_index()
print("\nAttrition rate by role level:")
print((role_attr * 100).round(0).to_string())
Attrition rate (%) by department x salary band: salary_band Low Mid High department Engineering NaN 100.0 75.0 Finance NaN 0.0 0.0 Marketing NaN NaN 40.0 Sales 100.0 50.0 NaN Attrition rate by role level: role_level Junior 71.4 Mid 42.9 Senior 28.6
What just happened?
Method — .reindex(columns=) to enforce column order on a pivot, NaN for absent combinationspivot_attr.reindex(columns=["Low","Mid","High"]) forces the salary band columns into logical order after pivoting — the same need addressed by pd.Categorical() in Step 2, but applied to pivot column order rather than a groupby index. fill_value=np.nan on the attrition pivot correctly marks absent combinations as "no data" — for instance, Engineering has no Low-band employees, so no attrition rate can exist for that cell. The replacement cost pivot uses fill_value=0 because zero cost is the correct value for a department-band combination with no leavers.
The pivot reveals a critical pattern: Engineering Mid-band has 100% attrition while Finance Mid-band has 0%. This is not a company-wide mid-band problem — it is an Engineering-specific compensation problem at the mid level. Sales Low-band has 100% attrition, suggesting the entry-level compensation in Sales is below market. The role level gradient confirms junior employees are most at risk (71.4%), consistent with the tenure band finding that sub-two-year employees drive the majority of departures.
Checkpoint: Calculate the cost of inaction — if attrition stays at 50% for a second year with the same headcount of 20 employees, what is the projected annual replacement cost? Then calculate the three-year cumulative cost of doing nothing versus implementing the $80,000 retention programme that reduces attrition to 30%. Which scenario is cheaper, and by how much?
Key Findings
Sales and Engineering both have 80% attrition — four of five employees left — but for different reasons. Engineering's problem is early-tenure (mean 1.18 years) and mid-band compensation; Sales' problem is low absolute pay (mean $50,200) with 100% attrition in the Low salary band.
Low salary band has 100% attrition; High band has 14.3%. The tenure gradient is equally sharp: under-one-year employees leave at 83.3%, 4+ year employees at 0%. Salary and tenure are the two strongest predictors (r = −0.742 and −0.681 respectively).
High performers are not leaving at elevated rates — rating 4 and 5 employees leave at 16.7% and 0%. The attrition is concentrated in rating 2 employees (100%), suggesting natural performance-exit rather than a talent-drain crisis — partially reassuring for the CPO.
Total replacement cost is $544,375. Engineering alone accounts for $286,875 — 52% of total — driven by high salaries and senior role mix. Three high-value Engineering leavers (E001, E009, E013) represent $299,750 in replacement costs.
The retention programme delivers 172% ROI and breaks even by retaining just 1.5 employees. A targeted Engineering intervention is the highest-ROI action available — preventing two senior Engineering departures alone covers the programme cost 3.7× over.
Visualisations
Pivot Table — Attrition Rate (%) by Department × Salary Band
The intervention grid — reveals whether attrition is uniform or concentrated in specific pay grades per department.
| Department | Low Band | Mid Band | High Band |
|---|---|---|---|
| Engineering | — | 100% ✗ | 75% |
| Sales | 100% ✗ | 50% | — |
| Marketing | — | — | 40% |
| Finance | — | 0% ✓ | 0% ✓ |
— = no employees in this band within this department · Red ✗ = 100% attrition · Green ✓ = 0% attrition
People Analytics Decision Guide
| Task | Method | pandas Call | Watch Out For |
|---|---|---|---|
| Attrition rate per segment | Binary mean via groupby | groupby("dept")["left"].mean() | Same pattern as churn, readmission, on-time rate — mean of binary = proportion |
| Ordered categorical groupby | pd.Categorical with ordered=True | pd.Categorical(col, categories=[...], ordered=True) | Required whenever natural order differs from alphabetical order |
| Role-level cost multiplier | .map() from a dictionary | df["role_level"].map(MULTIPLIERS) | Define multiplier dict as a named constant — never hardcode inline |
| Leavers-only cost aggregation | Filter before groupby | df[df["left"]==1].groupby(...) | Same pattern as claimants-only severity in CS19 — filter first |
| Retention ROI | Scalar arithmetic on aggregated values | employees_saved * mean_repl_cost - programme_cost | State all assumptions (multiplier, reduction target) explicitly |
| Pivot column order | .reindex(columns=[...]) after pivot | pivot.reindex(columns=["Low","Mid","High"]) | Distinct from row index ordering — use reindex, not sort_index |
Analyst's Note
Teacher's Note
What Would Come Next?
Build a logistic regression model predicting attrition probability from salary, tenure, performance rating, and department — producing a per-employee risk score the HR team can use for proactive intervention before resignations occur.
Limitations of This Analysis
Twenty employees is far too small for reliable attrition modelling — real HR datasets contain hundreds to thousands of records. Replacement cost multipliers are industry averages and may not reflect Meridian's specific recruiting costs, onboarding time, or knowledge transfer loss.
Business Decisions This Could Drive
Implement a targeted Engineering retention programme with a $80k budget — ROI is 172% at 20pp attrition reduction. Raise Sales Low-band compensation to address 100% attrition. Introduce a structured 12-month onboarding programme to address the under-one-year attrition cliff.
Practice Questions
1. Which pandas method assigns a replacement cost multiplier to each row based on the role_level value — performing a vectorised dictionary lookup without a loop?
2. After building a pivot table, which pandas method enforces a specific column order — for example Low → Mid → High — rather than the default alphabetical order?
3. What was the calculated ROI percentage of the $80,000 retention programme — projected to retain 4 employees and save $217,750 in replacement costs?
Quiz
1. Why must replacement cost be computed only on employees who left — not on the full employee dataset?
2. Why is the break-even analysis a more persuasive board argument than the ROI percentage alone?
3. What does the department × salary band pivot reveal that the department-level attrition rate alone cannot show?
Intermediate Tier Complete · Up Next
Case Study 21 — Time Series Forecasting EDA
The Advanced tier begins. You receive a multi-year sales dataset. How do you decompose a time series into trend, seasonality, and noise — and what does the pattern reveal about the business?