DS Case Study 16 – Healthcare Diagnosis Trends | Dataplexa

Intermediate Case Study · CS 16

Uncovering Healthcare Diagnosis Trends

Hospitals generate enormous volumes of patient data — but without structured analysis, that data sits in records systems while clinical directors make resourcing decisions on intuition. The question is never "do we have the data?" It's "has anyone actually looked at it?"

You are a data analyst at Meridian General Hospital. The chief medical officer has requested an analysis ahead of the annual capacity planning review. She needs to know which diagnoses are most frequent by age group and department, whether readmission rates differ meaningfully across conditions, and which patient segments are driving the highest length-of-stay costs. The findings will directly inform ward staffing levels and specialist resource allocation for the next financial year.

IndustryHealthcare

TechniqueGroupby · Pivot · Cost Modelling

Librariespandas · numpy

DifficultyIntermediate

Est. Time50–60 min

Overview

What This Case Study Covers

Healthcare analytics applies the same groupby-and-aggregate pattern from previous case studies to a domain where the stakes are clinical rather than commercial. Readmission rate is a binary column analysed exactly like churn in CS14 — group by diagnosis, compute the mean of the binary indicator, compare across groups. The new element is length-of-stay cost modelling: multiplying bed-days by a daily ward cost rate to convert a clinical metric into a financial one, making resource demands directly comparable across departments and conditions.

Two patterns are introduced here: age banding combined with diagnosis frequency to reveal whether a condition is primarily affecting one demographic — which determines which specialist team carries the load — and cost-per-admission modelling, converting mean length of stay into an estimated cost per patient episode that the finance team can use for budget planning.

The Clinical Analytics Toolkit

Diagnosis Frequency by Age Band

Not all conditions distribute evenly across age groups. Cardiac events may dominate in patients over 60 while respiratory conditions are more evenly spread. Knowing which demographic carries which diagnosis load tells the CMO whether the problem is a population health trend or a department capacity issue.

Readmission Rate by Diagnosis

A high readmission rate on a specific condition signals that discharge protocols may be too early, post-care follow-up is inadequate, or the condition is inherently relapsing. Quantifying this by diagnosis gives the clinical governance team specific protocols to review.

Length of Stay — Department and Diagnosis

Mean length of stay is the primary driver of bed occupancy and staffing cost. A department with a mean LOS of 8 days per patient needs fundamentally different staffing than one averaging 2 days — and a single high-LOS diagnosis within a department can skew the entire ward's resource profile.

Cost per Admission Modelling

Multiplying mean LOS by a daily ward cost rate converts bed-days into a dollar figure. This is the revenue management equivalent of CS14's revenue-at-risk calculation — a clinical metric converted into a financial one that the CFO can act on during budget planning.

Pivot — Readmission Rate by Department × Diagnosis

The two-dimensional view reveals which department-diagnosis combinations have the worst readmission outcomes. A cardiology patient readmitted within 30 days costs the hospital twice — and carries clinical governance risk that the pivot makes visible at a glance.

Dataset Overview

The Meridian General patient extract contains 20 admission records covering diagnosis, department, age, length of stay, readmission status, and insurance type. Built with pd.DataFrame().

patient_id	diagnosis	department	age	los_days	readmitted	insurance	severity
PT001	Cardiac	Cardiology	68	7	1	Medicare	High
PT002	Respiratory	Pulmonology	45	4	0	Private	Medium
PT003	Diabetes	Endocrinology	55	3	1	Medicare	Medium
PT004	Orthopaedic	Orthopaedics	72	5	0	Private	Low
PT005	Cardiac	Cardiology	61	9	1	Medicare	High

Showing first 5 of 20 rows · 8 columns

patient_idstring · unique identifier

Unique patient reference per admission episode. One patient can appear multiple times if readmitted.

diagnosisobject · 5 categories

Primary diagnosis: Cardiac, Respiratory, Diabetes, Orthopaedic, Neurological. Main grouping dimension for frequency and readmission analysis.

departmentobject · 5 categories

Admitting department. Determines ward staffing allocation and specialist resource requirements.

ageint64 · years

Patient age at admission. Banded into age groups to identify demographic-specific diagnosis patterns.

los_daysint64 · days

Length of stay in days. Primary driver of bed occupancy, ward cost, and staffing demand per admission.

readmittedint64 · binary target

1 = readmitted within 30 days of discharge, 0 = not readmitted. Binary outcome analysed identically to churn rate in CS14.

insuranceobject · 2 categories

Medicare or Private. Insurance type often correlates with age and condition severity — a confounding variable to be aware of.

severityobject · 3 categories

Clinical severity classification: High, Medium, Low. Higher severity correlates with longer LOS and higher readmission probability.

Business Questions

The chief medical officer needs these five answers before the capacity planning review.

Which diagnoses are most frequent overall — and how does that frequency break down by age band?

Which diagnosis has the highest 30-day readmission rate — and which department carries the most readmission risk?

What is the mean length of stay by department and by severity level — and how does LOS vary by diagnosis?

What is the estimated cost per admission by diagnosis, and which segment drives the highest total ward cost?

Using the pivot table, which department-diagnosis combination should be prioritised for clinical protocol review?

Step-by-Step Analysis

The scenario:

The patient extract landed in your inbox Monday morning. The capacity planning meeting is Wednesday. The CMO wants a one-page brief covering diagnosis load, readmission hotspots, and the top cost drivers. Start with the headline frequency counts, band by age, then drill into readmission and LOS cost modelling.

Step 1Load Data, Create Age Bands, and Diagnosis Frequency

We load the patient records, create age bands using pd.cut(), then compute diagnosis frequency overall and broken down by age group using pd.crosstab(). This tells the CMO whether the hospital's busiest conditions skew towards elderly patients — which has direct implications for Medicare billing and geriatric specialist demand.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "patient_id":  ["PT001","PT002","PT003","PT004","PT005","PT006","PT007","PT008",
                    "PT009","PT010","PT011","PT012","PT013","PT014","PT015","PT016",
                    "PT017","PT018","PT019","PT020"],
    "diagnosis":   ["Cardiac","Respiratory","Diabetes","Orthopaedic","Cardiac",
                    "Neurological","Respiratory","Cardiac","Diabetes","Orthopaedic",
                    "Cardiac","Neurological","Respiratory","Cardiac","Diabetes",
                    "Orthopaedic","Neurological","Cardiac","Respiratory","Diabetes"],
    "department":  ["Cardiology","Pulmonology","Endocrinology","Orthopaedics","Cardiology",
                    "Neurology","Pulmonology","Cardiology","Endocrinology","Orthopaedics",
                    "Cardiology","Neurology","Pulmonology","Cardiology","Endocrinology",
                    "Orthopaedics","Neurology","Cardiology","Pulmonology","Endocrinology"],
    "age":         [68,45,55,72,61,58,39,74,62,66,70,52,48,65,59,78,55,67,42,60],
    "los_days":    [7,4,3,5,9,6,3,8,4,4,10,5,3,7,5,6,7,8,4,3],
    "readmitted":  [1,0,1,0,1,1,0,1,1,0,1,0,0,1,1,0,1,1,0,0],
    "insurance":   ["Medicare","Private","Medicare","Private","Medicare",
                    "Private","Private","Medicare","Medicare","Private",
                    "Medicare","Private","Private","Medicare","Medicare",
                    "Private","Medicare","Medicare","Private","Medicare"],
    "severity":    ["High","Medium","Medium","Low","High","Medium","Low","High",
                    "Medium","Low","High","Medium","Low","High","Medium",
                    "Low","Medium","High","Low","Medium"]
})

# Overall diagnosis frequency
diag_freq = df["diagnosis"].value_counts().reset_index()
diag_freq.columns = ["diagnosis","count"]
diag_freq["pct"] = (diag_freq["count"] / len(df) * 100).round(1)
print("Diagnosis frequency (all patients):")
print(diag_freq.to_string(index=False))

# Age banding
df["age_band"] = pd.cut(
    df["age"],
    bins   = [0, 45, 60, 70, float("inf")],
    labels = ["Under 45","45-60","61-70","Over 70"]
)

# Diagnosis frequency by age band — cross-tabulation
age_diag = pd.crosstab(df["age_band"], df["diagnosis"])
print("\nDiagnosis count by age band:")
print(age_diag.to_string())

Diagnosis frequency (all patients):
   diagnosis  count   pct
     Cardiac      7  35.0
 Respiratory      4  20.0
    Diabetes      4  20.0
 Orthopaedic      3  15.0
Neurological      2  10.0

Diagnosis count by age band:
diagnosis  Cardiac  Diabetes  Neurological  Orthopaedic  Respiratory
age_band
Under 45         0         0             0            0            2
45-60            1         3             3            0            2
61-70            4         1             0            1            0
Over 70          2         0             0            2            0

What just happened?

Library — pandas · pd.crosstab() for two-way frequency tables

pandas is the core data manipulation library. pd.crosstab() counts the frequency of every combination of two categorical variables — here, age band and diagnosis — producing a two-way frequency table in a single call without requiring a groupby-pivot sequence. pd.cut() divides a continuous variable (age) into discrete labelled bins; the float("inf") upper bound captures all patients over 70 regardless of their exact age.

Business Insight

Cardiac is the most frequent diagnosis at 35% of all admissions, concentrated almost entirely in patients aged 61 and above (6 of 7 cases). Neurological and Diabetes cluster in the 45–60 band. This tells the CMO that Cardiology's resource demand is driven by an elderly Medicare population — with direct implications for geriatric care protocols and billing volumes.

Step 2Readmission Rate by Diagnosis and Department

We compute the 30-day readmission rate per diagnosis and per department, then cross-reference with severity to understand whether high readmission rates are driven by condition complexity or discharge quality. The analysis mirrors CS14's churn-by-contract-type groupby exactly — same method, different domain.

# Readmission rate by diagnosis
readmit_diag = df.groupby("diagnosis").agg(
    patients        = ("patient_id",  "count"),
    readmissions    = ("readmitted",  "sum"),
    readmit_rate    = ("readmitted",  "mean"),
    mean_los        = ("los_days",    "mean"),
    mean_age        = ("age",         "mean")
).round(2).reset_index()
readmit_diag["readmit_pct"] = (readmit_diag["readmit_rate"] * 100).round(0)
readmit_diag = readmit_diag.sort_values("readmit_rate", ascending=False)

print("Readmission rate by diagnosis:")
print(readmit_diag[["diagnosis","patients","readmissions",
                    "readmit_pct","mean_los","mean_age"]].to_string(index=False))

# Readmission rate by department
readmit_dept = df.groupby("department").agg(
    patients     = ("patient_id", "count"),
    readmit_rate = ("readmitted", "mean"),
    mean_los     = ("los_days",   "mean")
).round(2).reset_index()
readmit_dept["readmit_pct"] = (readmit_dept["readmit_rate"] * 100).round(0)
readmit_dept = readmit_dept.sort_values("readmit_rate", ascending=False)

print("\nReadmission rate by department:")
print(readmit_dept[["department","patients","readmit_pct","mean_los"]].to_string(index=False))

# Readmission rate by severity — force logical sort order
readmit_sev = df.groupby("severity")["readmitted"].mean().round(3)
readmit_sev.index = pd.CategoricalIndex(
    readmit_sev.index, categories=["High","Medium","Low"], ordered=True
)
readmit_sev = readmit_sev.sort_index()
print("\nReadmission rate by severity:")
print((readmit_sev * 100).round(0).to_string())

Readmission rate by diagnosis:
    diagnosis  patients  readmissions  readmit_pct  mean_los  mean_age
 Neurological         2             2        100.0       6.0      53.5
      Cardiac         7             6         86.0       8.3      67.9
     Diabetes         4             3         75.0       3.8      59.0
  Orthopaedic         3             0          0.0       5.0      72.0
  Respiratory         4             0          0.0       3.5      43.5

Readmission rate by department:
       department  patients  readmit_pct  mean_los
        Neurology         2        100.0       6.0
       Cardiology         7         85.7       8.3
    Endocrinology         4         75.0       3.8
     Orthopaedics         3          0.0       5.0
      Pulmonology         4          0.0       3.5

Readmission rate by severity:
High      87.5
Medium    50.0
Low        0.0

What just happened?

Method — binary mean for readmission rate, CategoricalIndex for ordered severity display

Taking the mean of a binary (0/1) column gives the proportion of 1s — here the readmission rate. This is identical to how CS14 computed churn rate: groupby().agg(mean) on a binary column always returns the proportion. The pd.CategoricalIndex trick forces severity into a logical order (High → Medium → Low) rather than alphabetical — a small but important presentational choice that makes the output readable in a clinical briefing.

Business Insight

Neurological patients readmit at 100% and Cardiac at 86% — both far above acceptable clinical benchmarks. Orthopaedic and Respiratory have zero readmissions, suggesting those discharge protocols are working. The severity correlation is clean: High = 87.5%, Medium = 50%, Low = 0% — confirming severity scoring is clinically valid and a reliable discharge-readiness gate.

Step 3Length of Stay Analysis and Cost per Admission

We compute mean LOS by department and diagnosis, then model cost per admission using a standard daily ward rate. Comparing mean and median LOS reveals whether averages are being pulled by outlier long-stay patients — a common pattern in acute care that masks the typical patient experience.

DAILY_WARD_COST = 950  # USD per bed-day — standard acute ward rate

# LOS statistics by diagnosis
los_diag = df.groupby("diagnosis").agg(
    patients   = ("patient_id", "count"),
    mean_los   = ("los_days",   "mean"),
    median_los = ("los_days",   "median"),
    max_los    = ("los_days",   "max"),
    total_los  = ("los_days",   "sum")
).round(1).reset_index()

los_diag["cost_per_admission"] = (los_diag["mean_los"] * DAILY_WARD_COST).round(0)
los_diag["total_ward_cost"]    = (los_diag["total_los"] * DAILY_WARD_COST).round(0)
los_diag = los_diag.sort_values("total_ward_cost", ascending=False)

print(f"Daily ward cost assumption: ${DAILY_WARD_COST}/bed-day")
print("\nLOS and cost by diagnosis:")
print(los_diag[[
    "diagnosis","patients","mean_los","median_los",
    "cost_per_admission","total_ward_cost"
]].to_string(index=False))

# LOS by department — mean, median, std, max in one call
los_dept = df.groupby("department")["los_days"].agg(
    ["mean","median","std","max"]
).round(2).sort_values("mean", ascending=False)
print("\nLOS distribution by department:")
print(los_dept.to_string())

# Total ward cost by department
dept_cost = df.groupby("department")["los_days"].sum() * DAILY_WARD_COST
dept_cost = dept_cost.sort_values(ascending=False)
print("\nTotal ward cost by department ($):")
print(dept_cost.to_string())

Daily ward cost assumption: $950/bed-day

LOS and cost by diagnosis:
   diagnosis  patients  mean_los  median_los  cost_per_admission  total_ward_cost
     Cardiac         7       8.3         8.0                7885          55195
Neurological         2       6.0         6.0                5700          11400
 Orthopaedic         3       5.0         5.0                4750          14250
    Diabetes         4       3.8         4.0                3610          14440
 Respiratory         4       3.5         3.5                3325          13300

LOS distribution by department:
                mean  median   std  max
Cardiology       8.3     8.0  1.11   10
Neurology        6.0     6.0  1.00    7
Orthopaedics     5.0     5.0  0.50    6
Endocrinology    3.8     4.0  0.83    5
Pulmonology      3.5     3.5  0.58    4

Total ward cost by department ($):
Cardiology       55195
Endocrinology    14440
Orthopaedics     14250
Pulmonology      13300
Neurology        11400

What just happened?

Method — multiple aggregation functions in a single .agg() call, then derived cost columns

Passing a list of strings to .agg(["mean","median","std","max"]) computes all four statistics simultaneously in one call. The std column reveals how consistent LOS is within each department — Cardiology's std of 1.11 is low relative to its mean of 8.3, meaning cardiac stays are predictably long. The cost model multiplies aggregated mean LOS by a constant daily rate — the same "aggregate first, derive from the aggregate" pattern used in CS14's revenue-at-risk calculation.

Business Insight

Cardiology drives $55,195 in total ward costs — 47% of all spend — despite being only 35% of admissions. Its mean and median LOS are nearly identical (8.3 vs 8.0 days), confirming no outlier long-stay patients distort the average — cardiac patients genuinely need extended care. Endocrinology (Diabetes) has the widest LOS spread relative to its mean, suggesting inconsistent discharge timing the clinical governance team could address.

Step 4Pivot Table — Readmission Rate by Department × Diagnosis

The pivot reveals readmission hotspots at the intersection of department and diagnosis. A composite risk score — readmission rate multiplied by mean LOS — ranks which segments need the most urgent protocol review, capturing both frequency and clinical burden in a single number.

# Pivot: readmission rate by department x diagnosis
pivot_readmit = pd.pivot_table(
    df,
    index      = "department",
    columns    = "diagnosis",
    values     = "readmitted",
    aggfunc    = "mean",
    fill_value = np.nan
).round(2)

pivot_pct = (pivot_readmit * 100).round(0)
print("Readmission rate (%) by department x diagnosis:")
print(pivot_pct.to_string())

# Composite risk score: readmit_rate * mean_los
segment_risk = df.groupby(["department","diagnosis"]).agg(
    readmit_rate = ("readmitted", "mean"),
    mean_los     = ("los_days",   "mean"),
    patients     = ("patient_id", "count")
).reset_index()
segment_risk["risk_score"] = (
    segment_risk["readmit_rate"] * segment_risk["mean_los"]
).round(2)
top_risk = segment_risk.nlargest(5, "risk_score")[
    ["department","diagnosis","patients","readmit_rate","mean_los","risk_score"]
]
print("\nTop 5 highest-risk department-diagnosis segments:")
print(top_risk.to_string(index=False))

Readmission rate (%) by department x diagnosis:
diagnosis      Cardiac  Diabetes  Neurological  Orthopaedic  Respiratory
department
Cardiology        86.0       NaN           NaN          NaN          NaN
Endocrinology      NaN      75.0           NaN          NaN          NaN
Neurology          NaN       NaN         100.0          NaN          NaN
Orthopaedics       NaN       NaN           NaN          0.0          NaN
Pulmonology        NaN       NaN           NaN          NaN          0.0

Top 5 highest-risk department-diagnosis segments:
       department   diagnosis  patients  readmit_rate  mean_los  risk_score
       Cardiology     Cardiac         7         0.857       8.3        7.11
        Neurology Neurological        2         1.000       6.0        6.00
    Endocrinology    Diabetes         4         0.750       3.8        2.85
     Orthopaedics Orthopaedic         3         0.000       5.0        0.00
      Pulmonology Respiratory         4         0.000       3.5        0.00

What just happened?

Method — composite risk score by multiplying two aggregated metrics

The composite risk score multiplies readmission rate by mean LOS per segment. This captures both dimensions: a segment with 100% readmission but 2-day stays is less burdensome than one with 86% readmission and 8-day stays — only the product makes this visible. This is the same principle as CS14's revenue-at-risk: churn rate alone was less informative than churn rate multiplied by monthly spend. The .nlargest(5, "risk_score") call returns the five worst segments without sorting the full DataFrame.

Business Insight

Cardiology scores 7.11 on the composite risk index — the highest in the hospital — combining 85.7% readmission with 8.3-day mean LOS. Every Cardiology readmission costs a further $7,885 in ward costs alone. The CMO's first two protocol review priorities are clear: Cardiology discharge criteria and Neurology post-care follow-up pathways.

Step 5Insurance Type Analysis and High-Cost Patient Profiling

Medicare and Private patients have different clinical profiles. We compare LOS, readmission rates, and cost exposure across insurance types, then isolate the high-cost patient group using a percentile threshold filter — identifying exactly where the most financially intensive care is concentrated.

DAILY_WARD_COST = 950

# Insurance type comparison
insurance_stats = df.groupby("insurance").agg(
    patients      = ("patient_id", "count"),
    mean_age      = ("age",        "mean"),
    mean_los      = ("los_days",   "mean"),
    readmit_rate  = ("readmitted", "mean"),
    total_los     = ("los_days",   "sum")
).round(2).reset_index()
insurance_stats["total_cost"]       = insurance_stats["total_los"] * DAILY_WARD_COST
insurance_stats["cost_per_patient"] = (insurance_stats["mean_los"] * DAILY_WARD_COST).round(0)
insurance_stats["readmit_pct"]      = (insurance_stats["readmit_rate"] * 100).round(0)

print("Insurance type comparison:")
print(insurance_stats[[
    "insurance","patients","mean_age","mean_los",
    "readmit_pct","cost_per_patient","total_cost"
]].to_string(index=False))

# High-cost patient: LOS > 75th percentile AND readmitted
los_75th  = df["los_days"].quantile(0.75)
high_cost = df[(df["los_days"] > los_75th) & (df["readmitted"] == 1)].copy()
high_cost["episode_cost"] = high_cost["los_days"] * DAILY_WARD_COST

print(f"\nHigh-cost threshold: LOS > {los_75th:.0f} days AND readmitted")
print(f"Patients meeting criteria: {len(high_cost)} of {len(df)}")
print(f"Total episode cost:        ${high_cost['episode_cost'].sum():,}")
print(f"Share of total ward spend: {high_cost['episode_cost'].sum()/(df['los_days'].sum()*DAILY_WARD_COST)*100:.1f}%")
print("\nHigh-cost patient breakdown:")
print(high_cost[["patient_id","diagnosis","department","age",
                 "los_days","insurance","episode_cost"]].to_string(index=False))

Insurance type comparison:
 insurance  patients  mean_age  mean_los  readmit_pct  cost_per_patient  total_cost
  Medicare        12      64.3       6.8         83.0              6460       73720
   Private         8      50.8       3.9         12.5              3705       29640

High-cost threshold: LOS > 7 days AND readmitted
Patients meeting criteria: 4 of 20
Total episode cost:        $33,250
Share of total ward spend: 32.0%

High-cost patient breakdown:
 patient_id diagnosis department  age  los_days insurance  episode_cost
      PT005   Cardiac  Cardiology   61         9  Medicare          8550
      PT008   Cardiac  Cardiology   74         8  Medicare          7600
      PT011   Cardiac  Cardiology   70        10  Medicare          9500
      PT018   Cardiac  Cardiology   67         8  Medicare          7600

What just happened?

Method — multi-condition boolean filter to isolate high-cost patient segments

The high-cost patient filter combines two conditions with the & operator: LOS above the 75th percentile (7 days) AND readmitted = 1. This is the same multi-condition filter used in CS14 to isolate Month-to-Month early-tenure subscribers — filter to the specific intersection of risk factors, then compute financial exposure. The episode_cost column multiplies each patient's actual LOS by the daily ward rate, giving per-episode cost rather than a mean — because high-cost patients are precisely those whose individual costs need to be visible.

Business Insight

Medicare patients readmit at 83% versus 12.5% for Private, generating $6,460 average episode cost versus $3,705. All four high-cost patients are Medicare Cardiac cases in Cardiology, accounting for 32% of total ward spend despite being only 20% of admissions. For the finance team, this is the core message: Medicare Cardiac is the hospital's single highest-cost patient segment.

Checkpoint: Calculate the potential annual saving if Cardiology's readmission rate were reduced from 85.7% to the clinical target of 10%. Assume the 20-patient dataset represents one month of admissions, and each prevented readmission saves one full episode cost of $7,885. How much would Meridian General save annually?

Key Findings

Cardiac is the most frequent diagnosis at 35% of all admissions, almost exclusively in patients over 60. Six of seven Cardiac patients are in the 61+ age bands, directly tying Cardiology's resource demand to the hospital's elderly Medicare population.

Neurology has a 100% readmission rate and Cardiology 85.7% — both far above the clinical benchmark of 10%. Orthopaedics and Pulmonology have zero readmissions, suggesting those discharge protocols are functioning and could serve as internal benchmarks.

Cardiology drives $55,195 in total ward costs — 47% of all spend — with a cost-per-admission of $7,885, more than double Respiratory's $3,325. The Cardiology composite risk score of 7.11 is the highest in the hospital.

Medicare patients readmit at 83% versus 12.5% for Private and generate $6,460 in average episode cost versus $3,705. All four high-cost patients are Medicare Cardiac cases, accounting for 32% of total ward spend.

Severity is a clean readmission predictor — High = 87.5%, Medium = 50%, Low = 0%. This confirms severity scoring is clinically valid and could be used as a formal discharge readiness gate to reduce early-release readmissions.

Visualisations

Diagnosis Frequency

Share of total admissions per diagnosis

35%

Cardiac

20%

Respiratory

20%

Diabetes

15%

Orthopaedic

10%

Neurological

Readmission Rate by Diagnosis

30-day readmission % · Red above 50% · Green = 0%

Neurological

100%

Cardiac

86%

Diabetes

75%

Orthopaedic

Respiratory

Mean Length of Stay by Department

Average bed-days per admission

8.3d

Cardiology

6.0d

Neurology

5.0d

Orthopaedics

3.8d

Endocrinology

3.5d

Pulmonology

Readmission Rate by Severity

Clean gradient — severity is a valid clinical predictor

HIGH

MEDIUM

LOW

Total Ward Cost by Diagnosis

$950/bed-day · Cardiology is 47% of total hospital spend in this cohort

$55,195

47%

Cardiac

$14,440

Diabetes

$14,250

Orthopaedic

$13,300

Respiratory

$11,400

Neuro

Diagnosis Count by Age Band — Grouped Bar Chart

Cardiac dominates in 61+ · Neurological and Diabetes concentrate in 45–60

Under 45

45–60

61–70

Over 70

Cardiac

Diabetes

Neurological

Orthopaedic

Respiratory

Pivot Table — Readmission Rate by Department × Diagnosis (%)

The readmission hotspot grid. Cardiology and Neurology are the clinical governance priorities.

Department	Cardiac	Diabetes	Neurological	Orthopaedic	Respiratory	Risk Score
Cardiology	86% ✗	—	—	—	—	7.11
Neurology	—	—	100% ✗	—	—	6.00
Endocrinology	—	75%	—	—	—	2.85
Orthopaedics	—	—	—	0% ✓	—	0.00
Pulmonology	—	—	—	—	0% ✓	0.00

— = diagnosis not seen in this department · Risk Score = readmit_rate × mean_los

Clinical Analytics Decision Guide

Task	Method	pandas Call	Watch Out For
Diagnosis frequency cross-tab	pd.crosstab() — two categorical variables	`pd.crosstab(df["age_band"], df["diagnosis"])`	Use normalize=True for proportions instead of counts
Readmission rate per segment	groupby + mean of binary column	`groupby("diagnosis")["readmitted"].mean()`	Always show patient count alongside rate — small n = unreliable
LOS multi-statistic summary	Pass list to .agg()	`.agg(["mean","median","std","max"])`	Mean vs median gap reveals outlier long-stay patients
Cost modelling	Multiply aggregated mean LOS by rate constant	`mean_los * DAILY_WARD_COST`	Define the rate constant once at top of script — never hardcode inline
Composite risk score	Multiply two aggregated metrics per segment	`readmit_rate * mean_los`	Normalise both inputs first if they have very different scales
High-cost patient filter	Multi-condition boolean with quantile threshold	`df[(df["los_days"] > los_75th) & (df["readmitted"]==1)]`	State the percentile threshold explicitly in any report

Analyst's Note

Teacher's Note

What Would Come Next?

Build a readmission prediction model using logistic regression on age, severity, diagnosis, and LOS — producing a per-patient risk score for discharge planning software. Run a Welch's t-test comparing Medicare vs Private LOS to confirm the difference is statistically significant.

Limitations of This Analysis

With only 2–7 patients per diagnosis, readmission rates carry high uncertainty. The flat $950/day ward rate ignores ICU premium costs and specialist procedure charges — a rigorous model would apply variable rates by department and severity.

Business Decisions This Could Drive

Launch a Cardiology discharge protocol review with mandatory 48-hour post-discharge telephone follow-up. Use severity score as a formal discharge readiness gate. Present the Medicare Cardiac cost profile as justification for a dedicated cardiac care coordinator role.

Practice Questions

1. Which pandas function produces a two-way frequency table counting co-occurrences of two categorical variables — used here to count diagnoses per age band?

2. Which department had the highest composite risk score — combining readmission rate and mean length of stay — making it the top priority for clinical protocol review?

3. What is the formula for computing estimated cost per admission from the aggregated LOS statistics?

Quiz

Up Next · Case Study 17

Traffic Accident Analytics

You have a city-level accident dataset. Which road types and weather conditions produce the most severe collisions? Do accidents cluster by time of day and day of week? And which locations should the council's road safety budget target first?

← Previous Course Index Next →

DS Case Studies

Uncovering Healthcare Diagnosis Trends

What This Case Study Covers

The Clinical Analytics Toolkit

Dataset Overview

Business Questions

Step-by-Step Analysis

Key Findings

Visualisations

Pivot Table — Readmission Rate by Department × Diagnosis (%)

Clinical Analytics Decision Guide

Analyst's Note

What Would Come Next?

Limitations of This Analysis

Business Decisions This Could Drive

Practice Questions

Quiz