DS Case Studies
Exploring Hospital Patient Records
Healthcare data is some of the most consequential data a analyst will ever work with. The numbers on the screen are not revenue figures — they are patients, bed occupancy rates, readmission risks. Getting this analysis wrong has real consequences for real people.
You are a data analyst at MedCore Hospital Group. The operations director has pulled you into a meeting: the hospital is planning its staffing and resource allocation for next quarter. She needs to know which departments are carrying the heaviest patient load, whether readmission rates differ by age group, and what the average length of stay looks like across the facility. Your dataset is a 12-record admissions export — a sample of the full system.
What This Case Study Covers
Hospital admissions data sits at the intersection of operations, finance, and patient care. The same dataset that tells a CFO about billing can tell a department head about staffing pressure, and tell a clinical director about discharge patterns. This case study focuses on the operational view — who is coming in, where they are going, how long they are staying, and whether they are coming back.
This case study covers four layers: department load analysis to identify which units carry the most admissions and longest stays, age group readmission rates to find whether older or younger patients return more frequently, length of stay distribution to understand the range and outliers in discharge patterns, and admission type breakdown to see how emergency vs elective vs urgent cases are distributed across departments.
The Hospital EDA Toolkit
Admissions Load by Department
Count admissions and sum total bed-days per department. A department with high admissions but short stays has a throughput problem. One with low admissions but long stays has a complexity problem. Both require different resource responses.Age Group Segmentation
Usepd.cut() to bin continuous age into labelled brackets. Age bands are more clinically meaningful than raw age and allow direct comparison of readmission rates across life stages.Length of Stay Analysis
Compute mean, median, and std of length of stay per department. A high std means highly variable stays — some patients are discharged quickly while others remain for extended periods, which creates unpredictable bed demand.Readmission Rate Analysis
Group by age band and compute the mean readmission flag. Readmission within 30 days is a key quality indicator — high rates suggest premature discharge or inadequate follow-up care after leaving the hospital.Admission Type Cross-tabulation
Cross-tabulate department against admission type. Emergency admissions are uncontrollable — they arrive when they arrive. Elective admissions can be scheduled and rescheduled. Knowing the mix per department is essential for resource planning.Dataset Overview
MedCore's admissions extract contains 12 patient records covering five departments, three admission types, and a 30-day readmission flag. Built directly with pd.DataFrame() — in production this would be a read from the hospital's EHR system.
| patient_id | age | department | admission_type | los_days | readmitted_30d | discharge_status |
|---|---|---|---|---|---|---|
| P001 | 67 | Cardiology | Emergency | 8 | 1 | Home |
| P002 | 34 | Orthopaedics | Elective | 3 | 0 | Home |
| P003 | 78 | General Medicine | Urgent | 11 | 1 | Rehab |
| P004 | 52 | Cardiology | Emergency | 6 | 0 | Home |
| P005 | 45 | Oncology | Elective | 14 | 0 | Home |
Showing first 5 of 12 rows · 7 columns
Unique patient record ID. Never analysed directly — used for counting and deduplication only.
Patient age in years. Will be binned into age groups using pd.cut() for the readmission analysis.
Clinical department the patient was admitted to. The primary grouping dimension for load and length-of-stay analysis.
Emergency, Elective, or Urgent. Determines how predictable the admission was — critical for resource scheduling.
Length of stay in days — from admission to discharge. The primary operational metric in this case study.
1 = patient readmitted within 30 days of discharge. 0 = not readmitted. Key quality-of-care indicator.
Where the patient went after discharge — Home, Rehab, or Deceased. Used for cross-tabulation with department.
Business Questions
The operations director needs these five answers to finalise the staffing plan. Each is answered by a specific step in the analysis below.
Which department has the highest total admissions and the longest average length of stay — and are they the same department?
Which age group has the highest 30-day readmission rate, and how large is the gap between the highest and lowest group?
What is the distribution of length of stay across the hospital — what is the mean, median, and which department has the most variable stays?
What proportion of admissions are Emergency vs Elective vs Urgent across the hospital?
Which department and admission type combination drives the longest average length of stay?
Step-by-Step Analysis
The scenario:
The export has just arrived in your inbox as a CSV. You have 48 hours to return a written summary to the operations director. Open your notebook and work through the data systematically — one question at a time, no jumping ahead.
Healthcare data often arrives with encoding issues, mixed date formats, and columns that look numeric but are stored as strings. The audit step catches all of this before any calculation runs.
import pandas as pd
import numpy as np
# Build the MedCore admissions dataset
df = pd.DataFrame({
"patient_id": ["P001","P002","P003","P004","P005","P006",
"P007","P008","P009","P010","P011","P012"],
"age": [67, 34, 78, 52, 45, 71, 29, 83, 58, 41, 65, 37],
"department": ["Cardiology","Orthopaedics","General Medicine","Cardiology",
"Oncology","General Medicine","Orthopaedics","General Medicine",
"Cardiology","Oncology","Cardiology","Orthopaedics"],
"admission_type": ["Emergency","Elective","Urgent","Emergency","Elective",
"Urgent","Elective","Emergency","Emergency","Elective",
"Urgent","Elective"],
"los_days": [8, 3, 11, 6, 14, 9, 2, 13, 7, 10, 5, 4],
"readmitted_30d": [1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0],
"discharge_status": ["Home","Home","Rehab","Home","Home","Rehab",
"Home","Rehab","Home","Home","Home","Home"]
})
# Inspect shape and types
print("Shape:", df.shape)
print("\nData types:")
print(df.dtypes)
# Missing value audit
print("\nMissing values:")
print(df.isnull().sum())
# Summary stats for the two key numeric columns
print("\nLength of Stay — summary:")
print(df["los_days"].describe().round(1))
print(f"\nOverall readmission rate: {df['readmitted_30d'].mean():.1%}")
print(f"Total bed-days used: {df['los_days'].sum()}")
Shape: (12, 7) Data types: patient_id object age int64 department object admission_type object los_days int64 readmitted_30d int64 discharge_status object dtype: object Missing values: patient_id 0 age 0 department 0 admission_type 0 los_days 0 readmitted_30d 0 discharge_status 0 dtype: int64 Length of Stay — summary: count 12.0 mean 7.7 std 3.8 min 2.0 25% 4.8 50% 7.5 75% 10.3 max 14.0 Overall readmission rate: 41.7% Total bed-days used: 92
What just happened?
Method — .describe() on a single columnWe called .describe() on just los_days rather than the full DataFrame — passing a single column keeps the output focused. The result gives us count, mean, std, min, quartiles, and max in one call. For healthcare data, the difference between mean (7.7) and median (7.5) tells us the distribution is fairly symmetric — no extreme outliers are pulling the average far from the centre.
The headline numbers are immediately concerning: the overall 30-day readmission rate is 41.7% — the NHS benchmark target is below 10%, and most well-run hospitals aim for below 15%. The hospital used 92 total bed-days across 12 admissions, an average of 7.7 days per stay. Length of stay has a std of 3.8 — nearly half the mean — indicating highly variable stays that will make bed planning difficult without department-level breakdown.
The operations director's first question: which departments are busiest and which hold patients the longest? High admissions and long stays together signal a department that needs additional staffing — but each alone tells a different story.
# Aggregate by department: admissions count, total bed-days, mean and std of LOS
dept = df.groupby("department").agg(
admissions = ("patient_id", "count"), # total patients admitted
total_bed_days= ("los_days", "sum"), # total bed-days consumed
mean_los = ("los_days", "mean"), # average length of stay
std_los = ("los_days", "std"), # variability in stay length
readmit_rate = ("readmitted_30d", "mean") # readmission rate per dept
).round(2).reset_index()
# Sort by total bed-days — the heaviest operational burden first
dept = dept.sort_values("total_bed_days", ascending=False)
print("Department performance summary:")
print(dept.to_string(index=False))
# Which department has the most variable LOS?
most_variable = dept.loc[dept["std_los"].idxmax(), "department"]
print(f"\nMost variable LOS: {most_variable} (std = {dept['std_los'].max():.2f} days)")
# Are the highest-admissions and longest-stay departments the same?
most_admissions = dept.loc[dept["admissions"].idxmax(), "department"]
longest_stay = dept.loc[dept["mean_los"].idxmax(), "department"]
print(f"Most admissions: {most_admissions}")
print(f"Longest avg stay: {longest_stay}")
print(f"Same department: {most_admissions == longest_stay}")
Department performance summary:
department admissions total_bed_days mean_los std_los readmit_rate
General Medicine 3 33 11.00 2.00 1.00
Cardiology 4 26 6.50 1.29 0.50
Orthopaedics 3 9 3.00 1.00 0.00
Oncology 2 24 12.00 2.83 0.00
Most variable LOS: Oncology (std = 2.83 days)
Most admissions: Cardiology
Longest avg stay: Oncology
Same department: FalseWhat just happened?
Method — groupby().agg() with named keywordsUsing the named keyword syntax in .agg() — new_col = ("source_col", "function") — lets us compute five different metrics from different source columns in a single grouped operation. The result is a clean, labelled summary table without any column renaming needed. This pattern is cleaner than chaining multiple separate .groupby() calls and merging the results.
.idxmax() returns the index label of the row with the highest value — not the value itself. After applying it, we pass the result to .loc[] to extract the department name from that row. This is the standard pattern for answering "which group is highest" questions cleanly without sorting the full table.
General Medicine has the highest total bed-days at 33 — and a readmission rate of 100%, meaning every patient admitted to General Medicine in this sample was readmitted within 30 days. That is a serious clinical flag. Oncology has the longest average stay at 12 days with the most variable LOS (std = 2.83) — some Oncology patients stay much longer than others, making bed planning extremely difficult. Critically, the most-admissions department (Cardiology) and longest-stay department (Oncology) are not the same — meaning staffing pressure and bed pressure need to be managed separately.
Continuous age is hard to analyse directly — there is no meaningful difference between a 67-year-old and a 68-year-old. Binning into age brackets using pd.cut() creates clinically meaningful groups that reveal patterns raw age would obscure.
# Bin age into clinical age groups using pd.cut()
# bins defines the edges of each bracket — right=True means right edge is included
# labels assigns a human-readable name to each bracket
df["age_group"] = pd.cut(
df["age"],
bins=[0, 40, 60, 75, 100],
labels=["Under 40", "40–60", "61–75", "Over 75"],
right=True
)
# Group by age group and compute key metrics
age_analysis = df.groupby("age_group", observed=True).agg(
patient_count = ("patient_id", "count"),
readmit_rate = ("readmitted_30d","mean"),
mean_los = ("los_days", "mean")
).round(2).reset_index()
print("Readmission rate by age group:")
print(age_analysis.to_string(index=False))
# Gap between highest and lowest readmission rate
max_rate = age_analysis["readmit_rate"].max()
min_rate = age_analysis["readmit_rate"].min()
max_grp = age_analysis.loc[age_analysis["readmit_rate"].idxmax(), "age_group"]
min_grp = age_analysis.loc[age_analysis["readmit_rate"].idxmin(), "age_group"]
print(f"\nHighest readmission: {max_grp} at {max_rate:.0%}")
print(f"Lowest readmission: {min_grp} at {min_rate:.0%}")
print(f"Gap: {(max_rate - min_rate):.0%} percentage points")
Readmission rate by age group:
age_group patient_count readmit_rate mean_los
Under 40 3 0.00 3.00
40–60 4 0.00 9.25
61–75 3 0.67 6.67
Over 75 2 1.00 12.00
Highest readmission: Over 75 at 100%
Lowest readmission: Under 40 at 0%
Gap: 100 percentage pointsWhat just happened?
Method — pd.cut() for binning continuous variablespd.cut() converts a continuous numeric column into ordered categorical bins. The bins parameter defines the boundary edges — here [0, 40, 60, 75, 100] creates four brackets. The labels parameter assigns readable names to each bracket. The right=True default means the right edge is included in each bin — so a patient aged exactly 60 falls into the "40–60" group, not the "61–75" group. The resulting column is a pandas Categorical dtype, which preserves the ordering of the groups for correct sorting.
When grouping by a Categorical column, pandas by default includes all possible category values even if some have no data — which produces empty rows. Passing observed=True suppresses empty groups, keeping only age bands that actually have patients in this dataset. This is a newer pandas behaviour — on older versions you may not need it.
The age pattern is stark. Patients over 75 have a 100% readmission rate — every single patient in that age group was readmitted within 30 days. The 61–75 group has a 67% rate. Under-40 and 40–60 patients had zero readmissions. The gap between the highest and lowest group is a full 100 percentage points. This is a clear clinical finding: the hospital's discharge and follow-up protocols are failing elderly patients. The operations director should flag this to the clinical director immediately — it is both a quality-of-care issue and a resource issue, since every readmission consumes additional bed-days.
Knowing the proportion of emergency vs elective vs urgent admissions tells the operations team what fraction of demand they can plan for — and what fraction will arrive unpredictably. We use pd.crosstab() to cross-tabulate department against admission type.
# Overall admission type split — what proportion of all admissions is each type?
type_split = df["admission_type"].value_counts(normalize=True).round(3) * 100
print("Admission type distribution (% of total):")
print(type_split.to_string())
# Cross-tabulate department vs admission type
# pd.crosstab() creates a frequency table showing how many rows fall into each cell
cross = pd.crosstab(df["department"], df["admission_type"])
print("\nDepartment × Admission Type (count):")
print(cross.to_string())
# Average LOS by admission type — which type stays longest?
los_by_type = df.groupby("admission_type")["los_days"].agg(["mean","std"]).round(1)
print("\nAverage LOS by admission type:")
print(los_by_type.to_string())
# Which dept + admission_type combination has the longest average LOS?
combo = df.groupby(["department","admission_type"])["los_days"].mean().round(1)
print("\nLOS by department + admission type:")
print(combo.sort_values(ascending=False).head(5).to_string())
Admission type distribution (% of total):
Elective 41.7
Emergency 33.3
Urgent 25.0
Name: admission_type, dtype: float64
Department × Admission Type (count):
admission_type Elective Emergency Urgent
department
Cardiology 0 3 1
General Medicine 0 1 2
Oncology 2 0 0
Orthopaedics 3 0 0
Average LOS by admission type:
mean std
Emergency 8.5 2.9
Urgent 8.3 3.8
Elective 6.6 4.5
LOS by department + admission type:
department admission_type
General Medicine Urgent 10.0
Oncology Elective 12.0
Cardiology Emergency 7.0
Urgent 5.0
Orthopaedics Elective 3.0
Name: los_days, dtype: float64What just happened?
Method — pd.crosstab()pd.crosstab(index, columns) builds a frequency table showing how many rows fall into each combination of two categorical variables. It is the fastest way to see the intersection of two dimensions without needing to write a groupby().size().unstack() chain. By default it counts rows — pass normalize=True to get proportions instead. Here we keep it as counts to show the raw staffing picture per department and admission type.
.value_counts(normalize=True) returns proportions (0–1) instead of raw counts. Multiplying by 100 converts to percentages. This is the fastest single-line way to compute the distribution of a categorical column — and more readable than computing the sum and dividing manually.
41.7% of admissions are Elective — meaning the hospital can plan for nearly half its demand in advance. However, the cross-tabulation reveals a structural issue: Cardiology receives zero elective admissions — all of its patients arrive as Emergency or Urgent, making Cardiology entirely reactive and impossible to staff predictably. Oncology and Orthopaedics are entirely elective, making them the most plannable departments. The longest single combination is Oncology Elective at 12.0 days — those beds are occupied for the longest periods of any group in the hospital.
Checkpoint: Try extending Step 3 by computing readmission rate broken down by both age group and department simultaneously — use df.groupby(["age_group","department"])["readmitted_30d"].mean(). Which department is driving the 100% readmission rate in the over-75 group? Is it the same department for the 61–75 group? This kind of two-dimensional drill-down is what turns an EDA into a briefing the clinical director can act on.
Key Findings
General Medicine has the highest total bed-days at 33 and a 100% readmission rate across all three of its patients in this sample. This is the single highest-priority department for the operations director to investigate — it has both the highest bed burden and the worst readmission performance.
Patients over 75 have a 100% readmission rate versus 0% for patients under 60. The gap is 100 percentage points — the most extreme possible. Current discharge and post-care protocols are clearly not adequate for elderly patients.
Oncology has the longest average length of stay at 12.0 days and the highest variability (std = 2.83). Despite having only 2 admissions, Oncology consumed 24 bed-days — making it the second-heaviest department by bed pressure per patient.
Cardiology is entirely reactive — 100% of its admissions are Emergency or Urgent, with zero elective cases. It handles the most admissions (4) of any department but cannot be staffed predictably. This makes Cardiology the operationally highest-risk department for the hospital.
Emergency and Urgent admissions stay significantly longer (8.5 and 8.3 days on average) than Elective admissions (6.6 days). Elective patients, despite staying shorter, have the highest LOS variability (std = 4.5) — some planned procedures take far longer than anticipated.
Visualisations
Healthcare EDA Decision Guide
Hospital data has a different shape to commercial datasets — the key metrics are operational and clinical, not financial. Here is the framework for approaching any hospital admissions EDA:
| Question Type | pandas Approach | Key Method | Watch Out For |
|---|---|---|---|
| Which dept is busiest? | Count admissions and sum bed-days | groupby().agg(count, sum) | High count ≠ high bed pressure — check both |
| Who readmits most? | Group by segment, mean the 0/1 flag | groupby().mean() | Small groups — 1 patient = 100% or 0% |
| Age group patterns? | Bin continuous age into brackets | pd.cut() | Bin edges must be clinically meaningful |
| What is the demand mix? | Count by admission type | value_counts(normalize=True) | Emergency = unplannable — treat separately |
| Two-way breakdown? | Cross-tabulate two categorical columns | pd.crosstab() | Sparse cells when dataset is small |
Analyst's Note
Teacher's Note
What Would Come Next?
A senior analyst would present the General Medicine 100% readmission finding to the clinical director as a patient safety signal requiring an immediate audit, then build a 30-day readmission prediction model to flag high-risk patients before discharge.
Limitations of This Analysis
Twelve records produce highly unstable percentages — 3 out of 3 patients is 100%, but one different patient would drop it to 67%. Meaningful clinical patterns require hundreds of admissions per department and data on diagnosis codes and social circumstances.
Business Decisions This Could Drive
A dedicated elderly discharge planning team for over-75 patients, a Cardiology staffing buffer for unpredictable emergency demand, and a review of General Medicine discharge protocols are the three evidence-based actions this analysis supports.
Practice Questions
1. Which department had both the highest total bed-days and the highest readmission rate in the MedCore dataset?
2. Which pandas function converts a continuous numeric column like age into labelled categorical brackets?
3. Which pandas function creates a frequency table showing the count of rows for every combination of two categorical variables?
Quiz
1. What does pd.cut(df["age"], bins=[0,40,60,100], labels=["Young","Middle","Senior"]) do?
2. Which department poses the highest staffing challenge for the operations director, and why?
3. Why do we pass observed=True when grouping by a Categorical column created with pd.cut()?
Up Next · Case Study 5
Retail Store Performance
You are handed 12 months of store-level transaction data. Which stores are over-performing, which are dragging the network down, and what does footfall vs conversion reveal that revenue alone misses?