DS Case Study 4 – Patient Records Analysis | Dataplexa

Beginner Case Study · CS 4

Exploring Hospital Patient Records

Healthcare data is some of the most consequential data a analyst will ever work with. The numbers on the screen are not revenue figures — they are patients, bed occupancy rates, readmission risks. Getting this analysis wrong has real consequences for real people.

You are a data analyst at MedCore Hospital Group. The operations director has pulled you into a meeting: the hospital is planning its staffing and resource allocation for next quarter. She needs to know which departments are carrying the heaviest patient load, whether readmission rates differ by age group, and what the average length of stay looks like across the facility. Your dataset is a 12-record admissions export — a sample of the full system.

IndustryHealthcare

TechniqueEDA · Aggregation

Librariespandas · numpy

DifficultyBeginner

Est. Time40–50 min

Overview

What This Case Study Covers

Hospital admissions data sits at the intersection of operations, finance, and patient care. The same dataset that tells a CFO about billing can tell a department head about staffing pressure, and tell a clinical director about discharge patterns. This case study focuses on the operational view — who is coming in, where they are going, how long they are staying, and whether they are coming back.

This case study covers four layers: department load analysis to identify which units carry the most admissions and longest stays, age group readmission rates to find whether older or younger patients return more frequently, length of stay distribution to understand the range and outliers in discharge patterns, and admission type breakdown to see how emergency vs elective vs urgent cases are distributed across departments.

The Hospital EDA Toolkit

Admissions Load by Department

Count admissions and sum total bed-days per department. A department with high admissions but short stays has a throughput problem. One with low admissions but long stays has a complexity problem. Both require different resource responses.

Age Group Segmentation

Use pd.cut() to bin continuous age into labelled brackets. Age bands are more clinically meaningful than raw age and allow direct comparison of readmission rates across life stages.

Length of Stay Analysis

Compute mean, median, and std of length of stay per department. A high std means highly variable stays — some patients are discharged quickly while others remain for extended periods, which creates unpredictable bed demand.

Readmission Rate Analysis

Group by age band and compute the mean readmission flag. Readmission within 30 days is a key quality indicator — high rates suggest premature discharge or inadequate follow-up care after leaving the hospital.

Admission Type Cross-tabulation

Cross-tabulate department against admission type. Emergency admissions are uncontrollable — they arrive when they arrive. Elective admissions can be scheduled and rescheduled. Knowing the mix per department is essential for resource planning.

Dataset Overview

MedCore's admissions extract contains 12 patient records covering five departments, three admission types, and a 30-day readmission flag. Built directly with pd.DataFrame() — in production this would be a read from the hospital's EHR system.

patient_id	age	department	admission_type	los_days	readmitted_30d	discharge_status
P001	67	Cardiology	Emergency	8	1	Home
P002	34	Orthopaedics	Elective	3	0	Home
P003	78	General Medicine	Urgent	11	1	Rehab
P004	52	Cardiology	Emergency	6	0	Home
P005	45	Oncology	Elective	14	0	Home

Showing first 5 of 12 rows · 7 columns

patient_idstring · unique identifier

Unique patient record ID. Never analysed directly — used for counting and deduplication only.

ageint64 · years

Patient age in years. Will be binned into age groups using pd.cut() for the readmission analysis.

departmentobject · 5 categories

Clinical department the patient was admitted to. The primary grouping dimension for load and length-of-stay analysis.

admission_typeobject · 3 categories

Emergency, Elective, or Urgent. Determines how predictable the admission was — critical for resource scheduling.

los_daysint64 · days

Length of stay in days — from admission to discharge. The primary operational metric in this case study.

readmitted_30dint64 · binary (0/1)

1 = patient readmitted within 30 days of discharge. 0 = not readmitted. Key quality-of-care indicator.

discharge_statusobject · 3 categories

Where the patient went after discharge — Home, Rehab, or Deceased. Used for cross-tabulation with department.

Business Questions

The operations director needs these five answers to finalise the staffing plan. Each is answered by a specific step in the analysis below.

Which department has the highest total admissions and the longest average length of stay — and are they the same department?

Which age group has the highest 30-day readmission rate, and how large is the gap between the highest and lowest group?

What is the distribution of length of stay across the hospital — what is the mean, median, and which department has the most variable stays?

What proportion of admissions are Emergency vs Elective vs Urgent across the hospital?

Which department and admission type combination drives the longest average length of stay?

Step-by-Step Analysis

The scenario:

The export has just arrived in your inbox as a CSV. You have 48 hours to return a written summary to the operations director. Open your notebook and work through the data systematically — one question at a time, no jumping ahead.

Step 1Load, Inspect, and Audit the Dataset

Healthcare data often arrives with encoding issues, mixed date formats, and columns that look numeric but are stored as strings. The audit step catches all of this before any calculation runs.

import pandas as pd
import numpy as np

# Build the MedCore admissions dataset
df = pd.DataFrame({
    "patient_id":       ["P001","P002","P003","P004","P005","P006",
                         "P007","P008","P009","P010","P011","P012"],
    "age":              [67, 34, 78, 52, 45, 71, 29, 83, 58, 41, 65, 37],
    "department":       ["Cardiology","Orthopaedics","General Medicine","Cardiology",
                         "Oncology","General Medicine","Orthopaedics","General Medicine",
                         "Cardiology","Oncology","Cardiology","Orthopaedics"],
    "admission_type":   ["Emergency","Elective","Urgent","Emergency","Elective",
                         "Urgent","Elective","Emergency","Emergency","Elective",
                         "Urgent","Elective"],
    "los_days":         [8, 3, 11, 6, 14, 9, 2, 13, 7, 10, 5, 4],
    "readmitted_30d":   [1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0],
    "discharge_status": ["Home","Home","Rehab","Home","Home","Rehab",
                         "Home","Rehab","Home","Home","Home","Home"]
})

# Inspect shape and types
print("Shape:", df.shape)
print("\nData types:")
print(df.dtypes)

# Missing value audit
print("\nMissing values:")
print(df.isnull().sum())

# Summary stats for the two key numeric columns
print("\nLength of Stay — summary:")
print(df["los_days"].describe().round(1))

print(f"\nOverall readmission rate: {df['readmitted_30d'].mean():.1%}")
print(f"Total bed-days used:       {df['los_days'].sum()}")

Shape: (12, 7)

Data types:
patient_id          object
age                  int64
department          object
admission_type      object
los_days             int64
readmitted_30d       int64
discharge_status    object
dtype: object

Missing values:
patient_id          0
age                 0
department          0
admission_type      0
los_days            0
readmitted_30d      0
discharge_status    0
dtype: int64

Length of Stay — summary:
count    12.0
mean      7.7
std       3.8
min       2.0
25%       4.8
50%       7.5
75%      10.3
max      14.0

Overall readmission rate: 41.7%
Total bed-days used:       92

What just happened?

Method — .describe() on a single column

We called .describe() on just los_days rather than the full DataFrame — passing a single column keeps the output focused. The result gives us count, mean, std, min, quartiles, and max in one call. For healthcare data, the difference between mean (7.7) and median (7.5) tells us the distribution is fairly symmetric — no extreme outliers are pulling the average far from the centre.

Business Insight

The headline numbers are immediately concerning: the overall 30-day readmission rate is 41.7% — the NHS benchmark target is below 10%, and most well-run hospitals aim for below 15%. The hospital used 92 total bed-days across 12 admissions, an average of 7.7 days per stay. Length of stay has a std of 3.8 — nearly half the mean — indicating highly variable stays that will make bed planning difficult without department-level breakdown.

Step 2Department Load and Length-of-Stay Analysis

The operations director's first question: which departments are busiest and which hold patients the longest? High admissions and long stays together signal a department that needs additional staffing — but each alone tells a different story.

# Aggregate by department: admissions count, total bed-days, mean and std of LOS
dept = df.groupby("department").agg(
    admissions    = ("patient_id", "count"),        # total patients admitted
    total_bed_days= ("los_days",   "sum"),           # total bed-days consumed
    mean_los      = ("los_days",   "mean"),          # average length of stay
    std_los       = ("los_days",   "std"),           # variability in stay length
    readmit_rate  = ("readmitted_30d", "mean")       # readmission rate per dept
).round(2).reset_index()

# Sort by total bed-days — the heaviest operational burden first
dept = dept.sort_values("total_bed_days", ascending=False)

print("Department performance summary:")
print(dept.to_string(index=False))

# Which department has the most variable LOS?
most_variable = dept.loc[dept["std_los"].idxmax(), "department"]
print(f"\nMost variable LOS: {most_variable} (std = {dept['std_los'].max():.2f} days)")

# Are the highest-admissions and longest-stay departments the same?
most_admissions = dept.loc[dept["admissions"].idxmax(), "department"]
longest_stay    = dept.loc[dept["mean_los"].idxmax(), "department"]
print(f"Most admissions:  {most_admissions}")
print(f"Longest avg stay: {longest_stay}")
print(f"Same department:  {most_admissions == longest_stay}")

Department performance summary:
       department  admissions  total_bed_days  mean_los  std_los  readmit_rate
  General Medicine           3              33     11.00     2.00          1.00
       Cardiology           4              26      6.50     1.29          0.50
     Orthopaedics           3              9       3.00     1.00          0.00
         Oncology           2              24     12.00     2.83          0.00

Most variable LOS: Oncology (std = 2.83 days)
Most admissions:  Cardiology
Longest avg stay: Oncology
Same department:  False

What just happened?

Method — groupby().agg() with named keywords

Using the named keyword syntax in .agg() — new_col = ("source_col", "function") — lets us compute five different metrics from different source columns in a single grouped operation. The result is a clean, labelled summary table without any column renaming needed. This pattern is cleaner than chaining multiple separate .groupby() calls and merging the results.

Method — .idxmax() for label retrieval

.idxmax() returns the index label of the row with the highest value — not the value itself. After applying it, we pass the result to .loc[] to extract the department name from that row. This is the standard pattern for answering "which group is highest" questions cleanly without sorting the full table.

Business Insight

General Medicine has the highest total bed-days at 33 — and a readmission rate of 100%, meaning every patient admitted to General Medicine in this sample was readmitted within 30 days. That is a serious clinical flag. Oncology has the longest average stay at 12 days with the most variable LOS (std = 2.83) — some Oncology patients stay much longer than others, making bed planning extremely difficult. Critically, the most-admissions department (Cardiology) and longest-stay department (Oncology) are not the same — meaning staffing pressure and bed pressure need to be managed separately.

Step 3Age Group Readmission Rate Analysis

Continuous age is hard to analyse directly — there is no meaningful difference between a 67-year-old and a 68-year-old. Binning into age brackets using pd.cut() creates clinically meaningful groups that reveal patterns raw age would obscure.

# Bin age into clinical age groups using pd.cut()
# bins defines the edges of each bracket — right=True means right edge is included
# labels assigns a human-readable name to each bracket
df["age_group"] = pd.cut(
    df["age"],
    bins=[0, 40, 60, 75, 100],
    labels=["Under 40", "40–60", "61–75", "Over 75"],
    right=True
)

# Group by age group and compute key metrics
age_analysis = df.groupby("age_group", observed=True).agg(
    patient_count  = ("patient_id",    "count"),
    readmit_rate   = ("readmitted_30d","mean"),
    mean_los       = ("los_days",      "mean")
).round(2).reset_index()

print("Readmission rate by age group:")
print(age_analysis.to_string(index=False))

# Gap between highest and lowest readmission rate
max_rate = age_analysis["readmit_rate"].max()
min_rate = age_analysis["readmit_rate"].min()
max_grp  = age_analysis.loc[age_analysis["readmit_rate"].idxmax(), "age_group"]
min_grp  = age_analysis.loc[age_analysis["readmit_rate"].idxmin(), "age_group"]

print(f"\nHighest readmission: {max_grp} at {max_rate:.0%}")
print(f"Lowest readmission:  {min_grp} at {min_rate:.0%}")
print(f"Gap:                 {(max_rate - min_rate):.0%} percentage points")

Readmission rate by age group:
 age_group  patient_count  readmit_rate  mean_los
  Under 40              3          0.00      3.00
     40–60              4          0.00      9.25
     61–75              3          0.67      6.67
   Over 75              2          1.00     12.00

Highest readmission: Over 75 at 100%
Lowest readmission:  Under 40 at 0%
Gap:                 100 percentage points

What just happened?

Method — pd.cut() for binning continuous variables

pd.cut() converts a continuous numeric column into ordered categorical bins. The bins parameter defines the boundary edges — here [0, 40, 60, 75, 100] creates four brackets. The labels parameter assigns readable names to each bracket. The right=True default means the right edge is included in each bin — so a patient aged exactly 60 falls into the "40–60" group, not the "61–75" group. The resulting column is a pandas Categorical dtype, which preserves the ordering of the groups for correct sorting.

Method — observed=True in groupby

When grouping by a Categorical column, pandas by default includes all possible category values even if some have no data — which produces empty rows. Passing observed=True suppresses empty groups, keeping only age bands that actually have patients in this dataset. This is a newer pandas behaviour — on older versions you may not need it.

Business Insight

The age pattern is stark. Patients over 75 have a 100% readmission rate — every single patient in that age group was readmitted within 30 days. The 61–75 group has a 67% rate. Under-40 and 40–60 patients had zero readmissions. The gap between the highest and lowest group is a full 100 percentage points. This is a clear clinical finding: the hospital's discharge and follow-up protocols are failing elderly patients. The operations director should flag this to the clinical director immediately — it is both a quality-of-care issue and a resource issue, since every readmission consumes additional bed-days.

Step 4Admission Type Mix and Cross-tabulation

Knowing the proportion of emergency vs elective vs urgent admissions tells the operations team what fraction of demand they can plan for — and what fraction will arrive unpredictably. We use pd.crosstab() to cross-tabulate department against admission type.

# Overall admission type split — what proportion of all admissions is each type?
type_split = df["admission_type"].value_counts(normalize=True).round(3) * 100
print("Admission type distribution (% of total):")
print(type_split.to_string())

# Cross-tabulate department vs admission type
# pd.crosstab() creates a frequency table showing how many rows fall into each cell
cross = pd.crosstab(df["department"], df["admission_type"])
print("\nDepartment × Admission Type (count):")
print(cross.to_string())

# Average LOS by admission type — which type stays longest?
los_by_type = df.groupby("admission_type")["los_days"].agg(["mean","std"]).round(1)
print("\nAverage LOS by admission type:")
print(los_by_type.to_string())

# Which dept + admission_type combination has the longest average LOS?
combo = df.groupby(["department","admission_type"])["los_days"].mean().round(1)
print("\nLOS by department + admission type:")
print(combo.sort_values(ascending=False).head(5).to_string())

Admission type distribution (% of total):
Elective     41.7
Emergency    33.3
Urgent       25.0
Name: admission_type, dtype: float64

Department × Admission Type (count):
admission_type    Elective  Emergency  Urgent
department
Cardiology               0          3       1
General Medicine         0          1       2
Oncology                 2          0       0
Orthopaedics             3          0       0

Average LOS by admission type:
            mean  std
Emergency    8.5  2.9
Urgent       8.3  3.8
Elective     6.6  4.5

LOS by department + admission type:
department        admission_type
General Medicine  Urgent            10.0
Oncology          Elective          12.0
Cardiology        Emergency          7.0
                  Urgent             5.0
Orthopaedics      Elective           3.0
Name: los_days, dtype: float64

What just happened?

Method — pd.crosstab()

pd.crosstab(index, columns) builds a frequency table showing how many rows fall into each combination of two categorical variables. It is the fastest way to see the intersection of two dimensions without needing to write a groupby().size().unstack() chain. By default it counts rows — pass normalize=True to get proportions instead. Here we keep it as counts to show the raw staffing picture per department and admission type.

Method — .value_counts(normalize=True)

.value_counts(normalize=True) returns proportions (0–1) instead of raw counts. Multiplying by 100 converts to percentages. This is the fastest single-line way to compute the distribution of a categorical column — and more readable than computing the sum and dividing manually.

Business Insight

41.7% of admissions are Elective — meaning the hospital can plan for nearly half its demand in advance. However, the cross-tabulation reveals a structural issue: Cardiology receives zero elective admissions — all of its patients arrive as Emergency or Urgent, making Cardiology entirely reactive and impossible to staff predictably. Oncology and Orthopaedics are entirely elective, making them the most plannable departments. The longest single combination is Oncology Elective at 12.0 days — those beds are occupied for the longest periods of any group in the hospital.

Checkpoint: Try extending Step 3 by computing readmission rate broken down by both age group and department simultaneously — use df.groupby(["age_group","department"])["readmitted_30d"].mean(). Which department is driving the 100% readmission rate in the over-75 group? Is it the same department for the 61–75 group? This kind of two-dimensional drill-down is what turns an EDA into a briefing the clinical director can act on.

Key Findings

General Medicine has the highest total bed-days at 33 and a 100% readmission rate across all three of its patients in this sample. This is the single highest-priority department for the operations director to investigate — it has both the highest bed burden and the worst readmission performance.

Patients over 75 have a 100% readmission rate versus 0% for patients under 60. The gap is 100 percentage points — the most extreme possible. Current discharge and post-care protocols are clearly not adequate for elderly patients.

Oncology has the longest average length of stay at 12.0 days and the highest variability (std = 2.83). Despite having only 2 admissions, Oncology consumed 24 bed-days — making it the second-heaviest department by bed pressure per patient.

Cardiology is entirely reactive — 100% of its admissions are Emergency or Urgent, with zero elective cases. It handles the most admissions (4) of any department but cannot be staffed predictably. This makes Cardiology the operationally highest-risk department for the hospital.

Emergency and Urgent admissions stay significantly longer (8.5 and 8.3 days on average) than Elective admissions (6.6 days). Elective patients, despite staying shorter, have the highest LOS variability (std = 4.5) — some planned procedures take far longer than anticipated.

Visualisations

Total Bed-Days by Department

Higher = more operational pressure on that unit

General Med.

33 days

Cardiology

26 days

Oncology

24 days

Orthopaedics

9 days

30-Day Readmission Rate by Age Group

% of patients readmitted within 30 days of discharge

Over 75

100%

61–75

67%

40–60

Under 40

Average Length of Stay — Department vs Admission Type

Days per admission · sorted by longest stay

Oncology Elec.

12.0d

Gen. Med. Urg.

10.0d

Cardiology Em.

7.0d

Cardiology Urg.

5.0d

Ortho. Elec.

3.0d

Healthcare EDA Decision Guide

Hospital data has a different shape to commercial datasets — the key metrics are operational and clinical, not financial. Here is the framework for approaching any hospital admissions EDA:

Question Type	pandas Approach	Key Method	Watch Out For
Which dept is busiest?	Count admissions and sum bed-days	`groupby().agg(count, sum)`	High count ≠ high bed pressure — check both
Who readmits most?	Group by segment, mean the 0/1 flag	`groupby().mean()`	Small groups — 1 patient = 100% or 0%
Age group patterns?	Bin continuous age into brackets	`pd.cut()`	Bin edges must be clinically meaningful
What is the demand mix?	Count by admission type	`value_counts(normalize=True)`	Emergency = unplannable — treat separately
Two-way breakdown?	Cross-tabulate two categorical columns	`pd.crosstab()`	Sparse cells when dataset is small

Analyst's Note

Teacher's Note

What Would Come Next?

A senior analyst would present the General Medicine 100% readmission finding to the clinical director as a patient safety signal requiring an immediate audit, then build a 30-day readmission prediction model to flag high-risk patients before discharge.

Limitations of This Analysis

Twelve records produce highly unstable percentages — 3 out of 3 patients is 100%, but one different patient would drop it to 67%. Meaningful clinical patterns require hundreds of admissions per department and data on diagnosis codes and social circumstances.

Business Decisions This Could Drive

A dedicated elderly discharge planning team for over-75 patients, a Cardiology staffing buffer for unpredictable emergency demand, and a review of General Medicine discharge protocols are the three evidence-based actions this analysis supports.

Practice Questions

1. Which department had both the highest total bed-days and the highest readmission rate in the MedCore dataset?

2. Which pandas function converts a continuous numeric column like age into labelled categorical brackets?

3. Which pandas function creates a frequency table showing the count of rows for every combination of two categorical variables?

Quiz

Up Next · Case Study 5

Retail Store Performance

You are handed 12 months of store-level transaction data. Which stores are over-performing, which are dragging the network down, and what does footfall vs conversion reveal that revenue alone misses?

← Previous Course Index Next →

DS Case Studies

Exploring Hospital Patient Records

What This Case Study Covers

The Hospital EDA Toolkit

Dataset Overview

Business Questions

Step-by-Step Analysis

Key Findings

Visualisations

Healthcare EDA Decision Guide

Analyst's Note

What Would Come Next?

Limitations of This Analysis

Business Decisions This Could Drive

Practice Questions

Quiz