EDA Course
Boxplots & Violin Plots
A mean tells you the centre. A standard deviation tells you the spread. But neither tells you where the outliers are, how skewed the distribution is, or whether there are two clusters hiding inside your data. Boxplots and violin plots show all of this at once — and they're especially powerful when comparing groups.
What a Boxplot Actually Shows
A boxplot packs five numbers into one shape. Once you know how to read it, you can extract more information from a boxplot in three seconds than from a table of statistics in thirty.
Anatomy of a Boxplot
The Dataset We'll Use
The scenario: You're an analyst at an NHS trust. The operations director wants to understand how long patients stay across three wards — Medical, Surgical, and Emergency. Are some wards consistently longer? Are outliers pulling the averages up? She needs answers before presenting to the board.
import pandas as pd
import numpy as np
# 24 patients across 3 wards — length of stay in days
df = pd.DataFrame({
'ward': ['Medical']*8 + ['Surgical']*8 + ['Emergency']*8,
'los_days': [
4, 6, 5, 7, 8, 5, 6, 21, # Medical — one very long outlier (21 days)
9,11,10,13,12,14,11,10, # Surgical — tight cluster, no outliers
1, 2, 1, 3, 1, 2, 15, 18 # Emergency — mostly 1-3 days, two long outliers
]
})
# Quick stats to see what we're dealing with
print(df.groupby('ward')['los_days'].describe().round(1))
count mean std min 25% 50% 75% max ward Emergency 8.0 5.4 6.5 1.0 1.0 2.0 3.0 18.0 Medical 8.0 7.8 5.1 4.0 5.0 5.5 7.0 21.0 Surgical 8.0 11.3 1.6 9.0 10.0 11.0 12.5 14.0
What just happened?
Three very different stories hidden in numbers: Surgical has the tightest spread (std=1.6) — predictable. Emergency's mean (5.4) is way above its median (2.0) — outliers inflating the average. Medical's max of 21 when the 75th percentile is just 7 — one patient is a massive outlier. Boxplots will make all three patterns impossible to miss.
Step 1 — Calculate the Boxplot Numbers
Before any chart, let's compute every number a boxplot shows — Q1, median, Q3, the fences, and which values are outliers. These numbers are the analysis. The chart just makes them easier to communicate.
def boxplot_stats(series, label):
Q1, median, Q3 = series.quantile([0.25, 0.50, 0.75])
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = series[(series < lower) | (series > upper)].tolist()
print(f" {label}")
print(f" Q1={Q1:.0f} Median={median:.0f} Q3={Q3:.0f} IQR={IQR:.0f}")
print(f" Fences: [{lower:.1f}, {upper:.1f}]")
print(f" Outliers: {outliers if outliers else 'None'}\n")
print("=== BOXPLOT STATS BY WARD ===\n")
for ward, grp in df.groupby('ward'):
boxplot_stats(grp['los_days'], ward)
=== BOXPLOT STATS BY WARD ===
Emergency
Q1=1 Median=2 Q3=3 IQR=2
Fences: [-2.0, 6.0]
Outliers: [15, 18]
Medical
Q1=5 Median=6 Q3=7 IQR=2
Fences: [2.0, 10.0]
Outliers: [21]
Surgical
Q1=10 Median=11 Q3=13 IQR=3
Fences: [5.5, 17.5]
Outliers: None
What just happened?
pandas' .quantile() takes a list and returns all three percentiles at once. Fences are simple arithmetic. Emergency has two outliers (15 and 18 days) — they stayed 8× the median. Medical has one (21 days). Surgical has none at all — the most predictable ward in the hospital.
The Three Wards — Visual Comparison
Here's what all three boxplots look like side by side — built directly from the numbers above:
Length of Stay by Ward (days)
Red dots = outliers. Surgical has none — its box sits highest because stays are genuinely longer, not because of outliers.
Boxplot vs Violin — What the Violin Adds
A boxplot summarises with five numbers. A violin plot wraps the full distribution shape around the outside — wider = more patients at that stay length. It catches things boxplots hide: bimodal distributions (two humps), unusual gaps, and whether a distribution is truly symmetric or just happens to have a centred median.
Use a Boxplot when:
- Comparing many groups side by side
- You specifically need to see outliers
- Sample size is small (<30)
Use a Violin when:
- You suspect bimodality or unusual shape
- Distribution shape matters for your analysis
- Sample size is larger (n≥30 per group)
Step 2 — The Plain-English Findings
The most important output isn't the chart — it's what you tell the operations director. Here's the function that translates boxplot statistics into actionable findings.
print("=== BOARD BRIEFING: LENGTH-OF-STAY FINDINGS ===\n")
for ward, grp in df.groupby('ward'):
s = grp['los_days']
Q1 = s.quantile(0.25)
median = s.median()
Q3 = s.quantile(0.75)
mean = s.mean()
IQR = Q3 - Q1
outliers = s[s > Q3 + 1.5*IQR].tolist()
print(f" {ward} Ward")
print(f" Typical stay: {Q1:.0f}–{Q3:.0f} days (median={median:.0f}d, mean={mean:.1f}d)")
# Flag when mean is misleadingly high due to outliers
if mean > median * 1.25:
pct_inflated = (mean/median - 1)*100
print(f" ⚠ Mean is {pct_inflated:.0f}% above median — use median for planning, not mean")
# Report outliers in plain English
if outliers:
print(f" ⚠ {len(outliers)} outlier(s): {outliers} days — "
f"{max(outliers)/median:.0f}x longer than typical stay")
print(f" Action: investigate case complexity and flag for clinical review")
else:
print(f" ✓ No outliers — this ward is suitable for capacity planning based on historical data")
print()
=== BOARD BRIEFING: LENGTH-OF-STAY FINDINGS ===
Emergency Ward
Typical stay: 1–3 days (median=2d, mean=5.4d)
⚠ Mean is 170% above median — use median for planning, not mean
⚠ 2 outlier(s): [15, 18] days — 8x longer than typical stay
Action: investigate case complexity and flag for clinical review
Medical Ward
Typical stay: 5–7 days (median=6d, mean=7.8d)
⚠ Mean is 30% above median — use median for planning, not mean
⚠ 1 outlier(s): [21] days — 4x longer than typical stay
Action: investigate case complexity and flag for clinical review
Surgical Ward
Typical stay: 10–13 days (median=11d, mean=11.3d)
✓ No outliers — this ward is suitable for capacity planning based on historical data
What just happened?
pandas' .median(), .mean(), and .quantile() run inside a groupby loop. The 25% inflation check flags when the mean is misleading. The outlier list is generated with a single boolean filter.
The critical insight for the operations director: Emergency's mean (5.4 days) is 170% above the median (2 days). If she plans bed capacity using the mean, she overestimates by nearly 3×. The median is the right number. Surgical's near-identical mean and median (11.3 vs 11) means it's safe to plan with either — it's genuinely predictable.
Teacher's Note
The boxplot is one of the best "is my mean trustworthy?" checks you have. If the median line sits far from the centre of the box, or if outlier dots appear above the upper whisker, the mean is being distorted. Always check the boxplot before reporting a mean to a stakeholder.
In a real notebook: seaborn.boxplot(x='ward', y='los_days', data=df) draws the full chart in one line. The statistics underneath are exactly what we computed manually — seaborn just draws the shape automatically.
Practice Questions
1. In a boxplot, which line inside the box represents the 50th percentile of the data?
2. You suspect a distribution might be bimodal (two humps). Which plot type would reveal the full shape and confirm this?
3. Emergency ward mean stay is 5.4 days, median is 2 days. Which measure should the operations director use for bed planning?
Quiz
1. In a boxplot, what visual feature tells you the data is right-skewed?
2. You're comparing salaries across departments. One department has two bands — junior staff around £25k and senior staff around £75k, nobody in between. Which chart reveals this most clearly?
3. Which ward is most suitable for straightforward capacity planning based on historical data?
Up Next · Lesson 30
Correlation Maps
Build and read a full correlation heatmap — and learn what it's actually telling you about your features before you model.