EDA Lesson 29 – Boxplots & Violin Plots | Dataplexa

Intermediate Level · Lesson 29

Boxplots & Violin Plots

A mean tells you the centre. A standard deviation tells you the spread. But neither tells you where the outliers are, how skewed the distribution is, or whether there are two clusters hiding inside your data. Boxplots and violin plots show all of this at once — and they're especially powerful when comparing groups.

What a Boxplot Actually Shows

A boxplot packs five numbers into one shape. Once you know how to read it, you can extract more information from a boxplot in three seconds than from a table of statistics in thirty.

Anatomy of a Boxplot

Box = middle 50% of data (Q1 to Q3 = IQR)

Line inside = median (50th percentile)

Whiskers = min and max within 1.5×IQR

Dots beyond whiskers = outliers

If median line is near Q1 (bottom of box) → right-skewed If median line is centred → symmetric If median line is near Q3 (top of box) → left-skewed

The Dataset We'll Use

The scenario: You're an analyst at an NHS trust. The operations director wants to understand how long patients stay across three wards — Medical, Surgical, and Emergency. Are some wards consistently longer? Are outliers pulling the averages up? She needs answers before presenting to the board.

import pandas as pd
import numpy as np

# 24 patients across 3 wards — length of stay in days
df = pd.DataFrame({
    'ward':     ['Medical']*8 + ['Surgical']*8 + ['Emergency']*8,
    'los_days': [
        4, 6, 5, 7, 8, 5, 6, 21,   # Medical — one very long outlier (21 days)
        9,11,10,13,12,14,11,10,    # Surgical — tight cluster, no outliers
        1, 2, 1, 3, 1, 2, 15, 18   # Emergency — mostly 1-3 days, two long outliers
    ]
})

# Quick stats to see what we're dealing with
print(df.groupby('ward')['los_days'].describe().round(1))

           count  mean   std  min   25%   50%   75%   max
ward
Emergency    8.0   5.4   6.5  1.0   1.0   2.0   3.0  18.0
Medical      8.0   7.8   5.1  4.0   5.0   5.5   7.0  21.0
Surgical     8.0  11.3   1.6  9.0  10.0  11.0  12.5  14.0

What just happened?

Three very different stories hidden in numbers: Surgical has the tightest spread (std=1.6) — predictable. Emergency's mean (5.4) is way above its median (2.0) — outliers inflating the average. Medical's max of 21 when the 75th percentile is just 7 — one patient is a massive outlier. Boxplots will make all three patterns impossible to miss.

Step 1 — Calculate the Boxplot Numbers

Before any chart, let's compute every number a boxplot shows — Q1, median, Q3, the fences, and which values are outliers. These numbers are the analysis. The chart just makes them easier to communicate.

def boxplot_stats(series, label):
    Q1, median, Q3 = series.quantile([0.25, 0.50, 0.75])
    IQR   = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    outliers = series[(series < lower) | (series > upper)].tolist()

    print(f"  {label}")
    print(f"    Q1={Q1:.0f}  Median={median:.0f}  Q3={Q3:.0f}  IQR={IQR:.0f}")
    print(f"    Fences: [{lower:.1f}, {upper:.1f}]")
    print(f"    Outliers: {outliers if outliers else 'None'}\n")

print("=== BOXPLOT STATS BY WARD ===\n")
for ward, grp in df.groupby('ward'):
    boxplot_stats(grp['los_days'], ward)

=== BOXPLOT STATS BY WARD ===

  Emergency
    Q1=1  Median=2  Q3=3  IQR=2
    Fences: [-2.0, 6.0]
    Outliers: [15, 18]

  Medical
    Q1=5  Median=6  Q3=7  IQR=2
    Fences: [2.0, 10.0]
    Outliers: [21]

  Surgical
    Q1=10  Median=11  Q3=13  IQR=3
    Fences: [5.5, 17.5]
    Outliers: None

What just happened?

pandas' .quantile() takes a list and returns all three percentiles at once. Fences are simple arithmetic. Emergency has two outliers (15 and 18 days) — they stayed 8× the median. Medical has one (21 days). Surgical has none at all — the most predictable ward in the hospital.

The Three Wards — Visual Comparison

Here's what all three boxplots look like side by side — built directly from the numbers above:

Length of Stay by Ward (days)

●18d ●15d

Emergency IQR: 1–3d Median: 2d

●21d

Medical IQR: 5–7d Median: 6d

Surgical IQR: 10–13d Median: 11d

Red dots = outliers. Surgical has none — its box sits highest because stays are genuinely longer, not because of outliers.

Boxplot vs Violin — What the Violin Adds

A boxplot summarises with five numbers. A violin plot wraps the full distribution shape around the outside — wider = more patients at that stay length. It catches things boxplots hide: bimodal distributions (two humps), unusual gaps, and whether a distribution is truly symmetric or just happens to have a centred median.

Use a Boxplot when:

Comparing many groups side by side
You specifically need to see outliers
Sample size is small (<30)

Use a Violin when:

You suspect bimodality or unusual shape
Distribution shape matters for your analysis
Sample size is larger (n≥30 per group)

Step 2 — The Plain-English Findings

The most important output isn't the chart — it's what you tell the operations director. Here's the function that translates boxplot statistics into actionable findings.

print("=== BOARD BRIEFING: LENGTH-OF-STAY FINDINGS ===\n")

for ward, grp in df.groupby('ward'):
    s      = grp['los_days']
    Q1     = s.quantile(0.25)
    median = s.median()
    Q3     = s.quantile(0.75)
    mean   = s.mean()
    IQR    = Q3 - Q1
    outliers = s[s > Q3 + 1.5*IQR].tolist()

    print(f"  {ward} Ward")
    print(f"  Typical stay: {Q1:.0f}–{Q3:.0f} days  (median={median:.0f}d, mean={mean:.1f}d)")

    # Flag when mean is misleadingly high due to outliers
    if mean > median * 1.25:
        pct_inflated = (mean/median - 1)*100
        print(f"  ⚠  Mean is {pct_inflated:.0f}% above median — use median for planning, not mean")

    # Report outliers in plain English
    if outliers:
        print(f"  ⚠  {len(outliers)} outlier(s): {outliers} days — "
              f"{max(outliers)/median:.0f}x longer than typical stay")
        print(f"     Action: investigate case complexity and flag for clinical review")
    else:
        print(f"  ✓  No outliers — this ward is suitable for capacity planning based on historical data")

    print()

=== BOARD BRIEFING: LENGTH-OF-STAY FINDINGS ===

  Emergency Ward
  Typical stay: 1–3 days  (median=2d, mean=5.4d)
  ⚠  Mean is 170% above median — use median for planning, not mean
  ⚠  2 outlier(s): [15, 18] days — 8x longer than typical stay
     Action: investigate case complexity and flag for clinical review

  Medical Ward
  Typical stay: 5–7 days  (median=6d, mean=7.8d)
  ⚠  Mean is 30% above median — use median for planning, not mean
  ⚠  1 outlier(s): [21] days — 4x longer than typical stay
     Action: investigate case complexity and flag for clinical review

  Surgical Ward
  Typical stay: 10–13 days  (median=11d, mean=11.3d)
  ✓  No outliers — this ward is suitable for capacity planning based on historical data

What just happened?

pandas' .median(), .mean(), and .quantile() run inside a groupby loop. The 25% inflation check flags when the mean is misleading. The outlier list is generated with a single boolean filter.

The critical insight for the operations director: Emergency's mean (5.4 days) is 170% above the median (2 days). If she plans bed capacity using the mean, she overestimates by nearly 3×. The median is the right number. Surgical's near-identical mean and median (11.3 vs 11) means it's safe to plan with either — it's genuinely predictable.

Teacher's Note

The boxplot is one of the best "is my mean trustworthy?" checks you have. If the median line sits far from the centre of the box, or if outlier dots appear above the upper whisker, the mean is being distorted. Always check the boxplot before reporting a mean to a stakeholder.

In a real notebook: seaborn.boxplot(x='ward', y='los_days', data=df) draws the full chart in one line. The statistics underneath are exactly what we computed manually — seaborn just draws the shape automatically.

Practice Questions

1. In a boxplot, which line inside the box represents the 50th percentile of the data?

2. You suspect a distribution might be bimodal (two humps). Which plot type would reveal the full shape and confirm this?

3. Emergency ward mean stay is 5.4 days, median is 2 days. Which measure should the operations director use for bed planning?

Quiz

Up Next · Lesson 30

Correlation Maps

Build and read a full correlation heatmap — and learn what it's actually telling you about your features before you model.

← Previous Course Index Next →

EDA Course

Boxplots & Violin Plots

What a Boxplot Actually Shows

The Dataset We'll Use

Step 1 — Calculate the Boxplot Numbers

The Three Wards — Visual Comparison

Boxplot vs Violin — What the Violin Adds

Step 2 — The Plain-English Findings

Practice Questions

Quiz