EDA Course
Dispersion Measures — How Spread Out Is Your Data?
The average tells you where the centre is. But two datasets can have the exact same average and look completely different. Dispersion measures tell you the other half of the story — how spread out, how clustered, how wild your data actually is.
Same Average. Completely Different Data.
Here's something that surprises a lot of people. Consider two call centres. Both have an average call handling time of 6 minutes. Your manager looks at the report and says "they're identical — both performing the same."
But Centre A handles calls in 5, 6, 6, 6, 7 minutes — tight and consistent. Centre B handles calls in 1, 2, 6, 10, 11 minutes — all over the place. Same average. Wildly different reality. Centre B has agents who either rush customers off in under 2 minutes or keep them on hold for 11. That's a training problem, a quality problem, a customer satisfaction problem. And the average hid all of it.
Dispersion describes how spread out values are around the centre. The four main measures are range, variance, standard deviation, and interquartile range (IQR). Together with the mean or median, they give you the complete picture.
Range — The Simplest Spread
Range is max minus min. That's it. It tells you the full span of your data — from the smallest value to the largest. It's fast, easy, and useful for a first look. But it has one big weakness: it only cares about the two most extreme values and completely ignores everything in between.
The scenario: You manage delivery logistics for an online store. Your operations director wants to know the spread in delivery times this week. Is the range of delivery times tight (reliable service) or huge (inconsistent service that's frustrating customers)?
import pandas as pd
# Delivery times in hours for 10 orders this week
delivery_df = pd.DataFrame({
'order_id': [4001, 4002, 4003, 4004, 4005, 4006, 4007, 4008, 4009, 4010],
'customer': ['Alice', 'Bob', 'Carol', 'Dave', 'Eve', 'Frank', 'Grace', 'Hank', 'Ivy', 'Jack'],
'delivery_hours': [18, 22, 19, 48, 21, 20, 23, 19, 72, 21] # most are 18-23hrs, two outliers
})
# .max() returns the largest value in the column
# .min() returns the smallest value in the column
max_time = delivery_df['delivery_hours'].max()
min_time = delivery_df['delivery_hours'].min()
# Range = max - min — the full span of delivery times
range_val = max_time - min_time
print(f"Fastest delivery : {min_time} hours")
print(f"Slowest delivery : {max_time} hours")
print(f"Range : {range_val} hours")
print(f"\nMean delivery : {delivery_df['delivery_hours'].mean():.1f} hours")
print(f"Median delivery : {delivery_df['delivery_hours'].median()} hours")
Fastest delivery : 18 hours Slowest delivery : 72 hours Range : 54 hours Mean delivery : 28.3 hours Median delivery : 21.0 hours
What just happened?
.max() and .min() are basic pandas Series methods. Range isn't a built-in pandas function — you calculate it manually as max - min. That's fine, it's a one-liner.
The range is 54 hours — that's enormous for a delivery service. Most orders arrive in 18–23 hours, but two orders (48 and 72 hours) are dragging the mean up to 28.3 while the median stays at 21. The range alone flagged that something is wrong. Two orders are serious outliers. A customer who waited 72 hours for a "next-day" delivery is not coming back.
Variance — How Far Values Stray From the Mean
Variance measures the average squared distance of each value from the mean. The "squared" part sounds odd, but it serves two purposes: it makes all distances positive (so negative and positive deviations don't cancel out), and it penalises large deviations heavily.
The main downside of variance? The units are squared — so if your data is in hours, your variance is in hours². That's hard to interpret directly. That's why we almost always use standard deviation instead, which brings the units back to normal.
The scenario: You're comparing the consistency of two sales reps. Both close an average of 8 deals per week. But you suspect one is wildly inconsistent — feast-or-famine — while the other is steady and reliable. Variance will reveal the difference.
import pandas as pd
# Weekly deals closed by two sales reps over 8 weeks
sales_df = pd.DataFrame({
'week': [1, 2, 3, 4, 5, 6, 7, 8],
'rep_alice': [8, 7, 9, 8, 8, 9, 7, 8], # very consistent — always 7-9
'rep_bob': [2, 15, 3, 14, 1, 16, 4, 13] # feast or famine — wild swings
})
# .mean() — confirm both reps have the same average
print(f"Alice's mean deals/week: {sales_df['rep_alice'].mean()}")
print(f"Bob's mean deals/week : {sales_df['rep_bob'].mean()}")
# .var() calculates variance — average of squared deviations from the mean
# ddof=1 means we use sample variance (divides by n-1, not n) — standard for real data
alice_var = sales_df['rep_alice'].var()
bob_var = sales_df['rep_bob'].var()
print(f"\nAlice's variance: {alice_var:.2f} deals²")
print(f"Bob's variance : {bob_var:.2f} deals²")
print(f"\nBob's variance is {bob_var/alice_var:.0f}x higher than Alice's")
Alice's mean deals/week: 8.0 Bob's mean deals/week : 8.0 Alice's variance: 0.57 deals² Bob's variance : 37.14 deals² Bob's variance is 65x higher than Alice's
What just happened?
.var() is a pandas Series method that computes the sample variance. The ddof=1 parameter is the default — it divides by n-1 rather than n, which gives a better estimate of population variance from a sample. You usually don't need to change this.
Same average — 8 deals per week each. But Alice's variance is 0.57 and Bob's is 37.14. Bob is 65× more volatile than Alice. If you're forecasting pipeline revenue, Alice's numbers are reliable. Bob's are a coin flip. The average hid all of that. Variance exposed it.
Standard Deviation — Variance in Human Units
Standard deviation is just the square root of variance. That one step brings the units back from "deals squared" to "deals" — back to something you can actually talk about with a manager or put in a report.
The rule of thumb for normally distributed data: about 68% of values fall within 1 standard deviation of the mean, and about 95% fall within 2 standard deviations. This is the famous 68-95-99.7 rule, and it makes standard deviation the most useful dispersion measure in day-to-day data work.
The scenario: You're an analyst at a manufacturing plant. A machine stamps metal parts, and each part should weigh exactly 500g. Quality control wants to know how consistent the machine is — how far off does it typically run?
import pandas as pd
import numpy as np
# Weight of 12 metal parts stamped by the machine (should be 500g each)
parts_df = pd.DataFrame({
'part_id': range(1, 13),
'weight_g': [499.2, 500.8, 498.7, 501.3, 499.9, 500.4,
500.1, 499.5, 501.0, 498.9, 500.6, 499.8]
})
# .std() calculates standard deviation — square root of variance
# This is in the same units as the data (grams), which makes it interpretable
std_dev = parts_df['weight_g'].std()
mean_wt = parts_df['weight_g'].mean()
print(f"Target weight : 500.0g")
print(f"Mean weight : {mean_wt:.2f}g")
print(f"Std deviation : {std_dev:.2f}g")
# The 68% rule: most parts should fall within mean ± 1 std dev
lower_1std = mean_wt - std_dev
upper_1std = mean_wt + std_dev
print(f"\n1 std dev range : {lower_1std:.2f}g to {upper_1std:.2f}g")
# Count how many parts actually fall within that range
within_1std = parts_df[(parts_df['weight_g'] >= lower_1std) &
(parts_df['weight_g'] <= upper_1std)].shape[0]
total_parts = len(parts_df)
print(f"Parts within range: {within_1std} out of {total_parts} ({within_1std/total_parts*100:.0f}%)")
Target weight : 500.0g Mean weight : 500.02g Std deviation : 0.71g 1 std dev range : 499.31g to 500.73g Parts within range: 9 out of 12 (75%)
What just happened?
.std() is a pandas Series method — it computes the sample standard deviation (again, ddof=1 by default). We also used boolean indexing — filtering the DataFrame with a condition inside square brackets — to count how many parts fell within the expected range.
The machine's average weight is spot on at 500.02g. But the standard deviation of 0.71g means a typical part is about 0.71g away from the target. For precision manufacturing that might be acceptable — or it might not be, depending on the tolerance spec. 75% of parts fall within 1 standard deviation, close to the expected 68% for a normal distribution. The machine looks reasonably consistent.
IQR — The Outlier-Resistant Spread
The Interquartile Range (IQR) is the distance between the 25th percentile (Q1) and the 75th percentile (Q3). In plain English — it's the range of the middle 50% of your data. The bottom 25% and top 25% are completely ignored.
That makes IQR incredibly robust to outliers. One enormous value at the top won't move the IQR at all. This is why IQR is the measure used in box plots, and why it's the standard tool for outlier detection — which we cover in depth in Lesson 8.
The scenario: You're analysing customer wait times at a bank branch. A few customers waited an extremely long time due to a system outage. You want a spread measure that reflects the typical customer experience — not one that gets pulled around by the bad afternoon.
import pandas as pd
import numpy as np
# Customer wait times in minutes — most are normal, a few are long outliers
wait_df = pd.DataFrame({
'customer_id': range(601, 621), # 20 customers
'wait_mins': [4, 6, 5, 7, 3, 8, 5, 6, 4, 7,
5, 6, 45, 3, 8, 4, 7, 52, 5, 6] # 45 and 52 are outliers
})
# .quantile() gives you any percentile — 0.25 = 25th, 0.75 = 75th
q1 = wait_df['wait_mins'].quantile(0.25) # 25th percentile (lower quartile)
q3 = wait_df['wait_mins'].quantile(0.75) # 75th percentile (upper quartile)
iqr = q3 - q1 # IQR = Q3 - Q1
# Compare IQR vs standard deviation — see how outliers affect each differently
std_dev = wait_df['wait_mins'].std()
mean_wt = wait_df['wait_mins'].mean()
print(f"Q1 (25th percentile): {q1} mins")
print(f"Q3 (75th percentile): {q3} mins")
print(f"IQR : {iqr} mins ← middle 50% of customers")
print(f"\nMean wait : {mean_wt:.1f} mins ← pulled up by outliers")
print(f"Std deviation : {std_dev:.1f} mins ← also inflated by outliers")
print(f"Median wait : {wait_df['wait_mins'].median()} mins ← honest middle")
Q1 (25th percentile): 4.75 mins Q3 (75th percentile): 7.0 mins IQR : 2.25 mins ← middle 50% of customers Mean wait : 9.7 mins ← pulled up by outliers Std deviation : 12.4 mins ← also inflated by outliers Median wait : 5.5 mins ← honest middle
What just happened?
.quantile() is a pandas method that returns the value at any given percentile. Pass 0.25 for Q1, 0.75 for Q3, 0.5 for the median. IQR is then a simple subtraction.
The two outliers (45 and 52 minutes) inflated the mean to 9.7 and the standard deviation to 12.4 — making it look like the branch routinely keeps customers waiting nearly 10 minutes. But the IQR is just 2.25 minutes — the middle 50% of customers waited between 4.75 and 7 minutes. That's the real story for 90% of your customers. Always pair IQR with median when your data has outliers.
The Full Picture — All Four Together
Knowing the four measures individually is good. Knowing how to read them together is what makes you useful on a real project.
The scenario: You're a data analyst reviewing student test scores for a national education board. They want a complete dispersion profile — not just averages — before deciding which schools need extra support.
import pandas as pd
import numpy as np
# Test scores for two schools — same mean, very different distributions
scores_df = pd.DataFrame({
'school_a': [62, 65, 68, 70, 71, 72, 73, 74, 76, 79], # consistent performers
'school_b': [30, 40, 55, 68, 72, 74, 80, 88, 92, 96] # very wide spread
})
# Build a complete dispersion summary for both schools
def dispersion_summary(series, label):
"""Returns a dict of all key dispersion measures for a given series."""
q1 = series.quantile(0.25)
q3 = series.quantile(0.75)
return {
'Mean' : round(series.mean(), 1),
'Median' : series.median(),
'Std Dev' : round(series.std(), 1),
'Variance': round(series.var(), 1),
'Range' : series.max() - series.min(),
'IQR' : round(q3 - q1, 1),
'Min' : series.min(),
'Max' : series.max()
}
# Generate summary for each school and display side by side
summary_a = dispersion_summary(scores_df['school_a'], 'School A')
summary_b = dispersion_summary(scores_df['school_b'], 'School B')
comparison = pd.DataFrame({'School A': summary_a, 'School B': summary_b})
print(comparison)
School A School B Mean 71.0 69.5 Median 71.5 73.0 Std Dev 4.8 21.1 Variance 23.2 445.4 Range 17 66 IQR 8.5 37.5 Min 62 30 Max 79 96
What just happened?
We wrote a reusable Python function that packages all four dispersion measures together — this is a real pattern you'll use constantly in professional EDA. Then we used pd.DataFrame() to display both schools side by side for easy comparison.
School A and School B have almost the same mean (~70). But School B's standard deviation is 21.1 vs School A's 4.8. School B's range is 66 vs School A's 17. School B has students scoring 30 and students scoring 96 — that school needs differentiated teaching, not a one-size approach. School A is consistent across the board. The mean said they were equal. Dispersion revealed they are completely different problems.
Visual — Spread at a Glance
Here's a side-by-side mockup showing how School A (tight cluster) and School B (wide spread) look visually when you plot their score ranges. Both centred near 70 — but completely different shapes.
Score distribution — School A vs School B
White line = median position (~71). Both schools centred similarly — but School B's bar spans nearly the full range.
Dispersion Measures — Quick Reference
| Measure | pandas method | Best for | Weakness |
|---|---|---|---|
| Range | .max() - .min() |
Quick first look at full span | Sensitive to a single outlier |
| Variance | .var() |
Comparing relative spread | Units are squared — hard to interpret |
| Std Dev | .std() |
Most common — same units as data | Affected by outliers like the mean |
| IQR | .quantile(0.75) - .quantile(0.25) |
Skewed data and outlier detection | Ignores the tails entirely |
Teacher's Note
The most important habit to build: never report a mean without also reporting a spread measure. A mean alone is a lie of omission. If you say "average delivery time is 6 hours" — fine. But is that 6 ± 0.5 hours or 6 ± 18 hours? Those are completely different businesses.
The standard professional format is: mean ± std dev for symmetric data, or median (IQR) for skewed data. Write it that way in every report, every analysis, every slide deck. Your stakeholders may not understand what standard deviation means — but they'll understand when you say "typically between X and Y" — which is exactly what mean ± std dev tells you.
Practice Questions
1. Your dataset has several extreme outliers. Which dispersion measure is the most resistant to their effect?
2. Which pandas method calculates the standard deviation of a Series? (just the method name, no brackets)
3. Standard deviation is the square root of ______.
Quiz
1. You want to report how spread out customer ages are in a report that non-technical stakeholders will read. The ages are roughly normally distributed. Which measure should you use?
2. Team A has a standard deviation of 18.4 and Team B has a standard deviation of 2.1. Both have the same mean score. What does this tell you?
3. What is the main weakness of using range as a dispersion measure?
Up Next · Lesson 6
Missing Values
Missing data is in every real-world dataset you'll ever touch. Learn how to find it, classify it, and understand exactly how much damage it can do before you even start cleaning.