EDA Course
Distributions
Before you run a single statistical test or build a single model, you need to understand the shape of your data — because that shape determines which methods are valid, which assumptions hold, and where your analysis can go wrong.
The Shape of Data Tells a Story
Think about the heights of 10,000 adults. Most people cluster around 5'7"–5'9". Very few are 4'10" or 6'6". That natural clustering — a mountain of values in the middle, tapering off symmetrically on both sides — is what statisticians call a normal distribution. It's the most famous shape in statistics, but it's far from the only one you'll meet in real data.
Now think about annual incomes. Most people earn modest amounts, but a small number earn millions. That's a right-skewed distribution — a big pile on the left, a long tail dragging to the right. Or think about a die roll: every number from 1 to 6 has exactly the same chance. That's a uniform distribution. Each shape requires a different analytical approach.
🧭 Why Distributions Matter
Most statistical tests (t-tests, ANOVA, linear regression) assume your data is normally distributed. Use them on skewed data and your p-values become unreliable. Knowing the shape upfront tells you which tests are valid and which need to be swapped for non-parametric alternatives.
The Four Shapes You'll See Most
🔔 Normal (Gaussian)
Symmetric bell curve. Mean = Median = Mode. Heights, test scores, measurement errors. Most statistical tools assume this shape.
➡️ Right-Skewed (Positive)
Long tail to the right. Mean > Median. Incomes, house prices, social media follower counts. Use median, not mean.
⬅️ Left-Skewed (Negative)
Long tail to the left. Mean < Median. Age at retirement, exam scores in easy tests. Less common in practice.
〰️ Bimodal / Uniform
Two peaks (bimodal) or flat (uniform). Bimodal suggests two sub-populations mixed together. Uniform = equal probability across range.
Visualising Distributions with Histograms
The fastest way to see your distribution's shape is a histogram. It divides your column's range into equal-width bins and counts how many values fall into each bin. The resulting bar chart reveals the shape immediately — no maths required at first glance.
The scenario: You're a data analyst at a healthcare company. Your team has collected patient resting heart rate data from a wellness app, and you've been asked to prepare an initial distribution report before the medical statistics team runs their analysis. They need to know whether heart rates are roughly normally distributed (which would validate their planned t-test comparisons between age groups) or whether the data is skewed (which would require them to switch to a Mann-Whitney U test instead). Your job is to produce histograms and basic distribution statistics for the dataset and write a brief interpretation for the medical team's report.
import pandas as pd # DataFrame construction and .describe() stats
import numpy as np # np.random for generating realistic sample data
import matplotlib.pyplot as plt # plotting engine — histograms, axis labels, layout
# Seed for reproducibility — same random numbers every run
np.random.seed(42)
# Generate 200 realistic resting heart rates — normally distributed
# Mean 72 bpm, std 10 bpm — a clinically reasonable spread
heart_rates = pd.DataFrame({
'bpm': np.random.normal(loc=72, scale=10, size=200).round(1)
})
# Print summary statistics first — mean, median, std, min, max
print("Heart rate summary statistics:")
print(heart_rates['bpm'].describe().round(2))
print()
# Also print skewness — close to 0 means symmetric (normal)
print(f"Skewness: {heart_rates['bpm'].skew():.3f}")
print(f"Mean: {heart_rates['bpm'].mean():.2f} bpm")
print(f"Median: {heart_rates['bpm'].median():.2f} bpm")
print()
# Plot the histogram — bins=20 gives a good resolution for 200 rows
fig, ax = plt.subplots(figsize=(8, 4))
ax.hist(heart_rates['bpm'], bins=20, color='#7dd3fc', edgecolor='white')
ax.axvline(heart_rates['bpm'].mean(), color='#1d4ed8', linestyle='--', label='Mean')
ax.axvline(heart_rates['bpm'].median(), color='#dc2626', linestyle=':', label='Median')
ax.set_xlabel('Heart Rate (bpm)')
ax.set_ylabel('Frequency')
ax.set_title('Resting Heart Rate Distribution — 200 Patients')
ax.legend()
plt.tight_layout()
plt.show()
Heart rate summary statistics: count 200.00 mean 72.21 std 9.87 min 42.60 25% 65.60 50% 72.35 75% 78.72 max 99.50 Name: bpm, dtype: float64 Skewness: 0.041 Mean: 72.21 bpm Median: 72.35 bpm A histogram renders with a clear bell-curve shape centred around 72 bpm. The blue dashed mean line and red dotted median line sit almost exactly on top of each other — the hallmark of a symmetric, normal distribution. Bars taper symmetrically toward both tails, with no values below ~42 or above ~100.
💡 What just happened?
We used three libraries here. numpy generated the data with np.random.normal() — the standard tool for simulating normally distributed samples using a mean and standard deviation. pandas gave us .describe() and .skew() to quantify the shape numerically. matplotlib drew the histogram and overlaid vertical lines for the mean and median using axvline(). The skewness of 0.041 is extremely close to 0 — perfect symmetry. The mean (72.21) and median (72.35) are nearly identical, which is the textbook fingerprint of a normal distribution. The medical team can proceed with their t-test.
Spotting a Right-Skewed Distribution
Right-skewed data is everywhere in business and economics. Whenever a value can't go below zero but can theoretically climb very high — income, website session duration, loan amounts, bug fix times — you'll almost always see a right skew. The telltale sign is a large gap between the mean and the median: the mean gets pulled upward by the long tail, while the median stays near where most values actually live.
The scenario: You work at an e-learning platform. Your product team wants to understand how long users spend on a single lesson before closing the tab. The engagement team suspects the data is skewed — most users finish quickly, but a small group of highly engaged users spend very long sessions. Your manager has asked you to confirm this with a histogram and report the mean vs median gap, because the marketing team has been quoting the mean session time in their press releases, and the product team thinks it's misleading. You need to produce a clear visualisation and the key numbers to make the case.
import pandas as pd # describe(), mean(), median(), skew()
import numpy as np # np.random.exponential() for right-skewed data generation
import matplotlib.pyplot as plt # histogram with mean/median overlay
np.random.seed(7)
# Simulate lesson session durations in minutes
# Exponential distribution naturally produces right-skewed data
# scale=8 means average session ~8 min, but a long tail of power users
raw_sessions = np.random.exponential(scale=8, size=300)
# Clip to realistic bounds — sessions can't be negative or over 3 hours
sessions = pd.DataFrame({
'duration_min': np.clip(raw_sessions, 0.5, 120).round(1)
})
# Calculate and print the key numbers for the product team report
mean_val = sessions['duration_min'].mean()
median_val = sessions['duration_min'].median()
skew_val = sessions['duration_min'].skew()
print(f"Mean session duration: {mean_val:.2f} min ← what marketing quotes")
print(f"Median session duration: {median_val:.2f} min ← what most users actually do")
print(f"Skewness: {skew_val:.3f} ← positive = right tail")
print()
print(f"Mean is {mean_val - median_val:.1f} minutes HIGHER than median — skew inflating the average")
# Plot the distribution — right tail should be clearly visible
fig, ax = plt.subplots(figsize=(8, 4))
ax.hist(sessions['duration_min'], bins=30, color='#fca5a5', edgecolor='white')
ax.axvline(mean_val, color='#dc2626', linestyle='--', linewidth=2, label=f'Mean ({mean_val:.1f} min)')
ax.axvline(median_val, color='#1d4ed8', linestyle=':', linewidth=2, label=f'Median ({median_val:.1f} min)')
ax.set_xlabel('Session Duration (minutes)')
ax.set_ylabel('Number of Users')
ax.set_title('Lesson Session Duration — Right-Skewed Distribution')
ax.legend()
plt.tight_layout()
plt.show()
Mean session duration: 8.31 min ← what marketing quotes Median session duration: 5.62 min ← what most users actually do Skewness: 2.187 ← positive = right tail Mean is 2.7 minutes HIGHER than median — skew inflating the average A histogram renders. The tallest bars are clustered between 0–10 minutes, with a sharp drop-off and then a long, low tail extending out to ~90 minutes. The red dashed mean line sits noticeably to the right of the blue dotted median — the visual gap between them tells the skew story at a glance.
💡 What just happened?
numpy was used for np.random.exponential(), which is the standard function for generating right-skewed data — the exponential distribution naturally produces the "big pile near zero, long tail to the right" shape that mirrors real-world session durations, incomes, and wait times. np.clip() bounded the values to a realistic range. pandas computed the mean, median, and skew. matplotlib overlaid both reference lines on the histogram. The 2.7-minute gap between mean and median is the smoking gun — marketing has been quoting an average that the majority of users never experience. The median of 5.62 minutes is the honest number.
HTML Mockup — Distribution Shape Gallery
Here's a visual reference of the four core distribution shapes — built in pure HTML and CSS so you can see the silhouette of each type clearly.
Distribution Shape Gallery
🔔 Normal Distribution
Symmetric · Mean ≈ Median · Skew ≈ 0
➡️ Right-Skewed
Tail right · Mean > Median · Skew > 0
⬅️ Left-Skewed
Tail left · Mean < Median · Skew < 0
〰️ Bimodal
Two peaks · Suggests two sub-groups mixed
Detecting a Bimodal Distribution
A bimodal distribution has two distinct peaks. It almost always signals that two different sub-populations have been mixed together in the same column. On its own, the mean is completely meaningless for bimodal data — it falls right in the valley between the two peaks, describing nobody.
The scenario: You're a data analyst at a gym chain. The marketing team is planning a single campaign targeting the "average" member based on workout duration. Before they finalise the brief, you suspect the duration data might be bimodal — the gym serves both a morning crowd of dedicated athletes (60–90 minute sessions) and a lunchtime crowd of office workers (20–35 minutes). If you're right, a single "average" is meaningless and the marketing team should be running two separate campaigns. You need to prove it with a histogram and report both peaks clearly.
import pandas as pd # DataFrame, concat(), describe()
import numpy as np # np.random.normal() to simulate two separate groups
import matplotlib.pyplot as plt # histogram — bins reveal the two peaks
np.random.seed(21)
# Group 1: lunchtime office workers — shorter sessions centred at 28 min
office_crowd = np.random.normal(loc=28, scale=5, size=150)
# Group 2: morning athletes — longer sessions centred at 72 min
athletes = np.random.normal(loc=72, scale=8, size=120)
# Combine both groups into one DataFrame — as they appear in the real data
workouts = pd.DataFrame({
'duration_min': np.clip(
np.concatenate([office_crowd, athletes]), 5, 120
).round(1)
})
# The mean of bimodal data is deceptive — watch where it falls
mean_val = workouts['duration_min'].mean()
median_val = workouts['duration_min'].median()
print(f"Mean duration: {mean_val:.1f} min")
print(f"Median duration: {median_val:.1f} min")
print(f"Skewness: {workouts['duration_min'].skew():.3f}")
print("Warning: mean falls between two groups — it describes neither!")
# Plot — use enough bins to clearly show both peaks
fig, ax = plt.subplots(figsize=(8, 4))
ax.hist(workouts['duration_min'], bins=35, color='#c4b5fd', edgecolor='white')
ax.axvline(mean_val, color='#7c3aed', linestyle='--', linewidth=2,
label=f'Mean ({mean_val:.1f} min) — in the valley!')
ax.set_xlabel('Workout Duration (minutes)')
ax.set_ylabel('Number of Members')
ax.set_title('Workout Duration — Two Distinct Member Groups')
ax.legend()
plt.tight_layout()
plt.show()
Mean duration: 47.6 min Median duration: 38.4 min Skewness: 0.312 Warning: mean falls between two groups — it describes neither! A histogram renders with two clear peaks — one around 28 minutes and a second around 72 minutes, with a visible valley between them at roughly 50 minutes. The purple dashed mean line sits at 47.6 minutes, right inside the valley — almost no member actually has a session that length. The skewness of 0.312 looks mild and would not raise red flags on its own, which is exactly why visually inspecting histograms is non-negotiable.
💡 What just happened?
numpy was used twice — np.random.normal() generated each group separately, and np.concatenate() merged the two arrays into one, simulating what happens in real data when two populations share the same column. matplotlib drew the histogram with enough bins (35) to resolve both peaks clearly — too few bins would have blurred them together. The most important insight here: the skewness is only 0.312, which looks almost normal by that metric alone. Only the histogram catches the bimodal shape. This is the strongest argument for always plotting before you calculate.
Reading .describe() for Distribution Clues
You won't always have time to plot everything. .describe() gives you eight quick numbers that together tell a distribution story — if you know how to read them. The gap between mean and median, the spread between 25th and 75th percentiles, and the relationship between std and the IQR all carry shape information.
The scenario: You've inherited a dataset of customer support ticket resolution times from a previous analyst who left no documentation. Your team lead needs a quick distribution summary of three key columns before a meeting in 20 minutes. You have no time to produce plots — you need to read the shape from numbers alone and report back verbally with confidence.
import pandas as pd # describe(), skew(), and kurtosis() for shape fingerprint
import numpy as np # data generation for realistic simulation
np.random.seed(99)
# Customer support ticket data — three columns with different shapes
tickets = pd.DataFrame({
# Resolution time (hours) — right-skewed: most fast, a few nightmares
'resolve_hours': np.clip(
np.random.exponential(scale=4, size=200), 0.5, 72
).round(1),
# Agent satisfaction score — normally distributed 1–5 scale
'satisfaction': np.clip(
np.random.normal(loc=3.5, scale=0.8, size=200), 1, 5
).round(1),
# Number of reply messages — integers, right-skewed
'reply_count': np.clip(
np.random.exponential(scale=3, size=200).astype(int), 1, 25
)
})
# .describe() gives count, mean, std, min, quartiles, max all at once
print("Full describe() output:")
print(tickets.describe().round(2))
print()
# Quick shape fingerprint — skewness for each column
print("Skewness per column (0 = symmetric, +ve = right tail, -ve = left tail):")
print(tickets.skew().round(3))
print()
# Mean vs median gap — big gap means skew is affecting the mean
for col in tickets.columns:
mean = tickets[col].mean()
med = tickets[col].median()
gap = abs(mean - med)
print(f"{col:16s} mean={mean:.2f} median={med:.2f} gap={gap:.2f}")
Full describe() output:
resolve_hours satisfaction reply_count
count 200.00 200.00 200.00
mean 3.84 3.47 2.88
std 4.01 0.76 2.91
min 0.50 1.10 1.00
25% 1.10 2.90 1.00
50% 2.40 3.50 2.00
75% 5.20 4.00 4.00
max 38.20 5.00 20.00
Skewness per column (0 = symmetric, +ve = right tail, -ve = left tail):
resolve_hours 2.841
satisfaction 0.042
reply_count 2.103
dtype: float64
resolve_hours mean=3.84 median=2.40 gap=1.44
satisfaction mean=3.47 median=3.50 gap=0.03
reply_count mean=2.88 median=2.00 gap=0.88
💡 What just happened?
pandas did all the heavy lifting here — .describe() runs on the whole DataFrame at once, returning an 8-row summary table for every numeric column simultaneously. .skew() is also a DataFrame-level method, computing skewness for every column in one call. The output tells a clear story: resolve_hours is heavily right-skewed (2.841) with a mean 1.44 hours higher than the median — you should report the median to your team lead, not the mean. satisfaction is nearly perfect normal (0.042) — mean and median are almost identical. reply_count is also right-skewed, which makes sense: most tickets are resolved in 2 messages, but a few complex ones spiral into 20.
🍎 Teacher's Note
Here's your distribution reading checklist for any new dataset: first, run .describe() and look at the mean vs median gap for every column. Second, compute .skew() — anything above 1 or below –1 needs a closer look. Third, always plot at least one histogram per column before you trust a single number. A skewness of 0.3 looks innocent until a histogram reveals it's bimodal. Numbers describe; shapes reveal. Both are necessary.
Practice Questions
1. Which distribution shape is symmetric, has mean equal to median, and is assumed by most classical statistical tests?
2. When a column has a skewness of 2.8 (right-skewed), which measure of central tendency should you report as the typical value?
3. A histogram shows two distinct peaks with a valley in the middle. What type of distribution is this?
Quiz
1. A salary column has a skewness of 3.1. Which statement about its mean and median is correct?
2. A column's skewness is 0.3 — close to normal. But you suspect a bimodal distribution. What should you do?
3. Which numpy function was used in this lesson to simulate a right-skewed distribution?
Up Next · Lesson 11
Normality Checks
Looking normal isn't the same as being normal — learn the statistical tests that formally confirm whether your data meets the normality assumption your models depend on.