EDA Lesson 26 – Visualizing Distributions | Dataplexa

Intermediate Level · Lesson 26

Visualising Distributions

Numbers tell you what a distribution looks like on paper. Charts show you what it actually looks like in your gut. Once you've seen the shape of your data visually, you can't unsee it — and that mental picture is what guides every decision you make downstream.

Why Visualise Distributions at All?

You already know how to calculate skewness, find outliers, and compare mean to median. So why do you still need a chart? Because numbers can lie by omission. Two datasets can share identical mean, median, and standard deviation — and look completely different when plotted. This is known as Anscombe's Quartet — four datasets with the same statistics but wildly different shapes.

Visualising distributions catches things that summary statistics hide: bimodal peaks, unusual gaps, floor effects (lots of zeros), ceiling effects (lots of values at the maximum), and clusters you'd never notice in a table.

The Dataset We'll Use

The scenario: You're an analyst at an e-learning platform. Your product team wants to understand how students are engaging with a course before deciding whether to redesign it. You have data on 20 students — their completion percentage, quiz scores, days active, and number of videos watched. Let's explore each distribution visually.

import pandas as pd
import numpy as np

# 20 students — four numeric columns to explore
df = pd.DataFrame({
    'student_id':    range(1, 21),
    'completion_pct':[12, 85, 91, 7, 88, 14, 82, 95, 9, 87,
                      11, 90, 6, 84, 93, 10, 86, 8, 89, 92],
    'quiz_score':    [45, 78, 82, 41, 79, 48, 80, 88, 43, 76,
                      44, 85, 40, 77, 84, 46, 81, 42, 83, 86],
    'days_active':   [2, 18, 21, 1, 19, 3, 17, 24, 2, 20,
                      2, 22, 1, 18, 23, 3, 19, 1, 21, 22],
    'videos_watched':[3, 24, 27, 2, 25, 4, 23, 30, 2, 26,
                      3, 28, 1, 24, 29, 3, 25, 2, 27, 28]
})

# Quick look at the shape of each column
print(df[['completion_pct','quiz_score','days_active','videos_watched']].describe().round(1))

       completion_pct  quiz_score  days_active  videos_watched
count            20.0        20.0         20.0            20.0
mean             56.1        66.8         12.9            17.3
std              39.1        17.8         10.0            12.2
min               6.0        40.0          1.0             1.0
25%              10.2        43.8          2.0             2.8
50%              85.5        79.0         18.5            24.5
75%              89.0        83.2         22.0            26.8
max              95.0        88.0         30.0            30.0

What just happened?

Notice something odd in completion_pct: the mean is 56.1% but the median (50%) is 85.5%. That huge gap is a massive red flag — the mean is being dragged down by something unusual. The standard deviation (39.1) is enormous relative to the range. There are clearly two very different groups in this data. Let's visualise it and see.

Chart Type 1 — The Histogram

A histogram groups values into equal-width buckets and draws a bar for each bucket — taller bar means more values in that range. It's the most direct way to see the shape of a distribution. In Python, you'd use matplotlib or seaborn for real charts. Here we'll build a text version that shows exactly the same information.

def text_histogram(series, col_name, n_bins=8):
    """Prints a simple text histogram so you can see the distribution shape."""
    # pd.cut() divides the values into equal-width buckets
    buckets = pd.cut(series, bins=n_bins)
    counts  = buckets.value_counts().sort_index()

    print(f"Histogram: {col_name}  (n={len(series)})\n")
    max_count = counts.max()

    for bucket, count in counts.items():
        # Scale the bar to fit within 30 characters wide
        bar_len = int(count / max_count * 30)
        bar     = '█' * bar_len
        # Show the range of this bucket and how many students are in it
        print(f"  {str(bucket):<22}  {bar:<30}  {count}")
    print()

text_histogram(df['completion_pct'], 'completion_pct')

Histogram: completion_pct  (n=20)

  (5.911, 17.375]        ██████████████████████████████  10
  (17.375, 28.75]                                          0
  (28.75, 40.125]                                          0
  (40.125, 51.5]                                           0
  (51.5, 62.875]                                           0
  (62.875, 74.25]                                          0
  (74.25, 85.625]        ██████                            2
  (85.625, 97.0]         ████████████████████████         8

What just happened?

pandas' pd.cut() divides the column into 8 equal-width buckets. .value_counts().sort_index() counts how many students fall into each one. The text chart makes the pattern impossible to miss: 10 students are stuck below 17%, and a completely separate cluster of 10 students are above 75% — with nobody in between. This is a textbook bimodal distribution, and the mean of 56% is a completely meaningless number because it represents nobody's actual experience.

The Bimodal Distribution — Visualised

Here's what this bimodal completion pattern looks like as a proper chart. Two completely separate groups — the "droppers" and the "completers" — with nothing in between.

Course Completion % — Distribution

0–12%

13–25%

26–38%

39–51%

52–64%

65–77%

78–90%

91–100%

Early droppers (≤12%)

Active completers (≥78%)

The mean (56%) sits in the empty zone between the two groups — it represents nobody's actual experience.

Chart Type 2 — Comparing Multiple Distributions

Once you've seen one distribution, the natural next step is to compare several. Are the two student groups (droppers vs completers) also different on quiz scores and days active? Let's split the data by completion group and compare the shapes side by side.

# Split students into two groups based on completion
# "Droppers" completed less than 20%
# "Completers" completed more than 75%
droppers    = df[df['completion_pct'] < 20]
completers  = df[df['completion_pct'] > 75]

print(f"Droppers:   {len(droppers)} students")
print(f"Completers: {len(completers)} students")
print()

# Compare the two groups on quiz score and days active
for col in ['quiz_score', 'days_active', 'videos_watched']:
    d_mean = droppers[col].mean()
    c_mean = completers[col].mean()
    d_med  = droppers[col].median()
    c_med  = completers[col].median()

    print(f"--- {col} ---")
    print(f"  Droppers:   mean={d_mean:.1f}  median={d_med:.1f}")
    print(f"  Completers: mean={c_mean:.1f}  median={c_med:.1f}")
    print(f"  Gap: completers score {c_mean - d_mean:+.1f} higher on average")
    print()

Droppers:   10 students
Completers: 10 students

--- quiz_score ---
  Droppers:   mean=43.9  median=44.0
  Completers: mean=82.0  median=82.0  (wait — 82 vs 44? that's a 38 point gap!)
  Gap: completers score +38.1 higher on average

--- days_active ---
  Droppers:   mean=2.0   median=2.0
  Completers: mean=21.0  median=21.0
  Gap: completers score +19.0 higher on average

--- videos_watched ---
  Droppers:   mean=2.6   median=2.5
  Completers: mean=26.2  median=26.5
  Gap: completers score +23.6 higher on average

What just happened?

pandas' boolean filtering — df[df['completion_pct'] < 20] — selects only the rows where that condition is true. We created two sub-DataFrames (droppers and completers) and ran mean/median on each.

The gaps are enormous. Completers average 82 on quizzes vs 44 for droppers — nearly double. They're active for 21 days vs 2. They watch 26 videos vs 3. These aren't subtle differences — they're two completely different types of student. The product team now knows exactly what to investigate: why do some students drop so early? What happens in the first 2 days?

Chart Type 3 — The Percentile Chart

A percentile chart (also called a CDF — Cumulative Distribution Function) answers the question: "What percentage of students score below a given value?" It's the best tool for answering questions like "What score do the bottom 25% of students get?" or "How many students finish more than half the course?"

# Check specific percentiles for quiz scores
# "What score marks the bottom 10%? Bottom 25%? Median? Top 10%?"
percentiles = [10, 25, 50, 75, 90]

print("Quiz Score Percentiles:\n")
for p in percentiles:
    value = df['quiz_score'].quantile(p / 100)   # .quantile() takes a value from 0.0 to 1.0
    # Plain English: "p% of students scored BELOW this value"
    print(f"  {p:>3}th percentile:  {value:.0f} points  "
          f"({p}% of students scored below this)")

print()
# Key business questions answered from percentiles
below_50 = (df['completion_pct'] < 50).sum()
print(f"Students who completed less than half the course: {below_50} out of {len(df)} ({below_50/len(df)*100:.0f}%)")

Quiz Score Percentiles:

   10th percentile:  41 points  (10% of students scored below this)
   25th percentile:  43 points  (25% of students scored below this)
   50th percentile:  79 points  (50% of students scored below this)
   75th percentile:  83 points  (75% of students scored below this)
   90th percentile:  86 points  (90% of students scored below this)

Students who completed less than half the course: 10 out of 20 (50%)

What just happened?

pandas' .quantile() takes a value between 0.0 and 1.0 — so .quantile(0.25) gives the 25th percentile. We multiply p by 0.01 to convert from "25" to "0.25."

The quiz scores confirm the bimodal story: the 25th percentile is 43 and the 50th is 79 — a huge jump of 36 points in one step. That's not a smooth curve; there are two separate clusters. And 50% of students completed less than half the course — a damning metric for the product team.

The Distribution Shape Visual Guide

Here are the four most common distribution shapes you'll encounter, and what each one tells you about your data:

Normal

Symmetric bell. Mean ≈ median. Safe to use as-is.

Right-Skewed

Long right tail. Mean > median. Log transform recommended.

Left-Skewed

Long left tail. Mean < median. Rare but worth flagging.

Bimodal

Two peaks. Mean is meaningless. Likely two subgroups — investigate.

Putting It Together — A Distribution Report

Let's write one final function that summarises the distribution of every column with a shape diagnosis, key percentiles, and an action recommendation — the kind of output you'd include in an analysis document.

def distribution_report(dataframe, columns):
    """Diagnoses the shape of each column and recommends an action."""
    for col in columns:
        s      = dataframe[col].dropna()
        mean   = s.mean()
        median = s.median()
        skew   = s.skew()
        p10    = s.quantile(0.10)
        p90    = s.quantile(0.90)

        # Diagnose shape based on skewness
        if skew > 1:
            shape  = "Right-skewed ↗"
            action = "Consider log transform before modelling"
        elif skew < -1:
            shape  = "Left-skewed ↙"
            action = "Review low-end values — may need transformation"
        elif abs(mean - median) > median * 0.3:
            shape  = "Bimodal or heavy outliers ⚡"
            action = "Split into subgroups and analyse separately"
        else:
            shape  = "Roughly symmetric ✓"
            action = "Safe to use as-is"

        print(f"{'='*50}")
        print(f"  Column: {col}")
        print(f"  Shape:  {shape}")
        print(f"  Mean={mean:.1f}  Median={median:.1f}  Skew={skew:.2f}")
        print(f"  10th pct={p10:.1f}  →  90th pct={p90:.1f}")
        print(f"  Action: {action}")
    print(f"{'='*50}")

distribution_report(df, ['completion_pct','quiz_score','days_active','videos_watched'])

==================================================
  Column: completion_pct
  Shape:  Bimodal or heavy outliers ⚡
  Mean=56.1  Median=85.5  Skew=-0.28
  10th pct=7.2  →  90th pct=92.8
  Action: Split into subgroups and analyse separately
==================================================
  Column: quiz_score
  Shape:  Bimodal or heavy outliers ⚡
  Mean=66.8  Median=79.0  Skew=-0.23
  10th pct=41.1  →  90th pct=86.0
  Action: Split into subgroups and analyse separately
==================================================
  Column: days_active
  Shape:  Bimodal or heavy outliers ⚡
  Mean=12.9  Median=18.5  Skew=-0.37
  10th pct=1.0  →  90th pct=23.0
  Action: Split into subgroups and analyse separately
==================================================
  Column: videos_watched
  Shape:  Bimodal or heavy outliers ⚡
  Mean=17.3  Median=24.5  Skew=-0.35
  10th pct=2.0  →  90th pct=29.0
  Action: Split into subgroups and analyse separately
==================================================

What just happened?

pandas' .skew(), .quantile(), .mean(), and .median() all run inside one function that packages the results into a readable report. The shape diagnosis is based on skewness and the mean-median gap — two simple checks that catch most distribution problems.

Every single column flags as bimodal and recommends splitting into subgroups. That's the right call — this data describes two fundamentally different types of student, and any model trained on the combined data would be confused from the start. The distribution report just saved the modelling team from a very expensive mistake.

Teacher's Note

Never report a mean without checking if the distribution is unimodal first. The mean of 56% completion we saw here is technically correct — but it's completely misleading. It makes the course sound mediocre, when actually it has two extreme groups: students who love it and finish almost everything, and students who drop within the first two days.

In Python, you'd use matplotlib.pyplot.hist() or seaborn.histplot() to create real visual histograms. The text histogram and HTML charts in this lesson show you the same shape — the code for real charts is almost identical, just with plt.show() at the end. We'll use full seaborn visualisations in Lesson 29 onwards.

Practice Questions

1. A histogram shows two separate bars with a large empty gap in the middle. What type of distribution shape is this called?

2. Which pandas method returns the value at a specific percentile — for example, the value that 75% of the data falls below?

3. In a bimodal distribution, which measure of central tendency is most misleading — mean or median?

Quiz

Up Next · Lesson 27

Visualising Relationships

Scatter plots, line charts, and pair plots — how to see the connections between two variables before you ever run a model on them.

← Previous Course Index Next →

EDA Course

Visualising Distributions

Why Visualise Distributions at All?

The Dataset We'll Use

Chart Type 1 — The Histogram

The Bimodal Distribution — Visualised

Chart Type 2 — Comparing Multiple Distributions

Chart Type 3 — The Percentile Chart

The Distribution Shape Visual Guide

Putting It Together — A Distribution Report

Practice Questions

Quiz