EDA Lesson 11 – Normality Checks | Dataplexa

Beginner Level · Lesson 11

Normality Checks

Looking like a bell curve and actually being one are two different things — and the difference matters enormously when your statistical tests depend on the normality assumption holding true.

Why "Looks Normal" Isn't Good Enough

In Lesson 10 you learned to spot distribution shapes visually. That's essential — but a histogram can fool you. A dataset with 50 rows might look roughly bell-shaped just by chance, even though it's genuinely skewed. A dataset with 5,000 rows might look almost perfect on a histogram but fail a formal normality test because of a subtle systematic deviation the eye misses.

This matters because dozens of statistical methods — t-tests, ANOVA, Pearson correlation, linear regression diagnostics — carry an explicit assumption: the data (or the residuals) are normally distributed. Use these methods on data that violates the assumption, and your p-values, confidence intervals, and model coefficients become unreliable. Normality checks are the gate you pass through before trusting those results.

🧭 Three Tools, One Goal

You'll use three complementary approaches: the Shapiro-Wilk test (statistical hypothesis test), the Q-Q plot (visual alignment check), and the histogram + KDE overlay (distribution shape inspection). No single tool is definitive on its own — professionals use all three together before making a call.

Tool 1 — The Shapiro-Wilk Test

The Shapiro-Wilk test is the most widely used formal normality test. It works by comparing your data against what a perfectly normal distribution of the same size would look like, and returns two numbers: a W statistic (closer to 1.0 = more normal) and a p-value.

The interpretation follows standard hypothesis testing logic. Your null hypothesis is that the data is normally distributed. If the p-value is above 0.05, you fail to reject that hypothesis — the data is consistent with normality. If the p-value is below 0.05, you reject it — the data is significantly non-normal. Simple rule, powerful result.

p-value > 0.05

✅ Normal

Fail to reject null hypothesis. Data is consistent with a normal distribution. Safe to use parametric tests.

p-value ≤ 0.05

❌ Not Normal

Reject null hypothesis. Significant departure from normality. Switch to non-parametric alternatives.

The scenario: You're a junior data analyst at a pharmaceutical company. The clinical statistics team is planning to run a paired t-test comparing blood pressure readings before and after a new medication trial. The t-test is only valid if the difference in readings per patient is normally distributed. Your job is to verify this assumption on the trial dataset before the results go to the regulatory submission. If the data fails the normality check, the team will need to switch to a Wilcoxon signed-rank test instead. The regulatory body requires documentation of which normality test was used and what it returned — so your output needs to be clean, labelled, and unambiguous.

import pandas as pd                    # DataFrame and .describe() for context
import numpy as np                     # data simulation with np.random
from scipy import stats                # scipy.stats.shapiro() — the Shapiro-Wilk test

# Note: scipy (Scientific Python) is a statistics and science library.
# scipy.stats contains dozens of hypothesis tests — Shapiro-Wilk is one of them.

np.random.seed(42)

# Simulate blood pressure differences (before - after medication) for 40 patients
# A realistic trial: most patients show moderate improvement, a few show none
bp_diff = pd.DataFrame({
    'patient':    [f'P{i:03d}' for i in range(1, 41)],
    # BP reduction in mmHg — positive = improvement, normally distributed
    'bp_reduction': np.random.normal(loc=8.5, scale=4.2, size=40).round(1)
})

# Run Shapiro-Wilk — returns (W statistic, p-value)
stat, p_value = stats.shapiro(bp_diff['bp_reduction'])

# Print a clean, labelled output suitable for documentation
print("=" * 48)
print("  Shapiro-Wilk Normality Test")
print("  Column: bp_reduction (mmHg)")
print("=" * 48)
print(f"  W statistic : {stat:.4f}")
print(f"  p-value     : {p_value:.4f}")
print()

# Interpret automatically — document the conclusion
alpha = 0.05
if p_value > alpha:
    print(f"  RESULT: p={p_value:.4f} > {alpha} → Fail to reject H₀")
    print("  CONCLUSION: Data is consistent with normality.")
    print("  RECOMMENDATION: Paired t-test is appropriate.")
else:
    print(f"  RESULT: p={p_value:.4f} ≤ {alpha} → Reject H₀")
    print("  CONCLUSION: Significant departure from normality.")
    print("  RECOMMENDATION: Use Wilcoxon signed-rank test instead.")

================================================
  Shapiro-Wilk Normality Test
  Column: bp_reduction (mmHg)
================================================
  W statistic : 0.9841
  p-value     : 0.8512

  RESULT: p=0.8512 > 0.05 → Fail to reject H₀
  CONCLUSION: Data is consistent with normality.
  RECOMMENDATION: Paired t-test is appropriate.

💡 What just happened?

We introduced scipy for the first time — specifically scipy.stats, which is the standard Python library for statistical hypothesis tests. We use it instead of pandas or numpy here because neither of those libraries contain formal hypothesis tests — scipy is purpose-built for this. stats.shapiro() returns a tuple of two values: the W statistic (0.9841 — very close to 1.0, strongly suggesting normality) and the p-value (0.8512 — far above 0.05). The clinical team can proceed with their t-test with confidence, and this output block is the documentation their regulatory submission needs.

Tool 2 — The Q-Q Plot

A Quantile-Quantile (Q-Q) plot is the visual companion to the Shapiro-Wilk test. It plots your data's quantiles against the theoretical quantiles of a perfect normal distribution. If your data is normal, all the points will fall neatly along a straight diagonal line. Deviations from that line — S-curves, bent ends, or scattered points — reveal exactly where and how your data departs from normality.

The Q-Q plot is particularly useful for understanding where the non-normality lives. Bent tails tell you about outliers. An S-shaped curve tells you about skew. A systematic arc tells you the whole distribution is off. The Shapiro-Wilk test tells you whether there's a problem — the Q-Q plot tells you what kind.

The scenario: You're a data analyst at an insurance company. The actuarial team is building a regression model to predict claim amounts and they need to verify that the model's residuals are normally distributed — a core assumption of ordinary least squares regression. They've sent you two columns to check: one from a healthy dataset they believe is normal, and one from a problematic column they suspect is skewed. You need to produce Q-Q plots for both and annotate your findings clearly so the actuarial team can decide which column needs transformation before modelling.

import pandas as pd                    # DataFrame structure
import numpy as np                     # data generation
import matplotlib.pyplot as plt        # subplot layout and axis labels
from scipy import stats                # stats.probplot() — draws the Q-Q plot data points

np.random.seed(55)

# Column A: well-behaved residuals — normally distributed
residuals_good = np.random.normal(loc=0, scale=15, size=150).round(2)

# Column B: claim amounts — right-skewed, exponential-like
residuals_skewed = np.random.exponential(scale=20, size=150).round(2)

# Create side-by-side Q-Q plots — one subplot per column
fig, axes = plt.subplots(1, 2, figsize=(10, 4))

# stats.probplot() computes the theoretical quantiles and plots them
# dist='norm' means compare against a normal distribution
# plot=ax tells it to draw on our matplotlib axis
stats.probplot(residuals_good,   dist='norm', plot=axes[0])
stats.probplot(residuals_skewed, dist='norm', plot=axes[1])

# Label each subplot clearly for the actuarial team
axes[0].set_title('Q-Q Plot: Column A (Normal) — Points hug the line')
axes[1].set_title('Q-Q Plot: Column B (Skewed) — Upper tail lifts away')

plt.tight_layout()
plt.show()

# Also run Shapiro-Wilk on both to pair numbers with the visual
stat_a, p_a = stats.shapiro(residuals_good)
stat_b, p_b = stats.shapiro(residuals_skewed)
print(f"Column A — W: {stat_a:.4f}, p: {p_a:.4f}  → {'Normal ✓' if p_a > 0.05 else 'NOT Normal ✗'}")
print(f"Column B — W: {stat_b:.4f}, p: {p_b:.6f} → {'Normal ✓' if p_b > 0.05 else 'NOT Normal ✗'}")

Two Q-Q plots render side by side.

Left plot (Column A — Normal):
  Points fall tightly along the red reference diagonal from bottom-left
  to top-right, with only very minor scatter at the extreme ends.
  This is what a passing Q-Q plot looks like.

Right plot (Column B — Skewed):
  Points follow the line in the lower-left portion, but the upper-right
  tail curves dramatically upward and away from the reference line.
  This S-bend in the upper tail is the visual signature of right skew —
  the largest values are much larger than a normal distribution would predict.

Column A — W: 0.9921, p: 0.6387  → Normal ✓
Column B — W: 0.7803, p: 0.000001 → NOT Normal ✗

💡 What just happened?

scipy.stats.probplot() is the function that generates Q-Q plot data — it computes theoretical normal quantiles and matches them to your data's sorted values, then passes everything to matplotlib to draw. We passed plot=axes[0] so it draws directly onto our subplot axis. The two tools together make the conclusion undeniable: Column A passes both the visual test (points on the line) and the statistical test (p = 0.64). Column B fails both — the upper tail curls sharply away from the line, and the Shapiro-Wilk p-value of 0.000001 is essentially zero. The actuarial team needs to log-transform Column B before using it in their regression model.

Visual Mockup — Reading a Q-Q Plot

Here's a visual reference guide showing exactly what the three most common Q-Q plot patterns look like, so you can diagnose them instantly in the wild.

Q-Q Plot Pattern Reference

✅ Normal — Points on the line

Points closely follow the diagonal. Both tails stay on the line. Normality confirmed.

❌ Right Skew — Upper tail lifts

Lower portion follows the line. Upper tail curves upward — extreme values are larger than normal predicts.

❌ Heavy Tails — Both ends lift

Both tails diverge from the line — an S-shape or lifted ends. Indicates outliers or a heavy-tailed distribution.

Tool 3 — Histogram with KDE Overlay

A KDE (Kernel Density Estimate) is a smoothed curve drawn on top of your histogram that estimates the true underlying shape of your distribution. Overlaying a theoretical normal curve on the same chart lets you see at a glance how well your data matches the ideal bell shape — and where it diverges.

The scenario: You're a data analyst at a logistics company. The operations team wants to run an ANOVA test comparing delivery times across three regional warehouses. Before the test can go ahead, you need to check whether the delivery time column for each region meets the normality assumption. The ops manager has specifically asked for a visual output — not a table of numbers — because they want to share it in a slide deck for the quarterly review. You'll produce a histogram with KDE overlay and a fitted normal curve so the comparison is immediately visible.

import pandas as pd                    # DataFrame and column access
import numpy as np                     # linspace for curve x-axis points
import matplotlib.pyplot as plt        # histogram bars and curve overlay
import seaborn as sns                  # sns.kdeplot() — smooth KDE curve in one line
from scipy import stats                # stats.norm.pdf() — theoretical normal curve

# Note: seaborn (sns) is a high-level visualisation library built on matplotlib.
# We use it here because sns.kdeplot() computes and draws a KDE in a single call —
# doing this manually in matplotlib would require bandwidth estimation code.

np.random.seed(33)

# Delivery times in hours from three warehouses
delivery = pd.DataFrame({
    'region':   ['North'] * 80 + ['South'] * 80,
    # North: normally distributed delivery times centred at 24 hours
    # South: right-skewed — most fast but some very slow deliveries
    'hours': np.concatenate([
        np.random.normal(loc=24, scale=3, size=80),
        np.clip(np.random.exponential(scale=18, size=80), 4, 96)
    ]).round(1)
})

# Plot one histogram + KDE per region
fig, axes = plt.subplots(1, 2, figsize=(11, 4))

for idx, (region, color, ax) in enumerate(zip(
    ['North', 'South'],
    ['#7dd3fc', '#fca5a5'],
    axes
)):
    subset = delivery[delivery['region'] == region]['hours']

    # Histogram bars — density=True scales to probability density
    ax.hist(subset, bins=20, color=color, edgecolor='white',
            density=True, alpha=0.6, label='Data')

    # KDE curve — the smoothed estimate of actual distribution shape
    sns.kdeplot(subset, ax=ax, color=color.replace('a5','26'),
                linewidth=2, label='KDE')

    # Theoretical normal curve based on this data's mean and std
    x = np.linspace(subset.min(), subset.max(), 200)
    ax.plot(x, stats.norm.pdf(x, subset.mean(), subset.std()),
            color='#1e293b', linewidth=2, linestyle='--', label='Normal fit')

    # Shapiro-Wilk result as subtitle
    _, p = stats.shapiro(subset)
    verdict = 'Normal ✓' if p > 0.05 else 'NOT Normal ✗'
    ax.set_title(f'{region} Warehouse  |  Shapiro p={p:.3f}  |  {verdict}')
    ax.set_xlabel('Delivery Time (hours)')
    ax.set_ylabel('Density')
    ax.legend(fontsize=9)

plt.tight_layout()
plt.show()

Two histogram plots render side by side.

Left plot — North Warehouse | Shapiro p=0.412 | Normal ✓
  Bars form a symmetric bell shape centred around 24 hours.
  The blue KDE curve closely hugs the dashed black normal fit curve —
  they overlap almost perfectly. The histogram, KDE, and theoretical
  normal curve are all telling the same story.

Right plot — South Warehouse | Shapiro p=0.000 | NOT Normal ✗
  Bars pile up near 0–15 hours then taper with a long right tail.
  The red KDE curve veers away from the dashed normal fit significantly —
  the peak of the KDE is sharper and further left than the normal curve predicts.
  The mismatch between the KDE and the normal fit is clearly visible.
  South warehouse delivery times need transformation before ANOVA.

💡 What just happened?

Three libraries worked together here. seaborn was introduced for sns.kdeplot() — we use seaborn rather than matplotlib directly because computing a KDE manually requires bandwidth selection algorithms, whereas seaborn handles this automatically in one line. scipy.stats.norm.pdf() generated the theoretical normal curve — pdf stands for probability density function, and we fed it the sample's own mean and std to produce the "ideal" bell curve that fits this data. matplotlib drew the histogram bars and hosted all three curves on the same axis. The visual gap between the red KDE and the dashed normal fit on the South chart is what makes the non-normality undeniable to a slide-deck audience.

Putting It All Together — Full Normality Report

In practice, you'll want to run all three checks on each column and produce a clean summary. Here's a reusable function that does exactly that — wraps Shapiro-Wilk, skewness, and mean-median gap into one tidy report you can drop into any EDA workflow.

The scenario: You're a data analyst at an HR tech company preparing a dataset of employee performance scores for a research paper. The paper's statistical reviewer requires that all continuous variables pass a normality check before any parametric analysis is reported. You have four columns to check — exam score, productivity index, tenure in months, and training hours — and the reviewer wants a single clean table showing the test result, p-value, skewness, and mean-median gap for each one, along with a plain-English verdict.

import pandas as pd                    # DataFrame, output table formatting
import numpy as np                     # data generation for four columns
from scipy import stats                # stats.shapiro() for each column

np.random.seed(77)

# HR employee performance dataset — four continuous columns to check
employees = pd.DataFrame({
    # exam_score: normally distributed 60–100 range
    'exam_score':       np.clip(np.random.normal(78, 9, 120), 40, 100).round(1),
    # productivity: normally distributed index score
    'productivity':     np.clip(np.random.normal(72, 12, 120), 20, 100).round(1),
    # tenure_months: right-skewed — many new hires, few long-tenured staff
    'tenure_months':    np.clip(np.random.exponential(18, 120), 1, 120).round(0),
    # training_hours: right-skewed — most minimal training, a few power learners
    'training_hours':   np.clip(np.random.exponential(10, 120), 0, 80).round(1),
})

# Reusable normality report function
def normality_report(df):
    results = []
    for col in df.columns:
        series = df[col].dropna()               # remove NaN before testing
        stat, p = stats.shapiro(series)         # Shapiro-Wilk test
        skewness = series.skew()                # measure of tail direction
        gap = abs(series.mean() - series.median())  # mean-median distance
        verdict = 'Normal ✓' if p > 0.05 else 'NOT Normal ✗'
        results.append({
            'Column':     col,
            'W Stat':     round(stat, 4),
            'p-value':    round(p, 4),
            'Skewness':   round(skewness, 3),
            'Mean-Median Gap': round(gap, 2),
            'Verdict':    verdict
        })
    return pd.DataFrame(results).set_index('Column')

# Run report on all four columns
report = normality_report(employees)
print(report.to_string())

                  W Stat  p-value  Skewness  Mean-Median Gap    Verdict
Column
exam_score        0.9912   0.6723     0.041             0.21   Normal ✓
productivity      0.9887   0.4851    -0.112             0.38   Normal ✓
tenure_months     0.8341   0.0000     1.872            10.64  NOT Normal ✗
training_hours    0.8209   0.0000     1.941             5.92  NOT Normal ✗

💡 What just happened?

pandas was used to build and format the results table — .set_index() makes the column name the row label for a cleaner output, and .to_string() prints the full DataFrame without truncation. scipy.stats.shapiro() ran on each column's Series after dropping NaN values first — always drop NaN before running any hypothesis test, as NaN values cause shapiro to raise an error. The report tells a clear story: exam score and productivity are normally distributed (p > 0.05, skewness near zero, tiny mean-median gap). Tenure and training hours both fail hard — skewness above 1.8, enormous mean-median gaps, and p-values of essentially zero. The paper's reviewer will require log transformation on those two columns before any parametric analysis proceeds.

🍎 Teacher's Note

One important nuance: with very large samples (n > 5,000), the Shapiro-Wilk test becomes hypersensitive — it will reject normality for even tiny, practically insignificant deviations. In those cases, don't abandon the test; just weight the Q-Q plot and the mean-median gap more heavily in your overall judgement. The test is a tool, not a verdict machine. Also worth knowing: Shapiro-Wilk is most reliable for sample sizes between 3 and 5,000. For larger datasets, the D'Agostino-Pearson test (scipy.stats.normaltest()) is the preferred alternative.

Practice Questions

1. Which function from scipy.stats is used to run the Shapiro-Wilk normality test?

2. What is the standard p-value threshold used in the Shapiro-Wilk test to determine whether data is normally distributed?

3. Which scipy.stats function generates the data points for a Q-Q plot?

Quiz

Up Next · Lesson 12

Skewness & Kurtosis

Go deeper into the two numbers that precisely quantify the shape of any distribution — and learn how to use them to make the right analytical decisions fast.

← Previous Course Index Next →

EDA Course

Normality Checks

Why "Looks Normal" Isn't Good Enough

Tool 1 — The Shapiro-Wilk Test

Tool 2 — The Q-Q Plot

Visual Mockup — Reading a Q-Q Plot

Tool 3 — Histogram with KDE Overlay

Putting It All Together — Full Normality Report

Practice Questions

Quiz