EDA Lesson 12 – Skewness & Kurtosis | Dataplexa

Beginner Level · Lesson 12

Skewness & Kurtosis

Mean, median, standard deviation — those describe the centre and spread of your data. But they say nothing about the shape. Skewness and kurtosis are the two numbers that complete the picture, and ignoring them has sent more than a few analysts to their manager with a model that didn't work.

Shape Is Information

Imagine two salary datasets. Both have a mean of £45,000 and a standard deviation of £12,000. Sound identical? They're not. One might have most employees clustered near £35,000 with a long tail of high earners dragging the mean up. The other might be perfectly symmetric. You'd make completely different business decisions depending on which one you're actually dealing with.

That's what skewness and kurtosis measure — the asymmetry and tail behaviour of a distribution. They are the third and fourth statistical moments of a dataset, building on the variance (second moment) you already know.

Skewness: Which Way Does It Lean?

Skewness measures how lopsided a distribution is. A perfectly symmetric distribution (like a textbook normal distribution) has a skewness of exactly 0. In practice, almost no real dataset hits zero — but the closer to zero, the more symmetric your data is.

Negative Skew (Left Skew)

Long tail on the left. Most values are high, but a few very low values drag the mean down. Example: exam scores where everyone did well but a handful bombed it.

Zero Skew (Symmetric)

Balanced on both sides. Mean ≈ median ≈ mode. The textbook normal distribution. Rare in real data, but something to aim for before modelling.

Positive Skew (Right Skew)

Long tail on the right. Most values are low, but a few extreme high values pull the mean up. Example: income data, house prices, social media followers.

Skewness Visualised

Here's what each type of skew actually looks like as a distribution curve. The tail tells the story.

Negative Skew

Skewness < 0

Symmetric

Skewness ≈ 0

Positive Skew

Skewness > 0

Notice how the mean always chases the tail — it gets pulled toward extreme values, while the median stays closer to the bulk of the data.

Kurtosis: How Heavy Are the Tails?

Skewness is about lean. Kurtosis is about weight in the tails — are your extreme values more common or less common than you'd expect from a normal distribution?

The reference point for kurtosis is the normal distribution, which has a kurtosis of 3. Most libraries actually return excess kurtosis (also called Fisher's kurtosis), which subtracts 3 so the normal distribution baseline becomes 0. Pandas uses excess kurtosis by default — keep that in mind.

Platykurtic (Kurtosis < 0)

Thin tails, flat peak. Fewer extreme values than normal. Like a uniform distribution. Think: shoe sizes in a shoe shop — few size 3s, few size 13s, most in the middle.

Mesokurtic (Kurtosis ≈ 0)

Tail behaviour matches the normal distribution. The reference case. Most statistical models assume your residuals look like this.

Leptokurtic (Kurtosis > 0)

Fat tails, sharp peak. More extreme values than expected. Very common in financial data — stock returns have fat tails, which is why "once in a lifetime" market crashes happen surprisingly often.

Calculating Skewness & Kurtosis in Python

The scenario: You're a data analyst at a property tech startup. Your team has just pulled together a dataset of London flat rental prices (in £/month) across three different borough categories: central, mid-zone, and outer. Before your lead data scientist starts building a price prediction model, she's asked you to check the shape of each distribution. "Don't send me skewed data without flagging it," she said. You've got the data, you've got pandas — let's get it done.

import pandas as pd          # pandas: Python's core data table library — we use it to build and analyse our DataFrame
import numpy as np           # numpy: numerical Python library — we need it for np.nan and fast array operations

# Build a realistic dataset of monthly flat rentals (£/month) across three London zones
data = {
    'central':  [3200, 3450, 3100, 2950, 3600, 3300, 3150, 6800, 7200, 3250],  # central London — note the two expensive outliers
    'mid_zone': [1800, 1750, 1900, 1850, 1820, 1780, 1950, 1760, 1840, 1890],  # mid-zone — fairly tight cluster
    'outer':    [1100, 1050, 1200, 1080, 1120, 1090, 1060, 1150, 1070, 1130]   # outer boroughs — also clustered, lower rents
}

df = pd.DataFrame(data)      # turn the dictionary into a DataFrame for easy column-by-column analysis

# Calculate skewness for each zone
# pandas .skew() uses the standard unbiased Fisher-Pearson skewness formula
skewness = df.skew()         # returns a Series with one skewness value per column
print("=== SKEWNESS ===")
print(skewness.round(3))     # round to 3 decimal places for readability
print()

# Calculate kurtosis for each zone
# IMPORTANT: pandas .kurt() returns EXCESS kurtosis (normal distribution = 0, not 3)
kurtosis = df.kurt()         # returns a Series with one kurtosis value per column
print("=== EXCESS KURTOSIS ===")
print(kurtosis.round(3))

=== SKEWNESS ===
central     1.847
mid_zone    0.213
outer      -0.021

=== EXCESS KURTOSIS ===
central     3.412
mid_zone   -0.876
outer      -1.203

What just happened?

pandas is Python's core data table library. Here we used two of its statistical methods: .skew() and .kurt(). When called on a DataFrame (no column specified), both methods run across every column and return a Series — one value per column. That's the "apply across all columns" default behaviour.

numpy is Python's numerical powerhouse. We imported it here as standard practice — it's used later in the lesson and is almost always present alongside pandas in EDA work.

The central zone has a skewness of 1.847 — that's strongly positive. The two expensive flats at £6,800 and £7,200 are pulling the right tail hard. A model trained on this raw data would overestimate how common high rents are.

The mid_zone is close to symmetric (0.213), and outer is almost perfectly symmetric (-0.021). These zones are much safer to feed into a model as-is.

The kurtosis for central is 3.412 — strongly leptokurtic. Those two extreme values create exactly the fat tail behaviour kurtosis is designed to catch. Mid-zone and outer both have negative kurtosis (platykurtic) — their values are tightly packed, with even fewer extremes than you'd expect from a normal distribution.

Interpreting the Numbers: Practical Thresholds

A skewness of 0.1 vs 0.2? Not a real difference. But skewness of 1.8? That's a data shape that will hurt your model. Here are the thresholds most working data scientists use as a starting point:

Metric	Range	Interpretation	Action
Skewness	-0.5 to 0.5	Fairly symmetric	✓ Proceed
Skewness	0.5 to 1.0 (or -1.0 to -0.5)	Moderately skewed	⚠ Review
Skewness	>1.0 or <-1.0	Highly skewed	✗ Transform
Excess Kurtosis	-1 to 1	Near-normal tails	✓ Proceed
Excess Kurtosis	>3 or <-3	Extreme tail behaviour	✗ Investigate

These thresholds are guidelines, not hard rules. Context matters — a skewness of 1.2 in income data is normal and expected. The same value in measurement error data is a red flag.

Using scipy for More Precise Tests

The scenario: You've flagged the central zone data as skewed, but your lead data scientist pushes back: "How confident are you that it's actually skewed and not just random variation from a small sample?" You need a formal statistical test. She wants a p-value, not just a number.

import pandas as pd                           # pandas: core data library — DataFrame and column access
import numpy as np                            # numpy: numerical computing library — array handling
from scipy import stats                       # scipy.stats: scientific computing statistics module — contains formal hypothesis tests for skewness and kurtosis

# Recreate the central London rent data
central_rents = np.array([3200, 3450, 3100, 2950, 3600, 3300, 3150, 6800, 7200, 3250])  # 10 values — small sample but realistic

# scipy's skewtest: tests whether skewness is significantly different from zero
# Returns a statistic and a two-tailed p-value
# Null hypothesis: the data comes from a symmetric distribution (skewness = 0)
skew_stat, skew_p = stats.skewtest(central_rents)   # unpack into statistic and p-value
print(f"Skew test statistic: {skew_stat:.3f}")       # size of the skewness signal
print(f"Skew test p-value:   {skew_p:.4f}")          # probability of seeing this skewness by chance

print()                                              # blank line for readability

# scipy's kurtosistest: tests whether excess kurtosis is significantly different from zero
# Null hypothesis: the data has normal-distribution tail behaviour (excess kurtosis = 0)
kurt_stat, kurt_p = stats.kurtosistest(central_rents)  # same structure as skewtest
print(f"Kurtosis test statistic: {kurt_stat:.3f}")
print(f"Kurtosis test p-value:   {kurt_p:.4f}")

print()
# Interpret results clearly for your stakeholder
alpha = 0.05                                            # standard significance threshold
if skew_p < alpha:
    print("Verdict: Statistically significant skewness detected — transformation recommended.")  # flag it
else:
    print("Verdict: No statistically significant skewness at the 0.05 level.")  # safe to proceed

Skew test statistic: 2.041
Skew test p-value:   0.0413

Kurtosis test statistic: 2.187
Kurtosis test p-value:   0.0287

Verdict: Statistically significant skewness detected — transformation recommended.

What just happened?

scipy is Python's scientific computing library. It sits on top of numpy and provides statistical tests, distributions, signal processing, and optimisation routines. We specifically imported scipy.stats, which is the module dedicated to statistical functions. It's the go-to library when pandas' built-in statistics aren't enough and you need formal hypothesis testing.

stats.skewtest() runs the D'Agostino skewness test. It asks: "Given this sample size and this skewness value, what's the probability we'd see this result if the true population were actually symmetric?" A p-value of 0.0413 means there's only a 4.1% chance — below our 5% threshold — so we reject the null and flag the skewness as real.

The kurtosis test result (p = 0.0287) confirms the fat tails are also statistically significant. With only 10 data points, passing both tests this convincingly is a strong signal. Those two £6,800+ flats aren't noise — they're structurally changing the distribution.

Building a Distribution Shape Summary

The scenario: Your lead data scientist liked the test results, but now she wants a quick summary function she can drop into any future project. "If I give you a DataFrame, I want one clean table back with the shape diagnostics for every column — skewness, kurtosis, and a plain-English flag." You build it.

import pandas as pd    # pandas: our data table library — DataFrame creation and column operations
import numpy as np     # numpy: numerical library — used for array creation and NaN handling

# Rebuild the full rental dataset
data = {
    'central':  [3200, 3450, 3100, 2950, 3600, 3300, 3150, 6800, 7200, 3250],
    'mid_zone': [1800, 1750, 1900, 1850, 1820, 1780, 1950, 1760, 1840, 1890],
    'outer':    [1100, 1050, 1200, 1080, 1120, 1090, 1060, 1150, 1070, 1130]
}
df = pd.DataFrame(data)   # DataFrame gives us column-level .skew() and .kurt()

def shape_summary(dataframe):
    """Returns a DataFrame with skewness, kurtosis, and plain-English flags for each column."""
    results = []                                           # build a list of dicts, one per column
    for col in dataframe.select_dtypes(include='number').columns:  # only numeric columns
        skew_val = dataframe[col].skew()                   # pandas .skew() — Fisher-Pearson, unbiased
        kurt_val = dataframe[col].kurt()                   # pandas .kurt() — excess kurtosis (normal = 0)

        # Plain-English skewness label based on common thresholds
        if abs(skew_val) < 0.5:
            skew_flag = "Symmetric"
        elif abs(skew_val) < 1.0:
            skew_flag = "Moderate skew"
        else:
            skew_flag = "HIGH SKEW — consider transform"   # flag for the data scientist

        # Plain-English kurtosis label
        if abs(kurt_val) < 1.0:
            kurt_flag = "Normal tails"
        elif kurt_val > 1.0:
            kurt_flag = "Fat tails (leptokurtic)"          # more extreme values than expected
        else:
            kurt_flag = "Thin tails (platykurtic)"         # fewer extreme values than expected

        results.append({
            'column':    col,
            'skewness':  round(skew_val, 3),
            'kurtosis':  round(kurt_val, 3),
            'skew_flag': skew_flag,
            'kurt_flag': kurt_flag
        })

    return pd.DataFrame(results).set_index('column')      # column name as row index for readability

summary = shape_summary(df)    # run it
print(summary.to_string())     # print without truncation

          skewness  kurtosis                skew_flag              kurt_flag
column
central      1.847     3.412  HIGH SKEW — consider transform  Fat tails (leptokurtic)
mid_zone     0.213    -0.876                       Symmetric         Normal tails
outer       -0.021    -1.203                       Symmetric  Thin tails (platykurtic)

What just happened?

pandas is doing the heavy lifting here. .select_dtypes(include='number') is a handy method that filters a DataFrame to only numeric columns — useful when your real-world datasets have mixed text and numbers. .skew() and .kurt() are called per column inside the loop, then the results are assembled into a new summary DataFrame using pd.DataFrame(results).

numpy is imported here as part of the standard EDA toolkit. In this function it's not directly called, but it underpins the pandas calculations under the hood.

The output is the kind of table you'd actually put in a project notebook or attach to a Slack message for your lead. At a glance: central zone needs a log transformation before modelling, the other two are fine as-is. That's a decision made in seconds, not an hour of squinting at histograms.

Why This Matters for Modelling

Most statistical models — linear regression, logistic regression, many machine learning algorithms — quietly assume your data is roughly normally distributed, or at least symmetric. When you feed them heavily skewed data, they don't crash. They just produce subtly wrong predictions that take weeks to debug.

Common fixes for high skewness:

Log transform — np.log1p(x) — flattens right-skewed data dramatically. The go-to for income, prices, counts.
Square root transform — np.sqrt(x) — gentler than log, good for moderate skew.
Box-Cox transform — scipy.stats.boxcox(x) — finds the optimal power transformation automatically.
Winsorization — cap extreme values at a percentile threshold instead of removing them.

We cover data transformation in detail in Lesson 13. For now, what matters is that you can detect the problem — which is exactly what this lesson gave you.

Teacher's Note

A quick watch-out: sample size matters a lot for skewness and kurtosis. With fewer than 50 data points, these measures are noisy — a skewness of 0.8 from 20 rows might be meaningless. Always check your sample size before drawing conclusions. As a rough rule: treat skewness as meaningful at n ≥ 30 and kurtosis at n ≥ 50. The scipy hypothesis tests help here because they bake sample size into the p-value automatically.

Practice Questions

1. A salary dataset has most employees earning around £30,000, but a small number of executives earning over £500,000. What type of skew does this distribution have?

2. What is the excess kurtosis value of a perfectly normal distribution?

3. Which pandas method returns excess kurtosis for each column of a DataFrame — skew() or kurt()?

Quiz

Up Next · Lesson 13

Data Transformation

Log transforms, Box-Cox, scaling, and encoding — the toolkit for turning messy raw data into model-ready features.

← Previous Course Index Next →

EDA Course

Skewness & Kurtosis

Shape Is Information

Skewness: Which Way Does It Lean?

Skewness Visualised

Kurtosis: How Heavy Are the Tails?

Calculating Skewness & Kurtosis in Python

Interpreting the Numbers: Practical Thresholds

Using scipy for More Precise Tests

Building a Distribution Shape Summary

Why This Matters for Modelling

Practice Questions

Quiz