EDA Lesson 23 – Numerical Exploration | Dataplexa
Intermediate Level · Lesson 23

Numerical Exploration

Numbers look simple — but a column full of numbers hides a lot of stories. Is the data bunched together or spread out? Are there a few extreme values pulling the average up? Is the shape lopsided? This lesson gives you a step-by-step process to read any numeric column like a pro.

What Are We Actually Looking For?

When you explore a numeric column, you're trying to answer five questions:

① Where is the centre?

What's the typical value? Mean, median, mode.

② How spread out is it?

Are values tightly packed or all over the place? Std dev, IQR.

③ What is the shape?

Symmetric bell curve, or lopsided? Skewness.

④ Are there outliers?

Extreme values that don't fit the pattern. IQR rule.

⑤ Is anything missing?

How many rows have no value at all? Missing rate.

The Dataset We'll Use

The scenario: You're a data analyst at a supermarket chain. Your manager wants a full numerical profile of customer spending behaviour before the loyalty team builds a segmentation model. You have data on 14 customers — their weekly spend, number of visits, basket size, and loyalty points. Let's explore each numeric column step by step.

First, create the dataset and take a peek:

import pandas as pd
import numpy as np

# Supermarket customer data — 14 customers, 4 numeric columns
df = pd.DataFrame({
    'customer_id':    range(1, 15),
    'weekly_spend':   [45, 120, 38, 210, 67, 89, 52, 340, 73, 95, 41, 88, 61, 780],
    #                                                              ^^^         ^^^
    #                  Customer 8 and 14 look unusually high — keep an eye on those

    'visits_pw':      [2, 4, 1, 5, 3, 3, 2, 6, 3, 4, 2, 3, 2, np.nan],
    #                                                              ^^^
    #                  Customer 14 has a missing visit count — np.nan means "no data"

    'avg_basket':     [22, 30, 38, 42, 22, 30, 26, 57, 24, 24, 21, 29, 31, 780],
    'loyalty_points': [120, 450, 95, 820, 200, 310, 140, 1200, 230, 380, 110, 290, 180, 2900]
})

print(df.to_string(index=False))

What just happened?

pandas creates and stores the data table. numpy supplies np.nan — a special value that means "this cell is empty." Even before running any analysis, you can eyeball the data: customer 14 has very high spend (£780), a huge basket (£780 avg), and 2,900 loyalty points. They look like a VIP or possibly a data entry error — we'll find out in the outlier step.

Step 1 — The Quick Summary

The fastest way to get an overview of all numeric columns at once is .describe(). It gives you count, mean, min, max, and the three quartiles (25%, 50%, 75%) in one table. Think of it as a medical check-up — one glance tells you where the problems might be.

# .describe() produces a statistical summary of every numeric column
# .T transposes the table — swaps rows and columns so it's easier to read
print(df.describe().T.round(1))

How to read this table

pandas' .describe() computes all those statistics at once. The .T flips the table so each column becomes a row — easier to scan when you have many features.

Two red flags jump out immediately: weekly_spend has a mean of £149 but a median (50%) of only £84 — the mean is being pulled up by high spenders. And avg_basket has a max of £780 which is clearly that same outlier customer. The visits_pw count is 13 (not 14) — one customer is missing. These three signals come from one table in one second.

Step 2 — Mean vs Median: Spotting the Skew Quickly

Here's a quick trick every analyst should know: when the mean is much higher than the median, the data is right-skewed. A few very high values are pulling the mean up, but the median (the middle value) stays put. It's like how one billionaire in a room raises the average wealth dramatically — but the median barely moves.

numeric_cols = ['weekly_spend', 'visits_pw', 'avg_basket', 'loyalty_points']

print(f"{'Column':<18} {'Mean':>8} {'Median':>8}  {'Gap':>8}  Verdict")
print("-" * 62)

for col in numeric_cols:
    mean   = df[col].mean()    # .mean() adds up all values and divides by count
    median = df[col].median()  # .median() finds the exact middle value
    gap    = mean - median     # a big positive gap = right skew (high outliers pulling mean up)

    # Flag columns where mean is more than 20% higher than median
    if gap > median * 0.2:
        verdict = "⚠ Right-skewed — high outliers present"
    elif gap < -median * 0.2:
        verdict = "⚠ Left-skewed — low outliers present"
    else:
        verdict = "✓ Roughly symmetric"

    print(f"{col:<18} {mean:>8.1f} {median:>8.1f}  {gap:>+8.1f}  {verdict}")

What just happened?

pandas' .mean() and .median() are called on each column inside a loop. The "gap" check is just subtraction — no complex formula needed. Three columns are flagged as right-skewed, all driven by that one extreme customer (customer 14). Only visits_pw is clean. This tells us three columns will likely need a log transform before modelling.

Step 3 — Finding Outliers With the IQR Rule

The IQR rule is the most common way to detect outliers. Here's the idea: look at the middle 50% of your data (the IQR — InterQuartile Range). Any value that sits more than 1.5 × IQR below the bottom of that middle range, or above the top, is flagged as an outlier. Think of it as drawing a fence around the "normal" zone — anything outside the fence gets a flag.

The IQR fence in plain English:

① Find Q1 (the value at the 25% mark) and Q3 (the value at the 75% mark)
② IQR = Q3 − Q1 (the width of the middle 50%)
③ Lower fence = Q1 − 1.5 × IQR
④ Upper fence = Q3 + 1.5 × IQR
⑤ Any value outside those fences = outlier candidate

Let's apply this to the weekly_spend column:

col = 'weekly_spend'

# Step 1: Find Q1 and Q3
Q1 = df[col].quantile(0.25)   # the value that 25% of customers are below
Q3 = df[col].quantile(0.75)   # the value that 75% of customers are below

# Step 2: Calculate the IQR (the middle 50% range)
IQR = Q3 - Q1

# Step 3: Set the fences
lower_fence = Q1 - 1.5 * IQR
upper_fence = Q3 + 1.5 * IQR

print(f"Q1 = £{Q1}   Q3 = £{Q3}   IQR = £{IQR}")
print(f"Lower fence: £{lower_fence:.1f}   Upper fence: £{upper_fence:.1f}")
print()

# Step 4: Find customers outside the fences
outliers = df[(df[col] < lower_fence) | (df[col] > upper_fence)]
#              ^^^ below lower fence  OR  ^^^ above upper fence

print(f"Outlier customers (outside the fence):")
print(outliers[['customer_id', 'weekly_spend']])

What just happened?

pandas' .quantile(0.25) and .quantile(0.75) find Q1 and Q3. We then use simple maths to build the fences. The boolean filter (df[col] < lower_fence) | (df[col] > upper_fence) returns only the rows outside those fences. Customers 8 (£340) and 14 (£780) are confirmed outliers — both are spending far more than the typical customer. Whether to remove them or keep them depends on the business context — but now we know who they are.

Step 4 — Checking All Columns for Outliers at Once

Running the IQR check column by column is fine for a quick look. But you want a summary table that covers all four numeric columns in one go. Here's a compact function that does exactly that:

def outlier_summary(dataframe, columns):
    """For each column, count how many outliers exist using the IQR rule."""
    rows = []
    for col in columns:
        s  = dataframe[col].dropna()   # ignore missing values
        Q1 = s.quantile(0.25)
        Q3 = s.quantile(0.75)
        IQR = Q3 - Q1
        n_out = ((s < Q1 - 1.5*IQR) | (s > Q3 + 1.5*IQR)).sum()
        rows.append({
            'column':    col,
            'n_outliers': int(n_out),
            'pct':       f"{n_out/len(s)*100:.0f}%",
            'max_value': round(s.max(), 1),
            'upper_fence': round(Q3 + 1.5*IQR, 1)
        })
    return pd.DataFrame(rows).set_index('column')

print(outlier_summary(df, numeric_cols))

What just happened?

pandas' .dropna() skips missing values before doing the IQR maths — otherwise the NaN in visits_pw would break the calculation. The function runs the same IQR logic for every column and collects the results into a neat summary DataFrame. Two columns each have 2 outliers (14% of rows) — that's a high enough rate to investigate before modelling.

Step 5 — Distribution Shape: Binning Into Buckets

Numbers alone don't show shape. To see the shape, you need to group values into buckets and count how many fall in each one. This is called a histogram. We'll build a text version so you can see the pattern without needing a chart library.

# pd.cut() divides the column into equal-width buckets
# We use 5 buckets — you can change this to any number
spend_buckets = pd.cut(df['weekly_spend'], bins=5)

# Count how many customers fall into each bucket
bucket_counts = spend_buckets.value_counts().sort_index()

print("Weekly Spend Distribution:")
print()
for bucket, count in bucket_counts.items():
    bar = '█' * count              # one block per customer
    print(f"  {str(bucket):<25}  {bar}  ({count})")

What just happened?

pandas' pd.cut() divides the column into equal-width intervals and assigns each customer to a bucket. The text bar chart makes the shape immediately visible: 12 of 14 customers cluster in the lowest spending band (£37–£190). Two lone customers sit in the top two bands — those are our outliers, completely separated from the rest. This is a textbook right-skewed distribution.

Step 6 — The Full Numeric Profile in One Table

All the steps above together form a numeric profile — a single table that captures everything worth knowing about each column. This is the output you hand to your team before modelling begins.

def numeric_profile(dataframe, columns):
    """One-table summary of every numeric column — centre, spread, shape, outliers, missing."""
    rows = []
    for col in columns:
        s    = dataframe[col].dropna()
        Q1   = s.quantile(0.25)
        Q3   = s.quantile(0.75)
        IQR  = Q3 - Q1
        n_out = ((s < Q1-1.5*IQR) | (s > Q3+1.5*IQR)).sum()
        miss  = dataframe[col].isna().sum()

        # Skew flag: if mean is > 20% higher than median → right skew
        skew_flag = "Right ⚠" if s.mean() > s.median()*1.2 else (
                    "Left ⚠"  if s.mean() < s.median()*0.8 else "OK ✓")

        rows.append({
            'column':      col,
            'mean':        round(s.mean(), 1),
            'median':      round(s.median(), 1),
            'std':         round(s.std(), 1),
            'min':         round(s.min(), 1),
            'max':         round(s.max(), 1),
            'outliers':    int(n_out),
            'missing':     int(miss),
            'skew':        skew_flag
        })
    return pd.DataFrame(rows).set_index('column')

print(numeric_profile(df, numeric_cols))

What just happened?

pandas does all the statistical work inside a loop — mean, median, std, quantiles, boolean filters for outliers, and .isna().sum() for missing values. We collect every result into a dictionary and turn the list of dictionaries into a final DataFrame. The result: one clean table that answers all five questions (centre, spread, shape, outliers, missing) for every column at once. This is the file you'd attach to a Slack message to your data science team.

What the Numbers Are Telling Us

Before handing this off to the modelling team, here's the plain-English summary of findings:

weekly_spend, avg_basket, loyalty_points are all right-skewed. One customer (customer 14) is driving extreme values across all three columns. Consider a log transform before modelling, or treat this customer as a separate VIP segment.

visits_pw is symmetric and well-behaved. Mean = Median = 3. Only 1 outlier (the heavy shopper who visits 6 times a week). 1 missing value to impute.

🔍

Customer 14 shows up as an outlier in 3 out of 4 columns. This warrants a data quality check — could be a high-value B2B customer, a data entry error, or a VIP household account. Investigate before modelling.

Teacher's Note

The mean is famous but fragile. One extreme value can pull it far from where most of your data actually sits. That's why you always compare mean to median. If they're close — great. If they're far apart — you have skew, and probably an outlier worth investigating.

Don't just flag outliers and remove them. Ask why they exist first. A £780 weekly spend could be a data error, or it could be your most profitable customer. Deleting them blindly could be a very expensive mistake.

Practice Questions

1. A column has a mean of £500 and a median of £120. Is this column left-skewed, right-skewed, or symmetric?



2. What is the name of the value you get by subtracting Q1 from Q3 — the width of the middle 50% of your data?



3. When calculating statistics on a column that contains missing values, which pandas method removes those missing values first so they don't break the calculation?



Quiz

1. The IQR rule flags one customer's weekly spend as an outlier. What is the best first action?


2. You want to report the "typical" weekly spend to your manager in a right-skewed dataset. Which measure should you use?


3. Which pandas function do you use to group continuous values into equal-width intervals so you can see the distribution shape?


Up Next · Lesson 24

Group-By Summaries

Learn to slice your data by any category and compare groups — the technique that turns raw numbers into business insights in minutes.