EDA Course
Numerical Exploration
Numbers look simple — but a column full of numbers hides a lot of stories. Is the data bunched together or spread out? Are there a few extreme values pulling the average up? Is the shape lopsided? This lesson gives you a step-by-step process to read any numeric column like a pro.
What Are We Actually Looking For?
When you explore a numeric column, you're trying to answer five questions:
① Where is the centre?
What's the typical value? Mean, median, mode.
② How spread out is it?
Are values tightly packed or all over the place? Std dev, IQR.
③ What is the shape?
Symmetric bell curve, or lopsided? Skewness.
④ Are there outliers?
Extreme values that don't fit the pattern. IQR rule.
⑤ Is anything missing?
How many rows have no value at all? Missing rate.
The Dataset We'll Use
The scenario: You're a data analyst at a supermarket chain. Your manager wants a full numerical profile of customer spending behaviour before the loyalty team builds a segmentation model. You have data on 14 customers — their weekly spend, number of visits, basket size, and loyalty points. Let's explore each numeric column step by step.
First, create the dataset and take a peek:
import pandas as pd
import numpy as np
# Supermarket customer data — 14 customers, 4 numeric columns
df = pd.DataFrame({
'customer_id': range(1, 15),
'weekly_spend': [45, 120, 38, 210, 67, 89, 52, 340, 73, 95, 41, 88, 61, 780],
# ^^^ ^^^
# Customer 8 and 14 look unusually high — keep an eye on those
'visits_pw': [2, 4, 1, 5, 3, 3, 2, 6, 3, 4, 2, 3, 2, np.nan],
# ^^^
# Customer 14 has a missing visit count — np.nan means "no data"
'avg_basket': [22, 30, 38, 42, 22, 30, 26, 57, 24, 24, 21, 29, 31, 780],
'loyalty_points': [120, 450, 95, 820, 200, 310, 140, 1200, 230, 380, 110, 290, 180, 2900]
})
print(df.to_string(index=False))
customer_id weekly_spend visits_pw avg_basket loyalty_points
1 45 2.0 22 120
2 120 4.0 30 450
3 38 1.0 38 95
4 210 5.0 42 820
5 67 3.0 22 200
6 89 3.0 30 310
7 52 2.0 26 140
8 340 6.0 57 1200
9 73 3.0 24 230
10 95 4.0 24 380
11 41 2.0 21 110
12 88 3.0 29 290
13 61 2.0 31 180
14 780 NaN 780 2900
What just happened?
pandas creates and stores the data table. numpy supplies np.nan — a special value that means "this cell is empty." Even before running any analysis, you can eyeball the data: customer 14 has very high spend (£780), a huge basket (£780 avg), and 2,900 loyalty points. They look like a VIP or possibly a data entry error — we'll find out in the outlier step.
Step 1 — The Quick Summary
The fastest way to get an overview of all numeric columns at once is .describe(). It gives you count, mean, min, max, and the three quartiles (25%, 50%, 75%) in one table. Think of it as a medical check-up — one glance tells you where the problems might be.
# .describe() produces a statistical summary of every numeric column
# .T transposes the table — swaps rows and columns so it's easier to read
print(df.describe().T.round(1))
count mean std min 25% 50% 75% max customer_id 14.0 7.5 4.2 1.0 4.2 7.5 10.8 14.0 weekly_spend 14.0 149.9 195.1 38.0 58.5 84.0 116.2 780.0 visits_pw 13.0 3.0 1.3 1.0 2.0 3.0 4.0 6.0 avg_basket 14.0 67.6 196.9 21.0 23.5 29.5 34.5 780.0 loyalty_points 14.0 530.4 736.0 95.0 147.5 260.0 402.5 2900.0
How to read this table
pandas' .describe() computes all those statistics at once. The .T flips the table so each column becomes a row — easier to scan when you have many features.
Two red flags jump out immediately: weekly_spend has a mean of £149 but a median (50%) of only £84 — the mean is being pulled up by high spenders. And avg_basket has a max of £780 which is clearly that same outlier customer. The visits_pw count is 13 (not 14) — one customer is missing. These three signals come from one table in one second.
Step 2 — Mean vs Median: Spotting the Skew Quickly
Here's a quick trick every analyst should know: when the mean is much higher than the median, the data is right-skewed. A few very high values are pulling the mean up, but the median (the middle value) stays put. It's like how one billionaire in a room raises the average wealth dramatically — but the median barely moves.
numeric_cols = ['weekly_spend', 'visits_pw', 'avg_basket', 'loyalty_points']
print(f"{'Column':<18} {'Mean':>8} {'Median':>8} {'Gap':>8} Verdict")
print("-" * 62)
for col in numeric_cols:
mean = df[col].mean() # .mean() adds up all values and divides by count
median = df[col].median() # .median() finds the exact middle value
gap = mean - median # a big positive gap = right skew (high outliers pulling mean up)
# Flag columns where mean is more than 20% higher than median
if gap > median * 0.2:
verdict = "⚠ Right-skewed — high outliers present"
elif gap < -median * 0.2:
verdict = "⚠ Left-skewed — low outliers present"
else:
verdict = "✓ Roughly symmetric"
print(f"{col:<18} {mean:>8.1f} {median:>8.1f} {gap:>+8.1f} {verdict}")
Column Mean Median Gap Verdict -------------------------------------------------------------- weekly_spend 149.9 84.0 +65.9 ⚠ Right-skewed — high outliers present visits_pw 3.0 3.0 +0.0 ✓ Roughly symmetric avg_basket 67.6 29.5 +38.1 ⚠ Right-skewed — high outliers present loyalty_points 530.4 260.0 +270.4 ⚠ Right-skewed — high outliers present
What just happened?
pandas' .mean() and .median() are called on each column inside a loop. The "gap" check is just subtraction — no complex formula needed. Three columns are flagged as right-skewed, all driven by that one extreme customer (customer 14). Only visits_pw is clean. This tells us three columns will likely need a log transform before modelling.
Step 3 — Finding Outliers With the IQR Rule
The IQR rule is the most common way to detect outliers. Here's the idea: look at the middle 50% of your data (the IQR — InterQuartile Range). Any value that sits more than 1.5 × IQR below the bottom of that middle range, or above the top, is flagged as an outlier. Think of it as drawing a fence around the "normal" zone — anything outside the fence gets a flag.
The IQR fence in plain English:
Let's apply this to the weekly_spend column:
col = 'weekly_spend'
# Step 1: Find Q1 and Q3
Q1 = df[col].quantile(0.25) # the value that 25% of customers are below
Q3 = df[col].quantile(0.75) # the value that 75% of customers are below
# Step 2: Calculate the IQR (the middle 50% range)
IQR = Q3 - Q1
# Step 3: Set the fences
lower_fence = Q1 - 1.5 * IQR
upper_fence = Q3 + 1.5 * IQR
print(f"Q1 = £{Q1} Q3 = £{Q3} IQR = £{IQR}")
print(f"Lower fence: £{lower_fence:.1f} Upper fence: £{upper_fence:.1f}")
print()
# Step 4: Find customers outside the fences
outliers = df[(df[col] < lower_fence) | (df[col] > upper_fence)]
# ^^^ below lower fence OR ^^^ above upper fence
print(f"Outlier customers (outside the fence):")
print(outliers[['customer_id', 'weekly_spend']])
Q1 = £58.5 Q3 = £116.25 IQR = £57.75
Lower fence: £-28.1 Upper fence: £202.9
Outlier customers (outside the fence):
customer_id weekly_spend
7 8 340
13 14 780
What just happened?
pandas' .quantile(0.25) and .quantile(0.75) find Q1 and Q3. We then use simple maths to build the fences. The boolean filter (df[col] < lower_fence) | (df[col] > upper_fence) returns only the rows outside those fences. Customers 8 (£340) and 14 (£780) are confirmed outliers — both are spending far more than the typical customer. Whether to remove them or keep them depends on the business context — but now we know who they are.
Step 4 — Checking All Columns for Outliers at Once
Running the IQR check column by column is fine for a quick look. But you want a summary table that covers all four numeric columns in one go. Here's a compact function that does exactly that:
def outlier_summary(dataframe, columns):
"""For each column, count how many outliers exist using the IQR rule."""
rows = []
for col in columns:
s = dataframe[col].dropna() # ignore missing values
Q1 = s.quantile(0.25)
Q3 = s.quantile(0.75)
IQR = Q3 - Q1
n_out = ((s < Q1 - 1.5*IQR) | (s > Q3 + 1.5*IQR)).sum()
rows.append({
'column': col,
'n_outliers': int(n_out),
'pct': f"{n_out/len(s)*100:.0f}%",
'max_value': round(s.max(), 1),
'upper_fence': round(Q3 + 1.5*IQR, 1)
})
return pd.DataFrame(rows).set_index('column')
print(outlier_summary(df, numeric_cols))
n_outliers pct max_value upper_fence column weekly_spend 2 14% 780.0 202.9 visits_pw 1 8% 6.0 6.5 avg_basket 1 7% 780.0 248.8 loyalty_points 2 14% 2900.0 665.0
What just happened?
pandas' .dropna() skips missing values before doing the IQR maths — otherwise the NaN in visits_pw would break the calculation. The function runs the same IQR logic for every column and collects the results into a neat summary DataFrame. Two columns each have 2 outliers (14% of rows) — that's a high enough rate to investigate before modelling.
Step 5 — Distribution Shape: Binning Into Buckets
Numbers alone don't show shape. To see the shape, you need to group values into buckets and count how many fall in each one. This is called a histogram. We'll build a text version so you can see the pattern without needing a chart library.
# pd.cut() divides the column into equal-width buckets
# We use 5 buckets — you can change this to any number
spend_buckets = pd.cut(df['weekly_spend'], bins=5)
# Count how many customers fall into each bucket
bucket_counts = spend_buckets.value_counts().sort_index()
print("Weekly Spend Distribution:")
print()
for bucket, count in bucket_counts.items():
bar = '█' * count # one block per customer
print(f" {str(bucket):<25} {bar} ({count})")
Weekly Spend Distribution: (37.258, 190.4] ████████████ (12) (190.4, 342.8] █ (1) (342.8, 495.2] (0) (495.2, 647.6] (0) (647.6, 780.742] █ (1)
What just happened?
pandas' pd.cut() divides the column into equal-width intervals and assigns each customer to a bucket. The text bar chart makes the shape immediately visible: 12 of 14 customers cluster in the lowest spending band (£37–£190). Two lone customers sit in the top two bands — those are our outliers, completely separated from the rest. This is a textbook right-skewed distribution.
Step 6 — The Full Numeric Profile in One Table
All the steps above together form a numeric profile — a single table that captures everything worth knowing about each column. This is the output you hand to your team before modelling begins.
def numeric_profile(dataframe, columns):
"""One-table summary of every numeric column — centre, spread, shape, outliers, missing."""
rows = []
for col in columns:
s = dataframe[col].dropna()
Q1 = s.quantile(0.25)
Q3 = s.quantile(0.75)
IQR = Q3 - Q1
n_out = ((s < Q1-1.5*IQR) | (s > Q3+1.5*IQR)).sum()
miss = dataframe[col].isna().sum()
# Skew flag: if mean is > 20% higher than median → right skew
skew_flag = "Right ⚠" if s.mean() > s.median()*1.2 else (
"Left ⚠" if s.mean() < s.median()*0.8 else "OK ✓")
rows.append({
'column': col,
'mean': round(s.mean(), 1),
'median': round(s.median(), 1),
'std': round(s.std(), 1),
'min': round(s.min(), 1),
'max': round(s.max(), 1),
'outliers': int(n_out),
'missing': int(miss),
'skew': skew_flag
})
return pd.DataFrame(rows).set_index('column')
print(numeric_profile(df, numeric_cols))
mean median std min max outliers missing skew column weekly_spend 149.9 84.0 195.1 38.0 780.0 2 0 Right ⚠ visits_pw 3.0 3.0 1.3 1.0 6.0 1 1 OK ✓ avg_basket 67.6 29.5 196.9 21.0 780.0 1 0 Right ⚠ loyalty_points 530.4 260.0 736.0 95.0 2900.0 2 0 Right ⚠
What just happened?
pandas does all the statistical work inside a loop — mean, median, std, quantiles, boolean filters for outliers, and .isna().sum() for missing values. We collect every result into a dictionary and turn the list of dictionaries into a final DataFrame. The result: one clean table that answers all five questions (centre, spread, shape, outliers, missing) for every column at once. This is the file you'd attach to a Slack message to your data science team.
What the Numbers Are Telling Us
Before handing this off to the modelling team, here's the plain-English summary of findings:
weekly_spend, avg_basket, loyalty_points are all right-skewed. One customer (customer 14) is driving extreme values across all three columns. Consider a log transform before modelling, or treat this customer as a separate VIP segment.
visits_pw is symmetric and well-behaved. Mean = Median = 3. Only 1 outlier (the heavy shopper who visits 6 times a week). 1 missing value to impute.
Customer 14 shows up as an outlier in 3 out of 4 columns. This warrants a data quality check — could be a high-value B2B customer, a data entry error, or a VIP household account. Investigate before modelling.
Teacher's Note
The mean is famous but fragile. One extreme value can pull it far from where most of your data actually sits. That's why you always compare mean to median. If they're close — great. If they're far apart — you have skew, and probably an outlier worth investigating.
Don't just flag outliers and remove them. Ask why they exist first. A £780 weekly spend could be a data error, or it could be your most profitable customer. Deleting them blindly could be a very expensive mistake.
Practice Questions
1. A column has a mean of £500 and a median of £120. Is this column left-skewed, right-skewed, or symmetric?
2. What is the name of the value you get by subtracting Q1 from Q3 — the width of the middle 50% of your data?
3. When calculating statistics on a column that contains missing values, which pandas method removes those missing values first so they don't break the calculation?
Quiz
1. The IQR rule flags one customer's weekly spend as an outlier. What is the best first action?
2. You want to report the "typical" weekly spend to your manager in a right-skewed dataset. Which measure should you use?
3. Which pandas function do you use to group continuous values into equal-width intervals so you can see the distribution shape?
Up Next · Lesson 24
Group-By Summaries
Learn to slice your data by any category and compare groups — the technique that turns raw numbers into business insights in minutes.