EDA Course
Central Tendency — Mean, Median and Mode
Everyone knows what an "average" is. But average is actually three different things — and picking the wrong one is one of the most common mistakes in data reporting. This lesson shows you which to use and when.
The Problem With "Just Use the Average"
Here's a classic scenario. A news headline reads: "Average salary at Tech Company X is $120,000." Sounds great. You apply. You get hired. Your salary is $52,000.
Were they lying? Technically, no. The CEO earns $4.2 million. Ten senior engineers earn $180,000. Forty other employees earn $45,000–$65,000. The mean of all those numbers is around $120,000. But it tells you almost nothing about what you will actually earn.
This is why central tendency — the study of what's "typical" in your data — requires you to pick the right measure, not just the most convenient one.
Central tendency means finding a single number that best represents the centre of your data. The three tools you have are the mean (arithmetic average), the median (middle value), and the mode (most frequent value). They each tell a different story about the same data.
Mean — The Classic Average
The mean is what everyone calls "the average." Add up all values, divide by the count. Simple, fast, and genuinely useful — when your data doesn't have extreme outliers dragging it in one direction.
The scenario: You're analysing weekly order values for a small e-commerce store. Your manager wants to know the average order value to set a monthly revenue target. Let's calculate it and see what the mean actually tells us.
import pandas as pd
# Weekly order values for an e-commerce store — 10 orders this week
orders_df = pd.DataFrame({
'order_id': [3001, 3002, 3003, 3004, 3005, 3006, 3007, 3008, 3009, 3010],
'customer': ['Alice', 'Bob', 'Carol', 'Dave', 'Eve', 'Frank', 'Grace', 'Hank', 'Ivy', 'Jack'],
'order_value': [45.00, 120.00, 89.00, 34.00, 210.00, 67.00, 95.00, 52.00, 78.00, 110.00]
})
# .mean() calculates the arithmetic average — sum of all values divided by count
mean_order = orders_df['order_value'].mean()
print(f"Mean order value: ${mean_order:.2f}")
# Show what's happening under the hood — .mean() is just sum / count
total = orders_df['order_value'].sum() # add up every order value
count = orders_df['order_value'].count() # how many orders exist
manual = total / count # this is literally what .mean() computes
print(f"\nTotal revenue : ${total:.2f}")
print(f"Number of orders: {count}")
print(f"Manual mean : ${manual:.2f}") # same answer — no magic
Mean order value: $90.00 Total revenue : $900.00 Number of orders: 10 Manual mean : $90.00
What just happened?
.mean() is a pandas Series method. It computes the arithmetic mean — sum divided by count. We also showed the manual version using .sum() and .count() to confirm there's no magic under the hood.
The mean order value is $90.00. This is a fair number here — orders range from $34 to $210 with no wild extremes dragging it. The mean is a reasonable target to give your manager. But watch what happens when one big order enters the picture.
What One Outlier Does to the Mean
Same ten orders. But now a corporate client drops a $4,500 bulk order. One order. Watch how fast the mean breaks.
import pandas as pd
# Same orders — but MegaCorp places a huge $4,500 bulk order this week
orders_with_outlier = pd.DataFrame({
'order_id': [3001, 3002, 3003, 3004, 3005, 3006, 3007, 3008, 3009, 3010, 3011],
'customer': ['Alice', 'Bob', 'Carol', 'Dave', 'Eve', 'Frank', 'Grace', 'Hank', 'Ivy', 'Jack', 'MegaCorp'],
'order_value': [45.00, 120.00, 89.00, 34.00, 210.00, 67.00, 95.00, 52.00, 78.00, 110.00, 4500.00]
})
# Calculate mean with the outlier included
mean_val = orders_with_outlier['order_value'].mean()
# Calculate median — the middle value, much less sensitive to outliers
median_val = orders_with_outlier['order_value'].median()
print(f"Mean order value : ${mean_val:.2f}")
print(f"Median order value: ${median_val:.2f}")
print(f"\nDifference : ${mean_val - median_val:.2f}")
print("\nThe mean jumped from $90 to $491 because of one order.")
print("The median barely moved — it's still near the original data.")
Mean order value : $490.91 Median order value: $89.00 Difference : $401.91 The mean jumped from $90 to $491 because of one order. The median barely moved — it's still near the original data.
This is the core lesson.
One corporate order inflated the mean from $90 to $491. If you reported $491 as "average order value" and set team targets based on it, every rep would miss every single month. The median stayed at $89 — basically unchanged — because it only cares about the middle position, not the size of extremes. When your data has outliers, the median is almost always the more honest number to report.
Median — The Honest Middle
The median is the middle value when your data is sorted. With 11 values it's the 6th one. With 10 values it's the average of the 5th and 6th. It doesn't care how extreme the outer values are — it just finds the centre position.
The scenario: You're writing a housing market report for a property magazine. Some houses are modest starter homes. Two are absolute mansions. The median is going to tell a far more honest story to potential buyers than the mean would.
import pandas as pd
# House prices in a neighbourhood — mostly modest homes, two luxury outliers
housing_df = pd.DataFrame({
'property_id': [501, 502, 503, 504, 505, 506, 507, 508, 509, 510],
'address': ['12 Oak St', '45 Pine Ave', '8 Elm Rd', '33 Maple Dr', '19 Birch Ln',
'72 Cedar Ct', '5 Willow Way', '88 Ash Blvd', '61 Walnut Pl', '14 Spruce Rd'],
'price': [280000, 315000, 295000, 340000, 260000,
310000, 275000, 1850000, 290000, 2400000] # last two are luxury mansions
})
# Sort prices so you can see the distribution clearly
print("Sorted prices:", sorted(housing_df['price'].tolist()))
# .median() sorts all values and returns the middle one
mean_price = housing_df['price'].mean()
median_price = housing_df['price'].median()
print(f"\nMean house price : ${mean_price:,.0f}")
print(f"Median house price: ${median_price:,.0f}")
print(f"\nA typical buyer should expect to pay around ${median_price:,.0f}")
print(f"The mean of ${mean_price:,.0f} is misleading — pulled up by two mansions.")
Sorted prices: [260000, 275000, 280000, 290000, 295000, 310000, 315000, 340000, 1850000, 2400000] Mean house price : $661,500 Median house price: $302,500 A typical buyer should expect to pay around $302,500 The mean of $661,500 is misleading — pulled up by two mansions.
What just happened?
.median() is a pandas Series method that sorts all values and returns the middle one. With 10 values, it averaged the 5th ($295,000) and 6th ($310,000) positions to get $302,500.
The mean ($661,500) is more than double the median ($302,500) — two mansions dragged it up. A buyer searching for a home in this area gets completely misled by the mean. This is exactly why governments and real estate bodies report median home prices, not mean home prices. Now you know why.
Mean vs Median — A Visual
Here's a chart mockup showing where the mean and median land on the housing price data. The blue bars are regular homes. The red bars are the mansions. Notice how the mean gets dragged far to the right while the median stays planted near where most homes actually are.
Housing prices — where mean and median land
Mode — The Most Popular Value
The mode is the value that appears most often. It's less about "centre" and more about "most common." It's the go-to measure for categorical data — categories, sizes, ratings, types — where mean and median don't even make sense.
The scenario: You run a clothing store and want to know which T-shirt size to stock most of. You can't average clothing sizes. "Medium" and "Large" don't have a mathematical midpoint. But you can absolutely find the most popular one.
import pandas as pd
# T-shirt orders for the past two weeks — each row is one order
tshirt_df = pd.DataFrame({
'order_id': range(4001, 4021), # 20 orders
'size': ['M', 'L', 'M', 'XL', 'S', 'M', 'L', 'M', 'S', 'M',
'L', 'M', 'XL', 'M', 'L', 'S', 'M', 'M', 'L', 'XL'],
'colour': ['Black', 'White', 'Black', 'Blue', 'Black', 'White', 'Black', 'Blue', 'Black', 'White',
'Black', 'Black', 'White', 'Black', 'Blue', 'Black', 'White', 'Black', 'Blue', 'Black']
})
# .mode() returns the most frequently occurring value(s)
# It returns a Series — use [0] to get the top result
most_common_size = tshirt_df['size'].mode()[0]
most_common_colour = tshirt_df['colour'].mode()[0]
print(f"Most ordered size : {most_common_size}")
print(f"Most ordered colour: {most_common_colour}")
# .value_counts() shows full frequency table — great companion to mode
print("\nSize breakdown:")
print(tshirt_df['size'].value_counts())
print("\nColour breakdown:")
print(tshirt_df['colour'].value_counts())
Most ordered size : M Most ordered colour: Black Size breakdown: size M 8 L 5 XL 3 S 3 Name: count, dtype: int64 Colour breakdown: colour Black 12 White 5 Blue 4 Name: count, dtype: int64
What just happened?
.mode() is a pandas Series method that returns the most frequently occurring value. It returns a Series (not a single value) because there can be ties — two values appearing equally often. That's why we index with [0] to get the top result.
.value_counts() is the natural companion — it shows the full frequency breakdown sorted from most to least common. Together, .mode() and .value_counts() are how you understand any categorical column fast. Stock decision? Order extra Medium Black T-shirts. Done.
All Three Together — The Full Picture
The scenario: You're a data analyst at a gym chain. Management wants a complete summary of member age data before designing a new marketing campaign. Here's how you run all three measures together and actually interpret what they're telling you.
import pandas as pd
# Gym member ages — realistic distribution with a few older long-term members
gym_df = pd.DataFrame({
'member_id': range(5001, 5021), # 20 members
'name': ['Alex', 'Bella', 'Carlos', 'Diana', 'Ethan', 'Fiona', 'George', 'Hana',
'Ivan', 'Julia', 'Kevin', 'Laura', 'Mike', 'Nina', 'Oscar', 'Petra',
'Quinn', 'Rosa', 'Sam', 'Tina'],
'age': [24, 28, 22, 31, 26, 29, 24, 35, 27, 24,
33, 28, 62, 25, 29, 31, 24, 27, 58, 30],
'membership': ['Monthly', 'Annual', 'Monthly', 'Annual', 'Monthly', 'Annual', 'Monthly', 'Annual',
'Monthly', 'Monthly', 'Annual', 'Monthly', 'Annual', 'Monthly', 'Monthly', 'Annual',
'Monthly', 'Monthly', 'Annual', 'Monthly']
})
# Calculate all three central tendency measures for age
mean_age = gym_df['age'].mean().round(1) # arithmetic average
median_age = gym_df['age'].median() # middle value when sorted
mode_age = gym_df['age'].mode()[0] # most frequently occurring age
print(f"Mean age : {mean_age}")
print(f"Median age: {median_age}")
print(f"Mode age : {mode_age}")
# The gap between mean and median signals skew
gap = mean_age - median_age
print(f"\nMean - Median gap: {gap}")
if gap > 2:
print("Positive gap → distribution is right-skewed (a few older members pulling mean up)")
elif gap < -2:
print("Negative gap → distribution is left-skewed")
else:
print("Small gap → distribution is roughly symmetric")
# Most common membership type — mode on a categorical column
print(f"\nMost common membership: {gym_df['membership'].mode()[0]}")
Mean age : 31.9 Median age: 28.0 Mode age : 24 Mean - Median gap: 3.9 Positive gap → distribution is right-skewed (a few older members pulling mean up) Most common membership: Monthly
What just happened?
Three pandas methods — .mean(), .median(), .mode() — each reveal something different. The mean is 31.9, the median is 28, and the mode is 24. That spread is telling a real story.
Two older members (62 and 58) are pulling the mean up. The mode of 24 shows the most common age group. The median of 28 is the honest middle. For the marketing campaign? Target the 22–30 age group — that's where most of your members actually are. If you'd just reported the mean of 31.9, the campaign messaging would have been off.
When to Use Each One
Here's the decision table you'll reach for on every project. Print it. Tattoo it. Whatever works.
| Measure | Use when | Avoid when | Real example |
|---|---|---|---|
| Mean | Data is symmetric, no extreme outliers | Outliers exist (salaries, prices, durations) | Average exam score in a class |
| Median | Data is skewed or has outliers | You need to use the result in further calculations | Median household income, house prices |
| Mode | Categorical data, or finding most common value | Data is continuous with no repeating values | Most popular product, most common age |
Teacher's Note
A quick gut-check you can always do: if the mean and median are far apart, your data is skewed. The bigger the gap, the stronger the skew. If they're close together, your data is fairly symmetric and you can trust the mean.
Salary data? Always skewed right — always use median. Exam scores? Usually symmetric — mean is fine. House prices? Skewed right — median. Test scores in a class with one genius and one struggling student? Check the gap first. This one habit — always comparing mean and median before reporting — separates careful analysts from everyone else.
Practice Questions
1. You are reporting typical house prices in a city. A few luxury penthouses exist that cost 10× the average home. Which measure of central tendency should you report?
2. You want to find the most frequently ordered product in a sales dataset. Which measure do you use?
3. The mean of a salary dataset is $95,000 and the median is $62,000. The mean is higher than the median. This tells you the distribution is ______.
Quiz
1. A company reports that its mean employee salary is $110,000 but the median is $58,000. What is the most likely explanation?
2. In pandas, which code gives you the most frequently occurring value in a column called df['size']?
3. Which of the following tells you that a dataset is roughly symmetric with no major outliers?
Up Next · Lesson 5
Dispersion Measures
Knowing the average is only half the story. How spread out is your data? Range, variance, and standard deviation tell you what the mean never could.