Data Science Lesson 14 –Summary Statistics | Dataplexa

Exploratory Data Analysis · Lesson 14

Summary Statistics

Master the five key measures that reveal your data's story — calculate central tendency, spread, and distribution shape to make confident business decisions.

Measure Central Tendency

Calculate Variability

Assess Distribution Shape

Make Data-Driven Decisions

What Summary Statistics Actually Tell You

Raw numbers mean nothing. Summary statistics extract meaning. Think about checking your bank account — you don't scan every transaction. You want the balance, average spending, and how much it varies month to month. That's exactly what summary statistics do for datasets.

Five core measures matter: mean (typical value), median (middle value), mode (most common), standard deviation (spread), and range (min to max). Honestly, these five solve 90% of your "what does this data look like" questions.

Central Tendency

Where your data clusters — the "typical" customer, order, or rating

Spread

How scattered your data is — predictable or wildly variable

Shape

Symmetric bell curve or skewed toward high/low values

Outliers

Extreme values that might signal errors or golden opportunities

Central Tendency — Finding the "Typical"

The scenario: Flipkart's pricing team needs to understand typical order values across categories. They're setting discount thresholds and need to know what "normal" looks like for Electronics versus Food purchases.

import pandas as pd
import numpy as np

# Load the ecommerce data
df = pd.read_csv('dataplexa_ecommerce.csv')

# First look at our revenue data
print("Sample revenue data:")
print(df['revenue'].head(10))

Sample revenue data:
0     15750.0
1     23400.0
2      8900.0
3     45600.0
4     12300.0
5     67800.0
6      5400.0
7     34500.0
8     19200.0
9     28900.0
Name: revenue, dtype: float64

What just happened?

We loaded our ecommerce dataset and examined the first 10 revenue values. Notice the range — from ₹5,400 to ₹67,800. That's a 12x difference! Try this: Run df['revenue'].describe() to see all summary stats at once.

Now calculate the three measures of central tendency. Each tells a different story about your "typical" customer.

# Calculate all three measures of central tendency
mean_revenue = df['revenue'].mean()
median_revenue = df['revenue'].median()
mode_revenue = df['revenue'].mode()[0]  # mode() returns a Series, take first value

print(f"Mean revenue: ₹{mean_revenue:,.2f}")
print(f"Median revenue: ₹{median_revenue:,.2f}")
print(f"Mode revenue: ₹{mode_revenue:,.2f}")

# Show the difference between mean and median
difference = mean_revenue - median_revenue
print(f"\nDifference (Mean - Median): ₹{difference:,.2f}")

Mean revenue: ₹28,450.75
Median revenue: ₹24,200.00
Mode revenue: ₹15,750.00

Difference (Mean - Median): ₹4,250.75

What just happened?

Notice how mean > median > mode? That signals right-skewed data — a few high-value orders are pulling the average up. The ₹4,250 gap between mean and median is significant. Try this: Check if Electronics has more extreme values with df[df['product_category']=='Electronics']['revenue'].mean()

Key Insight: When to Use Which Measure

Mean: Use for symmetric data without outliers. Good for total calculations. Median: Better for skewed data or when outliers exist. More "typical" for real users. Mode: Most frequent value — useful for categorical data or finding the most common price point.

Measuring Variability — How Spread Out Is Your Data?

Central tendency tells you where data clusters. But that's only half the story. Two datasets can have identical means but wildly different spreads. Amazon's book prices might average ₹500 with tight clustering. Electronics might also average ₹500 but range from ₹50 cables to ₹50,000 laptops.

# Calculate measures of variability
revenue_std = df['revenue'].std()
revenue_var = df['revenue'].var()
revenue_range = df['revenue'].max() - df['revenue'].min()

print("=== VARIABILITY MEASURES ===")
print(f"Standard Deviation: ₹{revenue_std:,.2f}")
print(f"Variance: ₹{revenue_var:,.2f}")
print(f"Range: ₹{revenue_range:,.2f}")
print(f"Min: ₹{df['revenue'].min():,.2f}")
print(f"Max: ₹{df['revenue'].max():,.2f}")

=== VARIABILITY MEASURES ===
Standard Deviation: ₹18,245.30
Variance: ₹332,891,217.45
Range: ₹195,400.00
Min: ₹850.00
Max: ₹196,250.00

What just happened?

The standard deviation of ₹18,245 is huge — almost 65% of the mean! This means revenues are highly variable. The ₹195,400 range confirms extreme spread from discount items to luxury purchases. Try this: Compare categories with df.groupby('product_category')['revenue'].std()

Standard deviation is your go-to variability measure. But what does ₹18,245 actually mean? Here's the 68-95-99.7 rule that every analyst should memorize:

# Apply the 68-95-99.7 rule (assuming normal distribution)
mean_rev = df['revenue'].mean()
std_rev = df['revenue'].std()

# Calculate the ranges
one_std = (mean_rev - std_rev, mean_rev + std_rev)
two_std = (mean_rev - 2*std_rev, mean_rev + 2*std_rev)
three_std = (mean_rev - 3*std_rev, mean_rev + 3*std_rev)

print("68% of orders fall between:", f"₹{one_std[0]:,.0f} - ₹{one_std[1]:,.0f}")
print("95% of orders fall between:", f"₹{two_std[0]:,.0f} - ₹{two_std[1]:,.0f}")
print("99.7% of orders fall between:", f"₹{three_std[0]:,.0f} - ₹{three_std[1]:,.0f}")

# Check actual percentages in our data
within_1std = df[(df['revenue'] >= one_std[0]) & (df['revenue'] <= one_std[1])]
actual_pct = len(within_1std) / len(df) * 100
print(f"\nActual % within 1 std dev: {actual_pct:.1f}%")

68% of orders fall between: ₹10,205 - ₹46,696
95% of orders fall between: ₹-8,040 - ₹64,941
99.7% of orders fall between: ₹-26,285 - ₹83,186

Actual % within 1 std dev: 71.2%

What just happened?

Our actual percentage (71.2%) is close to the theoretical 68%, which is good. But notice the negative values in 2σ and 3σ ranges? That's impossible for revenue! This confirms our data isn't perfectly normal — it's right-skewed with a floor at zero. Try this: Use percentiles instead: df['revenue'].quantile([0.25, 0.75])

📊 Data Insight

High standard deviation (₹18,245) relative to mean (₹28,451) indicates this business serves diverse customer segments — from budget buyers to premium shoppers. Consider separate pricing strategies for each segment.

Percentiles and Quartiles — The Robust Alternative

Standard deviation breaks down with skewed data or outliers. Percentiles don't care about distribution shape. The 25th percentile (Q1) means 25% of your data falls below that value. Simple and bulletproof.

# Calculate key percentiles and quartiles
percentiles = [5, 10, 25, 50, 75, 90, 95]
results = df['revenue'].quantile([p/100 for p in percentiles])

print("=== PERCENTILE ANALYSIS ===")
for i, p in enumerate(percentiles):
    print(f"{p:2}th percentile: ₹{results.iloc[i]:,.2f}")

# Calculate Interquartile Range (IQR)
Q1 = df['revenue'].quantile(0.25)
Q3 = df['revenue'].quantile(0.75)
IQR = Q3 - Q1

print(f"\nInterquartile Range (Q3-Q1): ₹{IQR:,.2f}")
print("This captures the middle 50% of your revenue distribution")

=== PERCENTILE ANALYSIS ===
 5th percentile: ₹2,145.00
10th percentile: ₹4,280.00
25th percentile: ₹12,450.00
50th percentile: ₹24,200.00
75th percentile: ₹38,750.00
90th percentile: ₹54,800.00
95th percentile: ₹68,200.00

Interquartile Range (Q3-Q1): ₹26,300.00
This captures the middle 50% of your revenue distribution

What just happened?

The IQR of ₹26,300 shows that the middle 50% of orders span from ₹12,450 to ₹38,750. Notice the jump from 95th percentile (₹68,200) to max (₹196,250)? That 5% contains your extreme high-value customers. Try this: Find outliers above Q3 + 1.5*IQR

The exponential curve shows right-skewed data — most orders cluster in lower values with a long tail of high-value purchases

This percentile view reveals your customer segmentation naturally. Bottom 25% are bargain hunters (under ₹12,450). Middle 50% are regular customers. Top 10% (above ₹54,800) are your premium segment who likely drive disproportionate profits.

But percentiles shine when comparing across categories. Does Electronics have the same distribution shape as Books? Are Food orders consistently lower? Time to segment.

Distribution Shape — Skewness and Kurtosis

The scenario: Myntra's inventory team noticed some product categories sell consistently while others have huge spikes. They need to understand if their sales distributions are symmetric (predictable) or skewed (requiring buffer stock).

# Calculate skewness and kurtosis for the overall dataset
from scipy import stats

revenue_skew = df['revenue'].skew()
revenue_kurt = df['revenue'].kurtosis()

print("=== DISTRIBUTION SHAPE ===")
print(f"Skewness: {revenue_skew:.3f}")
print(f"Kurtosis: {revenue_kurt:.3f}")

# Interpret the results
if revenue_skew > 0.5:
    skew_interpretation = "Right-skewed (long tail of high values)"
elif revenue_skew < -0.5:
    skew_interpretation = "Left-skewed (long tail of low values)"
else:
    skew_interpretation = "Approximately symmetric"

print(f"Interpretation: {skew_interpretation}")

=== DISTRIBUTION SHAPE ===
Skewness: 1.847
Kurtosis: 4.231
Interpretation: Right-skewed (long tail of high values)

# Compare distribution shapes across product categories
category_stats = df.groupby('product_category')['revenue'].agg([
    'mean', 'median', 'std', 'skew'
]).round(2)

# Add a "mean vs median" ratio to quickly spot skewness
category_stats['mean_median_ratio'] = (category_stats['mean'] / category_stats['median']).round(2)

print("Distribution comparison by category:")
print(category_stats)

Distribution comparison by category:
                   mean   median     std  skew  mean_median_ratio
product_category                                                
Books           8245.30  6800.00  6420.15  1.32               1.21
Clothing       18950.75 16200.00 12580.40  1.85               1.17
Electronics    52480.20 41500.00 28960.75  2.14               1.26
Food            3890.45  3200.00  2840.60  1.67               1.22
Home           26750.80 22100.00 18420.35  1.91               1.21

What just happened?

Every category shows positive skewness (1.32 to 2.14), meaning right-skewed distributions. Electronics has the highest skew (2.14) — those premium laptops and phones create extreme outliers. Notice how mean_median_ratio stays around 1.2 for most categories? That's your quick skewness detector. Try this: Plot histograms with df.hist('revenue', by='product_category')

The gap between mean and median lines reveals skewness intensity — Electronics shows the largest gap, indicating most extreme outliers

Electronics dominates with ₹52,480 average revenue but ₹41,500 median. That ₹11,000 gap screams "luxury items pulling the average up." Food shows the opposite pattern — tight clustering around ₹3,200 with minimal outliers. Your inventory strategy should be completely different for these categories.

The Complete Summary Statistics Workflow

Time to put it all together. Here's the five-step workflow that professional analysts use to quickly understand any numeric variable:

# Step 1: Quick overview with describe()
print("=== STEP 1: QUICK OVERVIEW ===")
print(df['rating'].describe())
print("\n" + "="*40)

=== STEP 1: QUICK OVERVIEW ===
count    10000.000000
mean         3.654000
std          0.847231
min          1.000000
25%          3.200000
50%          3.800000
75%          4.300000
max          5.000000

========================================

# Step 2: Check for data quality issues
print("=== STEP 2: DATA QUALITY CHECK ===")
print(f"Missing values: {df['rating'].isnull().sum()}")
print(f"Duplicate values: {df['rating'].duplicated().sum()}")
print(f"Values outside 1-5 range: {((df['rating'] < 1) | (df['rating'] > 5)).sum()}")

# Step 3: Distribution shape analysis
print(f"\n=== STEP 3: DISTRIBUTION SHAPE ===")
print(f"Skewness: {df['rating'].skew():.3f}")
print(f"Mean vs Median gap: {df['rating'].mean() - df['rating'].median():.3f}")
print(f"Coefficient of Variation: {df['rating'].std() / df['rating'].mean():.3f}")

if df['rating'].mean() > df['rating'].median():
    print("→ Right-skewed: Few very high ratings")
else:
    print("→ Left-skewed: Few very low ratings")

=== STEP 2: DATA QUALITY CHECK ===
Missing values: 0
Duplicate values: 7842
Values outside 1-5 range: 0

=== STEP 3: DISTRIBUTION SHAPE ===
Skewness: -0.412
Mean vs Median gap: -0.146
Coefficient of Variation: 0.232
→ Left-skewed: Few very low ratings

# Step 4: Outlier detection using IQR method
Q1 = df['rating'].quantile(0.25)
Q3 = df['rating'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = df[(df['rating'] < lower_bound) | (df['rating'] > upper_bound)]

print(f"=== STEP 4: OUTLIER DETECTION ===")
print(f"IQR bounds: {lower_bound:.2f} to {upper_bound:.2f}")
print(f"Outliers found: {len(outliers)} ({len(outliers)/len(df)*100:.1f}%)")

# Step 5: Business context interpretation
print(f"\n=== STEP 5: BUSINESS INSIGHTS ===")
print(f"Average rating of {df['rating'].mean():.2f} suggests moderate satisfaction")
print(f"Standard deviation of {df['rating'].std():.2f} shows mixed opinions")

# Most common rating
mode_rating = df['rating'].mode()[0]
mode_count = (df['rating'] == mode_rating).sum()
print(f"Most common rating: {mode_rating} ({mode_count/len(df)*100:.1f}% of orders)")

=== STEP 4: OUTLIER DETECTION ===
IQR bounds: 1.55 to 6.05
Outliers found: 0 (0.0%)

=== STEP 5: BUSINESS INSIGHTS ===
Average rating of 3.65 suggests moderate satisfaction
Standard deviation of 0.85 shows mixed opinions
Most common rating: 4.0 (22.8% of orders)

What just happened?

We analyzed customer ratings systematically. The 3.65 mean vs 3.8 median gap reveals left skewness — most customers are satisfied (4-5 stars) but a few harsh critics (1-2 stars) pull the average down. No outliers in a 1-5 scale makes sense. Try this: Segment by product category to find which products get harsh reviews.

37.6% of customers give 5-star ratings while only 8.2% give 1-star — confirms left-skewed distribution with satisfaction bias

This systematic approach works for any numeric variable. Revenue, prices, quantities, conversion rates — same five steps. The key insight? Summary statistics aren't just numbers. They're your guide to understanding customer behavior, spotting business opportunities, and avoiding costly mistakes.

📊 Data Insight

Companies that regularly calculate summary statistics across customer segments make 23% better pricing decisions and reduce inventory waste by 18%. The five-minute analysis pays for itself immediately.

Common Mistake: Ignoring Data Type

Never calculate mean/std for categorical data like product names or cities. Always check df.dtypes first. Use df.select_dtypes(include='number') to isolate numeric columns before running summary statistics.

Quiz

Up Next

Real-World EDA

Apply your summary statistics knowledge to complete exploratory data analysis workflows that solve actual business problems from data import to actionable insights.

← Previous Course Index Next →