Data Science Lesson 21 – Descriptive Statistics | Dataplexa
Statistics · Lesson 21

Descriptive Statistics

Calculate key metrics like mean, median, and standard deviation to understand your data patterns before building models.

1
Import Data
2
Calculate Central Tendency
3
Measure Spread
4
Generate Summary Report

What Numbers Really Mean

Your dataset has 50,000 rows of customer orders. But what's the story? Descriptive statistics transform raw numbers into business insights. Think of it as your data's health checkup — revealing patterns invisible to the naked eye.

Every analyst at companies like Flipkart or Swiggy starts here. Why does this matter? Because you can't optimize what you don't understand. Descriptive statistics answer the fundamental question: "What happened in my data?"

Central Tendency

Mean, median, mode — where your data clusters

Variability

Standard deviation, range — how spread out values are

Shape

Skewness, kurtosis — distribution patterns

Position

Quartiles, percentiles — relative standings

Loading Your Data

The scenario: You're a data analyst at Myntra, and the marketing team needs customer age insights for their next campaign. They want to know the typical customer profile within 2 hours.

# Import essential libraries for data analysis
import pandas as pd
import numpy as np

# Load the ecommerce dataset
df = pd.read_csv('dataplexa_ecommerce.csv')

# Quick look at our data structure
print(df.head())
print(f"Dataset shape: {df.shape}")

What just happened?

We loaded 50,000 customer orders spanning 12 columns. Notice the mix of data types — integers for customer_age, floats for revenue, and strings for city. Try this: Check df.dtypes to see all column types at once.

Central Tendency Metrics

Central tendency tells you where your data "lives." But here's the catch — mean can lie. If you have outliers (like a customer buying a ₹5 lakh laptop), mean gets skewed. That's why we calculate multiple measures.

# Calculate mean (average) customer age
mean_age = df['customer_age'].mean()
print(f"Mean customer age: {mean_age:.1f} years")

# Find median (middle value when sorted)
median_age = df['customer_age'].median()
print(f"Median customer age: {median_age} years")

What just happened?

Mean age is 41.8 years and median is 42.0 years. They're very close, suggesting our age data is fairly balanced without extreme outliers. Try this: Calculate df['revenue'].mean() vs median() — you'll see bigger differences due to high-value purchases.

# Find mode (most common value) for categorical data
mode_city = df['city'].mode()[0]
print(f"Most common customer city: {mode_city}")

# Mode for numerical data
mode_quantity = df['quantity'].mode()[0]
print(f"Most common order quantity: {mode_quantity} items")

What just happened?

Mumbai dominates our customer base and most orders are for 1 item. Mode works differently than mean/median — it shows frequency, not position. Try this: Use df['city'].value_counts() to see the count for each city.

Variability and Spread

Knowing the average isn't enough. Two datasets can have identical means but completely different patterns. Standard deviation reveals how tightly clustered your data points are around the mean.

# Calculate standard deviation for customer age
std_age = df['customer_age'].std()
print(f"Age standard deviation: {std_age:.1f} years")

# Calculate variance (standard deviation squared)
var_age = df['customer_age'].var()
print(f"Age variance: {var_age:.1f}")

# Find the range (max - min)
age_range = df['customer_age'].max() - df['customer_age'].min()
print(f"Age range: {age_range} years")

What just happened?

Standard deviation of 13.7 years means most customers are within ±13.7 years of the mean (41.8). So roughly 68% fall between ages 28-56. The range of 47 years spans from 18 to 65. Try this: Compare revenue's standard deviation — it'll be much higher due to price variations.

📊 Data Insight

Age distribution is fairly normal with standard deviation of 13.7 years, indicating a diverse customer base spanning young adults to middle-aged shoppers without extreme age clustering.

Quartiles and Percentiles

Quartiles split your data into four equal parts. This is goldmine information for business decisions. The 75th percentile of revenue? That's your premium customer segment. Bottom 25%? Your budget-conscious buyers.

# Calculate quartiles for revenue data
q1 = df['revenue'].quantile(0.25)
q2 = df['revenue'].quantile(0.50)  # Same as median
q3 = df['revenue'].quantile(0.75)

print(f"Q1 (25th percentile): ₹{q1:.0f}")
print(f"Q2 (50th percentile): ₹{q2:.0f}")
print(f"Q3 (75th percentile): ₹{q3:.0f}")

What just happened?

25% of orders are below ₹1,250, 50% below ₹3,500, and 75% below ₹8,750. This reveals your customer segments clearly — budget shoppers, middle-tier, and premium buyers. Try this: Calculate the IQR (Interquartile Range) using q3 - q1 to measure spread without outliers.

# Calculate specific percentiles for business insights
p90 = df['revenue'].quantile(0.90)
p95 = df['revenue'].quantile(0.95)
p99 = df['revenue'].quantile(0.99)

print(f"90th percentile: ₹{p90:.0f}")
print(f"95th percentile: ₹{p95:.0f}")  
print(f"99th percentile: ₹{p99:.0f}")
print(f"Maximum value: ₹{df['revenue'].max():.0f}")

What just happened?

The top 10% of customers spend above ₹18,500, while the top 1% exceed ₹89,200. Notice the jump from 99th percentile to maximum — that ₹185,000 order is likely an outlier. Try this: Filter data for orders above the 99th percentile to identify your VIP customers.

The describe() Power Move

Pandas has a secret weapon: describe(). One line gives you count, mean, std, min, quartiles, and max. Honestly, this is underrated — 90% of initial analysis starts here.

# Generate comprehensive statistics for numerical columns
desc_stats = df[['customer_age', 'quantity', 'unit_price', 'revenue', 'rating']].describe()
print(desc_stats)

What just happened?

One command revealed everything: 50,000 complete records, average order ₹8,952, ratings cluster around 4.1/5, and quantity averages 3.2 items. Notice how unit_price variance is huge (std=4821 vs mean=2847) — indicates diverse product mix. Try this: Use df.describe(include='all') to include categorical columns too.

Standard deviation exceeds the mean, indicating high revenue variability across customer segments

The chart reveals a right-skewed distribution. Standard deviation (₹15.85k) is almost double the mean (₹8.95k), meaning you have many small orders and few large ones. This pattern is typical for ecommerce.

For business strategy, focus on the median (₹3.5k) rather than mean when setting inventory levels or marketing budgets. Why? Because 50% of your customers spend below ₹3,500 — that's your core market.

# Compare statistics by category to find patterns
category_stats = df.groupby('product_category')['revenue'].agg([
    'count',    # Number of orders
    'mean',     # Average revenue  
    'median',   # Middle value
    'std'       # Standard deviation
]).round(2)

print(category_stats)

What just happened?

Electronics dominates with mean ₹18.7k but highest variability (std=12.8k). Food has the lowest variance — predictable pricing. Clothing has most orders (11,567) making it volume-driven. Try this: Create ratio of std/mean to compare relative variability across categories.

📊 Data Insight

Electronics generates 2.1x higher average revenue (₹18.7k) than the next category, but Food orders are most predictable with lowest standard deviation of ₹1.6k.

Common Mistake

Never rely solely on mean for skewed data like revenue. Use median for central tendency and IQR for spread when you have outliers. Mean revenue might be ₹8,952, but median ₹3,500 better represents typical customer behavior.

Quiz

1. Your ecommerce dataset shows mean revenue of ₹8,952 and median revenue of ₹3,500. Why is median significantly lower than mean?


2. The revenue standard deviation is ₹15,847 while the mean is ₹8,952. What does this relationship tell you about customer behavior?


3. You need to identify customers in the top 25% by revenue for a VIP program. Which pandas method correctly finds this threshold?


Up Next

Probability Basics

Build on your descriptive statistics foundation to understand likelihood, distributions, and uncertainty quantification for predictive modeling.