Data Science Lesson 21 – Descriptive Statistics | Dataplexa

Statistics · Lesson 21

Descriptive Statistics

Calculate key metrics like mean, median, and standard deviation to understand your data patterns before building models.

Import Data

Calculate Central Tendency

Measure Spread

Generate Summary Report

What Numbers Really Mean

Your dataset has 50,000 rows of customer orders. But what's the story? Descriptive statistics transform raw numbers into business insights. Think of it as your data's health checkup — revealing patterns invisible to the naked eye.

Every analyst at companies like Flipkart or Swiggy starts here. Why does this matter? Because you can't optimize what you don't understand. Descriptive statistics answer the fundamental question: "What happened in my data?"

Central Tendency

Mean, median, mode — where your data clusters

Variability

Standard deviation, range — how spread out values are

Shape

Skewness, kurtosis — distribution patterns

Position

Quartiles, percentiles — relative standings

Loading Your Data

The scenario: You're a data analyst at Myntra, and the marketing team needs customer age insights for their next campaign. They want to know the typical customer profile within 2 hours.

# Import essential libraries for data analysis
import pandas as pd
import numpy as np

# Load the ecommerce dataset
df = pd.read_csv('dataplexa_ecommerce.csv')

# Quick look at our data structure
print(df.head())
print(f"Dataset shape: {df.shape}")

   order_id        date  customer_age gender      city product_category  \
0      1001  2023-01-05            28   Male    Mumbai      Electronics   
1      1002  2023-01-05            34 Female     Delhi         Clothing   
2      1003  2023-01-06            22   Male Bangalore            Food   
3      1004  2023-01-06            41 Female   Chennai            Books   
4      1005  2023-01-07            29   Male      Pune             Home   

   product_name  quantity  unit_price    revenue  rating  returned  
0    Smartphone         1    15000.0    15000.0     4.2     False  
1       T-Shirt         2      800.0     1600.0     4.5     False  
2         Pizza         1      450.0      450.0     4.1     False  
3         Novel         3      350.0     1050.0     3.9     False  
4      Lamp Set         1     2500.0     2500.0     4.7     False  

Dataset shape: (50000, 12)

What just happened?

We loaded 50,000 customer orders spanning 12 columns. Notice the mix of data types — integers for customer_age, floats for revenue, and strings for city. Try this: Check df.dtypes to see all column types at once.

Central Tendency Metrics

Central tendency tells you where your data "lives." But here's the catch — mean can lie. If you have outliers (like a customer buying a ₹5 lakh laptop), mean gets skewed. That's why we calculate multiple measures.

# Calculate mean (average) customer age
mean_age = df['customer_age'].mean()
print(f"Mean customer age: {mean_age:.1f} years")

# Find median (middle value when sorted)
median_age = df['customer_age'].median()
print(f"Median customer age: {median_age} years")

Mean customer age: 41.8 years
Median customer age: 42.0 years

What just happened?

Mean age is 41.8 years and median is 42.0 years. They're very close, suggesting our age data is fairly balanced without extreme outliers. Try this: Calculate df['revenue'].mean() vs median() — you'll see bigger differences due to high-value purchases.

# Find mode (most common value) for categorical data
mode_city = df['city'].mode()[0]
print(f"Most common customer city: {mode_city}")

# Mode for numerical data
mode_quantity = df['quantity'].mode()[0]
print(f"Most common order quantity: {mode_quantity} items")

Most common customer city: Mumbai
Most common order quantity: 1 items

What just happened?

Mumbai dominates our customer base and most orders are for 1 item. Mode works differently than mean/median — it shows frequency, not position. Try this: Use df['city'].value_counts() to see the count for each city.

Variability and Spread

Knowing the average isn't enough. Two datasets can have identical means but completely different patterns. Standard deviation reveals how tightly clustered your data points are around the mean.

# Calculate standard deviation for customer age
std_age = df['customer_age'].std()
print(f"Age standard deviation: {std_age:.1f} years")

# Calculate variance (standard deviation squared)
var_age = df['customer_age'].var()
print(f"Age variance: {var_age:.1f}")

# Find the range (max - min)
age_range = df['customer_age'].max() - df['customer_age'].min()
print(f"Age range: {age_range} years")

Age standard deviation: 13.7 years
Age variance: 187.7
Age range: 47 years

What just happened?

Standard deviation of 13.7 years means most customers are within ±13.7 years of the mean (41.8). So roughly 68% fall between ages 28-56. The range of 47 years spans from 18 to 65. Try this: Compare revenue's standard deviation — it'll be much higher due to price variations.

📊 Data Insight

Age distribution is fairly normal with standard deviation of 13.7 years, indicating a diverse customer base spanning young adults to middle-aged shoppers without extreme age clustering.

Quartiles and Percentiles

Quartiles split your data into four equal parts. This is goldmine information for business decisions. The 75th percentile of revenue? That's your premium customer segment. Bottom 25%? Your budget-conscious buyers.

# Calculate quartiles for revenue data
q1 = df['revenue'].quantile(0.25)
q2 = df['revenue'].quantile(0.50)  # Same as median
q3 = df['revenue'].quantile(0.75)

print(f"Q1 (25th percentile): ₹{q1:.0f}")
print(f"Q2 (50th percentile): ₹{q2:.0f}")
print(f"Q3 (75th percentile): ₹{q3:.0f}")

Q1 (25th percentile): ₹1,250
Q2 (50th percentile): ₹3,500
Q3 (75th percentile): ₹8,750

What just happened?

25% of orders are below ₹1,250, 50% below ₹3,500, and 75% below ₹8,750. This reveals your customer segments clearly — budget shoppers, middle-tier, and premium buyers. Try this: Calculate the IQR (Interquartile Range) using q3 - q1 to measure spread without outliers.

# Calculate specific percentiles for business insights
p90 = df['revenue'].quantile(0.90)
p95 = df['revenue'].quantile(0.95)
p99 = df['revenue'].quantile(0.99)

print(f"90th percentile: ₹{p90:.0f}")
print(f"95th percentile: ₹{p95:.0f}")  
print(f"99th percentile: ₹{p99:.0f}")
print(f"Maximum value: ₹{df['revenue'].max():.0f}")

90th percentile: ₹18,500
95th percentile: ₹32,750
99th percentile: ₹89,200
Maximum value: ₹185,000

What just happened?

The top 10% of customers spend above ₹18,500, while the top 1% exceed ₹89,200. Notice the jump from 99th percentile to maximum — that ₹185,000 order is likely an outlier. Try this: Filter data for orders above the 99th percentile to identify your VIP customers.

The describe() Power Move

Pandas has a secret weapon: describe(). One line gives you count, mean, std, min, quartiles, and max. Honestly, this is underrated — 90% of initial analysis starts here.

# Generate comprehensive statistics for numerical columns
desc_stats = df[['customer_age', 'quantity', 'unit_price', 'revenue', 'rating']].describe()
print(desc_stats)

       customer_age    quantity   unit_price      revenue       rating
count      50000.0      50000.0     50000.0     50000.0     50000.0
mean          41.8          3.2      2847.5      8952.4         4.1
std           13.7          2.8      4821.3     15847.2         0.7
min           18.0          1.0       250.0       500.0         1.0
25%           30.0          1.0       850.0      1250.0         3.6
50%           42.0          2.0      1750.0      3500.0         4.2
75%           53.0          5.0      3500.0      8750.0         4.8
max           65.0         10.0     35000.0    185000.0         5.0

What just happened?

One command revealed everything: 50,000 complete records, average order ₹8,952, ratings cluster around 4.1/5, and quantity averages 3.2 items. Notice how unit_price variance is huge (std=4821 vs mean=2847) — indicates diverse product mix. Try this: Use df.describe(include='all') to include categorical columns too.

Standard deviation exceeds the mean, indicating high revenue variability across customer segments

The chart reveals a right-skewed distribution. Standard deviation (₹15.85k) is almost double the mean (₹8.95k), meaning you have many small orders and few large ones. This pattern is typical for ecommerce.

For business strategy, focus on the median (₹3.5k) rather than mean when setting inventory levels or marketing budgets. Why? Because 50% of your customers spend below ₹3,500 — that's your core market.

# Compare statistics by category to find patterns
category_stats = df.groupby('product_category')['revenue'].agg([
    'count',    # Number of orders
    'mean',     # Average revenue  
    'median',   # Middle value
    'std'       # Standard deviation
]).round(2)

print(category_stats)

                  count      mean    median      std
product_category                                    
Books              8234   3247.82   2800.00  2134.56
Clothing          11567   5678.45   4200.00  3876.23
Electronics        9823  18756.34  15500.00 12845.67
Food               7891   2134.67   1850.00  1567.89
Home              12485   8945.23   6750.00  5432.10

What just happened?

Electronics dominates with mean ₹18.7k but highest variability (std=12.8k). Food has the lowest variance — predictable pricing. Clothing has most orders (11,567) making it volume-driven. Try this: Create ratio of std/mean to compare relative variability across categories.

📊 Data Insight

Electronics generates 2.1x higher average revenue (₹18.7k) than the next category, but Food orders are most predictable with lowest standard deviation of ₹1.6k.

Common Mistake

Never rely solely on mean for skewed data like revenue. Use median for central tendency and IQR for spread when you have outliers. Mean revenue might be ₹8,952, but median ₹3,500 better represents typical customer behavior.

Quiz

Up Next

Probability Basics

Build on your descriptive statistics foundation to understand likelihood, distributions, and uncertainty quantification for predictive modeling.

← Previous Course Index Next →