Data Science Lesson 11 – Univariate Analysis | Dataplexa

Data Analysis · Lesson 11

Univariate Analysis

Master the art of analyzing single variables to uncover hidden patterns, detect anomalies, and make data-driven decisions that impact business outcomes.

Data Collection — Load your ecommerce dataset

Descriptive Stats — Calculate mean, median, mode, variance

Distribution Analysis — Understand data shape and spread

Visual Insights — Create histograms, box plots, bar charts

What Makes Univariate Analysis Essential

Think of univariate analysis like examining each ingredient before cooking a complex dish. You need to understand individual characteristics before combining variables. Honestly, this step gets skipped too often — and that's where projects fail.

Your ecommerce dataset contains multiple variables: customer age, revenue, ratings, product categories. But analyzing them together without understanding each one individually? That's like trying to fix a car engine without knowing what each part does. Univariate analysis examines one variable at a time — revealing patterns, outliers, and distribution shapes that drive business decisions.

Numerical Variables

Revenue, age, ratings, quantity — continuous data requiring statistical measures

Categorical Variables

Gender, city, product category — discrete data needing frequency counts

Central Tendency

Mean, median, mode — where your data clusters and why it matters

Variability Measures

Standard deviation, variance, range — how spread out your data really is

Loading and Exploring Your Dataset

The scenario: A Flipkart analyst needs to understand customer purchase patterns urgently. Revenue numbers look suspicious, and leadership wants answers by tomorrow's board meeting.

# Load essential libraries for univariate analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Libraries imported successfully

What just happened?

pandas handles data manipulation, numpy provides mathematical functions, and matplotlib creates visualizations. Try this: Always import these three together — they work as a team for data analysis.

# Load the ecommerce dataset
df = pd.read_csv('dataplexa_ecommerce.csv')
print("Dataset shape:", df.shape)
df.head()

Dataset shape: (50000, 11)

   order_id        date  customer_age gender      city product_category    product_name  quantity  unit_price    revenue  rating  returned
0      1001  2023-01-05            28   Male    Mumbai      Electronics  Samsung Phone         1    25000.0   25000.0     4.2     False
1      1002  2023-01-05            34 Female     Delhi         Clothing   Nike T-Shirt         2     1200.0    2400.0     4.8     False  
2      1003  2023-01-06            45   Male  Bangalore            Food  Organic Rice         3      500.0    1500.0     4.1      True
3      1004  2023-01-06            29 Female   Chennai        Books   Python Guide         1     1800.0    1800.0     4.9     False
4      1005  2023-01-07            38   Male      Pune         Home  Kitchen Set         1     8500.0    8500.0     3.8     False

What just happened?

The dataset contains 50,000 rows and 11 columns. Notice the mix of numerical (revenue, age) and categorical (gender, city) variables. Try this: Always check shape first — it tells you the scale of analysis ahead.

Numerical Variable Analysis

Revenue analysis reveals business health faster than any executive dashboard. But here's the catch: raw numbers lie — you need statistical context. Mean revenue might show ₹15,000, but if most customers spend ₹2,000 and a few spend ₹200,000, that average misleads everyone.

# Analyze revenue distribution - the business critical metric
print("Revenue Statistics:")
print(f"Mean: ₹{df['revenue'].mean():,.2f}")
print(f"Median: ₹{df['revenue'].median():,.2f}")
print(f"Mode: ₹{df['revenue'].mode()[0]:,.2f}")
print(f"Standard Deviation: ₹{df['revenue'].std():,.2f}")

Revenue Statistics:
Mean: ₹12,847.32
Median: ₹8,500.00
Mode: ₹1,200.00
Standard Deviation: ₹18,254.67

What just happened?

The mean (₹12,847) exceeds the median (₹8,500), indicating right-skewed distribution — few high-value orders pull the average up. Try this: When mean > median, investigate outliers.

📊 Data Insight

Large standard deviation (₹18,254) reveals high revenue variability. This suggests diverse product portfolio — from ₹500 snacks to ₹50,000+ electronics, requiring segmented marketing strategies.

# Five-number summary reveals distribution shape
print("Five-Number Summary:")
print(df['revenue'].describe())

Five-Number Summary:
count    50000.000000
mean     12847.320000
std      18254.670000
min        500.000000
25%       2100.000000
50%       8500.000000
75%      18750.000000
max     198500.000000

What just happened?

The 75th percentile (₹18,750) shows 75% of orders are below this value. The huge gap between 75% and max (₹198,500) confirms high-value outliers exist. Try this: Use percentiles to understand customer segments.

45% of orders fall in the ₹0-5K range, revealing price-sensitive customer base

This distribution chart exposes a critical business insight: nearly half your customers are budget-conscious. The steep decline from ₹0-5K to higher ranges suggests pricing strategies need adjustment. Most ecommerce platforms miss this pattern.

Why does this matter for business decisions? If you're spending marketing budget equally across all customer segments, you're wasting money. The chart shows where your real volume lies — focus promotional campaigns on the ₹0-15K range where 75% of your revenue opportunity exists.

Categorical Variable Analysis

Numbers tell one story, but categories reveal customer behavior. Frequency analysis uncovers which cities drive sales, which product categories dominate, and where gender preferences impact purchasing decisions.

# Analyze product category performance
category_counts = df['product_category'].value_counts()
print("Product Category Distribution:")
print(category_counts)
print(f"\nCategory Percentages:")
print(category_counts / len(df) * 100)

Product Category Distribution:
Electronics    15250
Clothing       12800
Home           8950
Food           7600
Books          5400

Category Percentages:
Electronics    30.50
Clothing       25.60
Home          17.90
Food          15.20
Books         10.80

What just happened?

Electronics (30.5%) and Clothing (25.6%) dominate sales volume. Books show lowest engagement at 10.8%. Try this: Focus inventory investment on top 2-3 categories.

Electronics and Clothing combine for 56% of total sales volume

This pie chart reveals a classic 80-20 pattern in ecommerce. Two categories drive over half your business — but here's where most analysts stop. The real insight lies in understanding why Books underperform and whether Food's 15.2% represents opportunity or natural market size.

Smart inventory managers use this data differently. Instead of cutting Books entirely, they investigate: Are book customers high-value? Do they return frequently? Sometimes low-volume categories serve as acquisition channels for higher-value segments.

# Cross-analyze city performance with gender distribution
city_analysis = df.groupby('city').size().sort_values(ascending=False)
print("Top Cities by Order Volume:")
print(city_analysis)
print("\nGender Distribution:")
print(df['gender'].value_counts())

Top Cities by Order Volume:
city
Mumbai       12450
Delhi        11200
Bangalore    10850
Chennai       9200
Pune          6300

Gender Distribution:
Male      27800
Female    22200

What just happened?

Mumbai leads with 12,450 orders, while Male customers (55.6%) slightly outnumber females. Try this: Combine demographic analysis with geography to identify expansion opportunities.

Distribution Shape and Outlier Detection

Data shape matters more than most analysts realize. Skewed distributions break assumptions in machine learning models, while outliers can represent either data errors or your most valuable customers. The trick lies in distinguishing between the two.

Common Mistake

Removing all outliers automatically. High-value customers often appear as outliers in revenue analysis — removing them eliminates your best insights about premium segments.

# Calculate skewness and kurtosis for distribution shape
from scipy.stats import skew, kurtosis

revenue_skew = skew(df['revenue'])
revenue_kurt = kurtosis(df['revenue'])
age_skew = skew(df['customer_age'])

print(f"Revenue Skewness: {revenue_skew:.3f}")
print(f"Revenue Kurtosis: {revenue_kurt:.3f}")
print(f"Customer Age Skewness: {age_skew:.3f}")

# Interpret results
if revenue_skew > 1:
    print("Revenue distribution is highly right-skewed")
elif revenue_skew > 0.5:
    print("Revenue distribution is moderately right-skewed")

Revenue Skewness: 2.456
Revenue Kurtosis: 8.123
Customer Age Skewness: 0.034
Revenue distribution is highly right-skewed

What just happened?

Skewness > 2 confirms heavy right tail in revenue — few high spenders pull the distribution. Age skewness near 0 suggests normal distribution. Try this: Use log transformation for highly skewed data.

Age distribution peaks at 31-35 range with 9,800 customers

This age distribution reveals your sweet spot: early-to-mid thirties customers dominate. The smooth bell curve suggests healthy market penetration across age groups, unlike the skewed revenue data. This demographic typically has disposable income and established purchasing habits.

But here's the strategic insight: the steep drop after age 45 might represent opportunity rather than natural market behavior. Older demographics often have higher spending power but require different marketing approaches — mobile apps versus social media advertising, for instance.

📊 Data Insight

Combining skewed revenue (high variability) with normal age distribution (stable demographics) suggests product pricing strategy, not customer base, drives revenue variance. Consider tiered pricing models.

Practical Business Applications

Raw statistics mean nothing without business context. Univariate insights drive specific decisions — inventory planning, marketing budget allocation, pricing strategies, and customer segmentation. Here's how real companies apply these findings.

Finding	Business Action	Expected Impact
45% orders under ₹5K	Launch budget product lines	20% volume increase
Electronics lead at 30.5%	Expand tech inventory	₹2.5L monthly revenue boost
Mumbai 25% market share	Open fulfillment center	30% faster delivery
31-35 age peak	Target professional demographics	15% higher conversion

Each statistic translates to concrete strategy. The ₹5K threshold insight led one client to create "Essentials" product bundles priced at ₹4,999 — resulting in 23% revenue growth in that segment. Numbers without action plans are just academic exercises.

Pro Tip: Always validate univariate findings with business stakeholders before making major decisions. Statistical significance doesn't guarantee business relevance — combine data insights with domain expertise for optimal results.

Quiz

Up Next

Bivariate Analysis

Discover how variables interact with each other through correlation analysis, scatter plots, and cross-tabulations that reveal hidden business relationships.

← Previous Course Index Next →