Data Science
Univariate Analysis
Master the art of analyzing single variables to uncover hidden patterns, detect anomalies, and make data-driven decisions that impact business outcomes.
What Makes Univariate Analysis Essential
Think of univariate analysis like examining each ingredient before cooking a complex dish. You need to understand individual characteristics before combining variables. Honestly, this step gets skipped too often — and that's where projects fail.
Your ecommerce dataset contains multiple variables: customer age, revenue, ratings, product categories. But analyzing them together without understanding each one individually? That's like trying to fix a car engine without knowing what each part does. Univariate analysis examines one variable at a time — revealing patterns, outliers, and distribution shapes that drive business decisions.
Numerical Variables
Revenue, age, ratings, quantity — continuous data requiring statistical measures
Categorical Variables
Gender, city, product category — discrete data needing frequency counts
Central Tendency
Mean, median, mode — where your data clusters and why it matters
Variability Measures
Standard deviation, variance, range — how spread out your data really is
Loading and Exploring Your Dataset
The scenario: A Flipkart analyst needs to understand customer purchase patterns urgently. Revenue numbers look suspicious, and leadership wants answers by tomorrow's board meeting.
# Load essential libraries for univariate analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as pltLibraries imported successfully
What just happened?
pandas handles data manipulation, numpy provides mathematical functions, and matplotlib creates visualizations. Try this: Always import these three together — they work as a team for data analysis.
# Load the ecommerce dataset
df = pd.read_csv('dataplexa_ecommerce.csv')
print("Dataset shape:", df.shape)
df.head()Dataset shape: (50000, 11) order_id date customer_age gender city product_category product_name quantity unit_price revenue rating returned 0 1001 2023-01-05 28 Male Mumbai Electronics Samsung Phone 1 25000.0 25000.0 4.2 False 1 1002 2023-01-05 34 Female Delhi Clothing Nike T-Shirt 2 1200.0 2400.0 4.8 False 2 1003 2023-01-06 45 Male Bangalore Food Organic Rice 3 500.0 1500.0 4.1 True 3 1004 2023-01-06 29 Female Chennai Books Python Guide 1 1800.0 1800.0 4.9 False 4 1005 2023-01-07 38 Male Pune Home Kitchen Set 1 8500.0 8500.0 3.8 False
What just happened?
The dataset contains 50,000 rows and 11 columns. Notice the mix of numerical (revenue, age) and categorical (gender, city) variables. Try this: Always check shape first — it tells you the scale of analysis ahead.
Numerical Variable Analysis
Revenue analysis reveals business health faster than any executive dashboard. But here's the catch: raw numbers lie — you need statistical context. Mean revenue might show ₹15,000, but if most customers spend ₹2,000 and a few spend ₹200,000, that average misleads everyone.
# Analyze revenue distribution - the business critical metric
print("Revenue Statistics:")
print(f"Mean: ₹{df['revenue'].mean():,.2f}")
print(f"Median: ₹{df['revenue'].median():,.2f}")
print(f"Mode: ₹{df['revenue'].mode()[0]:,.2f}")
print(f"Standard Deviation: ₹{df['revenue'].std():,.2f}")Revenue Statistics: Mean: ₹12,847.32 Median: ₹8,500.00 Mode: ₹1,200.00 Standard Deviation: ₹18,254.67
What just happened?
The mean (₹12,847) exceeds the median (₹8,500), indicating right-skewed distribution — few high-value orders pull the average up. Try this: When mean > median, investigate outliers.
📊 Data Insight
Large standard deviation (₹18,254) reveals high revenue variability. This suggests diverse product portfolio — from ₹500 snacks to ₹50,000+ electronics, requiring segmented marketing strategies.
# Five-number summary reveals distribution shape
print("Five-Number Summary:")
print(df['revenue'].describe())Five-Number Summary: count 50000.000000 mean 12847.320000 std 18254.670000 min 500.000000 25% 2100.000000 50% 8500.000000 75% 18750.000000 max 198500.000000
What just happened?
The 75th percentile (₹18,750) shows 75% of orders are below this value. The huge gap between 75% and max (₹198,500) confirms high-value outliers exist. Try this: Use percentiles to understand customer segments.
45% of orders fall in the ₹0-5K range, revealing price-sensitive customer base
This distribution chart exposes a critical business insight: nearly half your customers are budget-conscious. The steep decline from ₹0-5K to higher ranges suggests pricing strategies need adjustment. Most ecommerce platforms miss this pattern.
Why does this matter for business decisions? If you're spending marketing budget equally across all customer segments, you're wasting money. The chart shows where your real volume lies — focus promotional campaigns on the ₹0-15K range where 75% of your revenue opportunity exists.
Categorical Variable Analysis
Numbers tell one story, but categories reveal customer behavior. Frequency analysis uncovers which cities drive sales, which product categories dominate, and where gender preferences impact purchasing decisions.
# Analyze product category performance
category_counts = df['product_category'].value_counts()
print("Product Category Distribution:")
print(category_counts)
print(f"\nCategory Percentages:")
print(category_counts / len(df) * 100)Product Category Distribution: Electronics 15250 Clothing 12800 Home 8950 Food 7600 Books 5400 Category Percentages: Electronics 30.50 Clothing 25.60 Home 17.90 Food 15.20 Books 10.80
What just happened?
Electronics (30.5%) and Clothing (25.6%) dominate sales volume. Books show lowest engagement at 10.8%. Try this: Focus inventory investment on top 2-3 categories.
Electronics and Clothing combine for 56% of total sales volume
This pie chart reveals a classic 80-20 pattern in ecommerce. Two categories drive over half your business — but here's where most analysts stop. The real insight lies in understanding why Books underperform and whether Food's 15.2% represents opportunity or natural market size.
Smart inventory managers use this data differently. Instead of cutting Books entirely, they investigate: Are book customers high-value? Do they return frequently? Sometimes low-volume categories serve as acquisition channels for higher-value segments.
# Cross-analyze city performance with gender distribution
city_analysis = df.groupby('city').size().sort_values(ascending=False)
print("Top Cities by Order Volume:")
print(city_analysis)
print("\nGender Distribution:")
print(df['gender'].value_counts())Top Cities by Order Volume: city Mumbai 12450 Delhi 11200 Bangalore 10850 Chennai 9200 Pune 6300 Gender Distribution: Male 27800 Female 22200
What just happened?
Mumbai leads with 12,450 orders, while Male customers (55.6%) slightly outnumber females. Try this: Combine demographic analysis with geography to identify expansion opportunities.
Distribution Shape and Outlier Detection
Data shape matters more than most analysts realize. Skewed distributions break assumptions in machine learning models, while outliers can represent either data errors or your most valuable customers. The trick lies in distinguishing between the two.
Common Mistake
Removing all outliers automatically. High-value customers often appear as outliers in revenue analysis — removing them eliminates your best insights about premium segments.
# Calculate skewness and kurtosis for distribution shape
from scipy.stats import skew, kurtosis
revenue_skew = skew(df['revenue'])
revenue_kurt = kurtosis(df['revenue'])
age_skew = skew(df['customer_age'])
print(f"Revenue Skewness: {revenue_skew:.3f}")
print(f"Revenue Kurtosis: {revenue_kurt:.3f}")
print(f"Customer Age Skewness: {age_skew:.3f}")
# Interpret results
if revenue_skew > 1:
print("Revenue distribution is highly right-skewed")
elif revenue_skew > 0.5:
print("Revenue distribution is moderately right-skewed")Revenue Skewness: 2.456 Revenue Kurtosis: 8.123 Customer Age Skewness: 0.034 Revenue distribution is highly right-skewed
What just happened?
Skewness > 2 confirms heavy right tail in revenue — few high spenders pull the distribution. Age skewness near 0 suggests normal distribution. Try this: Use log transformation for highly skewed data.
Age distribution peaks at 31-35 range with 9,800 customers
This age distribution reveals your sweet spot: early-to-mid thirties customers dominate. The smooth bell curve suggests healthy market penetration across age groups, unlike the skewed revenue data. This demographic typically has disposable income and established purchasing habits.
But here's the strategic insight: the steep drop after age 45 might represent opportunity rather than natural market behavior. Older demographics often have higher spending power but require different marketing approaches — mobile apps versus social media advertising, for instance.
📊 Data Insight
Combining skewed revenue (high variability) with normal age distribution (stable demographics) suggests product pricing strategy, not customer base, drives revenue variance. Consider tiered pricing models.
Practical Business Applications
Raw statistics mean nothing without business context. Univariate insights drive specific decisions — inventory planning, marketing budget allocation, pricing strategies, and customer segmentation. Here's how real companies apply these findings.
| Finding | Business Action | Expected Impact |
|---|---|---|
| 45% orders under ₹5K | Launch budget product lines | 20% volume increase |
| Electronics lead at 30.5% | Expand tech inventory | ₹2.5L monthly revenue boost |
| Mumbai 25% market share | Open fulfillment center | 30% faster delivery |
| 31-35 age peak | Target professional demographics | 15% higher conversion |
Each statistic translates to concrete strategy. The ₹5K threshold insight led one client to create "Essentials" product bundles priced at ₹4,999 — resulting in 23% revenue growth in that segment. Numbers without action plans are just academic exercises.
Pro Tip: Always validate univariate findings with business stakeholders before making major decisions. Statistical significance doesn't guarantee business relevance — combine data insights with domain expertise for optimal results.
Quiz
1. Your ecommerce dataset shows revenue with mean ₹12,847 and median ₹8,500. What does this relationship indicate about your customer spending patterns?
2. Which pandas method provides the most comprehensive univariate analysis for categorical variables like product_category in your ecommerce dataset?
3. Your revenue analysis shows skewness of 2.456 and several orders above ₹150,000 while most orders are under ₹10,000. What should be your immediate next step?
Up Next
Bivariate Analysis
Discover how variables interact with each other through correlation analysis, scatter plots, and cross-tabulations that reveal hidden business relationships.