Data Science
EDA Project
Build a complete exploratory data analysis from raw ecommerce data to actionable business insights using multiple visualization techniques.
The EDA Pipeline
Real EDA projects follow a predictable sequence. You start with data inspection, move through quality checks, then dive into univariate analysis, followed by relationships and correlations. The final step transforms insights into business recommendations.
Most analysts skip the inspection phase. Big mistake. Spending 30 minutes understanding your data structure saves hours debugging weird visualizations later. Trust me on this one.
Data Inspection Phase
The scenario: You're analyzing Flipkart's customer behavior data. The product team needs insights on which categories drive the highest revenue and customer satisfaction scores.
# Load essential libraries for data analysis
import pandas as pd
import numpy as np
# Read the ecommerce dataset
df = pd.read_csv('dataplexa_ecommerce.csv')
# First look at our data structure
print(f"Dataset shape: {df.shape}")
df.head()Dataset shape: (15000, 11) order_id date customer_age gender city product_category product_name quantity unit_price revenue rating returned 0 1001 2023-01-15 28 Male Mumbai Electronics iPhone 14 1 85000.0 85000.0 4.2 False 1 1002 2023-01-15 34 Female Delhi Clothing Saree Silk 2 3500.0 7000.0 4.8 False 2 1003 2023-01-16 22 Male Bangalore Food Organic Rice 5 450.0 2250.0 4.1 False
What just happened?
We loaded 15,000 records with 11 columns. Notice revenue ranges from ₹2,250 to ₹85,000, and rating varies from 4.1 to 4.8. Try this: Check if any orders have quantity > 1 but unit_price doesn't match total revenue.
# Check data types and missing values
df.info()
# Get summary statistics for numerical columns
print("\nNumerical Summary:")
df.describe()RangeIndex: 15000 entries, 0 to 14999 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_id 15000 non-null int64 1 date 15000 non-null object 2 customer_age 15000 non-null int64 3 gender 15000 non-null object 4 city 15000 non-null object 5 product_category 15000 non-null object 6 product_name 15000 non-null object 7 quantity 15000 non-null int64 8 unit_price 15000 non-null float64 9 revenue 15000 non-null float64 10 rating 15000 non-null float64 11 returned 15000 non-null bool Numerical Summary: customer_age quantity unit_price revenue rating count 15000.00 15000.00 15000.00 15000.00 15000.00 mean 41.50 3.24 12487.50 35642.80 3.48 std 13.77 2.89 18742.30 52847.20 1.15 min 18.00 1.00 450.00 2250.00 1.00 25% 29.00 1.00 2800.00 8400.00 2.50 50% 42.00 3.00 6500.00 19500.00 3.50 75% 54.00 5.00 18000.00 54000.00 4.40 max 65.00 10.00 95000.00 285000.00 5.00
What just happened?
Perfect — no missing values! Average customer age is 41.5 years, mean revenue per order is ₹35,643. The std of ₹52,847 for revenue suggests high variability. Try this: Calculate the coefficient of variation (std/mean) to quantify this spread.
Category Performance Analysis
Now we dig into business questions. Which product categories generate the most revenue? Are customers satisfied across all categories? Bar charts reveal category dominance while scatter plots show revenue-rating relationships.
# Calculate total revenue by product category
category_revenue = df.groupby('product_category')['revenue'].sum()
# Convert to lakhs for easier reading
category_revenue_lakhs = category_revenue / 100000
# Sort from highest to lowest
category_revenue_lakhs = category_revenue_lakhs.sort_values(ascending=False)
print("Revenue by Category (in Lakhs INR):")
category_revenue_lakhsRevenue by Category (in Lakhs INR): product_category Electronics 1842.5 Clothing 651.8 Home 423.7 Food 298.4 Books 147.2 Name: revenue, dtype: float64
What just happened?
Electronics dominates with ₹1,842.5 lakhs — that's 54% of total revenue! Books generate only ₹147.2 lakhs. groupby('product_category')['revenue'].sum() aggregated 15,000 rows into 5 category totals. Try this: Calculate average order value per category.
Electronics generates 12.5x more revenue than Books category
The revenue gap is massive. Electronics pulls in nearly ₹18.5 crores while Books struggle at ₹1.5 crores. This suggests either higher-priced items in Electronics or significantly more volume. The middle categories (Clothing, Home, Food) cluster between ₹3-7 crores.
Business decision: Invest more marketing budget in Electronics since it's clearly the revenue driver. But don't ignore Books — there might be untapped potential or it could be a strategic loss leader.
Customer Satisfaction Patterns
# Calculate average rating by category
category_ratings = df.groupby('product_category')['rating'].mean()
# Round to 2 decimal places for cleaner display
category_ratings = category_ratings.round(2)
# Sort by rating descending
category_ratings = category_ratings.sort_values(ascending=False)
print("Average Customer Rating by Category:")
category_ratingsAverage Customer Rating by Category: product_category Books 3.92 Food 3.67 Home 3.52 Clothing 3.31 Electronics 3.24 Name: rating, dtype: float64
What just happened?
Plot twist! Books has the highest satisfaction (3.92) while Electronics — our revenue champion — ranks lowest (3.24). .round(2) cleaned up the decimal display. Try this: Calculate rating standard deviation to see consistency.
Inverse relationship: highest revenue category has lowest satisfaction scores
Here's the fascinating insight — revenue and satisfaction move in opposite directions. Books customers are happiest (3.92/5) but contribute least revenue. Electronics customers are least satisfied (3.24/5) but drive most sales.
This pattern suggests price sensitivity or expectation mismatch. Higher-priced Electronics purchases create higher expectations. When products don't deliver, ratings suffer. Books are low-risk purchases with lower expectations.
📊 Data Insight
Electronics: ₹1,842L revenue but 3.24 rating. Books: ₹147L revenue but 3.92 rating. Revenue leaders aren't satisfaction leaders — investigate Electronics quality issues immediately.
Age Demographics & Spending Power
# Create age groups for better analysis
def categorize_age(age):
if age < 25:
return "Gen Z (18-24)"
elif age < 35:
return "Millennials (25-34)"
elif age < 50:
return "Gen X (35-49)"
else:
return "Boomers (50+)"
# Apply age categorization
df['age_group'] = df['customer_age'].apply(categorize_age)
df['age_group'].value_counts()Gen X (35-49) 4125 Boomers (50+) 3952 Millennials (25-34) 3894 Gen Z (18-24) 3029 Name: age_group, dtype: int64
What just happened?
We created 4 age segments from continuous age data. Gen X dominates (4,125 customers) while Gen Z is smallest (3,029). The apply() function ran our custom categorization on all 15,000 rows. Try this: Check if age distribution matches India's population demographics.
# Calculate average spending by age group
age_spending = df.groupby('age_group')['revenue'].agg(['mean', 'count', 'sum'])
# Clean up the column names
age_spending.columns = ['avg_order_value', 'total_orders', 'total_revenue']
# Round average for readability
age_spending['avg_order_value'] = age_spending['avg_order_value'].round(0)
print("Spending Patterns by Age Group:")
age_spendingSpending Patterns by Age Group:
avg_order_value total_orders total_revenue
age_group
Boomers (50+) 41250 3952 163020000.0
Gen X (35-49) 39874 4125 164477500.0
Gen Z (18-24) 28456 3029 86205624.0
Millennials (25-34) 31247 3894 121709718.0What just happened?
Boomers spend highest per order (₹41,250) while Gen Z spends least (₹28,456). The agg(['mean', 'count', 'sum']) calculated three metrics simultaneously. Try this: Calculate spending per customer by dividing total_revenue by total_orders.
Gen X and Boomers contribute 61% of total revenue despite being 54% of customers
The doughnut reveals spending power concentration. Gen X and Boomers combined account for ₹327.5 crores (61% of revenue) from 8,077 customers. Meanwhile, younger demographics generate ₹208 crores from 6,923 customers.
Marketing insight: Older customers have higher disposable income and purchase more expensive items. But Gen Z represents future growth — their spending habits today predict long-term value. Focus retention on Boomers, acquisition on Gen Z.
📊 Data Insight
Age drives spending: Boomers average ₹41,250 per order vs Gen Z's ₹28,456. The ₹12,794 gap suggests different product preferences or financial capacity. Target premium products to 35+ demographics.
Geographic Revenue Hotspots
City-wise analysis reveals market penetration and regional preferences. Mumbai and Delhi typically dominate Indian ecommerce, but let's verify with our data. Geographic clustering helps optimize logistics and marketing spend.
# Analyze city performance with return rates
city_analysis = df.groupby('city').agg({
'revenue': ['sum', 'mean', 'count'],
'returned': 'mean',
'rating': 'mean'
}).round(2)
# Flatten column names for easier reading
city_analysis.columns = ['total_revenue', 'avg_order_value', 'order_count', 'return_rate', 'avg_rating']
# Sort by total revenue
city_analysis = city_analysis.sort_values('total_revenue', ascending=False)
city_analysistotal_revenue avg_order_value order_count return_rate avg_rating city Mumbai 143628750.0 47876.25 3000 0.18 3.42 Delhi 126891250.0 42297.08 3000 0.22 3.51 Bangalore 115433750.0 38477.92 3000 0.19 3.49 Chennai 106778750.0 35592.92 3000 0.21 3.46 Pune 41912500.0 13970.83 3000 0.20 3.52
What just happened?
Mumbai leads with ₹143.6 crores total revenue and ₹47,876 average order value. Pune lags significantly at ₹13,971 per order. The agg() with dictionary calculated multiple metrics per city. Try this: Calculate revenue per capita using city population data.
Return Rate Alert
Delhi shows highest return rate at 22% despite strong revenue. Mumbai has lowest returns (18%) AND highest revenue — the sweet spot. Fix Delhi's fulfillment issues to capture lost revenue from returns.
Mumbai's dominance is clear — ₹143.6 crores revenue with lowest return rate (18%). Delhi generates ₹126.9 crores but suffers from 22% returns. Pune dramatically underperforms at ₹41.9 crores despite equal customer count.
The 3.4x revenue gap between Mumbai and Pune suggests vastly different purchasing power or product preferences. Mumbai customers spend ₹47,876 per order vs Pune's ₹13,971. This isn't just volume — it's fundamentally different market segments.
Business Recommendations
Immediate Actions
• Investigate Electronics quality issues (lowest ratings)
• Fix Delhi fulfillment to reduce 22% return rate
• Expand Pune market with targeted campaigns
Strategic Moves
• Double Electronics marketing spend (54% revenue)
• Create premium products for Boomers/Gen X
• Develop Gen Z acquisition strategy
The data tells three stories: Electronics drives revenue but hurts satisfaction. Older customers spend significantly more than younger ones. Geographic markets show massive variation in spending power.
EDA isn't about creating pretty charts — it's about extracting actionable insights that change business decisions. These patterns suggest specific product, marketing, and operational improvements worth millions in additional revenue.
📊 Data Insight
Complete EDA revealed: Electronics needs quality improvement despite ₹18.4C revenue. Mumbai outperforms Pune by 3.4x per order. Boomers spend 45% more than Gen Z. Each insight drives specific business actions.
Quiz
1. Based on the EDA analysis, what is the key insight about Electronics category performance?
2. Which city shows the best combination of high revenue and low return rates?
3. Which pandas method correctly calculates total revenue by product category?
Up Next
Predictive Modeling
Transform your EDA insights into predictive models that forecast customer behavior and business outcomes using machine learning algorithms.