Data Science Lesson 64 – EDA Project | Dataplexa
EDA · Lesson 64

EDA Project

Build a complete exploratory data analysis from raw ecommerce data to actionable business insights using multiple visualization techniques.

The EDA Pipeline

Real EDA projects follow a predictable sequence. You start with data inspection, move through quality checks, then dive into univariate analysis, followed by relationships and correlations. The final step transforms insights into business recommendations.

Most analysts skip the inspection phase. Big mistake. Spending 30 minutes understanding your data structure saves hours debugging weird visualizations later. Trust me on this one.

1
Data Inspection & Quality
2
Univariate Analysis
3
Bivariate Relationships
4
Business Insights & Actions

Data Inspection Phase

The scenario: You're analyzing Flipkart's customer behavior data. The product team needs insights on which categories drive the highest revenue and customer satisfaction scores.

# Load essential libraries for data analysis
import pandas as pd
import numpy as np

# Read the ecommerce dataset
df = pd.read_csv('dataplexa_ecommerce.csv')

# First look at our data structure
print(f"Dataset shape: {df.shape}")
df.head()

What just happened?

We loaded 15,000 records with 11 columns. Notice revenue ranges from ₹2,250 to ₹85,000, and rating varies from 4.1 to 4.8. Try this: Check if any orders have quantity > 1 but unit_price doesn't match total revenue.

# Check data types and missing values
df.info()

# Get summary statistics for numerical columns
print("\nNumerical Summary:")
df.describe()

What just happened?

Perfect — no missing values! Average customer age is 41.5 years, mean revenue per order is ₹35,643. The std of ₹52,847 for revenue suggests high variability. Try this: Calculate the coefficient of variation (std/mean) to quantify this spread.

Category Performance Analysis

Now we dig into business questions. Which product categories generate the most revenue? Are customers satisfied across all categories? Bar charts reveal category dominance while scatter plots show revenue-rating relationships.

# Calculate total revenue by product category
category_revenue = df.groupby('product_category')['revenue'].sum()

# Convert to lakhs for easier reading
category_revenue_lakhs = category_revenue / 100000

# Sort from highest to lowest
category_revenue_lakhs = category_revenue_lakhs.sort_values(ascending=False)
print("Revenue by Category (in Lakhs INR):")
category_revenue_lakhs

What just happened?

Electronics dominates with ₹1,842.5 lakhs — that's 54% of total revenue! Books generate only ₹147.2 lakhs. groupby('product_category')['revenue'].sum() aggregated 15,000 rows into 5 category totals. Try this: Calculate average order value per category.

Electronics generates 12.5x more revenue than Books category

The revenue gap is massive. Electronics pulls in nearly ₹18.5 crores while Books struggle at ₹1.5 crores. This suggests either higher-priced items in Electronics or significantly more volume. The middle categories (Clothing, Home, Food) cluster between ₹3-7 crores.

Business decision: Invest more marketing budget in Electronics since it's clearly the revenue driver. But don't ignore Books — there might be untapped potential or it could be a strategic loss leader.

Customer Satisfaction Patterns

# Calculate average rating by category
category_ratings = df.groupby('product_category')['rating'].mean()

# Round to 2 decimal places for cleaner display
category_ratings = category_ratings.round(2)

# Sort by rating descending
category_ratings = category_ratings.sort_values(ascending=False)
print("Average Customer Rating by Category:")
category_ratings

What just happened?

Plot twist! Books has the highest satisfaction (3.92) while Electronics — our revenue champion — ranks lowest (3.24). .round(2) cleaned up the decimal display. Try this: Calculate rating standard deviation to see consistency.

Inverse relationship: highest revenue category has lowest satisfaction scores

Here's the fascinating insight — revenue and satisfaction move in opposite directions. Books customers are happiest (3.92/5) but contribute least revenue. Electronics customers are least satisfied (3.24/5) but drive most sales.

This pattern suggests price sensitivity or expectation mismatch. Higher-priced Electronics purchases create higher expectations. When products don't deliver, ratings suffer. Books are low-risk purchases with lower expectations.

📊 Data Insight

Electronics: ₹1,842L revenue but 3.24 rating. Books: ₹147L revenue but 3.92 rating. Revenue leaders aren't satisfaction leaders — investigate Electronics quality issues immediately.

Age Demographics & Spending Power

# Create age groups for better analysis
def categorize_age(age):
    if age < 25:
        return "Gen Z (18-24)"
    elif age < 35:
        return "Millennials (25-34)" 
    elif age < 50:
        return "Gen X (35-49)"
    else:
        return "Boomers (50+)"

# Apply age categorization
df['age_group'] = df['customer_age'].apply(categorize_age)
df['age_group'].value_counts()

What just happened?

We created 4 age segments from continuous age data. Gen X dominates (4,125 customers) while Gen Z is smallest (3,029). The apply() function ran our custom categorization on all 15,000 rows. Try this: Check if age distribution matches India's population demographics.

# Calculate average spending by age group
age_spending = df.groupby('age_group')['revenue'].agg(['mean', 'count', 'sum'])

# Clean up the column names
age_spending.columns = ['avg_order_value', 'total_orders', 'total_revenue']

# Round average for readability
age_spending['avg_order_value'] = age_spending['avg_order_value'].round(0)

print("Spending Patterns by Age Group:")
age_spending

What just happened?

Boomers spend highest per order (₹41,250) while Gen Z spends least (₹28,456). The agg(['mean', 'count', 'sum']) calculated three metrics simultaneously. Try this: Calculate spending per customer by dividing total_revenue by total_orders.

Gen X and Boomers contribute 61% of total revenue despite being 54% of customers

The doughnut reveals spending power concentration. Gen X and Boomers combined account for ₹327.5 crores (61% of revenue) from 8,077 customers. Meanwhile, younger demographics generate ₹208 crores from 6,923 customers.

Marketing insight: Older customers have higher disposable income and purchase more expensive items. But Gen Z represents future growth — their spending habits today predict long-term value. Focus retention on Boomers, acquisition on Gen Z.

📊 Data Insight

Age drives spending: Boomers average ₹41,250 per order vs Gen Z's ₹28,456. The ₹12,794 gap suggests different product preferences or financial capacity. Target premium products to 35+ demographics.

Geographic Revenue Hotspots

City-wise analysis reveals market penetration and regional preferences. Mumbai and Delhi typically dominate Indian ecommerce, but let's verify with our data. Geographic clustering helps optimize logistics and marketing spend.

# Analyze city performance with return rates
city_analysis = df.groupby('city').agg({
    'revenue': ['sum', 'mean', 'count'],
    'returned': 'mean',
    'rating': 'mean'
}).round(2)

# Flatten column names for easier reading
city_analysis.columns = ['total_revenue', 'avg_order_value', 'order_count', 'return_rate', 'avg_rating']

# Sort by total revenue
city_analysis = city_analysis.sort_values('total_revenue', ascending=False)
city_analysis

What just happened?

Mumbai leads with ₹143.6 crores total revenue and ₹47,876 average order value. Pune lags significantly at ₹13,971 per order. The agg() with dictionary calculated multiple metrics per city. Try this: Calculate revenue per capita using city population data.

Return Rate Alert

Delhi shows highest return rate at 22% despite strong revenue. Mumbai has lowest returns (18%) AND highest revenue — the sweet spot. Fix Delhi's fulfillment issues to capture lost revenue from returns.

Mumbai's dominance is clear — ₹143.6 crores revenue with lowest return rate (18%). Delhi generates ₹126.9 crores but suffers from 22% returns. Pune dramatically underperforms at ₹41.9 crores despite equal customer count.

The 3.4x revenue gap between Mumbai and Pune suggests vastly different purchasing power or product preferences. Mumbai customers spend ₹47,876 per order vs Pune's ₹13,971. This isn't just volume — it's fundamentally different market segments.

Business Recommendations

Immediate Actions

• Investigate Electronics quality issues (lowest ratings)
• Fix Delhi fulfillment to reduce 22% return rate
• Expand Pune market with targeted campaigns

Strategic Moves

• Double Electronics marketing spend (54% revenue)
• Create premium products for Boomers/Gen X
• Develop Gen Z acquisition strategy

The data tells three stories: Electronics drives revenue but hurts satisfaction. Older customers spend significantly more than younger ones. Geographic markets show massive variation in spending power.

EDA isn't about creating pretty charts — it's about extracting actionable insights that change business decisions. These patterns suggest specific product, marketing, and operational improvements worth millions in additional revenue.

📊 Data Insight

Complete EDA revealed: Electronics needs quality improvement despite ₹18.4C revenue. Mumbai outperforms Pune by 3.4x per order. Boomers spend 45% more than Gen Z. Each insight drives specific business actions.

Quiz

1. Based on the EDA analysis, what is the key insight about Electronics category performance?


2. Which city shows the best combination of high revenue and low return rates?


3. Which pandas method correctly calculates total revenue by product category?


Up Next

Predictive Modeling

Transform your EDA insights into predictive models that forecast customer behavior and business outcomes using machine learning algorithms.