Data Science Lesson 64 – EDA Project | Dataplexa

EDA · Lesson 64

EDA Project

Build a complete exploratory data analysis from raw ecommerce data to actionable business insights using multiple visualization techniques.

The EDA Pipeline

Real EDA projects follow a predictable sequence. You start with data inspection, move through quality checks, then dive into univariate analysis, followed by relationships and correlations. The final step transforms insights into business recommendations.

Most analysts skip the inspection phase. Big mistake. Spending 30 minutes understanding your data structure saves hours debugging weird visualizations later. Trust me on this one.

Data Inspection & Quality

Univariate Analysis

Bivariate Relationships

Business Insights & Actions

Data Inspection Phase

The scenario: You're analyzing Flipkart's customer behavior data. The product team needs insights on which categories drive the highest revenue and customer satisfaction scores.

# Load essential libraries for data analysis
import pandas as pd
import numpy as np

# Read the ecommerce dataset
df = pd.read_csv('dataplexa_ecommerce.csv')

# First look at our data structure
print(f"Dataset shape: {df.shape}")
df.head()

Dataset shape: (15000, 11)

   order_id        date  customer_age gender       city product_category         product_name  quantity  unit_price  revenue  rating  returned
0      1001  2023-01-15            28   Male     Mumbai      Electronics         iPhone 14        1    85000.0  85000.0     4.2     False
1      1002  2023-01-15            34 Female      Delhi         Clothing      Saree Silk        2     3500.0   7000.0     4.8     False
2      1003  2023-01-16            22   Male  Bangalore             Food   Organic Rice        5      450.0   2250.0     4.1     False

What just happened?

We loaded 15,000 records with 11 columns. Notice revenue ranges from ₹2,250 to ₹85,000, and rating varies from 4.1 to 4.8. Try this: Check if any orders have quantity > 1 but unit_price doesn't match total revenue.

# Check data types and missing values
df.info()

# Get summary statistics for numerical columns
print("\nNumerical Summary:")
df.describe()


RangeIndex: 15000 entries, 0 to 14999
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   order_id          15000 non-null  int64  
 1   date              15000 non-null  object 
 2   customer_age      15000 non-null  int64  
 3   gender            15000 non-null  object 
 4   city              15000 non-null  object 
 5   product_category  15000 non-null  object 
 6   product_name      15000 non-null  object 
 7   quantity          15000 non-null  int64  
 8   unit_price        15000 non-null  float64
 9   revenue           15000 non-null  float64
 10  rating            15000 non-null  float64
 11  returned          15000 non-null  bool   

Numerical Summary:
       customer_age     quantity    unit_price       revenue        rating
count      15000.00     15000.00     15000.00      15000.00      15000.00
mean          41.50         3.24     12487.50      35642.80         3.48
std           13.77         2.89     18742.30      52847.20         1.15
min           18.00         1.00       450.00       2250.00         1.00
25%           29.00         1.00      2800.00       8400.00         2.50
50%           42.00         3.00      6500.00      19500.00         3.50
75%           54.00         5.00     18000.00      54000.00         4.40
max           65.00        10.00     95000.00     285000.00         5.00

What just happened?

Perfect — no missing values! Average customer age is 41.5 years, mean revenue per order is ₹35,643. The std of ₹52,847 for revenue suggests high variability. Try this: Calculate the coefficient of variation (std/mean) to quantify this spread.

Category Performance Analysis

Now we dig into business questions. Which product categories generate the most revenue? Are customers satisfied across all categories? Bar charts reveal category dominance while scatter plots show revenue-rating relationships.

# Calculate total revenue by product category
category_revenue = df.groupby('product_category')['revenue'].sum()

# Convert to lakhs for easier reading
category_revenue_lakhs = category_revenue / 100000

# Sort from highest to lowest
category_revenue_lakhs = category_revenue_lakhs.sort_values(ascending=False)
print("Revenue by Category (in Lakhs INR):")
category_revenue_lakhs

Revenue by Category (in Lakhs INR):
product_category
Electronics    1842.5
Clothing        651.8
Home            423.7
Food            298.4
Books           147.2
Name: revenue, dtype: float64

What just happened?

Electronics dominates with ₹1,842.5 lakhs — that's 54% of total revenue! Books generate only ₹147.2 lakhs. groupby('product_category')['revenue'].sum() aggregated 15,000 rows into 5 category totals. Try this: Calculate average order value per category.

Electronics generates 12.5x more revenue than Books category

The revenue gap is massive. Electronics pulls in nearly ₹18.5 crores while Books struggle at ₹1.5 crores. This suggests either higher-priced items in Electronics or significantly more volume. The middle categories (Clothing, Home, Food) cluster between ₹3-7 crores.

Business decision: Invest more marketing budget in Electronics since it's clearly the revenue driver. But don't ignore Books — there might be untapped potential or it could be a strategic loss leader.

Customer Satisfaction Patterns

# Calculate average rating by category
category_ratings = df.groupby('product_category')['rating'].mean()

# Round to 2 decimal places for cleaner display
category_ratings = category_ratings.round(2)

# Sort by rating descending
category_ratings = category_ratings.sort_values(ascending=False)
print("Average Customer Rating by Category:")
category_ratings

Average Customer Rating by Category:
product_category
Books          3.92
Food           3.67
Home           3.52
Clothing       3.31
Electronics    3.24
Name: rating, dtype: float64

What just happened?

Plot twist! Books has the highest satisfaction (3.92) while Electronics — our revenue champion — ranks lowest (3.24). .round(2) cleaned up the decimal display. Try this: Calculate rating standard deviation to see consistency.

Inverse relationship: highest revenue category has lowest satisfaction scores

Here's the fascinating insight — revenue and satisfaction move in opposite directions. Books customers are happiest (3.92/5) but contribute least revenue. Electronics customers are least satisfied (3.24/5) but drive most sales.

This pattern suggests price sensitivity or expectation mismatch. Higher-priced Electronics purchases create higher expectations. When products don't deliver, ratings suffer. Books are low-risk purchases with lower expectations.

📊 Data Insight

Electronics: ₹1,842L revenue but 3.24 rating. Books: ₹147L revenue but 3.92 rating. Revenue leaders aren't satisfaction leaders — investigate Electronics quality issues immediately.

Age Demographics & Spending Power

# Create age groups for better analysis
def categorize_age(age):
    if age < 25:
        return "Gen Z (18-24)"
    elif age < 35:
        return "Millennials (25-34)" 
    elif age < 50:
        return "Gen X (35-49)"
    else:
        return "Boomers (50+)"

# Apply age categorization
df['age_group'] = df['customer_age'].apply(categorize_age)
df['age_group'].value_counts()

Gen X (35-49)        4125
Boomers (50+)        3952
Millennials (25-34)  3894
Gen Z (18-24)        3029
Name: age_group, dtype: int64

What just happened?

We created 4 age segments from continuous age data. Gen X dominates (4,125 customers) while Gen Z is smallest (3,029). The apply() function ran our custom categorization on all 15,000 rows. Try this: Check if age distribution matches India's population demographics.

# Calculate average spending by age group
age_spending = df.groupby('age_group')['revenue'].agg(['mean', 'count', 'sum'])

# Clean up the column names
age_spending.columns = ['avg_order_value', 'total_orders', 'total_revenue']

# Round average for readability
age_spending['avg_order_value'] = age_spending['avg_order_value'].round(0)

print("Spending Patterns by Age Group:")
age_spending

Spending Patterns by Age Group:
                    avg_order_value  total_orders  total_revenue
age_group                                                      
Boomers (50+)                41250          3952   163020000.0
Gen X (35-49)                39874          4125   164477500.0
Gen Z (18-24)                28456          3029    86205624.0
Millennials (25-34)          31247          3894   121709718.0

What just happened?

Boomers spend highest per order (₹41,250) while Gen Z spends least (₹28,456). The agg(['mean', 'count', 'sum']) calculated three metrics simultaneously. Try this: Calculate spending per customer by dividing total_revenue by total_orders.

Gen X and Boomers contribute 61% of total revenue despite being 54% of customers

The doughnut reveals spending power concentration. Gen X and Boomers combined account for ₹327.5 crores (61% of revenue) from 8,077 customers. Meanwhile, younger demographics generate ₹208 crores from 6,923 customers.

Marketing insight: Older customers have higher disposable income and purchase more expensive items. But Gen Z represents future growth — their spending habits today predict long-term value. Focus retention on Boomers, acquisition on Gen Z.

📊 Data Insight

Age drives spending: Boomers average ₹41,250 per order vs Gen Z's ₹28,456. The ₹12,794 gap suggests different product preferences or financial capacity. Target premium products to 35+ demographics.

Geographic Revenue Hotspots

City-wise analysis reveals market penetration and regional preferences. Mumbai and Delhi typically dominate Indian ecommerce, but let's verify with our data. Geographic clustering helps optimize logistics and marketing spend.

# Analyze city performance with return rates
city_analysis = df.groupby('city').agg({
    'revenue': ['sum', 'mean', 'count'],
    'returned': 'mean',
    'rating': 'mean'
}).round(2)

# Flatten column names for easier reading
city_analysis.columns = ['total_revenue', 'avg_order_value', 'order_count', 'return_rate', 'avg_rating']

# Sort by total revenue
city_analysis = city_analysis.sort_values('total_revenue', ascending=False)
city_analysis

         total_revenue  avg_order_value  order_count  return_rate  avg_rating
city                                                                            
Mumbai      143628750.0         47876.25         3000         0.18        3.42
Delhi       126891250.0         42297.08         3000         0.22        3.51
Bangalore   115433750.0         38477.92         3000         0.19        3.49
Chennai     106778750.0         35592.92         3000         0.21        3.46
Pune         41912500.0         13970.83         3000         0.20        3.52

What just happened?

Mumbai leads with ₹143.6 crores total revenue and ₹47,876 average order value. Pune lags significantly at ₹13,971 per order. The agg() with dictionary calculated multiple metrics per city. Try this: Calculate revenue per capita using city population data.

Return Rate Alert

Delhi shows highest return rate at 22% despite strong revenue. Mumbai has lowest returns (18%) AND highest revenue — the sweet spot. Fix Delhi's fulfillment issues to capture lost revenue from returns.

Mumbai's dominance is clear — ₹143.6 crores revenue with lowest return rate (18%). Delhi generates ₹126.9 crores but suffers from 22% returns. Pune dramatically underperforms at ₹41.9 crores despite equal customer count.

The 3.4x revenue gap between Mumbai and Pune suggests vastly different purchasing power or product preferences. Mumbai customers spend ₹47,876 per order vs Pune's ₹13,971. This isn't just volume — it's fundamentally different market segments.

Business Recommendations

Immediate Actions

• Investigate Electronics quality issues (lowest ratings)
• Fix Delhi fulfillment to reduce 22% return rate
• Expand Pune market with targeted campaigns

Strategic Moves

• Double Electronics marketing spend (54% revenue)
• Create premium products for Boomers/Gen X
• Develop Gen Z acquisition strategy

The data tells three stories: Electronics drives revenue but hurts satisfaction. Older customers spend significantly more than younger ones. Geographic markets show massive variation in spending power.

EDA isn't about creating pretty charts — it's about extracting actionable insights that change business decisions. These patterns suggest specific product, marketing, and operational improvements worth millions in additional revenue.

📊 Data Insight

Complete EDA revealed: Electronics needs quality improvement despite ₹18.4C revenue. Mumbai outperforms Pune by 3.4x per order. Boomers spend 45% more than Gen Z. Each insight drives specific business actions.

Quiz

Up Next

Predictive Modeling

Transform your EDA insights into predictive models that forecast customer behavior and business outcomes using machine learning algorithms.

← Previous Course Index Next →