Data Science Lesson 15 – Real-World EDA | Dataplexa
Data Analysis · Lesson 15

Real-World EDA

Apply systematic EDA techniques to uncover business insights, identify data quality issues, and build the foundation for machine learning models using real e-commerce data.

1

Data Quality Assessment

2

Business Pattern Discovery

3

Correlation Analysis

4

Actionable Recommendations

The Reality Gap

Most EDA tutorials show you perfect datasets. Clean columns. No missing values. Logical data types. That's like learning to drive in an empty parking lot — technically correct but utterly unrealistic.

Real data science projects fail not because of complex algorithms but because analysts skip thorough EDA. They miss outliers that break models. They overlook seasonal patterns that drive revenue. They ignore data quality issues that corrupt results.

Here's what happens at Flipkart when a data scientist rushes EDA: A pricing algorithm trained on data with undetected duplicates recommends selling electronics at 20% below cost. Revenue drops ₹2 crores before anyone notices. That's why systematic EDA matters.

Academic EDA

Perfect data, obvious patterns, clean visualizations

Production EDA

Messy data, hidden outliers, business context required

Time Available

2-3 hours max before stakeholders want insights

Stakes

Wrong insights = failed campaigns worth millions

The First 10 Minutes

The scenario: You're a data analyst at Myntra. The growth team needs insights on customer purchase patterns by 2 PM for tomorrow's leadership meeting. You have e-commerce transaction data. Clock starts now.

# STEP 1: Load and get immediate data overview
import pandas as pd
import numpy as np

# Load the dataset - first thing every morning
df = pd.read_csv('dataplexa_ecommerce.csv')
print(f"Dataset shape: {df.shape}")

What just happened?

The .shape output tells us we have 50,000 transactions with 11 columns. That's substantial enough for meaningful patterns but small enough to process quickly. Try this: Always check shape first — it determines your entire analysis strategy.

# STEP 2: Quick health check - what are we dealing with?
print("=== DATA HEALTH CHECK ===")
print(f"Missing values per column:")
print(df.isnull().sum())
print(f"\nData types:")
print(df.dtypes)

What just happened?

Great news: zero missing values! The date column is object type, not datetime — we'll need to convert that. Notice returned is already boolean — someone cleaned this data well. Try this: Run this health check on every dataset before diving deeper.

Common Mistake: Trusting Zero Missing Values

Just because .isnull().sum() shows zeros doesn't mean the data is clean. Missing values often hide as empty strings, "Unknown", -999, or 0. Always check for these disguised nulls in categorical and numeric columns.

Revenue Pattern Discovery

Business stakeholders care about money first. Everything else is secondary. Start with revenue analysis — it gets attention and reveals the most actionable patterns.

# Revenue by category - where does the money come from?
category_revenue = df.groupby('product_category')['revenue'].agg(['sum', 'mean', 'count'])
category_revenue.columns = ['Total_Revenue', 'Avg_Order_Value', 'Order_Count']
category_revenue = category_revenue.sort_values('Total_Revenue', ascending=False)
print("Revenue by Product Category:")
print(category_revenue)

📊 Data Insight

Electronics drives 40.5% of total revenue despite having fewer orders than Books and Food. Electronics AOV is ₹7,249 vs Food's ₹3,924 — a 85% difference that suggests premium pricing strategy.

Electronics and Clothing account for 68% of total platform revenue

This chart reveals the classic 80/20 pattern — two categories generate most revenue. But look deeper: Electronics has high AOV with moderate volume. Clothing has lower AOV but massive volume. That suggests different optimization strategies for each.

For business decisions: Electronics inventory should focus on premium products. Clothing should optimize for volume and conversion rates. Books and Food are underperforming — investigate whether it's pricing, selection, or marketing issues.

Customer Segmentation Analysis

# Age group analysis - who are our customers?
df['age_group'] = pd.cut(df['customer_age'], 
                        bins=[17, 25, 35, 45, 65], 
                        labels=['18-25', '26-35', '36-45', '46-65'])

age_analysis = df.groupby('age_group')['revenue'].agg(['sum', 'mean', 'count'])
age_analysis.columns = ['Total_Revenue', 'AOV', 'Orders']
print("Customer Age Group Analysis:")
print(age_analysis)

What just happened?

The pd.cut() function created age brackets automatically. Notice AOV increases with age — older customers spend ₹387 more per order than younger ones. The 26-35 group has highest volume but 46-65 has highest AOV. Try this: Always segment customers by value, not just demographics.

# City performance - geographic insights
city_performance = df.groupby('city').agg({
    'revenue': ['sum', 'mean'],
    'rating': 'mean',
    'returned': lambda x: (x.sum() / len(x)) * 100
}).round(2)

city_performance.columns = ['Total_Revenue', 'AOV', 'Avg_Rating', 'Return_Rate_%']
city_performance = city_performance.sort_values('Total_Revenue', ascending=False)
print("City Performance Analysis:")
print(city_performance)

Mumbai and Delhi contribute 31.7% of total platform revenue

📊 Data Insight

Chennai has the highest return rate at 16.42% despite similar AOV to other cities. Delhi has the best customer satisfaction (3.82 rating) and lowest returns (14.67%). This suggests operational or quality issues specific to Chennai.

Outlier Investigation

Outliers either represent your biggest opportunities or your biggest problems. Never ignore them. In e-commerce, they might be bulk orders, data entry errors, or VIP customers.

# Revenue outliers - find the extreme values
revenue_stats = df['revenue'].describe()
print("Revenue Distribution:")
print(revenue_stats)

# Calculate IQR for outlier detection
Q1 = df['revenue'].quantile(0.25)
Q3 = df['revenue'].quantile(0.75)
IQR = Q3 - Q1
outlier_threshold = Q3 + 1.5 * IQR

outliers = df[df['revenue'] > outlier_threshold]
print(f"\nFound {len(outliers)} revenue outliers above ₹{outlier_threshold:.2f}")
print("Top 5 revenue outliers:")
print(outliers.nlargest(5, 'revenue')[['revenue', 'product_category', 'quantity', 'unit_price']])

What just happened?

The IQR method found 3,847 outliers (7.7% of data). The highest order is ₹1.98 lakhs — legitimate for premium electronics with high quantity. Notice unit_price values suggest these are genuine premium purchases, not data errors. Try this: Always examine outliers manually before removing them.

Most orders fall in the ₹2k-6k range, with a long tail of high-value purchases

This distribution reveals typical e-commerce behavior: mass market in the middle, premium segment in the tail. The 15k+ segment represents only 2.5% of orders but likely 15%+ of revenue. These customers need different treatment — personalized service, exclusive offers, priority support.

Correlation and Relationships

# Correlation analysis between key metrics
correlation_data = df[['customer_age', 'quantity', 'unit_price', 'revenue', 'rating']].corr()
print("Correlation Matrix:")
print(correlation_data.round(3))

# Rating vs Revenue relationship
rating_revenue = df.groupby('rating')['revenue'].agg(['mean', 'count'])
print("\nRevenue by Rating:")
print(rating_revenue.round(2))

Surprising Discovery

Rating has almost zero correlation with revenue (-0.023). Higher-rated products don't generate more revenue per order. This challenges the assumption that customer satisfaction directly drives sales value. The correlation between unit_price and revenue (0.923) dominates everything else.

This finding has major business implications. Customer satisfaction (ratings) doesn't correlate with order value. But age does — older customers spend more (0.156 correlation). Quantity and unit price drive revenue in different ways: quantity has moderate correlation (0.687) while unit price has very strong correlation (0.923).

For strategy: Focus on selling higher-priced items rather than pushing for better ratings. Target older demographics with premium products. Don't assume happy customers spend more — they might just be satisfied with cheaper items.

Pro Tip: Always question business assumptions with data. "Customer satisfaction drives revenue" sounds logical but correlation analysis proves it's not always true. Question everything, measure everything, act on data not intuition.

EDA Summary and Next Steps

After 45 minutes of systematic EDA, here's what we discovered that changes everything:

Before EDA (Assumptions)

  • All categories perform similarly
  • Younger customers are key segment
  • High ratings = high revenue
  • Geographic differences are minor

After EDA (Reality)

  • Electronics drives 40.5% of revenue
  • Older customers have 85% higher AOV
  • Rating-revenue correlation is near zero
  • Chennai has 12% higher return rates

These insights directly inform feature engineering decisions. We need age-based customer segments, category-specific pricing models, and city-level quality adjustments. Revenue outliers suggest bulk buyer segments worth targeting.

The correlation analysis revealed that unit price drives revenue more than anything else. For machine learning models, focus on price optimization rather than satisfaction metrics. Older customers and Electronics category should get separate treatment.

Real-world EDA isn't about perfect visualizations — it's about finding the 3-4 insights that completely change your strategy. And honestly? This dataset revealed more surprises in one hour than most analysts find in a week of formal analysis.

Quiz

1. Based on the revenue analysis, why should Myntra prioritize Electronics inventory over Food and Books?


2. What does the correlation analysis reveal about the relationship between customer ratings and revenue?


3. What actionable insight does the city performance analysis provide for operations teams?


Up Next

Feature Creation

Transform your EDA insights into engineered features that boost model performance by creating age segments, revenue tiers, and interaction variables.