Data Science Lesson 28 – Seaborn | Dataplexa

Data Science · Lesson 28

Seaborn

Transform raw datasets into publication-ready statistical visualizations with Python's most elegant plotting library.

Why Seaborn Exists

Matplotlib makes everything possible but nothing easy. You write 15 lines for a basic scatter plot. Seaborn fixes this — it's built on Matplotlib but thinks like a statistician. One line creates what took twenty before.

Why does this matter? Because 90% of data exploration is the same few plots: distributions, correlations, comparisons between groups. Seaborn has functions named exactly for these tasks. No more wrestling with axis formatting — it handles statistical visualization patterns automatically.

Seaborn automatically calculates statistical summaries, handles categorical data elegantly, and produces publication-quality plots with sensible defaults. It's what matplotlib becomes when it grows up.

Setting Up Your Environment

The scenario: You're a data analyst at Swiggy. The product team wants beautiful charts for their quarterly review. Raw matplotlib charts won't impress anyone — you need professional-grade statistical plots.

# Import the essential libraries for statistical visualization
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

# Load our ecommerce dataset for analysis
df = pd.read_csv('dataplexa_ecommerce.csv')

# Set the visual style - this affects all future plots
sns.set_style("whitegrid")

# Check first few rows to understand our data structure
print(df.head())

   order_id        date  customer_age gender        city product_category       product_name  quantity  unit_price    revenue  rating  returned
0      1001  2023-01-05            28   Male      Mumbai      Electronics         Smartphone         1    25000.00  25000.00     4.5     False
1      1002  2023-01-05            34 Female       Delhi         Clothing        Summer Dress       2      899.99   1799.98     4.2     False
2      1003  2023-01-06            42   Male   Bangalore             Food      Organic Coffee       3      450.00   1350.00     4.8     False
3      1004  2023-01-07            29 Female     Chennai            Books    Python Handbook       1      799.00    799.00     4.6     False
4      1005  2023-01-08            38   Male        Pune             Home      Ceiling Fan        1     3200.00   3200.00     4.1     False

What just happened?

Seaborn automatically detects your data types and optimizes visualizations accordingly. The whitegrid style adds subtle gridlines that help readers estimate values without cluttering the plot. Try this: Change to "darkgrid" for presentations on dark backgrounds.

Distribution Plots That Actually Work

Histograms lie. They change shape based on bin size. Seaborn's distribution plots show the real story by combining multiple approaches. This is where Seaborn truly shines — it thinks statistically, not just visually.

The scenario: Swiggy's finance team needs to understand revenue distribution patterns. Are most orders small with few large ones? Or is it evenly spread? Traditional histograms won't cut it.

# Create a figure with proper size for presentation
plt.figure(figsize=(10, 6))

# Create a distribution plot with kernel density estimate
sns.histplot(data=df, x='revenue', kde=True, bins=30)

# Add proper labels and title
plt.title('Revenue Distribution - Swiggy Orders', fontsize=16, fontweight='bold')
plt.xlabel('Revenue (INR)', fontsize=12)
plt.ylabel('Number of Orders', fontsize=12)

# Show the plot
plt.show()

[A distribution plot showing revenue on x-axis (0 to 200000 INR) with histogram bars and a smooth KDE curve overlay. Most orders cluster in the 500-5000 range with a long tail toward higher values.]

What just happened?

The kde=True parameter adds a smooth curve showing the underlying probability distribution. This reveals patterns that binned data might hide. Try this: Use stat='density' to normalize the y-axis for comparing different datasets.

Most Swiggy orders fall in the ₹500-5K range, indicating a healthy mix of small frequent orders with occasional high-value purchases

This distribution reveals a classic e-commerce pattern — high frequency of smaller orders with a long tail of premium purchases. The smooth curve shows what bins might fragment. For business decisions, this suggests focusing marketing spend on the ₹2K-10K segment where volume and value intersect.

Relationship Plots That Tell Stories

Scatter plots show relationships. But raw matplotlib scatter plots are just dots on a page. Seaborn adds regression lines, confidence intervals, and categorical coloring automatically. This transforms correlation hunting from guesswork to insight.

# Create a scatter plot with regression line and confidence interval
plt.figure(figsize=(10, 6))

# Plot relationship between customer age and revenue with category colors
sns.scatterplot(data=df, x='customer_age', y='revenue', 
                hue='product_category', size='rating', alpha=0.7)

# Add a regression line to see overall trend
sns.regplot(data=df, x='customer_age', y='revenue', 
            scatter=False, color='red', ci=95)

[A scatter plot with customer_age (18-65) on x-axis and revenue (500-200000) on y-axis. Points are colored by product category and sized by rating. A red regression line with gray confidence interval shows slight positive correlation.]

# Add labels and show the complete plot
plt.title('Customer Age vs Revenue by Product Category', fontsize=16, fontweight='bold')
plt.xlabel('Customer Age (years)', fontsize=12)
plt.ylabel('Revenue (INR)', fontsize=12)

# Position legend outside plot area for clarity
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

[Complete scatter plot showing weak positive correlation between age and spending, with Electronics purchases (blue dots) appearing across all age ranges but clustering in higher revenue zones, while Food purchases (green) stay in lower revenue ranges.]

What just happened?

The hue parameter automatically assigns colors by category while size scales points by rating. The ci=95 shows the 95% confidence interval around the trend line. Try this: Use style parameter to vary point shapes by another variable.

Electronics purchases dominate high-revenue transactions across all age groups, while Food and Clothing cluster in lower ranges

The scatter plot reveals three distinct spending patterns. Electronics buyers span all ages but generate premium revenue. Food purchases stay consistently low-value regardless of customer age. This insight drives category-specific marketing strategies — target Electronics broadly, but focus Food campaigns on frequency rather than value.

Category Comparisons Made Simple

Comparing groups is painful in matplotlib. Box plots need manual calculation. Bar plots require aggregation first. Seaborn handles this automatically with categorical plot functions that understand your data structure.

# Create side-by-side comparison of categories
plt.figure(figsize=(12, 6))

# Plot 1: Average revenue by product category
plt.subplot(1, 2, 1)
sns.barplot(data=df, x='product_category', y='revenue', ci=95)
plt.title('Average Revenue by Category', fontweight='bold')
plt.xticks(rotation=45)

# Plot 2: Rating distribution by category  
plt.subplot(1, 2, 2)
sns.boxplot(data=df, x='product_category', y='rating')
plt.title('Rating Distribution by Category', fontweight='bold')
plt.xticks(rotation=45)

[Two subplots side by side. Left: Bar chart showing Electronics (~₹28K), Home (~₹12K), Clothing (~₹2.5K), Books (~₹1.8K), Food (~₹1.1K) with error bars. Right: Box plots showing rating distributions - Electronics and Books have higher median ratings around 4.5, while others cluster around 4.2.]

# Apply tight layout and display both plots
plt.tight_layout()
plt.show()

# Get exact statistics for business reporting
category_stats = df.groupby('product_category').agg({
    'revenue': ['mean', 'median', 'count'],
    'rating': ['mean', 'std']
}).round(2)

print("\nDetailed Category Statistics:")
print(category_stats)

                    revenue                    rating      
                       mean  median count     mean   std
product_category                                       
Books               1847.33  799.00   425     4.52  0.31
Clothing            2456.78 1899.99   380     4.18  0.42
Electronics        28450.25 25000.00  310     4.47  0.38
Food                1127.89  1100.00   445     4.21  0.35
Home               11825.50 3200.00   440     4.15  0.44

What just happened?

Seaborn automatically calculated means for the bar plot and quartiles for the box plot. The ci=95 adds confidence intervals showing statistical significance. Try this: Use sns.violinplot() to see full distribution shapes instead of just quartiles.

Electronics dominates revenue per order (₹28K average) while Food maintains the lowest ticket size (₹1.1K average)

The category comparison exposes a revenue quality paradox. Electronics generates 25x more revenue per order than Food, but maintains similar customer satisfaction ratings. This suggests pricing tolerance varies dramatically by category — customers expect to pay premium for Electronics but want value in Food purchases.

📊 Data Insight

Electronics averages ₹28,450 per order with 4.47/5 rating, while Food averages ₹1,128 with 4.21/5 rating. The 2,400% revenue difference suggests completely different purchasing psychology by category.

Seaborn vs Matplotlib: When to Use Which

Seaborn excels at statistical exploration. Matplotlib dominates custom visualization. The trick is knowing which tool fits your current problem.

Choose Seaborn When:

• Exploring data relationships
• Need statistical summaries
• Working with categorical data
• Want publication-ready defaults
• Time is limited

Choose Matplotlib When:

• Building custom chart types
• Need precise control over elements
• Creating interactive dashboards
• Non-standard statistical plots
• Performance is critical

Common Mistake: Using the Wrong Tool

Don't force Seaborn into custom visualization needs — its strength is statistical patterns, not artistic freedom. Conversely, don't write 20 matplotlib lines for a simple correlation plot when sns.scatterplot() does it in one.

Chart Selection Guide

Data Type	Seaborn Function	Best For
Single Continuous	`histplot(), kdeplot()`	Distribution shape, outliers
Two Continuous	`scatterplot(), regplot()`	Correlations, trends
Categorical vs Continuous	`boxplot(), barplot()`	Group comparisons
Time Series	`lineplot()`	Trends over time

Quiz

Up Next

Plotly

Transform your static Seaborn plots into interactive dashboards that stakeholders can explore and customize in real-time.

← Previous Course Index Next →