Data Science
Seaborn
Transform raw datasets into publication-ready statistical visualizations with Python's most elegant plotting library.
Why Seaborn Exists
Matplotlib makes everything possible but nothing easy. You write 15 lines for a basic scatter plot. Seaborn fixes this — it's built on Matplotlib but thinks like a statistician. One line creates what took twenty before.
Why does this matter? Because 90% of data exploration is the same few plots: distributions, correlations, comparisons between groups. Seaborn has functions named exactly for these tasks. No more wrestling with axis formatting — it handles statistical visualization patterns automatically.
Seaborn automatically calculates statistical summaries, handles categorical data elegantly, and produces publication-quality plots with sensible defaults. It's what matplotlib becomes when it grows up.
Setting Up Your Environment
The scenario: You're a data analyst at Swiggy. The product team wants beautiful charts for their quarterly review. Raw matplotlib charts won't impress anyone — you need professional-grade statistical plots.
# Import the essential libraries for statistical visualization
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
# Load our ecommerce dataset for analysis
df = pd.read_csv('dataplexa_ecommerce.csv')
# Set the visual style - this affects all future plots
sns.set_style("whitegrid")
# Check first few rows to understand our data structure
print(df.head())order_id date customer_age gender city product_category product_name quantity unit_price revenue rating returned 0 1001 2023-01-05 28 Male Mumbai Electronics Smartphone 1 25000.00 25000.00 4.5 False 1 1002 2023-01-05 34 Female Delhi Clothing Summer Dress 2 899.99 1799.98 4.2 False 2 1003 2023-01-06 42 Male Bangalore Food Organic Coffee 3 450.00 1350.00 4.8 False 3 1004 2023-01-07 29 Female Chennai Books Python Handbook 1 799.00 799.00 4.6 False 4 1005 2023-01-08 38 Male Pune Home Ceiling Fan 1 3200.00 3200.00 4.1 False
What just happened?
Seaborn automatically detects your data types and optimizes visualizations accordingly. The whitegrid style adds subtle gridlines that help readers estimate values without cluttering the plot. Try this: Change to "darkgrid" for presentations on dark backgrounds.
Distribution Plots That Actually Work
Histograms lie. They change shape based on bin size. Seaborn's distribution plots show the real story by combining multiple approaches. This is where Seaborn truly shines — it thinks statistically, not just visually.
The scenario: Swiggy's finance team needs to understand revenue distribution patterns. Are most orders small with few large ones? Or is it evenly spread? Traditional histograms won't cut it.
# Create a figure with proper size for presentation
plt.figure(figsize=(10, 6))
# Create a distribution plot with kernel density estimate
sns.histplot(data=df, x='revenue', kde=True, bins=30)
# Add proper labels and title
plt.title('Revenue Distribution - Swiggy Orders', fontsize=16, fontweight='bold')
plt.xlabel('Revenue (INR)', fontsize=12)
plt.ylabel('Number of Orders', fontsize=12)
# Show the plot
plt.show()[A distribution plot showing revenue on x-axis (0 to 200000 INR) with histogram bars and a smooth KDE curve overlay. Most orders cluster in the 500-5000 range with a long tail toward higher values.]
What just happened?
The kde=True parameter adds a smooth curve showing the underlying probability distribution. This reveals patterns that binned data might hide. Try this: Use stat='density' to normalize the y-axis for comparing different datasets.
Most Swiggy orders fall in the ₹500-5K range, indicating a healthy mix of small frequent orders with occasional high-value purchases
This distribution reveals a classic e-commerce pattern — high frequency of smaller orders with a long tail of premium purchases. The smooth curve shows what bins might fragment. For business decisions, this suggests focusing marketing spend on the ₹2K-10K segment where volume and value intersect.
Relationship Plots That Tell Stories
Scatter plots show relationships. But raw matplotlib scatter plots are just dots on a page. Seaborn adds regression lines, confidence intervals, and categorical coloring automatically. This transforms correlation hunting from guesswork to insight.
# Create a scatter plot with regression line and confidence interval
plt.figure(figsize=(10, 6))
# Plot relationship between customer age and revenue with category colors
sns.scatterplot(data=df, x='customer_age', y='revenue',
hue='product_category', size='rating', alpha=0.7)
# Add a regression line to see overall trend
sns.regplot(data=df, x='customer_age', y='revenue',
scatter=False, color='red', ci=95)[A scatter plot with customer_age (18-65) on x-axis and revenue (500-200000) on y-axis. Points are colored by product category and sized by rating. A red regression line with gray confidence interval shows slight positive correlation.]
# Add labels and show the complete plot
plt.title('Customer Age vs Revenue by Product Category', fontsize=16, fontweight='bold')
plt.xlabel('Customer Age (years)', fontsize=12)
plt.ylabel('Revenue (INR)', fontsize=12)
# Position legend outside plot area for clarity
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()[Complete scatter plot showing weak positive correlation between age and spending, with Electronics purchases (blue dots) appearing across all age ranges but clustering in higher revenue zones, while Food purchases (green) stay in lower revenue ranges.]
What just happened?
The hue parameter automatically assigns colors by category while size scales points by rating. The ci=95 shows the 95% confidence interval around the trend line. Try this: Use style parameter to vary point shapes by another variable.
Electronics purchases dominate high-revenue transactions across all age groups, while Food and Clothing cluster in lower ranges
The scatter plot reveals three distinct spending patterns. Electronics buyers span all ages but generate premium revenue. Food purchases stay consistently low-value regardless of customer age. This insight drives category-specific marketing strategies — target Electronics broadly, but focus Food campaigns on frequency rather than value.
Category Comparisons Made Simple
Comparing groups is painful in matplotlib. Box plots need manual calculation. Bar plots require aggregation first. Seaborn handles this automatically with categorical plot functions that understand your data structure.
# Create side-by-side comparison of categories
plt.figure(figsize=(12, 6))
# Plot 1: Average revenue by product category
plt.subplot(1, 2, 1)
sns.barplot(data=df, x='product_category', y='revenue', ci=95)
plt.title('Average Revenue by Category', fontweight='bold')
plt.xticks(rotation=45)
# Plot 2: Rating distribution by category
plt.subplot(1, 2, 2)
sns.boxplot(data=df, x='product_category', y='rating')
plt.title('Rating Distribution by Category', fontweight='bold')
plt.xticks(rotation=45)[Two subplots side by side. Left: Bar chart showing Electronics (~₹28K), Home (~₹12K), Clothing (~₹2.5K), Books (~₹1.8K), Food (~₹1.1K) with error bars. Right: Box plots showing rating distributions - Electronics and Books have higher median ratings around 4.5, while others cluster around 4.2.]
# Apply tight layout and display both plots
plt.tight_layout()
plt.show()
# Get exact statistics for business reporting
category_stats = df.groupby('product_category').agg({
'revenue': ['mean', 'median', 'count'],
'rating': ['mean', 'std']
}).round(2)
print("\nDetailed Category Statistics:")
print(category_stats) revenue rating
mean median count mean std
product_category
Books 1847.33 799.00 425 4.52 0.31
Clothing 2456.78 1899.99 380 4.18 0.42
Electronics 28450.25 25000.00 310 4.47 0.38
Food 1127.89 1100.00 445 4.21 0.35
Home 11825.50 3200.00 440 4.15 0.44What just happened?
Seaborn automatically calculated means for the bar plot and quartiles for the box plot. The ci=95 adds confidence intervals showing statistical significance. Try this: Use sns.violinplot() to see full distribution shapes instead of just quartiles.
Electronics dominates revenue per order (₹28K average) while Food maintains the lowest ticket size (₹1.1K average)
The category comparison exposes a revenue quality paradox. Electronics generates 25x more revenue per order than Food, but maintains similar customer satisfaction ratings. This suggests pricing tolerance varies dramatically by category — customers expect to pay premium for Electronics but want value in Food purchases.
📊 Data Insight
Electronics averages ₹28,450 per order with 4.47/5 rating, while Food averages ₹1,128 with 4.21/5 rating. The 2,400% revenue difference suggests completely different purchasing psychology by category.
Seaborn vs Matplotlib: When to Use Which
Seaborn excels at statistical exploration. Matplotlib dominates custom visualization. The trick is knowing which tool fits your current problem.
Choose Seaborn When:
• Exploring data relationships
• Need statistical summaries
• Working with categorical data
• Want publication-ready defaults
• Time is limited
Choose Matplotlib When:
• Building custom chart types
• Need precise control over elements
• Creating interactive dashboards
• Non-standard statistical plots
• Performance is critical
Common Mistake: Using the Wrong Tool
Don't force Seaborn into custom visualization needs — its strength is statistical patterns, not artistic freedom. Conversely, don't write 20 matplotlib lines for a simple correlation plot when sns.scatterplot() does it in one.
Chart Selection Guide
| Data Type | Seaborn Function | Best For |
|---|---|---|
| Single Continuous | histplot(), kdeplot() |
Distribution shape, outliers |
| Two Continuous | scatterplot(), regplot() |
Correlations, trends |
| Categorical vs Continuous | boxplot(), barplot() |
Group comparisons |
| Time Series | lineplot() |
Trends over time |
Quiz
1. Your Flipkart team needs to analyze the relationship between customer age and order value, colored by product category. Why is sns.scatterplot() better than plt.scatter() for this task?
2. When analyzing Zomato order revenue distribution, what does the kde=True parameter in sns.histplot() accomplish?
3. Your Paytm analytics team debates whether to use Seaborn or Matplotlib for their quarterly business review charts. What's the best decision framework?
Up Next
Plotly
Transform your static Seaborn plots into interactive dashboards that stakeholders can explore and customize in real-time.