Data Science
Pandas
Master DataFrames, data manipulation, grouping operations, and visualization techniques using real Indian e-commerce transaction data.
$ pandas --key-functions
Every data scientist's workflow starts here. NumPy gives you the foundation, but Pandas gives you the tools to actually manipulate real-world messy data. And trust me, real data is always messy.
DataFrames: Your Data Playground
| order_id | date | customer_age | city | product_category | revenue |
|---|---|---|---|---|---|
| 10001 | 2023-03-15 | 28 | Mumbai | Electronics | ₹45,899 |
| 10002 | 2023-03-15 | 34 | Delhi | Clothing | ₹2,899 |
| 10003 | 2023-03-16 | 22 | Bangalore | Food | ₹1,245 |
The scenario: Flipkart's analytics team needs to process 2.4 million order records from last quarter. Excel crashes at 500K rows. Pandas? It barely notices.
# Import pandas library for data manipulation
import pandas as pd
# Load the e-commerce dataset from CSV file
df = pd.read_csv('dataplexa_ecommerce.csv')
# Display first 5 rows to understand data structure
print(df.head())
order_id date customer_age gender city product_category revenue 0 10001 2023-03-15 28 M Mumbai Electronics 45899.0 1 10002 2023-03-15 34 F Delhi Clothing 2899.0 2 10003 2023-03-16 22 M Bangalore Food 1245.0 3 10004 2023-03-16 41 F Chennai Books 899.0 4 10005 2023-03-17 29 M Pune Home 8456.0
What just happened?
The pd.read_csv() function loaded our entire dataset into memory. Notice how df.head() shows the first 5 rows with proper column alignment. Try this: change head() to head(10) to see more rows.
Data Selection and Filtering
Coming from Excel, you might think of filtering as clicking dropdown arrows. Pandas filters work more like SQL WHERE clauses, but with Python syntax. Way more powerful once you get the hang of it.The scenario: Myntra's marketing team wants to analyze orders from Mumbai customers who spent more than ₹10,000. They need this data in the next 10 minutes for a board presentation.
# Filter for Mumbai customers only
mumbai_customers = df[df['city'] == 'Mumbai']
# Display shape to see how many rows we have
print(f"Mumbai orders: {mumbai_customers.shape[0]} rows")
Mumbai orders: 48762 rows
What just happened?
The square brackets create a boolean mask — df['city'] == 'Mumbai' returns True/False for each row. Pandas keeps only the True rows. The shape[0] gives us the row count. Try this: use df['city'].value_counts() to see all city distributions.
# Combine multiple conditions with & operator
# Note: parentheses around each condition are mandatory
high_value_mumbai = df[(df['city'] == 'Mumbai') & (df['revenue'] > 10000)]
# Show the results
print(f"High-value Mumbai orders: {high_value_mumbai.shape[0]}")
print(f"Average revenue: ₹{high_value_mumbai['revenue'].mean():.0f}")
High-value Mumbai orders: 8942 Average revenue: ₹28,456
What just happened?
The & operator combines conditions, but each condition needs parentheses. The .mean() method calculates the average revenue instantly across 8,942 orders. Try this: replace & with | for OR logic.
Common Filtering Mistake
Writing df[df['city'] == 'Mumbai' & df['revenue'] > 10000] throws a cryptic error. The fix: wrap each condition in parentheses. Python's operator precedence treats & differently than you expect.
Groupby Operations: The Real Power
Here's where Pandas separates itself from every other tool you've used. Groupby operations let you split your data, apply functions, and combine results in one smooth motion. This is the equivalent of pivot tables, but infinitely more flexible.Electronics dominates with 39% market share, followed by Clothing at 26%
Electronics clearly leads, but the story gets interesting when you examine the patterns. The 60% gap between Electronics and Clothing suggests either different pricing strategies or completely different customer behaviors. Food and Books occupy the lower segments, likely due to lower average order values rather than volume differences. Business teams use this exact chart to allocate marketing budgets. If Electronics generates 284 crores but Food only 87 crores, should advertising spend follow the same proportion? That depends on growth potential and margin analysis — data that groupby operations reveal effortlessly.The scenario: Swiggy's product team needs category-wise revenue breakdowns for five major cities. The CFO wants these numbers in the next hour for budget planning.
# Group by product category and calculate total revenue
category_revenue = df.groupby('product_category')['revenue'].sum()
# Display the results sorted by revenue (descending)
print(category_revenue.sort_values(ascending=False))
product_category Electronics 28450899.45 Clothing 19234567.89 Home 11567834.23 Food 8789123.45 Books 4323456.78 Name: revenue, dtype: float64
What just happened?
The groupby() method groups all rows by category, then ['revenue'].sum() adds up revenue within each group. The sort_values(ascending=False) shows highest revenue categories first. Try this: use .mean() instead of .sum() to see average order values.
# Group by city AND category for detailed breakdown
city_category_stats = df.groupby(['city', 'product_category']).agg({
'revenue': ['sum', 'mean', 'count'],
'rating': 'mean'
}).round(2)
# Display Mumbai results only
print(city_category_stats.loc['Mumbai'])
revenue rating
sum mean count mean
product_category
Books 892345.67 1245.89 717 4.12
Clothing 4234567.89 2847.45 1488 4.23
Electronics 8945672.34 24567.89 364 4.18
Food 1876543.21 967.23 1941 4.31
Home 2456789.01 4892.67 502 4.25
What just happened?
The agg() function applies multiple aggregations simultaneously. We got sum, mean, and count for revenue, plus average rating. The .loc['Mumbai'] filters to show only Mumbai results from the multi-index. Try this: remove .loc['Mumbai'] to see all cities.
📊 Data Insight
Mumbai Electronics has the highest average order value (₹24,568) but lowest order count (364). Food has 5x more orders but 25x lower average value. This suggests completely different customer behaviors requiring separate marketing strategies.
Pivot Tables and Cross-Analysis
Pivot tables in Pandas work like Excel's pivot tables, but with more precision and way better performance. Where Excel pivot tables break down at 100K rows, Pandas pivot tables handle millions.Older customers (55+) give highest ratings at 4.41/5, while youngest segment rates lowest at 4.12/5
The age-satisfaction curve reveals a fascinating pattern. Ratings climb steadily from young to old, with a slight dip in the 46-55 group. This isn't random — younger customers typically have higher expectations and are more critical in reviews. Older customers appreciate service more, possibly due to different baseline expectations. This data drives customer service strategies. If 55+ customers are your happiest but 18-25 are your growth segment, you need different approaches. The 26-35 group hits the sweet spot with decent satisfaction and high lifetime value — your ideal target for premium products.The scenario: Zomato's customer experience team wants to understand rating patterns across different cities and age groups. They suspect younger customers are more critical.
# Create age groups for better analysis
df['age_group'] = pd.cut(df['customer_age'],
bins=[0, 25, 35, 45, 55, 100],
labels=['18-25', '26-35', '36-45', '46-55', '55+'])
# Show the distribution
print(df['age_group'].value_counts())
age_group 26-35 89432 36-45 76543 18-25 65432 46-55 43567 55+ 18934 Name: count, dtype: int64
# Create pivot table: cities vs age groups showing average rating
rating_pivot = df.pivot_table(values='rating',
index='city',
columns='age_group',
aggfunc='mean').round(2)
print(rating_pivot)
age_group 18-25 26-35 36-45 46-55 55+ city Bangalore 4.08 4.25 4.32 4.15 4.38 Chennai 4.15 4.29 4.36 4.21 4.43 Delhi 4.11 4.27 4.33 4.18 4.40 Mumbai 4.09 4.26 4.34 4.17 4.39 Pune 4.13 4.28 4.35 4.20 4.42
What just happened?
The pd.cut() function creates age ranges from continuous numbers. The pivot_table() creates a cross-tabulation with cities as rows and age groups as columns, showing average ratings. Try this: change aggfunc='mean' to aggfunc='count' to see order volumes.
Handling Missing Data and Outliers
Real data is messy. Always. Missing values, weird outliers, inconsistent formats — Pandas gives you surgical precision to clean this mess without breaking anything.Weak positive correlation (r=0.23) - higher order values slightly correlate with better ratings
The scatter plot reveals a subtle truth: expensive orders don't guarantee satisfaction. The correlation coefficient of 0.23 suggests other factors matter more than price. You'll notice some ₹200K orders getting 4.1 ratings while ₹900 orders score 4.1 too. This challenges the assumption that premium customers are automatically happier. The data suggests quality consistency across all price points matters more than the absolute spend. Smart businesses use this insight to maintain service standards regardless of order size.The scenario: BigBasket's data engineering team discovered some revenue values are missing and others look suspiciously high. They need to clean this before the monthly report goes live.
# Check for missing values in all columns
missing_data = df.isnull().sum()
# Show columns with missing values
print("Missing values per column:")
print(missing_data[missing_data > 0])
Missing values per column: rating 2847 revenue 456 customer_age 23
# Fill missing ratings with category-specific averages
df['rating'] = df.groupby('product_category')['rating'].transform(
lambda x: x.fillna(x.mean())
)
# Check if it worked
print(f"Missing ratings after fill: {df['rating'].isnull().sum()}")
print(f"Sample filled ratings: {df['rating'].describe()}")
Missing ratings after fill: 0 Sample filled ratings: count 293908.000000 mean 4.247856 std 0.445632 min 1.000000 max 5.000000
# Detect outliers using IQR method
Q1 = df['revenue'].quantile(0.25)
Q3 = df['revenue'].quantile(0.75)
IQR = Q3 - Q1
# Define outlier boundaries
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print(f"Outlier boundaries: ₹{lower_bound:.0f} to ₹{upper_bound:.0f}")
print(f"Orders above upper bound: {(df['revenue'] > upper_bound).sum()}")
Outlier boundaries: ₹-2,847 to ₹24,567 Orders above upper bound: 18,934
What just happened?
The groupby().transform() fills missing ratings with category averages, preserving data relationships. The IQR method uses quantile() to find outlier boundaries — anything beyond 1.5x the interquartile range. Try this: use .median() instead of .mean() for more robust filling.
📊 Data Insight
18,934 orders (6.4% of data) qualify as revenue outliers above ₹24,567. However, these could be legitimate bulk purchases or premium items. Always investigate outliers before removing them — they might contain your most valuable insights.
Your analysis workflow should flow like this: Load → Explore → Clean → Group → Visualize → Interpret. Pandas excels at every step except the final interpretation — that's where your business knowledge becomes invaluable.
Quiz
1. You need to filter the dataframe for Mumbai orders with revenue above ₹10,000. Which syntax is correct?
2. To calculate total revenue by product category, which method should you use?
3. What does this code do: df['rating'] = df.groupby('product_category')['rating'].transform(lambda x: x.fillna(x.mean()))?
Up Next
SciPy
Advanced statistical functions, hypothesis testing, and scientific computing that build directly on the data manipulation skills you just mastered.