Data Science Lesson 43 – Pandas | Dataplexa

Data Science · Lesson 43

Pandas

Master DataFrames, data manipulation, grouping operations, and visualization techniques using real Indian e-commerce transaction data.

$ pandas --key-functions
pd.read_csv()(file_path) # Load data from CSV files
df.groupby()(column) # Group data by categories
df.pivot_table()(values, index, columns) # Create pivot summaries
df.merge()(other_df, on) # Join multiple datasets
df.fillna()(method) # Handle missing values
df.sort_values()(by, ascending) # Sort by columns
df.describe()() # Statistical summaries
df.plot()(kind, x, y) # Create visualizations

Think of Pandas as Excel's smarter cousin. While Excel handles thousands of rows reasonably well, Pandas crushes millions without breaking a sweat. The real magic? You write instructions once and apply them to entire columns instantly.

Every data scientist's workflow starts here. NumPy gives you the foundation, but Pandas gives you the tools to actually manipulate real-world messy data. And trust me, real data is always messy.

DataFrames: Your Data Playground

order_id	date	customer_age	city	product_category	revenue
10001	2023-03-15	28	Mumbai	Electronics	₹45,899
10002	2023-03-15	34	Delhi	Clothing	₹2,899
10003	2023-03-16	22	Bangalore	Food	₹1,245

This DataFrame structure becomes your canvas. Rows represent individual transactions, columns represent attributes. But here's what makes Pandas special — every operation you can imagine doing in Excel, you can do programmatically.

The scenario: Flipkart's analytics team needs to process 2.4 million order records from last quarter. Excel crashes at 500K rows. Pandas? It barely notices.

# Import pandas library for data manipulation
import pandas as pd

# Load the e-commerce dataset from CSV file
df = pd.read_csv('dataplexa_ecommerce.csv')

# Display first 5 rows to understand data structure
print(df.head())

   order_id        date  customer_age gender      city product_category    revenue
0     10001  2023-03-15            28      M    Mumbai      Electronics   45899.0
1     10002  2023-03-15            34      F     Delhi         Clothing    2899.0
2     10003  2023-03-16            22      M  Bangalore           Food    1245.0
3     10004  2023-03-16            41      F   Chennai          Books     899.0
4     10005  2023-03-17            29      M      Pune           Home    8456.0

What just happened?

The pd.read_csv() function loaded our entire dataset into memory. Notice how df.head() shows the first 5 rows with proper column alignment. Try this: change head() to head(10) to see more rows.

Data Selection and Filtering

Coming from Excel, you might think of filtering as clicking dropdown arrows. Pandas filters work more like SQL WHERE clauses, but with Python syntax. Way more powerful once you get the hang of it.

The scenario: Myntra's marketing team wants to analyze orders from Mumbai customers who spent more than ₹10,000. They need this data in the next 10 minutes for a board presentation.

# Filter for Mumbai customers only
mumbai_customers = df[df['city'] == 'Mumbai']

# Display shape to see how many rows we have
print(f"Mumbai orders: {mumbai_customers.shape[0]} rows")

Mumbai orders: 48762 rows

What just happened?

The square brackets create a boolean mask — df['city'] == 'Mumbai' returns True/False for each row. Pandas keeps only the True rows. The shape[0] gives us the row count. Try this: use df['city'].value_counts() to see all city distributions.

Now for the trickier part — combining multiple conditions. This is where 80% of beginners mess up the syntax.

# Combine multiple conditions with & operator
# Note: parentheses around each condition are mandatory
high_value_mumbai = df[(df['city'] == 'Mumbai') & (df['revenue'] > 10000)]

# Show the results
print(f"High-value Mumbai orders: {high_value_mumbai.shape[0]}")
print(f"Average revenue: ₹{high_value_mumbai['revenue'].mean():.0f}")

High-value Mumbai orders: 8942
Average revenue: ₹28,456

What just happened?

The & operator combines conditions, but each condition needs parentheses. The .mean() method calculates the average revenue instantly across 8,942 orders. Try this: replace & with | for OR logic.

Common Filtering Mistake

Writing df[df['city'] == 'Mumbai' & df['revenue'] > 10000] throws a cryptic error. The fix: wrap each condition in parentheses. Python's operator precedence treats & differently than you expect.

Groupby Operations: The Real Power

Here's where Pandas separates itself from every other tool you've used. Groupby operations let you split your data, apply functions, and combine results in one smooth motion. This is the equivalent of pivot tables, but infinitely more flexible.

Electronics dominates with 39% market share, followed by Clothing at 26%

Electronics clearly leads, but the story gets interesting when you examine the patterns. The 60% gap between Electronics and Clothing suggests either different pricing strategies or completely different customer behaviors. Food and Books occupy the lower segments, likely due to lower average order values rather than volume differences. Business teams use this exact chart to allocate marketing budgets. If Electronics generates 284 crores but Food only 87 crores, should advertising spend follow the same proportion? That depends on growth potential and margin analysis — data that groupby operations reveal effortlessly.

The scenario: Swiggy's product team needs category-wise revenue breakdowns for five major cities. The CFO wants these numbers in the next hour for budget planning.

# Group by product category and calculate total revenue
category_revenue = df.groupby('product_category')['revenue'].sum()

# Display the results sorted by revenue (descending)
print(category_revenue.sort_values(ascending=False))

product_category
Electronics    28450899.45
Clothing       19234567.89
Home          11567834.23
Food           8789123.45
Books          4323456.78
Name: revenue, dtype: float64

What just happened?

The groupby() method groups all rows by category, then ['revenue'].sum() adds up revenue within each group. The sort_values(ascending=False) shows highest revenue categories first. Try this: use .mean() instead of .sum() to see average order values.

Now for multiple grouping dimensions — this is where things get really interesting:

# Group by city AND category for detailed breakdown
city_category_stats = df.groupby(['city', 'product_category']).agg({
    'revenue': ['sum', 'mean', 'count'],
    'rating': 'mean'
}).round(2)

# Display Mumbai results only
print(city_category_stats.loc['Mumbai'])

                   revenue                    rating
                       sum      mean count   mean
product_category                                  
Books              892345.67  1245.89  717   4.12
Clothing          4234567.89  2847.45 1488   4.23
Electronics       8945672.34 24567.89  364   4.18
Food              1876543.21   967.23 1941   4.31
Home              2456789.01  4892.67  502   4.25

What just happened?

The agg() function applies multiple aggregations simultaneously. We got sum, mean, and count for revenue, plus average rating. The .loc['Mumbai'] filters to show only Mumbai results from the multi-index. Try this: remove .loc['Mumbai'] to see all cities.

📊 Data Insight

Mumbai Electronics has the highest average order value (₹24,568) but lowest order count (364). Food has 5x more orders but 25x lower average value. This suggests completely different customer behaviors requiring separate marketing strategies.

Pivot Tables and Cross-Analysis

Pivot tables in Pandas work like Excel's pivot tables, but with more precision and way better performance. Where Excel pivot tables break down at 100K rows, Pandas pivot tables handle millions.

Older customers (55+) give highest ratings at 4.41/5, while youngest segment rates lowest at 4.12/5

The age-satisfaction curve reveals a fascinating pattern. Ratings climb steadily from young to old, with a slight dip in the 46-55 group. This isn't random — younger customers typically have higher expectations and are more critical in reviews. Older customers appreciate service more, possibly due to different baseline expectations. This data drives customer service strategies. If 55+ customers are your happiest but 18-25 are your growth segment, you need different approaches. The 26-35 group hits the sweet spot with decent satisfaction and high lifetime value — your ideal target for premium products.

The scenario: Zomato's customer experience team wants to understand rating patterns across different cities and age groups. They suspect younger customers are more critical.

# Create age groups for better analysis
df['age_group'] = pd.cut(df['customer_age'], 
                        bins=[0, 25, 35, 45, 55, 100], 
                        labels=['18-25', '26-35', '36-45', '46-55', '55+'])

# Show the distribution
print(df['age_group'].value_counts())

age_group
26-35    89432
36-45    76543
18-25    65432
46-55    43567
55+      18934
Name: count, dtype: int64

# Create pivot table: cities vs age groups showing average rating
rating_pivot = df.pivot_table(values='rating', 
                              index='city', 
                              columns='age_group', 
                              aggfunc='mean').round(2)

print(rating_pivot)

age_group  18-25  26-35  36-45  46-55   55+
city                                        
Bangalore   4.08   4.25   4.32   4.15  4.38
Chennai     4.15   4.29   4.36   4.21  4.43
Delhi       4.11   4.27   4.33   4.18  4.40
Mumbai      4.09   4.26   4.34   4.17  4.39
Pune        4.13   4.28   4.35   4.20  4.42

What just happened?

The pd.cut() function creates age ranges from continuous numbers. The pivot_table() creates a cross-tabulation with cities as rows and age groups as columns, showing average ratings. Try this: change aggfunc='mean' to aggfunc='count' to see order volumes.

Handling Missing Data and Outliers

Real data is messy. Always. Missing values, weird outliers, inconsistent formats — Pandas gives you surgical precision to clean this mess without breaking anything.

Weak positive correlation (r=0.23) - higher order values slightly correlate with better ratings

The scatter plot reveals a subtle truth: expensive orders don't guarantee satisfaction. The correlation coefficient of 0.23 suggests other factors matter more than price. You'll notice some ₹200K orders getting 4.1 ratings while ₹900 orders score 4.1 too. This challenges the assumption that premium customers are automatically happier. The data suggests quality consistency across all price points matters more than the absolute spend. Smart businesses use this insight to maintain service standards regardless of order size.

The scenario: BigBasket's data engineering team discovered some revenue values are missing and others look suspiciously high. They need to clean this before the monthly report goes live.

# Check for missing values in all columns
missing_data = df.isnull().sum()

# Show columns with missing values
print("Missing values per column:")
print(missing_data[missing_data > 0])

Missing values per column:
rating      2847
revenue      456
customer_age  23

# Fill missing ratings with category-specific averages
df['rating'] = df.groupby('product_category')['rating'].transform(
    lambda x: x.fillna(x.mean())
)

# Check if it worked
print(f"Missing ratings after fill: {df['rating'].isnull().sum()}")
print(f"Sample filled ratings: {df['rating'].describe()}")

Missing ratings after fill: 0
Sample filled ratings: 
count    293908.000000
mean          4.247856
std           0.445632
min           1.000000
max           5.000000

# Detect outliers using IQR method
Q1 = df['revenue'].quantile(0.25)
Q3 = df['revenue'].quantile(0.75)
IQR = Q3 - Q1

# Define outlier boundaries
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Outlier boundaries: ₹{lower_bound:.0f} to ₹{upper_bound:.0f}")
print(f"Orders above upper bound: {(df['revenue'] > upper_bound).sum()}")

Outlier boundaries: ₹-2,847 to ₹24,567
Orders above upper bound: 18,934

What just happened?

The groupby().transform() fills missing ratings with category averages, preserving data relationships. The IQR method uses quantile() to find outlier boundaries — anything beyond 1.5x the interquartile range. Try this: use .median() instead of .mean() for more robust filling.

📊 Data Insight

18,934 orders (6.4% of data) qualify as revenue outliers above ₹24,567. However, these could be legitimate bulk purchases or premium items. Always investigate outliers before removing them — they might contain your most valuable insights.

Pandas transforms chaos into clarity. Where Excel gives you basic tools, Pandas gives you surgical instruments. The groupby operations alone save analysts weeks of manual work every quarter. But here's the honest truth — the first month with Pandas feels overwhelming. You'll miss Excel's point-and-click simplicity. Stick with it. Month two, you'll appreciate the precision. Month three, you'll wonder how you ever survived without programmatic data manipulation. The syntax matters less than understanding the logic. Split-Apply-Combine operations become second nature. Filter your data, group it meaningfully, calculate what matters, and present insights that drive decisions.

Your analysis workflow should flow like this: Load → Explore → Clean → Group → Visualize → Interpret. Pandas excels at every step except the final interpretation — that's where your business knowledge becomes invaluable.

Quiz

Up Next

SciPy

Advanced statistical functions, hypothesis testing, and scientific computing that build directly on the data manipulation skills you just mastered.

← Previous Course Index Next →