Data Science
Missing Values
Detect, understand, and handle missing data with pandas methods that preserve insights while maintaining statistical validity.
This lesson covers
Detection Methods · Imputation Strategies · Pattern Analysis · Impact Assessment
The Missing Data Pipeline
Missing values are everywhere. Customer surveys with skipped questions. Sensor failures in IoT devices. Website forms abandoned halfway. Every dataset has gaps — the question is what to do about them.
The scenario: Flipkart's analytics team found their product recommendation model accuracy dropped 15% overnight. Turns out — missing rating data wasn't handled properly.
import pandas as pd
import numpy as np
# Load ecommerce data
df = pd.read_csv('dataplexa_ecommerce.csv')
# Quick overview
print(f"Dataset shape: {df.shape}")
print("\nMissing values count:")
print(df.isnull().sum())
print("\nMissing percentage:")
print((df.isnull().sum() / len(df)) * 100)Dataset shape: (15000, 11) Missing values count: order_id 0 date 0 customer_age 234 gender 0 city 89 product_category 0 product_name 445 quantity 0 unit_price 178 revenue 0 rating 1205 returned 67 Missing percentage: order_id 0.00 date 0.00 customer_age 1.56 gender 0.00 city 0.59 product_category 0.00 product_name 2.97 quantity 0.00 unit_price 1.19 revenue 0.00 rating 8.03 returned 0.45
What just happened?
isnull().sum() counts missing values per column. rating has 1,205 missing values (8.03% of data). product_name missing 445 entries. Try this: Run df.info() to see data types alongside missing counts.
📊 Data Insight
Rating data is missing for 8.03% of orders — that's 1,205 customer reviews lost. If ratings drive recommendations, this gap directly impacts revenue.
Missing Data Types
Not all missing data is created equal. Three types matter: MCAR (Missing Completely At Random), MAR (Missing At Random), and MNAR (Missing Not At Random). Understanding which type you're dealing with determines your strategy.
MCAR - Random
Server crashes during checkout. Missing completely by chance. Safe to drop or impute.
MAR - Conditional
Younger customers skip age field more often. Missing depends on other variables.
MNAR - Systematic
Unhappy customers don't leave ratings. Missing value itself carries information.
Mixed Reality
Most datasets combine all three types. Requires careful analysis per column.
The scenario: Swiggy's data team noticed delivery ratings missing more often for orders above ₹800. That's MAR — missingness correlates with order value.
# Analyze missing patterns
print("Missing rating by revenue range:")
df['revenue_bucket'] = pd.cut(df['revenue'],
bins=[0, 5000, 20000, 50000, 200000],
labels=['Low', 'Medium', 'High', 'Premium'])
missing_by_revenue = df.groupby('revenue_bucket')['rating'].apply(lambda x: x.isnull().sum())
print(missing_by_revenue)
# Missing by city
print("\nMissing rating by city:")
print(df.groupby('city')['rating'].apply(lambda x: x.isnull().sum()))Missing rating by revenue range: revenue_bucket Low 234 Medium 456 High 398 Premium 117 Name: rating, dtype: int64 Missing rating by city: city Bangalore 287 Chennai 198 Delhi 345 Mumbai 289 Pune 86 Name: rating, dtype: int64
What just happened?
pd.cut() creates revenue buckets for analysis. Medium revenue orders have most missing ratings (456). Delhi leads missing ratings (345), while Pune has fewest (86). Try this: Create age buckets and check missing patterns there too.
Detection Strategies
Pandas makes missing value detection straightforward. But the real skill is interpreting patterns. Why is this data missing? Does the pattern tell a story?
Rating data shows the highest missing count — 1,205 missing reviews could indicate customer satisfaction issues
Rating dominates missing values. Product names and customer ages follow. This pattern suggests data collection issues during the checkout flow — customers abandon forms before completing reviews.
Missing returned values (67) might indicate processing delays. Orders too recent to determine return status. That's temporal missingness — time will fill these gaps naturally.
❌ Before Analysis
→ Drop all rows with NaN
→ Lose 12% of dataset
→ Bias results heavily
✅ After Analysis
→ Impute by product category
→ Keep 100% of data
→ Preserve relationships
Imputation Methods
Five approaches dominate: drop, mean/median fill, forward fill, interpolation, and advanced imputation. Each works best for different data types and missing patterns.
Key Concept
Imputation means filling missing values with estimated ones. The goal: maintain statistical properties while preserving real relationships in data. Poor imputation creates fake patterns worse than missing values.
The scenario: Zomato's ML team needs to handle missing ratings for their recommendation engine. Different imputation methods for different use cases.
# Method 1: Simple imputation
df_simple = df.copy()
# Mean for numerical, mode for categorical
df_simple['customer_age'] = df_simple['customer_age'].fillna(df_simple['customer_age'].mean())
df_simple['city'] = df_simple['city'].fillna(df_simple['city'].mode()[0])
# Check results
print("Age missing before:", df['customer_age'].isnull().sum())
print("Age missing after:", df_simple['customer_age'].isnull().sum())
print(f"Mean age used: {df['customer_age'].mean():.1f}")
print(f"Most common city: {df['city'].mode()[0]}")
Age missing before: 234 Age missing after: 0 Mean age used: 34.2 Most common city: Mumbai
What just happened?
fillna(mean()) replaced 234 missing ages with 34.2. mode()[0] filled missing cities with "Mumbai". All missing values gone, but variance reduced. Try this: Compare df['customer_age'].std() before and after imputation.
# Method 2: Group-based imputation (smarter)
df_group = df.copy()
# Fill rating by product category median
category_medians = df_group.groupby('product_category')['rating'].median()
print("Rating median by category:")
print(category_medians)
# Apply group-wise imputation
for category in df_group['product_category'].unique():
mask = (df_group['product_category'] == category) & (df_group['rating'].isnull())
df_group.loc[mask, 'rating'] = category_medians[category]
print(f"\nRating missing after group imputation: {df_group['rating'].isnull().sum()}")
Rating median by category: product_category Books 4.2 Clothing 3.8 Electronics 4.0 Food 3.9 Home 4.1 Name: rating, dtype: float64 Rating missing after group imputation: 0
What just happened?
groupby().median() calculates median rating per category. Electronics gets 4.0, Clothing 3.8. Missing ratings filled with category-appropriate values. Try this: Try group imputation by city or age groups.
Group-based imputation preserves category-specific rating patterns better than global mean replacement
The line chart shows subtle but important differences. Original ratings (teal) versus imputed values (blue). Books maintained higher ratings, Electronics stayed stable. Group-wise imputation respects real customer behavior patterns.
Notice how imputed ratings don't perfectly match originals? That's expected. Perfect matches would indicate overfitting. Small differences show the imputation learned general patterns, not specific outliers.
Advanced Techniques
Simple imputation works 80% of the time. The remaining 20% needs advanced methods: KNN imputation, multiple imputation, or even predictive models. When revenue depends on accuracy, invest in sophisticated approaches.
# Method 3: Forward/backward fill for time series
df_sorted = df.sort_values('date')
# Forward fill - use last known value
df_sorted['rating_ffill'] = df_sorted['rating'].fillna(method='ffill')
# Backward fill - use next known value
df_sorted['rating_bfill'] = df_sorted['rating'].fillna(method='bfill')
# Interpolate - linear estimation between points
df_sorted['rating_interp'] = df_sorted['rating'].interpolate()
print("Comparison of methods:")
sample_idx = df_sorted['rating'].isnull().head().index[0:3]
comparison = df_sorted.loc[sample_idx, ['date', 'rating', 'rating_ffill', 'rating_bfill', 'rating_interp']]
print(comparison)
Comparison of methods:
date rating rating_ffill rating_bfill rating_interp
156 2023-01-15 NaN 4.2 3.8 4.0
234 2023-01-28 NaN 3.8 4.1 3.95
445 2023-02-12 NaN 4.1 3.7 3.9What just happened?
ffill uses previous value (4.2), bfill uses next value (3.8). interpolate() estimates middle ground (4.0). Each method assumes different data behavior. Try this: Check which method preserves variance best with .std().
Common Mistake: Using fillna(0) for missing ratings. Zero isn't a valid rating — it creates fake data that breaks analysis. Use median, mean, or drop instead.
85.6% of records are complete — strong foundation for analysis with targeted imputation for specific gaps
The doughnut chart reveals the big picture. Most data (85.6%) is complete. Missing ratings represent the largest gap — focus imputation efforts there for maximum impact.
Missing "other" (627 records) combines product_name, unit_price, and returned fields. Multiple columns missing simultaneously often indicates systematic data collection failures. Consider investigating the root cause.
Pro Tip: Create a "missing_count" column showing how many fields each record is missing. Records with multiple missing values often have data quality issues worth investigating separately.
Business Impact Assessment
Every missing value decision affects business outcomes. Drop too much data? Lose insights. Impute poorly? Generate false patterns. The goal isn't perfect data — it's actionable insights.
The scenario: HDFC Bank's credit scoring model. Missing income data for 3% of loan applications. Drop them or impute? Each choice changes approval rates and risk profiles.
# Impact analysis - revenue at risk
revenue_missing_rating = df[df['rating'].isnull()]['revenue'].sum()
total_revenue = df['revenue'].sum()
revenue_at_risk_pct = (revenue_missing_rating / total_revenue) * 100
print(f"Total revenue: ₹{total_revenue/100000:.1f}L")
print(f"Revenue from orders with missing ratings: ₹{revenue_missing_rating/100000:.1f}L")
print(f"Revenue at risk: {revenue_at_risk_pct:.1f}%")
# Category breakdown of missing ratings
missing_revenue_by_category = df[df['rating'].isnull()].groupby('product_category')['revenue'].sum()
print(f"\nRevenue at risk by category:")
for category, revenue in missing_revenue_by_category.items():
print(f"{category}: ₹{revenue/100000:.1f}L")
Total revenue: ₹892.3L Revenue from orders with missing ratings: ₹67.8L Revenue at risk: 7.6% Revenue at risk by category: Books ₹8.2L Clothing ₹18.9L Electronics ₹24.1L Food ₹9.7L Home ₹6.9L
What just happened?
₹67.8L revenue (7.6% of total) comes from orders with missing ratings. Electronics leads at ₹24.1L, Clothing follows with ₹18.9L. If recommendation algorithms need ratings, this missing data directly impacts ₹67.8L worth of future sales. Try this: Calculate average order value for missing vs non-missing rating orders.
📊 Data Insight
Missing ratings affect ₹67.8L in revenue. Electronics and Clothing dominate the risk. Imputing these ratings could improve recommendation accuracy for 7.6% of business value.
Where to Practice
Reading is step one. Real learning happens when you run the code. Open any platform alongside Dataplexa and try every code block as you go.
Best workflow: Dataplexa on one side, Kaggle or Colab on the other. Read here, run code there immediately.
Quiz
1. Your ecommerce dataset has 8% missing ratings. Electronics products average 4.0 stars, Clothing averages 3.8 stars. What's the best approach to handle missing ratings for recommendation algorithms?
2. You discover that customer_age is missing more frequently for orders from Mumbai (15% missing) compared to Delhi (3% missing). What type of missing data pattern is this?
3. Which pandas code correctly calculates the median rating for each product category to use for group-based imputation?
Up Next
Outliers
Learn to identify and handle extreme values that can skew analysis and break machine learning models.