Data Science Lesson 5 – Missing Values | Dataplexa

Data Analysis · Lesson 5

Missing Values

Detect, understand, and handle missing data with pandas methods that preserve insights while maintaining statistical validity.

This lesson covers

Detection Methods · Imputation Strategies · Pattern Analysis · Impact Assessment

The Missing Data Pipeline

Missing values are everywhere. Customer surveys with skipped questions. Sensor failures in IoT devices. Website forms abandoned halfway. Every dataset has gaps — the question is what to do about them.

Detect

Analyze Pattern

Choose Strategy

Apply & Validate

The scenario: Flipkart's analytics team found their product recommendation model accuracy dropped 15% overnight. Turns out — missing rating data wasn't handled properly.

import pandas as pd
import numpy as np

# Load ecommerce data
df = pd.read_csv('dataplexa_ecommerce.csv')

# Quick overview
print(f"Dataset shape: {df.shape}")
print("\nMissing values count:")
print(df.isnull().sum())
print("\nMissing percentage:")
print((df.isnull().sum() / len(df)) * 100)

Dataset shape: (15000, 11)

Missing values count:
order_id            0
date               0
customer_age       234
gender             0
city               89
product_category   0
product_name       445
quantity           0
unit_price         178
revenue            0
rating            1205
returned           67

Missing percentage:
order_id         0.00
date            0.00
customer_age    1.56
gender          0.00
city            0.59
product_category 0.00
product_name    2.97
quantity        0.00
unit_price      1.19
revenue         0.00
rating          8.03
returned        0.45

What just happened?

isnull().sum() counts missing values per column. rating has 1,205 missing values (8.03% of data). product_name missing 445 entries. Try this: Run df.info() to see data types alongside missing counts.

📊 Data Insight

Rating data is missing for 8.03% of orders — that's 1,205 customer reviews lost. If ratings drive recommendations, this gap directly impacts revenue.

Missing Data Types

Not all missing data is created equal. Three types matter: MCAR (Missing Completely At Random), MAR (Missing At Random), and MNAR (Missing Not At Random). Understanding which type you're dealing with determines your strategy.

MCAR - Random

Server crashes during checkout. Missing completely by chance. Safe to drop or impute.

MAR - Conditional

Younger customers skip age field more often. Missing depends on other variables.

MNAR - Systematic

Unhappy customers don't leave ratings. Missing value itself carries information.

Mixed Reality

Most datasets combine all three types. Requires careful analysis per column.

The scenario: Swiggy's data team noticed delivery ratings missing more often for orders above ₹800. That's MAR — missingness correlates with order value.

# Analyze missing patterns
print("Missing rating by revenue range:")
df['revenue_bucket'] = pd.cut(df['revenue'], 
                             bins=[0, 5000, 20000, 50000, 200000], 
                             labels=['Low', 'Medium', 'High', 'Premium'])

missing_by_revenue = df.groupby('revenue_bucket')['rating'].apply(lambda x: x.isnull().sum())
print(missing_by_revenue)

# Missing by city
print("\nMissing rating by city:")
print(df.groupby('city')['rating'].apply(lambda x: x.isnull().sum()))

Missing rating by revenue range:
revenue_bucket
Low        234
Medium     456
High       398
Premium    117
Name: rating, dtype: int64

Missing rating by city:
city
Bangalore    287
Chennai      198
Delhi        345
Mumbai       289
Pune          86
Name: rating, dtype: int64

What just happened?

pd.cut() creates revenue buckets for analysis. Medium revenue orders have most missing ratings (456). Delhi leads missing ratings (345), while Pune has fewest (86). Try this: Create age buckets and check missing patterns there too.

Detection Strategies

Pandas makes missing value detection straightforward. But the real skill is interpreting patterns. Why is this data missing? Does the pattern tell a story?

Rating data shows the highest missing count — 1,205 missing reviews could indicate customer satisfaction issues

Rating dominates missing values. Product names and customer ages follow. This pattern suggests data collection issues during the checkout flow — customers abandon forms before completing reviews.

Missing returned values (67) might indicate processing delays. Orders too recent to determine return status. That's temporal missingness — time will fill these gaps naturally.

❌ Before Analysis

      "We have missing data"

      → Drop all rows with NaN

      → Lose 12% of dataset

      → Bias results heavily

✅ After Analysis

      "Rating missing = MAR pattern"

      → Impute by product category

      → Keep 100% of data

      → Preserve relationships

Imputation Methods

Five approaches dominate: drop, mean/median fill, forward fill, interpolation, and advanced imputation. Each works best for different data types and missing patterns.

Key Concept

Imputation means filling missing values with estimated ones. The goal: maintain statistical properties while preserving real relationships in data. Poor imputation creates fake patterns worse than missing values.

The scenario: Zomato's ML team needs to handle missing ratings for their recommendation engine. Different imputation methods for different use cases.

# Method 1: Simple imputation
df_simple = df.copy()

# Mean for numerical, mode for categorical
df_simple['customer_age'] = df_simple['customer_age'].fillna(df_simple['customer_age'].mean())
df_simple['city'] = df_simple['city'].fillna(df_simple['city'].mode()[0])

# Check results
print("Age missing before:", df['customer_age'].isnull().sum())
print("Age missing after:", df_simple['customer_age'].isnull().sum())
print(f"Mean age used: {df['customer_age'].mean():.1f}")
print(f"Most common city: {df['city'].mode()[0]}")

Age missing before: 234
Age missing after: 0
Mean age used: 34.2
Most common city: Mumbai

What just happened?

fillna(mean()) replaced 234 missing ages with 34.2. mode()[0] filled missing cities with "Mumbai". All missing values gone, but variance reduced. Try this: Compare df['customer_age'].std() before and after imputation.

# Method 2: Group-based imputation (smarter)
df_group = df.copy()

# Fill rating by product category median
category_medians = df_group.groupby('product_category')['rating'].median()
print("Rating median by category:")
print(category_medians)

# Apply group-wise imputation
for category in df_group['product_category'].unique():
    mask = (df_group['product_category'] == category) & (df_group['rating'].isnull())
    df_group.loc[mask, 'rating'] = category_medians[category]

print(f"\nRating missing after group imputation: {df_group['rating'].isnull().sum()}")

Rating median by category:
product_category
Books          4.2
Clothing       3.8
Electronics    4.0
Food           3.9
Home           4.1
Name: rating, dtype: float64

Rating missing after group imputation: 0

What just happened?

groupby().median() calculates median rating per category. Electronics gets 4.0, Clothing 3.8. Missing ratings filled with category-appropriate values. Try this: Try group imputation by city or age groups.

Group-based imputation preserves category-specific rating patterns better than global mean replacement

The line chart shows subtle but important differences. Original ratings (teal) versus imputed values (blue). Books maintained higher ratings, Electronics stayed stable. Group-wise imputation respects real customer behavior patterns.

Notice how imputed ratings don't perfectly match originals? That's expected. Perfect matches would indicate overfitting. Small differences show the imputation learned general patterns, not specific outliers.

Advanced Techniques

Simple imputation works 80% of the time. The remaining 20% needs advanced methods: KNN imputation, multiple imputation, or even predictive models. When revenue depends on accuracy, invest in sophisticated approaches.

# Method 3: Forward/backward fill for time series
df_sorted = df.sort_values('date')

# Forward fill - use last known value
df_sorted['rating_ffill'] = df_sorted['rating'].fillna(method='ffill')

# Backward fill - use next known value  
df_sorted['rating_bfill'] = df_sorted['rating'].fillna(method='bfill')

# Interpolate - linear estimation between points
df_sorted['rating_interp'] = df_sorted['rating'].interpolate()

print("Comparison of methods:")
sample_idx = df_sorted['rating'].isnull().head().index[0:3]
comparison = df_sorted.loc[sample_idx, ['date', 'rating', 'rating_ffill', 'rating_bfill', 'rating_interp']]
print(comparison)

Comparison of methods:
        date  rating  rating_ffill  rating_bfill  rating_interp
156  2023-01-15     NaN           4.2           3.8           4.0
234  2023-01-28     NaN           3.8           4.1           3.95
445  2023-02-12     NaN           4.1           3.7           3.9

What just happened?

ffill uses previous value (4.2), bfill uses next value (3.8). interpolate() estimates middle ground (4.0). Each method assumes different data behavior. Try this: Check which method preserves variance best with .std().

Common Mistake: Using fillna(0) for missing ratings. Zero isn't a valid rating — it creates fake data that breaks analysis. Use median, mean, or drop instead.

85.6% of records are complete — strong foundation for analysis with targeted imputation for specific gaps

The doughnut chart reveals the big picture. Most data (85.6%) is complete. Missing ratings represent the largest gap — focus imputation efforts there for maximum impact.

Missing "other" (627 records) combines product_name, unit_price, and returned fields. Multiple columns missing simultaneously often indicates systematic data collection failures. Consider investigating the root cause.

Pro Tip: Create a "missing_count" column showing how many fields each record is missing. Records with multiple missing values often have data quality issues worth investigating separately.

Business Impact Assessment

Every missing value decision affects business outcomes. Drop too much data? Lose insights. Impute poorly? Generate false patterns. The goal isn't perfect data — it's actionable insights.

The scenario: HDFC Bank's credit scoring model. Missing income data for 3% of loan applications. Drop them or impute? Each choice changes approval rates and risk profiles.

# Impact analysis - revenue at risk
revenue_missing_rating = df[df['rating'].isnull()]['revenue'].sum()
total_revenue = df['revenue'].sum()
revenue_at_risk_pct = (revenue_missing_rating / total_revenue) * 100

print(f"Total revenue: ₹{total_revenue/100000:.1f}L")
print(f"Revenue from orders with missing ratings: ₹{revenue_missing_rating/100000:.1f}L")  
print(f"Revenue at risk: {revenue_at_risk_pct:.1f}%")

# Category breakdown of missing ratings
missing_revenue_by_category = df[df['rating'].isnull()].groupby('product_category')['revenue'].sum()
print(f"\nRevenue at risk by category:")
for category, revenue in missing_revenue_by_category.items():
    print(f"{category}: ₹{revenue/100000:.1f}L")

Total revenue: ₹892.3L
Revenue from orders with missing ratings: ₹67.8L
Revenue at risk: 7.6%

Revenue at risk by category:
Books        ₹8.2L  
Clothing     ₹18.9L
Electronics  ₹24.1L
Food         ₹9.7L
Home         ₹6.9L

What just happened?

₹67.8L revenue (7.6% of total) comes from orders with missing ratings. Electronics leads at ₹24.1L, Clothing follows with ₹18.9L. If recommendation algorithms need ratings, this missing data directly impacts ₹67.8L worth of future sales. Try this: Calculate average order value for missing vs non-missing rating orders.

📊 Data Insight

Missing ratings affect ₹67.8L in revenue. Electronics and Clothing dominate the risk. Imputing these ratings could improve recommendation accuracy for 7.6% of business value.

Where to Practice

Reading is step one. Real learning happens when you run the code. Open any platform alongside Dataplexa and try every code block as you go.

Best workflow: Dataplexa on one side, Kaggle or Colab on the other. Read here, run code there immediately.

Quiz

Up Next

Outliers

Learn to identify and handle extreme values that can skew analysis and break machine learning models.

← Previous Course Index Next →