EDA Course
Handling Missing Values
Finding missing values is only half the battle — the real skill is deciding what to do about them, because the wrong choice can silently wreck your analysis before it even starts.
You Have a Gap. Now What?
Imagine you're reviewing a customer survey where 200 people answered every question — but 30 people skipped the income field. You have three options: pretend those rows don't exist, guess what the missing values probably are, or use a method that fills them in mathematically.
None of these is always right. The best strategy depends on why the data is missing and how much of it is gone. Drop too aggressively and you lose signal. Fill carelessly and you introduce bias. Let's walk through your real options.
🧭 The Three Strategies — Every missing-value decision falls into one of these buckets: drop it, fill it with a constant, or fill it intelligently. Everything else is a variation of these.
Strategy 1 — Dropping Missing Rows or Columns
dropna() removes any row (or column) that contains at least one NaN. It's the nuclear option — fast, simple, and sometimes the right call. If less than 5% of your data is missing at random, dropping is usually fine. If you're dropping 30% of your dataset, stop and ask why.
The scenario: A gym membership dataset has a few rows where members didn't record their weight. We'll explore different ways to drop them.
import pandas as pd
import numpy as np
# Build a gym membership dataset with some missing values
df = pd.DataFrame({
'member': ['Alice', 'Bob', 'Carla', 'Dan', 'Eva', 'Frank', 'Gina', 'Hiro'],
# Age column — two members skipped this field
'age': [28, 35, np.nan, 42, 31, np.nan, 29, 38],
# Weight column — three members skipped this field
'weight_kg':[62, np.nan, 75, 88, np.nan, 71, 58, np.nan],
# Membership tier — fully complete column
'tier': ['Gold', 'Silver', 'Gold', 'Bronze', 'Silver', 'Gold', 'Bronze', 'Silver']
})
# Drop any row that has at least one missing value
df_dropped_any = df.dropna()
print("Rows after dropping any NaN:", len(df_dropped_any))
# Drop rows only if ALL values in that row are missing (rarely useful)
df_dropped_all = df.dropna(how='all')
print("Rows after dropping all-NaN rows:", len(df_dropped_all))
# Drop rows only if weight_kg is missing — keep rows with missing age
df_dropped_weight = df.dropna(subset=['weight_kg'])
print("Rows after dropping missing weight:", len(df_dropped_weight))
# Drop columns where more than 30% of values are missing
threshold = len(df) * 0.7 # column must have at least 70% filled
df_col_dropped = df.dropna(axis=1, thresh=int(threshold))
print("Columns after threshold drop:", df_col_dropped.columns.tolist())
Rows after dropping any NaN: 3 Rows after dropping all-NaN rows: 8 Rows after dropping missing weight: 5 Columns after threshold drop: ['member', 'age', 'weight_kg', 'tier']
💡 What just happened?
We went from 8 rows to just 3 by dropping every row with any missing value — that's a 62% loss of data for only a few missing fields. The subset parameter is your friend here: it lets you be surgical, only dropping rows where the column you actually need is empty. The thresh parameter does the same for columns.
Strategy 2 — Filling with a Constant or Statistical Value
fillna() is the most commonly used imputation method. You can fill with a fixed value (like 0 or "Unknown"), or with a summary statistic from the column itself — mean, median, or mode. Each choice carries assumptions you should be aware of.
Mean Fill
Best for symmetric, normally distributed data. Sensitive to outliers — one extreme value shifts the mean and affects your fills.
Median Fill
Best for skewed data or when outliers are present. The median is robust and doesn't get pulled by extreme values.
Mode Fill
Best for categorical columns. Fills missing labels with the most frequently seen value — logical for text fields.
The scenario: A retail sales dataset is missing purchase amounts and product categories. We'll apply different fill strategies to each.
import pandas as pd
import numpy as np
# Retail sales dataset — some amounts and categories are missing
sales = pd.DataFrame({
'order_id': [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008],
# Purchase amounts — two orders didn't record the value
'amount': [45.0, np.nan, 120.0, 33.0, np.nan, 89.0, 200.0, 55.0],
# Product category — one order is missing its category
'category': ['Electronics', 'Clothing', 'Electronics', np.nan,
'Clothing', 'Clothing', 'Electronics', 'Clothing']
})
# Fill numeric column with mean — good for roughly symmetric data
mean_amount = sales['amount'].mean()
sales['amount_mean_filled'] = sales['amount'].fillna(mean_amount)
print("Mean fill value:", round(mean_amount, 2))
# Fill numeric column with median — better when skewed
median_amount = sales['amount'].median()
sales['amount_median_filled'] = sales['amount'].fillna(median_amount)
print("Median fill value:", median_amount)
# Fill categorical column with mode — most frequent category
mode_category = sales['category'].mode()[0] # mode() returns a Series, grab first item
sales['category_filled'] = sales['category'].fillna(mode_category)
print("Mode fill value:", mode_category)
# Compare the two numeric fill strategies side by side
print(sales[['order_id', 'amount', 'amount_mean_filled', 'amount_median_filled']])
Mean fill value: 90.33 Median fill value: 72.0 Mode fill value: Clothing order_id amount amount_mean_filled amount_median_filled 0 1001 45.0 45.0 45.0 1 1002 NaN 90.33 72.0 2 1003 120.0 120.0 120.0 3 1004 33.0 33.0 33.0 4 1005 NaN 90.33 72.0 5 1006 89.0 89.0 89.0 6 1007 200.0 200.0 200.0 7 1008 55.0 55.0 55.0
💡 What just happened?
Notice how the mean (90.33) and median (72.0) give very different fills for the same two missing rows. That 200.0 outlier in row 6 pulled the mean up by 18 points. For this skewed dataset, median fill is the more conservative, defensible choice. For the category column, "Clothing" was the mode — it appeared 4 times versus "Electronics" appearing 3 times, so it won.
Strategy 3 — Forward Fill and Backward Fill
When your data has a natural order — like time series, survey responses, or sequential records — you can fill gaps using neighboring values. ffill() carries the last known value forward. bfill() pulls the next known value backward. This is especially powerful for sensor readings, stock prices, or daily logs.
The scenario: A temperature logger recorded hourly readings, but the sensor dropped out for two hours. We need to fill those gaps without inventing unrealistic values.
import pandas as pd
import numpy as np
# Hourly temperature log — sensor dropped out at 14:00 and 15:00
temps = pd.DataFrame({
'hour': [10, 11, 12, 13, 14, 15, 16, 17],
# Celsius readings — two hours are missing in the middle
'temp_c': [21.2, 22.8, 24.1, 25.5, np.nan, np.nan, 26.3, 25.0]
})
# Forward fill — carry the last known reading forward into the gaps
temps['temp_ffill'] = temps['temp_c'].ffill()
# Backward fill — pull the next known reading back into the gaps
temps['temp_bfill'] = temps['temp_c'].bfill()
# Linear interpolation — smoothly connects the two known points
temps['temp_interpolated'] = temps['temp_c'].interpolate(method='linear')
# Show all four columns to compare the strategies
print(temps)
hour temp_c temp_ffill temp_bfill temp_interpolated 0 10 21.2 21.2 21.2 21.200 1 11 22.8 22.8 22.8 22.800 2 12 24.1 24.1 24.1 24.100 3 13 25.5 25.5 25.5 25.500 4 14 NaN 25.5 26.3 25.767 5 15 NaN 25.5 26.3 26.033 6 16 26.3 26.3 26.3 26.300 7 17 25.0 25.0 25.0 25.000
💡 What just happened?
All three strategies filled the two NaN rows, but notice how interpolation gives the most physically realistic result: the temperature was rising from 25.5° to 26.3°, so 25.77° and 26.03° are sensible intermediate estimates. Forward fill assumes nothing changed (both gaps become 25.5°), while backward fill assumes the future reading applied retroactively (both gaps become 26.3°). For time series, interpolation is almost always the better default.
Visual Comparison — Before and After Imputation
Here's a visual representation of our temperature dataset — showing exactly where the gaps were and how each strategy fills them differently.
Temperature Log — Fill Strategy Comparison
⚠ = gap hours that were missing. Highlighted cells show imputed values.
Filling Different Columns Differently
In real datasets, you'll almost never use the same strategy for every column. A good EDA workflow fills each column based on its type and business meaning. Here's how to do that cleanly in one pass.
The scenario: An HR dataset has employees with missing salary, department, and performance score fields — each needing a different approach.
import pandas as pd
import numpy as np
# HR dataset — three columns, three different missing value problems
hr = pd.DataFrame({
'employee': ['Priya', 'Tom', 'Sara', 'Luis', 'Mei', 'James', 'Olu', 'Nina'],
# Salary — two NaNs, skewed distribution so use median
'salary': [72000, 85000, np.nan, 91000, 68000, np.nan, 78000, 250000],
# Department — categorical, one entry unknown, use mode
'dept': ['Eng', 'Eng', 'HR', 'Eng', np.nan, 'HR', 'Eng', 'Finance'],
# Performance score — sequential reviews, use forward fill
'perf_score': [4.2, 3.8, 4.5, np.nan, 3.9, 4.1, np.nan, 4.7]
})
# Work on a copy so the original stays clean for comparison
hr_clean = hr.copy()
# Salary: use median because the 250,000 outlier skews the mean
hr_clean['salary'] = hr_clean['salary'].fillna(hr_clean['salary'].median())
# Department: use mode — most employees are in Engineering
hr_clean['dept'] = hr_clean['dept'].fillna(hr_clean['dept'].mode()[0])
# Performance score: forward fill — carry last known review score
hr_clean['perf_score'] = hr_clean['perf_score'].ffill()
# Confirm no missing values remain
print("Missing values after cleaning:")
print(hr_clean.isnull().sum())
print()
print(hr_clean)
Missing values after cleaning: employee 0 salary 0 dept 0 perf_score 0 dtype: int64 employee salary dept perf_score 0 Priya 72000.0 Eng 4.2 1 Tom 85000.0 Eng 3.8 2 Sara 78500.0 HR 4.5 3 Luis 91000.0 Eng 4.5 4 Mei 78500.0 Eng 3.9 5 James 68000.0 HR 4.1 6 Olu 78000.0 Eng 4.1 7 Nina 250000.0 Fin 4.7
💡 What just happened?
Every column is now clean, and each was handled with a strategy that fits its data type. The median salary (78,500) is conservative despite Nina's 250k outlier — mean would have been 105k, which is misleading for a typical fill. Mei's missing department became "Eng" (the mode). Luis's missing performance score became 4.5, forward-filled from Sara above him. Zero missing values remain.
🍎 Teacher's Note
One thing beginners often skip: always work on a copy (df.copy()) before imputing. That way you can always compare your cleaned version against the original. Also, after any imputation step, immediately run .isnull().sum() to confirm you got everything — silent misses happen more than you'd expect, especially if a column has NaN and an empty string mixed together.
Practice Questions
1. Which pandas method is used to replace NaN values with a specified value or statistic?
2. When a numeric column contains extreme outliers, which measure of central tendency is the safer choice for filling missing values?
3. What pandas method carries the last known value forward to fill subsequent missing entries in a sequential dataset?
Quiz
1. You have a DataFrame with missing values in several columns. You only want to drop rows where the 'price' column is missing. Which call does this correctly?
2. For a time series with two consecutive missing readings between known values, which method produces the most physically realistic fill?
3. To fill a missing categorical column with the most frequently occurring value, what should you pass into fillna()?
Up Next · Lesson 8
Detecting Outliers
Learn how to spot values that are unusually far from the rest of your data — and why they can be your most valuable or most dangerous rows.