EDA Lesson 7 – Handling Missing Values | Dataplexa
Beginner Level · Lesson 7

Handling Missing Values

Finding missing values is only half the battle — the real skill is deciding what to do about them, because the wrong choice can silently wreck your analysis before it even starts.

You Have a Gap. Now What?

Imagine you're reviewing a customer survey where 200 people answered every question — but 30 people skipped the income field. You have three options: pretend those rows don't exist, guess what the missing values probably are, or use a method that fills them in mathematically.

None of these is always right. The best strategy depends on why the data is missing and how much of it is gone. Drop too aggressively and you lose signal. Fill carelessly and you introduce bias. Let's walk through your real options.

🧭 The Three Strategies — Every missing-value decision falls into one of these buckets: drop it, fill it with a constant, or fill it intelligently. Everything else is a variation of these.

Strategy 1 — Dropping Missing Rows or Columns

dropna() removes any row (or column) that contains at least one NaN. It's the nuclear option — fast, simple, and sometimes the right call. If less than 5% of your data is missing at random, dropping is usually fine. If you're dropping 30% of your dataset, stop and ask why.

The scenario: A gym membership dataset has a few rows where members didn't record their weight. We'll explore different ways to drop them.

import pandas as pd
import numpy as np

# Build a gym membership dataset with some missing values
df = pd.DataFrame({
    'member':   ['Alice', 'Bob', 'Carla', 'Dan', 'Eva', 'Frank', 'Gina', 'Hiro'],
    # Age column — two members skipped this field
    'age':      [28, 35, np.nan, 42, 31, np.nan, 29, 38],
    # Weight column — three members skipped this field
    'weight_kg':[62, np.nan, 75, 88, np.nan, 71, 58, np.nan],
    # Membership tier — fully complete column
    'tier':     ['Gold', 'Silver', 'Gold', 'Bronze', 'Silver', 'Gold', 'Bronze', 'Silver']
})

# Drop any row that has at least one missing value
df_dropped_any = df.dropna()
print("Rows after dropping any NaN:", len(df_dropped_any))

# Drop rows only if ALL values in that row are missing (rarely useful)
df_dropped_all = df.dropna(how='all')
print("Rows after dropping all-NaN rows:", len(df_dropped_all))

# Drop rows only if weight_kg is missing — keep rows with missing age
df_dropped_weight = df.dropna(subset=['weight_kg'])
print("Rows after dropping missing weight:", len(df_dropped_weight))

# Drop columns where more than 30% of values are missing
threshold = len(df) * 0.7          # column must have at least 70% filled
df_col_dropped = df.dropna(axis=1, thresh=int(threshold))
print("Columns after threshold drop:", df_col_dropped.columns.tolist())

💡 What just happened?

We went from 8 rows to just 3 by dropping every row with any missing value — that's a 62% loss of data for only a few missing fields. The subset parameter is your friend here: it lets you be surgical, only dropping rows where the column you actually need is empty. The thresh parameter does the same for columns.

Strategy 2 — Filling with a Constant or Statistical Value

fillna() is the most commonly used imputation method. You can fill with a fixed value (like 0 or "Unknown"), or with a summary statistic from the column itself — mean, median, or mode. Each choice carries assumptions you should be aware of.

Mean Fill

Best for symmetric, normally distributed data. Sensitive to outliers — one extreme value shifts the mean and affects your fills.

Median Fill

Best for skewed data or when outliers are present. The median is robust and doesn't get pulled by extreme values.

Mode Fill

Best for categorical columns. Fills missing labels with the most frequently seen value — logical for text fields.

The scenario: A retail sales dataset is missing purchase amounts and product categories. We'll apply different fill strategies to each.

import pandas as pd
import numpy as np

# Retail sales dataset — some amounts and categories are missing
sales = pd.DataFrame({
    'order_id':  [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008],
    # Purchase amounts — two orders didn't record the value
    'amount':    [45.0, np.nan, 120.0, 33.0, np.nan, 89.0, 200.0, 55.0],
    # Product category — one order is missing its category
    'category':  ['Electronics', 'Clothing', 'Electronics', np.nan,
                  'Clothing', 'Clothing', 'Electronics', 'Clothing']
})

# Fill numeric column with mean — good for roughly symmetric data
mean_amount = sales['amount'].mean()
sales['amount_mean_filled'] = sales['amount'].fillna(mean_amount)
print("Mean fill value:", round(mean_amount, 2))

# Fill numeric column with median — better when skewed
median_amount = sales['amount'].median()
sales['amount_median_filled'] = sales['amount'].fillna(median_amount)
print("Median fill value:", median_amount)

# Fill categorical column with mode — most frequent category
mode_category = sales['category'].mode()[0]    # mode() returns a Series, grab first item
sales['category_filled'] = sales['category'].fillna(mode_category)
print("Mode fill value:", mode_category)

# Compare the two numeric fill strategies side by side
print(sales[['order_id', 'amount', 'amount_mean_filled', 'amount_median_filled']])

💡 What just happened?

Notice how the mean (90.33) and median (72.0) give very different fills for the same two missing rows. That 200.0 outlier in row 6 pulled the mean up by 18 points. For this skewed dataset, median fill is the more conservative, defensible choice. For the category column, "Clothing" was the mode — it appeared 4 times versus "Electronics" appearing 3 times, so it won.

Strategy 3 — Forward Fill and Backward Fill

When your data has a natural order — like time series, survey responses, or sequential records — you can fill gaps using neighboring values. ffill() carries the last known value forward. bfill() pulls the next known value backward. This is especially powerful for sensor readings, stock prices, or daily logs.

The scenario: A temperature logger recorded hourly readings, but the sensor dropped out for two hours. We need to fill those gaps without inventing unrealistic values.

import pandas as pd
import numpy as np

# Hourly temperature log — sensor dropped out at 14:00 and 15:00
temps = pd.DataFrame({
    'hour':  [10, 11, 12, 13, 14, 15, 16, 17],
    # Celsius readings — two hours are missing in the middle
    'temp_c': [21.2, 22.8, 24.1, 25.5, np.nan, np.nan, 26.3, 25.0]
})

# Forward fill — carry the last known reading forward into the gaps
temps['temp_ffill'] = temps['temp_c'].ffill()

# Backward fill — pull the next known reading back into the gaps
temps['temp_bfill'] = temps['temp_c'].bfill()

# Linear interpolation — smoothly connects the two known points
temps['temp_interpolated'] = temps['temp_c'].interpolate(method='linear')

# Show all four columns to compare the strategies
print(temps)

💡 What just happened?

All three strategies filled the two NaN rows, but notice how interpolation gives the most physically realistic result: the temperature was rising from 25.5° to 26.3°, so 25.77° and 26.03° are sensible intermediate estimates. Forward fill assumes nothing changed (both gaps become 25.5°), while backward fill assumes the future reading applied retroactively (both gaps become 26.3°). For time series, interpolation is almost always the better default.

Visual Comparison — Before and After Imputation

Here's a visual representation of our temperature dataset — showing exactly where the gaps were and how each strategy fills them differently.

Temperature Log — Fill Strategy Comparison

Hour
10
11
12
13
14 ⚠
15 ⚠
16
17
Original
21.2
22.8
24.1
25.5
NaN
NaN
26.3
25.0
ffill
21.2
22.8
24.1
25.5
25.5
25.5
26.3
25.0
bfill
21.2
22.8
24.1
25.5
26.3
26.3
26.3
25.0
interp
21.2
22.8
24.1
25.5
25.77
26.03
26.3
25.0

⚠ = gap hours that were missing. Highlighted cells show imputed values.

Filling Different Columns Differently

In real datasets, you'll almost never use the same strategy for every column. A good EDA workflow fills each column based on its type and business meaning. Here's how to do that cleanly in one pass.

The scenario: An HR dataset has employees with missing salary, department, and performance score fields — each needing a different approach.

import pandas as pd
import numpy as np

# HR dataset — three columns, three different missing value problems
hr = pd.DataFrame({
    'employee':   ['Priya', 'Tom', 'Sara', 'Luis', 'Mei', 'James', 'Olu', 'Nina'],
    # Salary — two NaNs, skewed distribution so use median
    'salary':     [72000, 85000, np.nan, 91000, 68000, np.nan, 78000, 250000],
    # Department — categorical, one entry unknown, use mode
    'dept':       ['Eng', 'Eng', 'HR', 'Eng', np.nan, 'HR', 'Eng', 'Finance'],
    # Performance score — sequential reviews, use forward fill
    'perf_score': [4.2, 3.8, 4.5, np.nan, 3.9, 4.1, np.nan, 4.7]
})

# Work on a copy so the original stays clean for comparison
hr_clean = hr.copy()

# Salary: use median because the 250,000 outlier skews the mean
hr_clean['salary'] = hr_clean['salary'].fillna(hr_clean['salary'].median())

# Department: use mode — most employees are in Engineering
hr_clean['dept'] = hr_clean['dept'].fillna(hr_clean['dept'].mode()[0])

# Performance score: forward fill — carry last known review score
hr_clean['perf_score'] = hr_clean['perf_score'].ffill()

# Confirm no missing values remain
print("Missing values after cleaning:")
print(hr_clean.isnull().sum())
print()
print(hr_clean)

💡 What just happened?

Every column is now clean, and each was handled with a strategy that fits its data type. The median salary (78,500) is conservative despite Nina's 250k outlier — mean would have been 105k, which is misleading for a typical fill. Mei's missing department became "Eng" (the mode). Luis's missing performance score became 4.5, forward-filled from Sara above him. Zero missing values remain.

🍎 Teacher's Note

One thing beginners often skip: always work on a copy (df.copy()) before imputing. That way you can always compare your cleaned version against the original. Also, after any imputation step, immediately run .isnull().sum() to confirm you got everything — silent misses happen more than you'd expect, especially if a column has NaN and an empty string mixed together.

Practice Questions

1. Which pandas method is used to replace NaN values with a specified value or statistic?



2. When a numeric column contains extreme outliers, which measure of central tendency is the safer choice for filling missing values?



3. What pandas method carries the last known value forward to fill subsequent missing entries in a sequential dataset?



Quiz

1. You have a DataFrame with missing values in several columns. You only want to drop rows where the 'price' column is missing. Which call does this correctly?


2. For a time series with two consecutive missing readings between known values, which method produces the most physically realistic fill?


3. To fill a missing categorical column with the most frequently occurring value, what should you pass into fillna()?


Up Next · Lesson 8

Detecting Outliers

Learn how to spot values that are unusually far from the rest of your data — and why they can be your most valuable or most dangerous rows.