EDA Lesson 6 – Missing Values | Dataplexa

Beginner Level · Lesson 6

Missing Values — The Holes in Your Data

Missing data is not an edge case. It's not a beginner problem. It's in every real-world dataset you will ever touch — and if you don't handle it deliberately, it handles you.

Why Data Goes Missing in the First Place

Before you fix missing values, you need to understand why they're missing. Because the reason matters — a lot. There are three completely different situations, and they require different responses.

MCAR

Missing Completely at Random

A survey respondent accidentally skipped a question. The missingness has nothing to do with the data itself. Safe to drop or fill.

MAR

Missing at Random

Income data is missing more for younger respondents — not random, but explainable by another column. Can be imputed carefully.

MNAR

Missing Not at Random

High earners deliberately skip the salary question. The missingness is related to the value itself. Most dangerous — needs domain knowledge.

In most beginner and intermediate EDA you won't formally classify every missing column into MCAR/MAR/MNAR. But the habit of asking "why is this missing?" before just filling it in — that habit is what separates careful analysts from careless ones.

How Pandas Represents Missing Values

Pandas uses two main representations for missing data. NaN — which stands for "Not a Number" — is used in numeric columns. None is the Python null object, used in object columns. In practice you'll see both, and pandas treats them the same way in most operations.

But here's the sneaky part — there are also disguised missing values that pandas doesn't flag automatically. Things like the string "N/A", a blank space " ", the number 0 used to mean "no data", or -999 used as a placeholder. These look fine to pandas. Only you can catch them.

The dangerous ones: A column where missing salary is stored as 0 will pass every null check. Your mean salary calculation will be dragged way down. No error. No warning. Just quietly wrong numbers going into your report.

Detecting Missing Values — The Full Toolkit

The scenario: You've just received a patient health dataset from a hospital. Before any analysis — average age, blood pressure trends, medication effectiveness — you need to know exactly what's missing and how much. Here's the complete detection routine.

import pandas as pd
import numpy as np

# Patient health dataset — realistic mix of NaN, None and present values
patient_df = pd.DataFrame({
    'patient_id':  [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010],
    'name':        ['Alice', 'Bob', None, 'Dave', 'Eve', 'Frank', 'Grace', None, 'Ivy', 'Jack'],
    'age':         [34, 45, 29, np.nan, 52, 38, np.nan, 41, 27, np.nan],
    'blood_type':  ['A+', 'B-', 'O+', 'A+', None, 'AB+', 'O-', 'B+', None, 'A-'],
    'weight_kg':   [68.5, 82.0, 74.3, np.nan, 91.2, 70.0, np.nan, 85.5, 63.0, 77.8],
    'smoker':      [False, True, False, True, None, False, True, None, False, True],
    'diagnosis':   ['Healthy', 'Hypertension', 'Healthy', 'Diabetes', 'Healthy',
                    'Healthy', None, 'Hypertension', 'Healthy', None]
})

# Method 1: .isnull() — True wherever a value is missing
# .sum() counts the True values per column
print("=== Missing value COUNT per column ===")
print(patient_df.isnull().sum())

# Method 2: Calculate missing percentage — more useful than raw count
print("\n=== Missing value PERCENTAGE per column ===")
missing_pct = (patient_df.isnull().sum() / len(patient_df) * 100).round(1)
print(missing_pct)

=== Missing value COUNT per column ===
patient_id    0
name          2
age           3
blood_type    2
weight_kg     2
smoker        2
diagnosis     2
dtype: int64

=== Missing value PERCENTAGE per column ===
patient_id     0.0
name          20.0
age           30.0
blood_type    20.0
weight_kg     20.0
smoker        20.0
diagnosis     20.0
dtype: float64

What just happened?

.isnull() is a pandas method that scans every cell and returns a boolean DataFrame — True where the value is missing, False where it isn't. Chaining .sum() collapses each column into a count of True values.

The age column is 30% missing — that's the most critical one to deal with. Everything else is 20% missing. In a 10-row dataset that's 2–3 records per column. In a 100,000-row dataset that's 20,000 missing values. The percentage framing makes that reality much clearer than a raw count does.

Building a Proper Missing Value Summary

The scenario: Your team lead asks for a clean report on data quality before the project kicks off. Just printing raw counts isn't enough — you want a tidy summary table that shows count, percentage, and dtype all together so anyone can read it at a glance.

import pandas as pd
import numpy as np

# Rebuild the patient dataset
patient_df = pd.DataFrame({
    'patient_id':  [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010],
    'name':        ['Alice', 'Bob', None, 'Dave', 'Eve', 'Frank', 'Grace', None, 'Ivy', 'Jack'],
    'age':         [34, 45, 29, np.nan, 52, 38, np.nan, 41, 27, np.nan],
    'blood_type':  ['A+', 'B-', 'O+', 'A+', None, 'AB+', 'O-', 'B+', None, 'A-'],
    'weight_kg':   [68.5, 82.0, 74.3, np.nan, 91.2, 70.0, np.nan, 85.5, 63.0, 77.8],
    'smoker':      [False, True, False, True, None, False, True, None, False, True],
    'diagnosis':   ['Healthy', 'Hypertension', 'Healthy', 'Diabetes', 'Healthy',
                    'Healthy', None, 'Hypertension', 'Healthy', None]
})

# Build a clean summary table — count, percentage, and dtype side by side
missing_summary = pd.DataFrame({
    'missing_count': patient_df.isnull().sum(),                                  # raw count
    'missing_pct':   (patient_df.isnull().sum() / len(patient_df) * 100).round(1), # percentage
    'dtype':         patient_df.dtypes                                            # column type
})

# Sort by most missing first — makes it easy to prioritise
missing_summary = missing_summary.sort_values('missing_pct', ascending=False)

# Only show columns that have at least one missing value
print(missing_summary[missing_summary['missing_count'] > 0])

           missing_count  missing_pct    dtype
age                    3         30.0  float64
name                   2         20.0   object
blood_type             2         20.0   object
weight_kg              2         20.0  float64
smoker                 2         20.0   object
diagnosis              2         20.0   object

What just happened?

We built a summary DataFrame by hand — combining .isnull().sum(), percentage calculation, and .dtypes into one tidy table. Then we used .sort_values() — a pandas method — to rank columns by how much data is missing. Most critical problem at the top.

age jumps to the top — it's 30% missing and it's a float64, meaning it's a numeric column that will affect averages and correlations. Text columns like name missing is less analytically dangerous than a numeric column like age missing. The dtype column in your summary gives you that context immediately.

Spotting Disguised Missing Values

This is the one pandas can't do for you. Disguised missing values look like valid data — but they're actually placeholders for "we don't know." You need domain knowledge and a suspicious mind to find them.

The scenario: A retail dataset arrives from the legacy ERP system. The developer who built it years ago used 0 for unknown prices, "Unknown" for unknown categories, and -1 for unknown customer ages. None of these will show up in your .isnull() check.

import pandas as pd
import numpy as np

# Legacy retail dataset — disguised missing values hiding as valid-looking data
retail_df = pd.DataFrame({
    'product_id':  [101, 102, 103, 104, 105, 106, 107, 108],
    'product_name':['Laptop', 'Mouse', 'Keyboard', 'Unknown', 'Monitor', 'Cable', 'Unknown', 'Headset'],
    'price':       [999.99, 29.99, 79.99, 0.00, 349.99, 0.00, 15.99, 89.99],  # 0 means unknown
    'customer_age':[-1, 34, 28, -1, 45, 31, -1, 27],   # -1 means age not recorded
    'stock':       [50, 200, 150, np.nan, 75, 300, 100, np.nan]  # real NaN values
})

# Step 1: Standard null check — only catches real NaN/None
print("=== Standard null check ===")
print(retail_df.isnull().sum())

# Step 2: Check for known placeholder values specific to this dataset
print("\n=== Checking for disguised missing values ===")

# Price = 0 is suspicious for a retail product — probably means 'unknown'
zero_price = (retail_df['price'] == 0).sum()
print(f"Prices equal to 0     : {zero_price}")

# -1 in age is a classic legacy system placeholder
neg_age = (retail_df['customer_age'] == -1).sum()
print(f"Ages equal to -1      : {neg_age}")

# 'Unknown' string in product name — sometimes people type this literally
unknown_name = (retail_df['product_name'] == 'Unknown').sum()
print(f"Product names 'Unknown': {unknown_name}")

# Step 3: Replace all disguised values with real NaN so pandas can handle them properly
retail_df['price']        = retail_df['price'].replace(0, np.nan)
retail_df['customer_age'] = retail_df['customer_age'].replace(-1, np.nan)
retail_df['product_name'] = retail_df['product_name'].replace('Unknown', np.nan)

# Now re-run the standard null check — everything surfaces
print("\n=== Null check AFTER replacing disguised values ===")
print(retail_df.isnull().sum())

=== Standard null check ===
product_id      0
product_name    0
price           0
customer_age    0
stock           2
dtype: int64

=== Checking for disguised missing values ===
Prices equal to 0     : 2
Ages equal to -1      : 3
Product names 'Unknown': 2

=== Null check AFTER replacing disguised values ===
product_id      0
product_name    2
price           2
customer_age    3
stock           2
dtype: int64

What just happened?

The first null check found only 2 missing values in stock. After manual investigation and .replace() — a pandas method that swaps specific values for np.nan — the picture changed completely. 7 additional missing values surfaced that were invisible to the standard check.

This is why you always ask the data provider: "Are there any placeholder values I should know about?" If they say no — be suspicious anyway. Legacy systems are full of this. Always look at .value_counts() on suspicious columns and ask yourself whether any values seem like they don't belong.

Visualising the Missing Value Pattern

Numbers are useful. But a visual of which rows and columns are missing, and whether the gaps cluster together, can reveal patterns that a table never would. Here's an HTML/CSS heatmap mockup of the patient dataset's missing value pattern.

Missing value heatmap — patient_df (green = present, red = missing)

	name	age	blood_type	weight_kg	smoker	diagnosis
Patient 1001
Patient 1002
Patient 1003	✗
Patient 1004		✗		✗
Patient 1005			✗		✗
Patient 1006
Patient 1007		✗		✗		✗
Patient 1008	✗				✗
Patient 1009			✗
Patient 1010		✗				✗

Value present Missing (NaN / None)

The heatmap reveals something a table of counts doesn't: Patient 1007 is missing age, weight_kg, and diagnosis all at once. That row might be worth investigating or dropping entirely — three missing values in one record is a quality flag, not a coincidence.

How Much Missing is Too Much?

There's no universal rule. But here's a practical framework most analysts use as a starting point.

0–5%

Safe to impute

Small enough that filling with mean/median/mode won't distort your analysis.

5–30%

Impute carefully

Consider smarter imputation methods. Check if missingness correlates with other columns.

30%+

Proceed with caution

Consider dropping the column or flagging it. Any imputation at this level introduces significant bias.

These are guidelines, not gospel. A column that's 40% missing but critical to your business question is still worth keeping — just flag it clearly in any report or model you build with it.

Teacher's Note

One mistake I see constantly: people drop missing values immediately without thinking. They run .dropna() on the whole dataset and carry on. If your missingness is not random — if the people who didn't answer the income question are systematically different from those who did — then dropping them introduces bias into every analysis you do afterwards. You haven't cleaned your data. You've quietly skewed it.

Always ask why values are missing before deciding what to do with them. We cover the actual handling strategies — dropping, filling, imputing — in the very next lesson.

Practice Questions

1. Which pandas method returns a boolean DataFrame showing True wherever a value is missing?

2. A column uses -1 as a placeholder for unknown ages. To convert all -1 values to NaN, you use df['age'].______(-1, np.nan). What goes in the blank?

3. High-income respondents consistently skip the salary question in a survey. The missingness is related to the value of the missing data itself. This is called ______.

Quiz

Up Next · Lesson 7

Handling Missing Values

Now that you can find every gap, it's time to decide what to do about it — drop, fill, or impute. The wrong choice can quietly ruin your analysis.

← Previous Course Index Next →

EDA Course

Missing Values — The Holes in Your Data

Why Data Goes Missing in the First Place

How Pandas Represents Missing Values

Detecting Missing Values — The Full Toolkit

Building a Proper Missing Value Summary

Spotting Disguised Missing Values

Visualising the Missing Value Pattern

How Much Missing is Too Much?

Practice Questions

Quiz