EDA Course
Missing Values — The Holes in Your Data
Missing data is not an edge case. It's not a beginner problem. It's in every real-world dataset you will ever touch — and if you don't handle it deliberately, it handles you.
Why Data Goes Missing in the First Place
Before you fix missing values, you need to understand why they're missing. Because the reason matters — a lot. There are three completely different situations, and they require different responses.
Missing Completely at Random
A survey respondent accidentally skipped a question. The missingness has nothing to do with the data itself. Safe to drop or fill.
Missing at Random
Income data is missing more for younger respondents — not random, but explainable by another column. Can be imputed carefully.
Missing Not at Random
High earners deliberately skip the salary question. The missingness is related to the value itself. Most dangerous — needs domain knowledge.
In most beginner and intermediate EDA you won't formally classify every missing column into MCAR/MAR/MNAR. But the habit of asking "why is this missing?" before just filling it in — that habit is what separates careful analysts from careless ones.
How Pandas Represents Missing Values
Pandas uses two main representations for missing data. NaN — which stands for "Not a Number" — is used in numeric columns. None is the Python null object, used in object columns. In practice you'll see both, and pandas treats them the same way in most operations.
But here's the sneaky part — there are also disguised missing values that pandas doesn't flag automatically. Things like the string "N/A", a blank space " ", the number 0 used to mean "no data", or -999 used as a placeholder. These look fine to pandas. Only you can catch them.
The dangerous ones: A column where missing salary is stored as 0 will pass every null check. Your mean salary calculation will be dragged way down. No error. No warning. Just quietly wrong numbers going into your report.
Detecting Missing Values — The Full Toolkit
The scenario: You've just received a patient health dataset from a hospital. Before any analysis — average age, blood pressure trends, medication effectiveness — you need to know exactly what's missing and how much. Here's the complete detection routine.
import pandas as pd
import numpy as np
# Patient health dataset — realistic mix of NaN, None and present values
patient_df = pd.DataFrame({
'patient_id': [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010],
'name': ['Alice', 'Bob', None, 'Dave', 'Eve', 'Frank', 'Grace', None, 'Ivy', 'Jack'],
'age': [34, 45, 29, np.nan, 52, 38, np.nan, 41, 27, np.nan],
'blood_type': ['A+', 'B-', 'O+', 'A+', None, 'AB+', 'O-', 'B+', None, 'A-'],
'weight_kg': [68.5, 82.0, 74.3, np.nan, 91.2, 70.0, np.nan, 85.5, 63.0, 77.8],
'smoker': [False, True, False, True, None, False, True, None, False, True],
'diagnosis': ['Healthy', 'Hypertension', 'Healthy', 'Diabetes', 'Healthy',
'Healthy', None, 'Hypertension', 'Healthy', None]
})
# Method 1: .isnull() — True wherever a value is missing
# .sum() counts the True values per column
print("=== Missing value COUNT per column ===")
print(patient_df.isnull().sum())
# Method 2: Calculate missing percentage — more useful than raw count
print("\n=== Missing value PERCENTAGE per column ===")
missing_pct = (patient_df.isnull().sum() / len(patient_df) * 100).round(1)
print(missing_pct)
=== Missing value COUNT per column === patient_id 0 name 2 age 3 blood_type 2 weight_kg 2 smoker 2 diagnosis 2 dtype: int64 === Missing value PERCENTAGE per column === patient_id 0.0 name 20.0 age 30.0 blood_type 20.0 weight_kg 20.0 smoker 20.0 diagnosis 20.0 dtype: float64
What just happened?
.isnull() is a pandas method that scans every cell and returns a boolean DataFrame — True where the value is missing, False where it isn't. Chaining .sum() collapses each column into a count of True values.
The age column is 30% missing — that's the most critical one to deal with. Everything else is 20% missing. In a 10-row dataset that's 2–3 records per column. In a 100,000-row dataset that's 20,000 missing values. The percentage framing makes that reality much clearer than a raw count does.
Building a Proper Missing Value Summary
The scenario: Your team lead asks for a clean report on data quality before the project kicks off. Just printing raw counts isn't enough — you want a tidy summary table that shows count, percentage, and dtype all together so anyone can read it at a glance.
import pandas as pd
import numpy as np
# Rebuild the patient dataset
patient_df = pd.DataFrame({
'patient_id': [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010],
'name': ['Alice', 'Bob', None, 'Dave', 'Eve', 'Frank', 'Grace', None, 'Ivy', 'Jack'],
'age': [34, 45, 29, np.nan, 52, 38, np.nan, 41, 27, np.nan],
'blood_type': ['A+', 'B-', 'O+', 'A+', None, 'AB+', 'O-', 'B+', None, 'A-'],
'weight_kg': [68.5, 82.0, 74.3, np.nan, 91.2, 70.0, np.nan, 85.5, 63.0, 77.8],
'smoker': [False, True, False, True, None, False, True, None, False, True],
'diagnosis': ['Healthy', 'Hypertension', 'Healthy', 'Diabetes', 'Healthy',
'Healthy', None, 'Hypertension', 'Healthy', None]
})
# Build a clean summary table — count, percentage, and dtype side by side
missing_summary = pd.DataFrame({
'missing_count': patient_df.isnull().sum(), # raw count
'missing_pct': (patient_df.isnull().sum() / len(patient_df) * 100).round(1), # percentage
'dtype': patient_df.dtypes # column type
})
# Sort by most missing first — makes it easy to prioritise
missing_summary = missing_summary.sort_values('missing_pct', ascending=False)
# Only show columns that have at least one missing value
print(missing_summary[missing_summary['missing_count'] > 0])
missing_count missing_pct dtype age 3 30.0 float64 name 2 20.0 object blood_type 2 20.0 object weight_kg 2 20.0 float64 smoker 2 20.0 object diagnosis 2 20.0 object
What just happened?
We built a summary DataFrame by hand — combining .isnull().sum(), percentage calculation, and .dtypes into one tidy table. Then we used .sort_values() — a pandas method — to rank columns by how much data is missing. Most critical problem at the top.
age jumps to the top — it's 30% missing and it's a float64, meaning it's a numeric column that will affect averages and correlations. Text columns like name missing is less analytically dangerous than a numeric column like age missing. The dtype column in your summary gives you that context immediately.
Spotting Disguised Missing Values
This is the one pandas can't do for you. Disguised missing values look like valid data — but they're actually placeholders for "we don't know." You need domain knowledge and a suspicious mind to find them.
The scenario: A retail dataset arrives from the legacy ERP system. The developer who built it years ago used 0 for unknown prices, "Unknown" for unknown categories, and -1 for unknown customer ages. None of these will show up in your .isnull() check.
import pandas as pd
import numpy as np
# Legacy retail dataset — disguised missing values hiding as valid-looking data
retail_df = pd.DataFrame({
'product_id': [101, 102, 103, 104, 105, 106, 107, 108],
'product_name':['Laptop', 'Mouse', 'Keyboard', 'Unknown', 'Monitor', 'Cable', 'Unknown', 'Headset'],
'price': [999.99, 29.99, 79.99, 0.00, 349.99, 0.00, 15.99, 89.99], # 0 means unknown
'customer_age':[-1, 34, 28, -1, 45, 31, -1, 27], # -1 means age not recorded
'stock': [50, 200, 150, np.nan, 75, 300, 100, np.nan] # real NaN values
})
# Step 1: Standard null check — only catches real NaN/None
print("=== Standard null check ===")
print(retail_df.isnull().sum())
# Step 2: Check for known placeholder values specific to this dataset
print("\n=== Checking for disguised missing values ===")
# Price = 0 is suspicious for a retail product — probably means 'unknown'
zero_price = (retail_df['price'] == 0).sum()
print(f"Prices equal to 0 : {zero_price}")
# -1 in age is a classic legacy system placeholder
neg_age = (retail_df['customer_age'] == -1).sum()
print(f"Ages equal to -1 : {neg_age}")
# 'Unknown' string in product name — sometimes people type this literally
unknown_name = (retail_df['product_name'] == 'Unknown').sum()
print(f"Product names 'Unknown': {unknown_name}")
# Step 3: Replace all disguised values with real NaN so pandas can handle them properly
retail_df['price'] = retail_df['price'].replace(0, np.nan)
retail_df['customer_age'] = retail_df['customer_age'].replace(-1, np.nan)
retail_df['product_name'] = retail_df['product_name'].replace('Unknown', np.nan)
# Now re-run the standard null check — everything surfaces
print("\n=== Null check AFTER replacing disguised values ===")
print(retail_df.isnull().sum())
=== Standard null check === product_id 0 product_name 0 price 0 customer_age 0 stock 2 dtype: int64 === Checking for disguised missing values === Prices equal to 0 : 2 Ages equal to -1 : 3 Product names 'Unknown': 2 === Null check AFTER replacing disguised values === product_id 0 product_name 2 price 2 customer_age 3 stock 2 dtype: int64
What just happened?
The first null check found only 2 missing values in stock. After manual investigation and .replace() — a pandas method that swaps specific values for np.nan — the picture changed completely. 7 additional missing values surfaced that were invisible to the standard check.
This is why you always ask the data provider: "Are there any placeholder values I should know about?" If they say no — be suspicious anyway. Legacy systems are full of this. Always look at .value_counts() on suspicious columns and ask yourself whether any values seem like they don't belong.
Visualising the Missing Value Pattern
Numbers are useful. But a visual of which rows and columns are missing, and whether the gaps cluster together, can reveal patterns that a table never would. Here's an HTML/CSS heatmap mockup of the patient dataset's missing value pattern.
Missing value heatmap — patient_df (green = present, red = missing)
| name | age | blood_type | weight_kg | smoker | diagnosis | |
|---|---|---|---|---|---|---|
| Patient 1001 | ||||||
| Patient 1002 | ||||||
| Patient 1003 | ✗ | |||||
| Patient 1004 | ✗ | ✗ | ||||
| Patient 1005 | ✗ | ✗ | ||||
| Patient 1006 | ||||||
| Patient 1007 | ✗ | ✗ | ✗ | |||
| Patient 1008 | ✗ | ✗ | ||||
| Patient 1009 | ✗ | |||||
| Patient 1010 | ✗ | ✗ |
The heatmap reveals something a table of counts doesn't: Patient 1007 is missing age, weight_kg, and diagnosis all at once. That row might be worth investigating or dropping entirely — three missing values in one record is a quality flag, not a coincidence.
How Much Missing is Too Much?
There's no universal rule. But here's a practical framework most analysts use as a starting point.
Safe to impute
Small enough that filling with mean/median/mode won't distort your analysis.
Impute carefully
Consider smarter imputation methods. Check if missingness correlates with other columns.
Proceed with caution
Consider dropping the column or flagging it. Any imputation at this level introduces significant bias.
These are guidelines, not gospel. A column that's 40% missing but critical to your business question is still worth keeping — just flag it clearly in any report or model you build with it.
Teacher's Note
One mistake I see constantly: people drop missing values immediately without thinking. They run .dropna() on the whole dataset and carry on. If your missingness is not random — if the people who didn't answer the income question are systematically different from those who did — then dropping them introduces bias into every analysis you do afterwards. You haven't cleaned your data. You've quietly skewed it.
Always ask why values are missing before deciding what to do with them. We cover the actual handling strategies — dropping, filling, imputing — in the very next lesson.
Practice Questions
1. Which pandas method returns a boolean DataFrame showing True wherever a value is missing?
2. A column uses -1 as a placeholder for unknown ages. To convert all -1 values to NaN, you use df['age'].______(-1, np.nan). What goes in the blank?
3. High-income respondents consistently skip the salary question in a survey. The missingness is related to the value of the missing data itself. This is called ______.
Quiz
1. A column stores the value 0 to represent unknown prices. What is the problem with this?
2. Why is blindly running .dropna() on a dataset potentially dangerous?
3. When building a missing value summary table, why should you sort columns by missing percentage descending?
Up Next · Lesson 7
Handling Missing Values
Now that you can find every gap, it's time to decide what to do about it — drop, fill, or impute. The wrong choice can quietly ruin your analysis.