Handling Missing Values in NumPy
In real-world datasets, missing values are very common. They can appear due to data collection errors, incomplete records, or system issues.
NumPy provides efficient tools to detect, analyze, and handle missing values properly.
What Are Missing Values?
Missing values represent unavailable or undefined data. In NumPy, missing values are usually represented as:
np.nan– Not a Number (for floating-point data)None– Object type missing values
Handling missing values correctly is critical for accurate analysis.
Creating an Array with Missing Values
Let us create a NumPy array that contains missing values.
import numpy as np
data = np.array([10, 20, np.nan, 40, np.nan, 60])
print(data)
Output:
[10. 20. nan 40. nan 60.]
Checking for Missing Values
Use np.isnan() to identify missing values.
missing_mask = np.isnan(data)
print(missing_mask)
Output:
[False False True False True False]
Each True indicates a missing value.
Counting Missing Values
You can count missing values using np.sum().
missing_count = np.sum(np.isnan(data))
print(missing_count)
Output:
2
Removing Missing Values
To remove missing values, use boolean indexing.
clean_data = data[~np.isnan(data)]
print(clean_data)
Output:
[10. 20. 40. 60.]
This method completely removes missing entries.
Replacing Missing Values
Instead of removing data, you may want to replace missing values.
Replacing with a Fixed Value
filled_data = np.nan_to_num(data, nan=0)
print(filled_data)
Output:
[10. 20. 0. 40. 0. 60.]
Replacing with Mean Value
A common strategy is replacing missing values with the mean.
mean_value = np.nanmean(data)
filled_mean = np.where(np.isnan(data), mean_value, data)
print(filled_mean)
Output:
[10. 20. 32.5 40. 32.5 60.]
This keeps the overall distribution more stable.
Handling Missing Values in 2D Arrays
Missing values often appear in tabular data.
matrix = np.array([
[1, 2, np.nan],
[4, np.nan, 6],
[7, 8, 9]
])
print(matrix)
Calculate column-wise means while ignoring missing values:
col_means = np.nanmean(matrix, axis=0)
print(col_means)
Best Practices
- Always detect missing values first
- Decide whether to remove or replace
- Use statistical methods carefully
- Document how missing values are handled
Practice Exercise
Task
- Create a NumPy array with missing values
- Count missing values
- Remove missing values
- Replace missing values with the mean
What’s Next?
In the next lesson, you will learn about performance optimization techniques in NumPy to make your code faster and more efficient.