NumPy Lesson 21 – Handling Missing Values | Dataplexa

Handling Missing Values in NumPy

In real-world datasets, missing values are very common. They can appear due to data collection errors, incomplete records, or system issues.

NumPy provides efficient tools to detect, analyze, and handle missing values properly.


What Are Missing Values?

Missing values represent unavailable or undefined data. In NumPy, missing values are usually represented as:

  • np.nan – Not a Number (for floating-point data)
  • None – Object type missing values

Handling missing values correctly is critical for accurate analysis.


Creating an Array with Missing Values

Let us create a NumPy array that contains missing values.

import numpy as np

data = np.array([10, 20, np.nan, 40, np.nan, 60])
print(data)

Output:

[10. 20. nan 40. nan 60.]

Checking for Missing Values

Use np.isnan() to identify missing values.

missing_mask = np.isnan(data)
print(missing_mask)

Output:

[False False  True False  True False]

Each True indicates a missing value.


Counting Missing Values

You can count missing values using np.sum().

missing_count = np.sum(np.isnan(data))
print(missing_count)

Output:

2

Removing Missing Values

To remove missing values, use boolean indexing.

clean_data = data[~np.isnan(data)]
print(clean_data)

Output:

[10. 20. 40. 60.]

This method completely removes missing entries.


Replacing Missing Values

Instead of removing data, you may want to replace missing values.

Replacing with a Fixed Value

filled_data = np.nan_to_num(data, nan=0)
print(filled_data)

Output:

[10. 20.  0. 40.  0. 60.]

Replacing with Mean Value

A common strategy is replacing missing values with the mean.

mean_value = np.nanmean(data)
filled_mean = np.where(np.isnan(data), mean_value, data)
print(filled_mean)

Output:

[10. 20. 32.5 40. 32.5 60.]

This keeps the overall distribution more stable.


Handling Missing Values in 2D Arrays

Missing values often appear in tabular data.

matrix = np.array([
    [1, 2, np.nan],
    [4, np.nan, 6],
    [7, 8, 9]
])

print(matrix)

Calculate column-wise means while ignoring missing values:

col_means = np.nanmean(matrix, axis=0)
print(col_means)

Best Practices

  • Always detect missing values first
  • Decide whether to remove or replace
  • Use statistical methods carefully
  • Document how missing values are handled

Practice Exercise

Task

  • Create a NumPy array with missing values
  • Count missing values
  • Remove missing values
  • Replace missing values with the mean

What’s Next?

In the next lesson, you will learn about performance optimization techniques in NumPy to make your code faster and more efficient.