Data Cleaning | Dataplexa

Data Cleaning in R

In this lesson, you will learn one of the most important skills in data analysis: data cleaning.

Real-world data is rarely perfect. It often contains missing values, duplicates, incorrect formats, or inconsistent entries. Cleaning data ensures accurate and reliable analysis.


Why Is Data Cleaning Important?

If data is not cleaned properly, analysis results can be misleading or incorrect.

Data cleaning helps you:

  • Improve data quality
  • Remove errors and inconsistencies
  • Prepare data for analysis and visualization
  • Make models more accurate

Understanding Missing Values

In R, missing values are represented using NA.

Missing values can occur due to data entry errors or incomplete information.

data <- c(10, 20, NA, 40)
data

Checking for Missing Values

Before cleaning, you should identify missing values in your data.

is.na(data)

To count missing values:

sum(is.na(data))

Removing Missing Values

You can remove missing values using the na.omit() function.

clean_data <- na.omit(data)
clean_data

This removes all rows that contain missing values.


Replacing Missing Values

Instead of removing data, you may want to replace missing values.

A common approach is replacing missing values with the mean.

data[is.na(data)] <- mean(data, na.rm = TRUE)
data

Removing Duplicate Rows

Duplicate data can affect summaries and analysis.

R provides the duplicated() function to identify duplicates.

values <- c(1, 2, 2, 3, 4, 4)
duplicated(values)

To remove duplicates:

unique(values)

Fixing Incorrect Data Types

Sometimes numeric data is imported as text.

You must convert data into the correct type for analysis.

values <- c("10", "20", "30")
values <- as.numeric(values)
values

Cleaning Data Frames

Data cleaning is often done on data frames.

Below is a simple example of cleaning a data frame.

df <- data.frame(
  name = c("Alex", "John", "Alex", NA),
  age = c(25, NA, 25, 30)
)

df <- na.omit(df)
df <- unique(df)
df

Best Practices for Data Cleaning

Good data cleaning habits save time and prevent mistakes.

  • Always inspect data before analysis
  • Handle missing values carefully
  • Remove duplicates where necessary
  • Check data types after import

📝 Practice Exercises


Exercise 1

Create a vector with missing values and remove them.

Exercise 2

Replace missing values in a numeric vector with the mean.

Exercise 3

Create a vector with duplicate values and remove duplicates.

Exercise 4

Convert a character vector of numbers into numeric type.


✅ Practice Answers


Answer 1

values <- c(5, NA, 10, NA)
na.omit(values)

Answer 2

values <- c(10, 20, NA, 40)
values[is.na(values)] <- mean(values, na.rm = TRUE)
values

Answer 3

numbers <- c(1, 2, 2, 3, 4)
unique(numbers)

Answer 4

chars <- c("5", "10", "15")
as.numeric(chars)

What’s Next?

In the next lesson, you will learn how to explore and summarize data to understand patterns and trends.

Exploratory data analysis is the foundation of meaningful insights.