Data Cleaning in R
In this lesson, you will learn one of the most important skills in data analysis: data cleaning.
Real-world data is rarely perfect. It often contains missing values, duplicates, incorrect formats, or inconsistent entries. Cleaning data ensures accurate and reliable analysis.
Why Is Data Cleaning Important?
If data is not cleaned properly, analysis results can be misleading or incorrect.
Data cleaning helps you:
- Improve data quality
- Remove errors and inconsistencies
- Prepare data for analysis and visualization
- Make models more accurate
Understanding Missing Values
In R, missing values are represented using NA.
Missing values can occur due to data entry errors or incomplete information.
data <- c(10, 20, NA, 40)
data
Checking for Missing Values
Before cleaning, you should identify missing values in your data.
is.na(data)
To count missing values:
sum(is.na(data))
Removing Missing Values
You can remove missing values using the na.omit() function.
clean_data <- na.omit(data)
clean_data
This removes all rows that contain missing values.
Replacing Missing Values
Instead of removing data, you may want to replace missing values.
A common approach is replacing missing values with the mean.
data[is.na(data)] <- mean(data, na.rm = TRUE)
data
Removing Duplicate Rows
Duplicate data can affect summaries and analysis.
R provides the duplicated() function to identify duplicates.
values <- c(1, 2, 2, 3, 4, 4)
duplicated(values)
To remove duplicates:
unique(values)
Fixing Incorrect Data Types
Sometimes numeric data is imported as text.
You must convert data into the correct type for analysis.
values <- c("10", "20", "30")
values <- as.numeric(values)
values
Cleaning Data Frames
Data cleaning is often done on data frames.
Below is a simple example of cleaning a data frame.
df <- data.frame(
name = c("Alex", "John", "Alex", NA),
age = c(25, NA, 25, 30)
)
df <- na.omit(df)
df <- unique(df)
df
Best Practices for Data Cleaning
Good data cleaning habits save time and prevent mistakes.
- Always inspect data before analysis
- Handle missing values carefully
- Remove duplicates where necessary
- Check data types after import
📝 Practice Exercises
Exercise 1
Create a vector with missing values and remove them.
Exercise 2
Replace missing values in a numeric vector with the mean.
Exercise 3
Create a vector with duplicate values and remove duplicates.
Exercise 4
Convert a character vector of numbers into numeric type.
✅ Practice Answers
Answer 1
values <- c(5, NA, 10, NA)
na.omit(values)
Answer 2
values <- c(10, 20, NA, 40)
values[is.na(values)] <- mean(values, na.rm = TRUE)
values
Answer 3
numbers <- c(1, 2, 2, 3, 4)
unique(numbers)
Answer 4
chars <- c("5", "10", "15")
as.numeric(chars)
What’s Next?
In the next lesson, you will learn how to explore and summarize data to understand patterns and trends.
Exploratory data analysis is the foundation of meaningful insights.