Exploratory Data Analysis (EDA) in R
In this lesson, you will learn how to explore and understand data before applying advanced analysis or modeling techniques.
Exploratory Data Analysis (EDA) helps you discover patterns, detect anomalies, and gain insights using simple summaries and visual checks.
What Is Exploratory Data Analysis?
Exploratory Data Analysis is the process of examining data to understand its main characteristics.
Instead of jumping directly into predictions or models, EDA allows you to ask basic questions about the data and get meaningful answers.
Why Is EDA Important?
EDA plays a critical role in data analysis because it helps you:
- Understand data structure and size
- Identify missing or incorrect values
- Detect outliers and unusual patterns
- Choose appropriate analysis techniques
Viewing the Data
The first step in EDA is simply looking at the data.
R provides several functions to quickly inspect datasets.
head(data)
tail(data)
These functions show the first and last few rows of the dataset.
Understanding Data Structure
Knowing the structure of the dataset helps identify column types and formats.
str(data)
This displays data types, column names, and sample values.
Summary Statistics
Summary statistics provide a quick overview of numeric and categorical variables.
summary(data)
This shows minimum, maximum, mean, median, and quartiles for numeric data.
Checking Dataset Dimensions
Understanding the size of the dataset is important for performance and analysis.
dim(data)
nrow(data)
ncol(data)
These functions return the number of rows and columns.
Exploring Individual Columns
You can analyze individual columns to understand distributions and values.
mean(data$age)
median(data$age)
range(data$age)
This helps identify unusual or extreme values.
Frequency Tables for Categorical Data
Categorical variables can be explored using frequency counts.
table(data$gender)
This shows how many times each category appears.
Detecting Missing Values
EDA also involves checking for missing values.
sum(is.na(data))
Knowing the amount of missing data helps decide cleaning strategies.
Identifying Outliers
Outliers are values that are unusually high or low compared to the rest of the data.
A simple way to inspect outliers is by using summary statistics.
summary(data$salary)
Outliers may indicate errors or important observations.
EDA Workflow Example
A basic EDA process often follows these steps:
- Load the dataset
- Inspect structure and size
- Check summaries and missing values
- Explore individual variables
📝 Practice Exercises
Exercise 1
Load a dataset and display the first and last five rows.
Exercise 2
Check the structure and dimensions of the dataset.
Exercise 3
Generate summary statistics for all columns.
Exercise 4
Identify missing values in the dataset.
✅ Practice Answers
Answer 1
head(data)
tail(data)
Answer 2
str(data)
dim(data)
Answer 3
summary(data)
Answer 4
sum(is.na(data))
What’s Next?
In the next lesson, you will learn how to manipulate and transform data efficiently using modern R tools.
This will allow you to prepare data for deeper analysis and visualization.