dplyr | Dataplexa

dplyr: Data Manipulation in R

In this lesson, you will learn how to manipulate and transform data using the dplyr package.

dplyr makes working with data frames simple, readable, and efficient. It is one of the most widely used tools in real-world R data analysis.


What Is dplyr?

dplyr is a package designed to help you work with structured data such as data frames.

It provides easy-to-understand functions for selecting columns, filtering rows, sorting data, creating new variables, and summarizing information.


Loading the dplyr Package

Before using dplyr, you must install and load it.

install.packages("dplyr")
library(dplyr)

Once loaded, dplyr functions are available for use.


Using a Sample Dataset

Let’s create a simple data frame to understand dplyr operations.

data <- data.frame(
  name = c("Alex", "Emma", "John", "Sophia"),
  age = c(25, 30, 28, 35),
  score = c(88, 92, 79, 95)
)

data

Selecting Columns with select()

The select() function is used to choose specific columns from a data frame.

This is useful when you want to focus only on relevant information.

select(data, name, score)

Filtering Rows with filter()

The filter() function is used to extract rows that meet a condition.

This helps narrow down data based on rules.

filter(data, age > 28)

Sorting Data with arrange()

The arrange() function sorts rows based on one or more columns.

By default, sorting is done in ascending order.

arrange(data, score)

To sort in descending order:

arrange(data, desc(score))

Creating New Columns with mutate()

The mutate() function allows you to create new columns based on existing data.

This is commonly used for calculations and transformations.

mutate(data, bonus = score + 5)

Summarizing Data with summarise()

The summarise() function calculates summary statistics.

It reduces a dataset to a single row of values.

summarise(data, average_score = mean(score))

Grouping Data with group_by()

The group_by() function groups data based on a column.

It is usually combined with summarise().

grouped_data <- group_by(data, age)
summarise(grouped_data, avg_score = mean(score))

The Pipe Operator (%>%)

The pipe operator allows you to chain multiple operations together.

It improves readability by passing the output of one function directly into the next.

data %>%
  filter(score > 85) %>%
  select(name, score)

Why dplyr Is Important

dplyr simplifies complex data operations into readable steps.

It is widely used in data analysis, reporting, and data science projects.


📝 Practice Exercises


Exercise 1

Select only the name and age columns from the dataset.

Exercise 2

Filter rows where the score is greater than 90.

Exercise 3

Create a new column that increases age by 1.

Exercise 4

Calculate the average score for the dataset.


✅ Practice Answers


Answer 1

select(data, name, age)

Answer 2

filter(data, score > 90)

Answer 3

mutate(data, age = age + 1)

Answer 4

summarise(data, mean_score = mean(score))

What’s Next?

In the next lesson, you will learn how to reshape and organize data using the tidyr package.

This will help you prepare data for analysis and visualization.