Data Merging | Dataplexa

Data Merging in R

In this lesson, you will learn how to combine multiple datasets in R.

In real-world projects, data rarely comes in a single file. You often need to merge information from different sources such as sales data, customer data, or survey results.


What Is Data Merging?

Data merging is the process of combining two or more datasets based on a common column.

This common column is usually called a key and is present in all datasets being merged.

Examples of keys include:

  • Employee ID
  • Customer ID
  • Order number

Sample Datasets

Let us start with two simple data frames.

employees <- data.frame(
  emp_id = c(1, 2, 3, 4),
  name = c("Alex", "Brian", "Chris", "Diana")
)

salaries <- data.frame(
  emp_id = c(1, 2, 4),
  salary = c(50000, 60000, 55000)
)

employees
salaries

The merge() Function

R provides the merge() function to combine datasets.

By default, it performs an inner join.

merge(employees, salaries, by = "emp_id")

This keeps only the rows where the key exists in both datasets.


Inner Join

An inner join returns only matching rows from both tables.

If a record does not exist in one of the datasets, it is excluded.


Left Join

A left join keeps all rows from the first dataset and adds matching data from the second.

Missing values are filled with NA.

merge(
  employees,
  salaries,
  by = "emp_id",
  all.x = TRUE
)

Right Join

A right join keeps all rows from the second dataset.

Rows missing in the first dataset are filled with NA.

merge(
  employees,
  salaries,
  by = "emp_id",
  all.y = TRUE
)

Full Outer Join

A full join keeps all rows from both datasets.

If a match is not found, missing values are filled with NA.

merge(
  employees,
  salaries,
  by = "emp_id",
  all = TRUE
)

Merging on Different Column Names

Sometimes, key columns have different names in different datasets.

You can specify them using by.x and by.y.

merge(
  employees,
  salaries,
  by.x = "emp_id",
  by.y = "emp_id"
)

Why Data Merging Is Important

  • Combines related information from different sources
  • Creates complete datasets for analysis
  • Essential for reporting and dashboards
  • Used heavily in business and research workflows

📝 Practice Exercises


Exercise 1

Perform an inner join between the two datasets.

Exercise 2

Perform a left join and identify missing values.

Exercise 3

Perform a full outer join and observe the result.

Exercise 4

Explain the difference between inner join and left join.


✅ Practice Answers


Answer 1

merge(employees, salaries, by = "emp_id")

Answer 2

merge(employees, salaries, by = "emp_id", all.x = TRUE)

Answer 3

merge(employees, salaries, by = "emp_id", all = TRUE)

Answer 4

An inner join returns only matching records, while a left join keeps all records from the first dataset and fills missing values with NA.


What’s Next?

In the next lesson, you will learn about Data Visualization Overview in R.

This will help you visually understand patterns and insights in your data.