Merging DataFrames in Pandas
In real-world data analysis, information is often spread across multiple datasets. To perform meaningful analysis, these datasets must be combined.
In this lesson, you will learn how to merge DataFrames in Pandas using different join techniques.
Why Do We Merge DataFrames?
Merging allows you to:
- Combine related data from different sources
- Add missing details to a dataset
- Build complete analytical tables
This is very common in business, analytics, and data science projects.
Loading the Main Dataset
We continue using the Pandas sales dataset.
import pandas as pd
sales = pd.read_csv("dataplexa_pandas_sales.csv")
Creating a Second DataFrame
Let’s create a second DataFrame containing product category information.
product_info = pd.DataFrame({
"product": ["Laptop", "Phone", "Tablet"],
"category": ["Electronics", "Electronics", "Electronics"]
})
This table contains additional information not present in the sales data.
Basic Merge Using merge()
The merge() function is used to combine DataFrames.
Let’s merge the sales data with product information using the
product column.
merged_df = pd.merge(
sales,
product_info,
on="product"
)
merged_df.head()
Understanding Join Types
Pandas supports several types of joins:
- Inner Join – keeps matching rows only
- Left Join – keeps all left DataFrame rows
- Right Join – keeps all right DataFrame rows
- Outer Join – keeps all rows from both DataFrames
Inner Join
An inner join returns rows that exist in both DataFrames.
pd.merge(
sales,
product_info,
on="product",
how="inner"
)
Left Join
A left join keeps all rows from the left DataFrame and adds matching data from the right DataFrame.
pd.merge(
sales,
product_info,
on="product",
how="left"
)
Right Join
A right join keeps all rows from the right DataFrame.
pd.merge(
sales,
product_info,
on="product",
how="right"
)
Outer Join
An outer join keeps all rows from both DataFrames.
Missing values are filled with NaN.
pd.merge(
sales,
product_info,
on="product",
how="outer"
)
Merging on Different Column Names
Sometimes column names differ between DataFrames.
Example:
pd.merge(
sales,
product_info,
left_on="product",
right_on="product"
)
Merging on Index
You can also merge using DataFrame indexes.
sales.set_index("product").merge(
product_info.set_index("product"),
left_index=True,
right_index=True
)
Common Merge Issues
- Duplicate column names
- Missing keys
- Unexpected NaN values
Always inspect your merged DataFrame using head() and info().
Practice Exercise
Try the following:
- Create a second DataFrame with region descriptions
- Merge it with the sales dataset
- Try inner, left, and outer joins
What’s Next?
In the next lesson, you will learn about concatenating DataFrames, which allows you to stack datasets vertically or horizontally.