Data Science Lesson 10 – Merging & Joining | Dataplexa

Data Science · Lesson 10

Merging & Joining

Master pandas merge operations to combine multiple DataFrames into unified datasets for comprehensive analysis.

Identify Common Columns

Choose Join Type

Execute Merge

Validate Results

Imagine you have customer data in one spreadsheet and order data in another. Your CEO asks for revenue by customer age group tomorrow morning. You need to combine both datasets. This is exactly what data merging solves — connecting separate DataFrames based on shared information. Honestly, merging is underrated until you realize 80% of real data science projects involve combining data from multiple sources. Sales from Salesforce, customer demographics from your CRM, product info from inventory systems. Nothing lives in isolation.

Why Merge Data?

Real business data splits across systems. Your e-commerce platform stores order details. Your payment processor tracks transaction status. Customer service logs complaints. Each system captures different pieces of the same customer journey.

Complete Picture

Combine customer demographics with purchase behavior for targeted marketing campaigns

Cross-System Analysis

Link inventory data with sales data to identify which products drive the highest margins

Time Series Enrichment

Add external data like weather or holidays to explain seasonal sales patterns

Data Validation

Cross-reference financial records with operational data to catch discrepancies early

The scenario: Flipkart's data team needs to analyze customer lifetime value. They have order history in one table and customer demographics in another. Without merging, they can't segment customers effectively.

import pandas as pd

# Create sample customer data
customers = pd.DataFrame({
    'customer_id': [101, 102, 103, 104, 105],
    'customer_age': [25, 34, 29, 42, 31],
    'gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
    'city': ['Mumbai', 'Delhi', 'Bangalore', 'Chennai', 'Pune']
})

print("Customer Data:")
print(customers.head())

Customer Data:
   customer_id  customer_age  gender       city
0          101            25    Male     Mumbai
1          102            34  Female      Delhi
2          103            29    Male  Bangalore
3          104            42  Female    Chennai
4          105            31    Male       Pune

What just happened?

We created a customer DataFrame with customer_id as the unique identifier. This will be our key column for joining. Try this: Create your own DataFrame with different customer IDs to see how merging works.

Now we create order data with the same customer IDs but different information:

# Create order data - notice some customer_ids overlap
orders = pd.DataFrame({
    'order_id': [2001, 2002, 2003, 2004, 2005, 2006],
    'customer_id': [101, 102, 103, 106, 107, 101],
    'product_category': ['Electronics', 'Clothing', 'Books', 'Food', 'Home', 'Electronics'],
    'revenue': [45000, 2800, 1200, 850, 15600, 32000]
})

print("Order Data:")
print(orders.head())

Order Data:
   order_id  customer_id product_category  revenue
0      2001          101     Electronics    45000
1      2002          102        Clothing     2800
2      2003          103           Books     1200
3      2004          106            Food      850
4      2005          107            Home    15600
5      2006          101     Electronics    32000

What just happened?

Notice customers 106 and 107 exist in orders but not in customers. Customer 101 has two orders. This mismatch is exactly why we need different join types. Try this: Count unique customers in each DataFrame.

Types of Joins

Coming from SQL? Pandas uses identical terminology. Four main join types solve different business questions. The 90% of cases use inner and left joins — the other 10% trips everyone up when they need outer joins for data auditing.

Join Type	Records Kept	Business Use Case	Pandas Parameter
Inner Join	Only matching records	Active customers with recent orders	`how='inner'`
Left Join	All left + matching right	All customers, with/without orders	`how='left'`
Right Join	All right + matching left	All orders, even from unknown customers	`how='right'`
Outer Join	All records from both	Complete audit of all data	`how='outer'`

Why does this matter? Because your business question determines the join type. "Which customers haven't ordered recently?" needs a left join. "Show me revenue for existing customers only" needs inner join.

Inner Join: Only the Matches

Inner join keeps only customers who appear in both DataFrames. Perfect for analyzing active customers with known demographics.

# Inner join - only customers with orders AND known demographics
inner_result = pd.merge(customers, orders, on='customer_id', how='inner')

print("Inner Join Result:")
print(inner_result)
print(f"\nRecords: {len(inner_result)} (from {len(customers)} customers, {len(orders)} orders)")

Inner Join Result:
   customer_id  customer_age  gender       city  order_id product_category  revenue
0          101            25    Male     Mumbai      2001     Electronics    45000
1          101            25    Male     Mumbai      2006     Electronics    32000
2          102            34  Female      Delhi      2002        Clothing     2800
3          103            29    Male  Bangalore      2003           Books     1200

Records: 4 (from 5 customers, 6 orders)

What just happened?

We lost customers 104 and 105 (no orders) and orders from customers 106 and 107 (no demographics). Customer 101 appears twice because they have two orders. Try this: Count total revenue in this result.

Notice customer 101 appears twice — once per order. This is expected behavior when one customer has multiple orders.

Left Join: Keep All Customers

Left join keeps every customer, even those without orders. Missing values fill with NaN. Perfect for identifying customers who haven't purchased recently.

# Left join - keep all customers, add order info where available
left_result = pd.merge(customers, orders, on='customer_id', how='left')

print("Left Join Result:")
print(left_result)
print(f"\nCustomers without orders:")
print(left_result[left_result['order_id'].isna()][['customer_id', 'city', 'customer_age']])

Left Join Result:
   customer_id  customer_age  gender       city  order_id product_category  revenue
0          101            25    Male     Mumbai    2001.0     Electronics  45000.0
1          101            25    Male     Mumbai    2006.0     Electronics  32000.0
2          102            34  Female      Delhi    2002.0        Clothing   2800.0
3          103            29    Male  Bangalore    2003.0           Books   1200.0
4          104            42  Female    Chennai       NaN             NaN      NaN
5          105            31    Male       Pune       NaN             NaN      NaN

Customers without orders:
   customer_id       city  customer_age
4          104    Chennai            42
5          105       Pune            31

What just happened?

All 5 customers remain, but customers 104 and 105 show NaN for order columns. The second query filters for customers with no orders — perfect for re-engagement campaigns. Try this: Calculate the percentage of customers without orders.

📊 Data Insight

40% of customers (2 out of 5) haven't made any orders. This suggests either new customer acquisition without conversion or customer churn that needs immediate attention.

Right & Outer Joins

Right join keeps all orders, even from unknown customers. Outer join keeps everything. Both useful for data auditing and finding system gaps.

# Right join - keep all orders, add customer info where available
right_result = pd.merge(customers, orders, on='customer_id', how='right')

print("Right Join - Orders from unknown customers:")
unknown_customers = right_result[right_result['customer_age'].isna()]
print(unknown_customers[['customer_id', 'order_id', 'product_category', 'revenue']])

Right Join - Orders from unknown customers:
   customer_id  order_id product_category  revenue
3          106      2004             Food      850
4          107      2005             Home    15600

# Outer join - everything from both DataFrames
outer_result = pd.merge(customers, orders, on='customer_id', how='outer')

print(f"Outer Join Summary:")
print(f"Total records: {len(outer_result)}")
print(f"Customers without orders: {outer_result['order_id'].isna().sum()}")
print(f"Orders from unknown customers: {outer_result['customer_age'].isna().sum()}")
print(f"Complete records: {len(outer_result) - outer_result['order_id'].isna().sum() - outer_result['customer_age'].isna().sum()}")

Outer Join Summary:
Total records: 8
Customers without orders: 2
Orders from unknown customers: 2
Complete records: 4

What just happened?

Right join revealed orders worth ₹16,450 from customers not in our database — potential data sync issues. Outer join shows the complete picture: 8 total records with data gaps on both sides. Try this: Calculate what percentage of total revenue comes from unknown customers.

Advanced Merge Scenarios

Real data is messier. Different column names, multiple join keys, handling duplicates. The scenario: Zomato wants to analyze delivery performance but restaurant data uses restaurant_id while delivery data uses rest_id.

# Different column names for the same data
restaurant_info = pd.DataFrame({
    'restaurant_id': [501, 502, 503],
    'cuisine_type': ['Indian', 'Chinese', 'Italian'],
    'city': ['Mumbai', 'Delhi', 'Bangalore']
})

delivery_data = pd.DataFrame({
    'rest_id': [501, 501, 502, 503, 503],
    'delivery_time': [25, 32, 18, 45, 28],
    'order_value': [650, 420, 890, 1200, 580]
})

print("Restaurant Info:")
print(restaurant_info)

Restaurant Info:
   restaurant_id cuisine_type       city
0            501       Indian     Mumbai
1            502      Chinese      Delhi
2            503      Italian  Bangalore

# Merge with different column names using left_on and right_on
merged_restaurants = pd.merge(
    restaurant_info, 
    delivery_data, 
    left_on='restaurant_id', 
    right_on='rest_id', 
    how='left'
)

print("Merged Restaurant Performance:")
print(merged_restaurants.head())

# Calculate average delivery time by cuisine
avg_delivery = merged_restaurants.groupby('cuisine_type')['delivery_time'].mean()
print(f"\nAverage delivery time by cuisine:")
print(avg_delivery)

Merged Restaurant Performance:
   restaurant_id cuisine_type       city  rest_id  delivery_time  order_value
0            501       Indian     Mumbai      501             25          650
1            501       Indian     Mumbai      501             32          420
2            502      Chinese      Delhi      502             18          890
3            503      Italian  Bangalore      503             45         1200
4            503      Italian  Bangalore      503             28          580

Average delivery time by cuisine:
cuisine_type
Chinese    18.0
Indian     28.5
Italian    36.5
Name: delivery_time, dtype: float64

What just happened?

We used left_on and right_on to join columns with different names. Both original ID columns remain in the result. Chinese restaurants deliver fastest (18 min), Italian slowest (36.5 min). Try this: Drop the redundant rest_id column.

Common Mistake: Forgetting to Drop Redundant Columns

When using left_on and right_on, both columns remain. Always drop one: merged_df.drop('rest_id', axis=1)

Multiple Key Joins

Sometimes you need multiple columns to uniquely identify records. The scenario: HDFC Bank analyzes transactions but needs both account number and transaction date to avoid duplicates.

# Multi-key join example - join on multiple columns
account_details = pd.DataFrame({
    'account_id': [1001, 1002, 1003, 1001, 1002],
    'date': ['2023-01-15', '2023-01-15', '2023-01-15', '2023-01-16', '2023-01-16'],
    'customer_age': [28, 35, 42, 28, 35],
    'account_type': ['Savings', 'Current', 'Savings', 'Savings', 'Current']
})

transaction_fees = pd.DataFrame({
    'account_id': [1001, 1002, 1003, 1001],
    'date': ['2023-01-15', '2023-01-15', '2023-01-15', '2023-01-16'],
    'fee_charged': [50, 100, 25, 50],
    'transaction_count': [3, 8, 1, 2]
})

# Join on both account_id AND date
complete_data = pd.merge(
    account_details, 
    transaction_fees, 
    on=['account_id', 'date'], 
    how='left'
)

print("Multi-key Join Result:")
print(complete_data)

Multi-key Join Result:
   account_id        date  customer_age account_type  fee_charged  transaction_count
0        1001  2023-01-15            28      Savings         50.0                3.0
1        1002  2023-01-15            35      Current        100.0                8.0
2        1003  2023-01-15            42      Savings         25.0                1.0
3        1001  2023-01-16            28      Savings         50.0                2.0
4        1002  2023-01-16            35      Current          NaN                NaN

What just happened?

Account 1002 on 2023-01-16 shows NaN because no transaction fees exist for that specific account-date combination. This prevents incorrect matching across different dates. Try this: Count how many account-date combinations have missing fee data.

Inner join is most restrictive, outer join most inclusive - choose based on your analysis needs

The chart shows how different join types affect your dataset size. Inner join gives you clean, complete records but loses data. Outer join preserves everything but requires handling missing values. Business impact: If you're calculating customer lifetime value, inner join gives accurate numbers but underestimates total customers. Left join reveals the full customer base including inactive ones.

Pro Tip: Always check your join results with len() before and after. If numbers don't make sense, you probably chose the wrong join type or have duplicate keys.

Quiz

Up Next

Univariate Analysis

Now that you can combine datasets, learn to analyze individual variables to uncover patterns and distributions in your merged data.

← Previous Course Index Next →