Data Science Lesson 10 – Merging & Joining | Dataplexa
Data Science · Lesson 10

Merging & Joining

Master pandas merge operations to combine multiple DataFrames into unified datasets for comprehensive analysis.

1
Identify Common Columns
2
Choose Join Type
3
Execute Merge
4
Validate Results
Imagine you have customer data in one spreadsheet and order data in another. Your CEO asks for revenue by customer age group tomorrow morning. You need to combine both datasets. This is exactly what data merging solves — connecting separate DataFrames based on shared information. Honestly, merging is underrated until you realize 80% of real data science projects involve combining data from multiple sources. Sales from Salesforce, customer demographics from your CRM, product info from inventory systems. Nothing lives in isolation.

Why Merge Data?

Real business data splits across systems. Your e-commerce platform stores order details. Your payment processor tracks transaction status. Customer service logs complaints. Each system captures different pieces of the same customer journey.

Complete Picture

Combine customer demographics with purchase behavior for targeted marketing campaigns

Cross-System Analysis

Link inventory data with sales data to identify which products drive the highest margins

Time Series Enrichment

Add external data like weather or holidays to explain seasonal sales patterns

Data Validation

Cross-reference financial records with operational data to catch discrepancies early

The scenario: Flipkart's data team needs to analyze customer lifetime value. They have order history in one table and customer demographics in another. Without merging, they can't segment customers effectively.
import pandas as pd

# Create sample customer data
customers = pd.DataFrame({
    'customer_id': [101, 102, 103, 104, 105],
    'customer_age': [25, 34, 29, 42, 31],
    'gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
    'city': ['Mumbai', 'Delhi', 'Bangalore', 'Chennai', 'Pune']
})

print("Customer Data:")
print(customers.head())

What just happened?

We created a customer DataFrame with customer_id as the unique identifier. This will be our key column for joining. Try this: Create your own DataFrame with different customer IDs to see how merging works.

Now we create order data with the same customer IDs but different information:
# Create order data - notice some customer_ids overlap
orders = pd.DataFrame({
    'order_id': [2001, 2002, 2003, 2004, 2005, 2006],
    'customer_id': [101, 102, 103, 106, 107, 101],
    'product_category': ['Electronics', 'Clothing', 'Books', 'Food', 'Home', 'Electronics'],
    'revenue': [45000, 2800, 1200, 850, 15600, 32000]
})

print("Order Data:")
print(orders.head())

What just happened?

Notice customers 106 and 107 exist in orders but not in customers. Customer 101 has two orders. This mismatch is exactly why we need different join types. Try this: Count unique customers in each DataFrame.

Types of Joins

Coming from SQL? Pandas uses identical terminology. Four main join types solve different business questions. The 90% of cases use inner and left joins — the other 10% trips everyone up when they need outer joins for data auditing.
Join Type Records Kept Business Use Case Pandas Parameter
Inner Join Only matching records Active customers with recent orders how='inner'
Left Join All left + matching right All customers, with/without orders how='left'
Right Join All right + matching left All orders, even from unknown customers how='right'
Outer Join All records from both Complete audit of all data how='outer'
Why does this matter? Because your business question determines the join type. "Which customers haven't ordered recently?" needs a left join. "Show me revenue for existing customers only" needs inner join.

Inner Join: Only the Matches

Inner join keeps only customers who appear in both DataFrames. Perfect for analyzing active customers with known demographics.
# Inner join - only customers with orders AND known demographics
inner_result = pd.merge(customers, orders, on='customer_id', how='inner')

print("Inner Join Result:")
print(inner_result)
print(f"\nRecords: {len(inner_result)} (from {len(customers)} customers, {len(orders)} orders)")

What just happened?

We lost customers 104 and 105 (no orders) and orders from customers 106 and 107 (no demographics). Customer 101 appears twice because they have two orders. Try this: Count total revenue in this result.

Notice customer 101 appears twice — once per order. This is expected behavior when one customer has multiple orders.

Left Join: Keep All Customers

Left join keeps every customer, even those without orders. Missing values fill with NaN. Perfect for identifying customers who haven't purchased recently.
# Left join - keep all customers, add order info where available
left_result = pd.merge(customers, orders, on='customer_id', how='left')

print("Left Join Result:")
print(left_result)
print(f"\nCustomers without orders:")
print(left_result[left_result['order_id'].isna()][['customer_id', 'city', 'customer_age']])

What just happened?

All 5 customers remain, but customers 104 and 105 show NaN for order columns. The second query filters for customers with no orders — perfect for re-engagement campaigns. Try this: Calculate the percentage of customers without orders.

📊 Data Insight

40% of customers (2 out of 5) haven't made any orders. This suggests either new customer acquisition without conversion or customer churn that needs immediate attention.

Right & Outer Joins

Right join keeps all orders, even from unknown customers. Outer join keeps everything. Both useful for data auditing and finding system gaps.
# Right join - keep all orders, add customer info where available
right_result = pd.merge(customers, orders, on='customer_id', how='right')

print("Right Join - Orders from unknown customers:")
unknown_customers = right_result[right_result['customer_age'].isna()]
print(unknown_customers[['customer_id', 'order_id', 'product_category', 'revenue']])
# Outer join - everything from both DataFrames
outer_result = pd.merge(customers, orders, on='customer_id', how='outer')

print(f"Outer Join Summary:")
print(f"Total records: {len(outer_result)}")
print(f"Customers without orders: {outer_result['order_id'].isna().sum()}")
print(f"Orders from unknown customers: {outer_result['customer_age'].isna().sum()}")
print(f"Complete records: {len(outer_result) - outer_result['order_id'].isna().sum() - outer_result['customer_age'].isna().sum()}")

What just happened?

Right join revealed orders worth ₹16,450 from customers not in our database — potential data sync issues. Outer join shows the complete picture: 8 total records with data gaps on both sides. Try this: Calculate what percentage of total revenue comes from unknown customers.

Advanced Merge Scenarios

Real data is messier. Different column names, multiple join keys, handling duplicates. The scenario: Zomato wants to analyze delivery performance but restaurant data uses restaurant_id while delivery data uses rest_id.
# Different column names for the same data
restaurant_info = pd.DataFrame({
    'restaurant_id': [501, 502, 503],
    'cuisine_type': ['Indian', 'Chinese', 'Italian'],
    'city': ['Mumbai', 'Delhi', 'Bangalore']
})

delivery_data = pd.DataFrame({
    'rest_id': [501, 501, 502, 503, 503],
    'delivery_time': [25, 32, 18, 45, 28],
    'order_value': [650, 420, 890, 1200, 580]
})

print("Restaurant Info:")
print(restaurant_info)
# Merge with different column names using left_on and right_on
merged_restaurants = pd.merge(
    restaurant_info, 
    delivery_data, 
    left_on='restaurant_id', 
    right_on='rest_id', 
    how='left'
)

print("Merged Restaurant Performance:")
print(merged_restaurants.head())

# Calculate average delivery time by cuisine
avg_delivery = merged_restaurants.groupby('cuisine_type')['delivery_time'].mean()
print(f"\nAverage delivery time by cuisine:")
print(avg_delivery)

What just happened?

We used left_on and right_on to join columns with different names. Both original ID columns remain in the result. Chinese restaurants deliver fastest (18 min), Italian slowest (36.5 min). Try this: Drop the redundant rest_id column.

Common Mistake: Forgetting to Drop Redundant Columns

When using left_on and right_on, both columns remain. Always drop one: merged_df.drop('rest_id', axis=1)

Multiple Key Joins

Sometimes you need multiple columns to uniquely identify records. The scenario: HDFC Bank analyzes transactions but needs both account number and transaction date to avoid duplicates.
# Multi-key join example - join on multiple columns
account_details = pd.DataFrame({
    'account_id': [1001, 1002, 1003, 1001, 1002],
    'date': ['2023-01-15', '2023-01-15', '2023-01-15', '2023-01-16', '2023-01-16'],
    'customer_age': [28, 35, 42, 28, 35],
    'account_type': ['Savings', 'Current', 'Savings', 'Savings', 'Current']
})

transaction_fees = pd.DataFrame({
    'account_id': [1001, 1002, 1003, 1001],
    'date': ['2023-01-15', '2023-01-15', '2023-01-15', '2023-01-16'],
    'fee_charged': [50, 100, 25, 50],
    'transaction_count': [3, 8, 1, 2]
})

# Join on both account_id AND date
complete_data = pd.merge(
    account_details, 
    transaction_fees, 
    on=['account_id', 'date'], 
    how='left'
)

print("Multi-key Join Result:")
print(complete_data)

What just happened?

Account 1002 on 2023-01-16 shows NaN because no transaction fees exist for that specific account-date combination. This prevents incorrect matching across different dates. Try this: Count how many account-date combinations have missing fee data.

Inner join is most restrictive, outer join most inclusive - choose based on your analysis needs

The chart shows how different join types affect your dataset size. Inner join gives you clean, complete records but loses data. Outer join preserves everything but requires handling missing values. Business impact: If you're calculating customer lifetime value, inner join gives accurate numbers but underestimates total customers. Left join reveals the full customer base including inactive ones.

Pro Tip: Always check your join results with len() before and after. If numbers don't make sense, you probably chose the wrong join type or have duplicate keys.

Quiz

1. Flipkart wants to find customers who haven't ordered in the last 30 days for a re-engagement campaign. They have a customers DataFrame and a recent_orders DataFrame. Which join type should they use?


2. When using pd.merge(df1, df2, left_on='restaurant_id', right_on='rest_id'), what happens to the join columns?


3. You have a customers DataFrame and orders DataFrame, both with customer_id and date columns. You want to join them only when both customer_id AND date match exactly. How do you do this?


Up Next

Univariate Analysis

Now that you can combine datasets, learn to analyze individual variables to uncover patterns and distributions in your merged data.