Data Science
Merging & Joining
Master pandas merge operations to combine multiple DataFrames into unified datasets for comprehensive analysis.
Why Merge Data?
Real business data splits across systems. Your e-commerce platform stores order details. Your payment processor tracks transaction status. Customer service logs complaints. Each system captures different pieces of the same customer journey.Complete Picture
Combine customer demographics with purchase behavior for targeted marketing campaigns
Cross-System Analysis
Link inventory data with sales data to identify which products drive the highest margins
Time Series Enrichment
Add external data like weather or holidays to explain seasonal sales patterns
Data Validation
Cross-reference financial records with operational data to catch discrepancies early
import pandas as pd
# Create sample customer data
customers = pd.DataFrame({
'customer_id': [101, 102, 103, 104, 105],
'customer_age': [25, 34, 29, 42, 31],
'gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
'city': ['Mumbai', 'Delhi', 'Bangalore', 'Chennai', 'Pune']
})
print("Customer Data:")
print(customers.head())Customer Data: customer_id customer_age gender city 0 101 25 Male Mumbai 1 102 34 Female Delhi 2 103 29 Male Bangalore 3 104 42 Female Chennai 4 105 31 Male Pune
What just happened?
We created a customer DataFrame with customer_id as the unique identifier. This will be our key column for joining. Try this: Create your own DataFrame with different customer IDs to see how merging works.
# Create order data - notice some customer_ids overlap
orders = pd.DataFrame({
'order_id': [2001, 2002, 2003, 2004, 2005, 2006],
'customer_id': [101, 102, 103, 106, 107, 101],
'product_category': ['Electronics', 'Clothing', 'Books', 'Food', 'Home', 'Electronics'],
'revenue': [45000, 2800, 1200, 850, 15600, 32000]
})
print("Order Data:")
print(orders.head())Order Data: order_id customer_id product_category revenue 0 2001 101 Electronics 45000 1 2002 102 Clothing 2800 2 2003 103 Books 1200 3 2004 106 Food 850 4 2005 107 Home 15600 5 2006 101 Electronics 32000
What just happened?
Notice customers 106 and 107 exist in orders but not in customers. Customer 101 has two orders. This mismatch is exactly why we need different join types. Try this: Count unique customers in each DataFrame.
Types of Joins
Coming from SQL? Pandas uses identical terminology. Four main join types solve different business questions. The 90% of cases use inner and left joins — the other 10% trips everyone up when they need outer joins for data auditing.| Join Type | Records Kept | Business Use Case | Pandas Parameter |
|---|---|---|---|
| Inner Join | Only matching records | Active customers with recent orders | how='inner' |
| Left Join | All left + matching right | All customers, with/without orders | how='left' |
| Right Join | All right + matching left | All orders, even from unknown customers | how='right' |
| Outer Join | All records from both | Complete audit of all data | how='outer' |
Inner Join: Only the Matches
Inner join keeps only customers who appear in both DataFrames. Perfect for analyzing active customers with known demographics.# Inner join - only customers with orders AND known demographics
inner_result = pd.merge(customers, orders, on='customer_id', how='inner')
print("Inner Join Result:")
print(inner_result)
print(f"\nRecords: {len(inner_result)} (from {len(customers)} customers, {len(orders)} orders)")Inner Join Result: customer_id customer_age gender city order_id product_category revenue 0 101 25 Male Mumbai 2001 Electronics 45000 1 101 25 Male Mumbai 2006 Electronics 32000 2 102 34 Female Delhi 2002 Clothing 2800 3 103 29 Male Bangalore 2003 Books 1200 Records: 4 (from 5 customers, 6 orders)
What just happened?
We lost customers 104 and 105 (no orders) and orders from customers 106 and 107 (no demographics). Customer 101 appears twice because they have two orders. Try this: Count total revenue in this result.
Left Join: Keep All Customers
Left join keeps every customer, even those without orders. Missing values fill withNaN. Perfect for identifying customers who haven't purchased recently.
# Left join - keep all customers, add order info where available
left_result = pd.merge(customers, orders, on='customer_id', how='left')
print("Left Join Result:")
print(left_result)
print(f"\nCustomers without orders:")
print(left_result[left_result['order_id'].isna()][['customer_id', 'city', 'customer_age']])Left Join Result: customer_id customer_age gender city order_id product_category revenue 0 101 25 Male Mumbai 2001.0 Electronics 45000.0 1 101 25 Male Mumbai 2006.0 Electronics 32000.0 2 102 34 Female Delhi 2002.0 Clothing 2800.0 3 103 29 Male Bangalore 2003.0 Books 1200.0 4 104 42 Female Chennai NaN NaN NaN 5 105 31 Male Pune NaN NaN NaN Customers without orders: customer_id city customer_age 4 104 Chennai 42 5 105 Pune 31
What just happened?
All 5 customers remain, but customers 104 and 105 show NaN for order columns. The second query filters for customers with no orders — perfect for re-engagement campaigns. Try this: Calculate the percentage of customers without orders.
📊 Data Insight
40% of customers (2 out of 5) haven't made any orders. This suggests either new customer acquisition without conversion or customer churn that needs immediate attention.
Right & Outer Joins
Right join keeps all orders, even from unknown customers. Outer join keeps everything. Both useful for data auditing and finding system gaps.# Right join - keep all orders, add customer info where available
right_result = pd.merge(customers, orders, on='customer_id', how='right')
print("Right Join - Orders from unknown customers:")
unknown_customers = right_result[right_result['customer_age'].isna()]
print(unknown_customers[['customer_id', 'order_id', 'product_category', 'revenue']])Right Join - Orders from unknown customers: customer_id order_id product_category revenue 3 106 2004 Food 850 4 107 2005 Home 15600
# Outer join - everything from both DataFrames
outer_result = pd.merge(customers, orders, on='customer_id', how='outer')
print(f"Outer Join Summary:")
print(f"Total records: {len(outer_result)}")
print(f"Customers without orders: {outer_result['order_id'].isna().sum()}")
print(f"Orders from unknown customers: {outer_result['customer_age'].isna().sum()}")
print(f"Complete records: {len(outer_result) - outer_result['order_id'].isna().sum() - outer_result['customer_age'].isna().sum()}")Outer Join Summary: Total records: 8 Customers without orders: 2 Orders from unknown customers: 2 Complete records: 4
What just happened?
Right join revealed orders worth ₹16,450 from customers not in our database — potential data sync issues. Outer join shows the complete picture: 8 total records with data gaps on both sides. Try this: Calculate what percentage of total revenue comes from unknown customers.
Advanced Merge Scenarios
Real data is messier. Different column names, multiple join keys, handling duplicates. The scenario: Zomato wants to analyze delivery performance but restaurant data usesrestaurant_id while delivery data uses rest_id.
# Different column names for the same data
restaurant_info = pd.DataFrame({
'restaurant_id': [501, 502, 503],
'cuisine_type': ['Indian', 'Chinese', 'Italian'],
'city': ['Mumbai', 'Delhi', 'Bangalore']
})
delivery_data = pd.DataFrame({
'rest_id': [501, 501, 502, 503, 503],
'delivery_time': [25, 32, 18, 45, 28],
'order_value': [650, 420, 890, 1200, 580]
})
print("Restaurant Info:")
print(restaurant_info)Restaurant Info: restaurant_id cuisine_type city 0 501 Indian Mumbai 1 502 Chinese Delhi 2 503 Italian Bangalore
# Merge with different column names using left_on and right_on
merged_restaurants = pd.merge(
restaurant_info,
delivery_data,
left_on='restaurant_id',
right_on='rest_id',
how='left'
)
print("Merged Restaurant Performance:")
print(merged_restaurants.head())
# Calculate average delivery time by cuisine
avg_delivery = merged_restaurants.groupby('cuisine_type')['delivery_time'].mean()
print(f"\nAverage delivery time by cuisine:")
print(avg_delivery)Merged Restaurant Performance: restaurant_id cuisine_type city rest_id delivery_time order_value 0 501 Indian Mumbai 501 25 650 1 501 Indian Mumbai 501 32 420 2 502 Chinese Delhi 502 18 890 3 503 Italian Bangalore 503 45 1200 4 503 Italian Bangalore 503 28 580 Average delivery time by cuisine: cuisine_type Chinese 18.0 Indian 28.5 Italian 36.5 Name: delivery_time, dtype: float64
What just happened?
We used left_on and right_on to join columns with different names. Both original ID columns remain in the result. Chinese restaurants deliver fastest (18 min), Italian slowest (36.5 min). Try this: Drop the redundant rest_id column.
Common Mistake: Forgetting to Drop Redundant Columns
When using left_on and right_on, both columns remain. Always drop one: merged_df.drop('rest_id', axis=1)
Multiple Key Joins
Sometimes you need multiple columns to uniquely identify records. The scenario: HDFC Bank analyzes transactions but needs both account number and transaction date to avoid duplicates.# Multi-key join example - join on multiple columns
account_details = pd.DataFrame({
'account_id': [1001, 1002, 1003, 1001, 1002],
'date': ['2023-01-15', '2023-01-15', '2023-01-15', '2023-01-16', '2023-01-16'],
'customer_age': [28, 35, 42, 28, 35],
'account_type': ['Savings', 'Current', 'Savings', 'Savings', 'Current']
})
transaction_fees = pd.DataFrame({
'account_id': [1001, 1002, 1003, 1001],
'date': ['2023-01-15', '2023-01-15', '2023-01-15', '2023-01-16'],
'fee_charged': [50, 100, 25, 50],
'transaction_count': [3, 8, 1, 2]
})
# Join on both account_id AND date
complete_data = pd.merge(
account_details,
transaction_fees,
on=['account_id', 'date'],
how='left'
)
print("Multi-key Join Result:")
print(complete_data)Multi-key Join Result: account_id date customer_age account_type fee_charged transaction_count 0 1001 2023-01-15 28 Savings 50.0 3.0 1 1002 2023-01-15 35 Current 100.0 8.0 2 1003 2023-01-15 42 Savings 25.0 1.0 3 1001 2023-01-16 28 Savings 50.0 2.0 4 1002 2023-01-16 35 Current NaN NaN
What just happened?
Account 1002 on 2023-01-16 shows NaN because no transaction fees exist for that specific account-date combination. This prevents incorrect matching across different dates. Try this: Count how many account-date combinations have missing fee data.
Inner join is most restrictive, outer join most inclusive - choose based on your analysis needs
The chart shows how different join types affect your dataset size. Inner join gives you clean, complete records but loses data. Outer join preserves everything but requires handling missing values. Business impact: If you're calculating customer lifetime value, inner join gives accurate numbers but underestimates total customers. Left join reveals the full customer base including inactive ones.Pro Tip: Always check your join results with len() before and after. If numbers don't make sense, you probably chose the wrong join type or have duplicate keys.
Quiz
1. Flipkart wants to find customers who haven't ordered in the last 30 days for a re-engagement campaign. They have a customers DataFrame and a recent_orders DataFrame. Which join type should they use?
2. When using pd.merge(df1, df2, left_on='restaurant_id', right_on='rest_id'), what happens to the join columns?
3. You have a customers DataFrame and orders DataFrame, both with customer_id and date columns. You want to join them only when both customer_id AND date match exactly. How do you do this?
Up Next
Univariate Analysis
Now that you can combine datasets, learn to analyze individual variables to uncover patterns and distributions in your merged data.