Data Science Lesson 33 – Joins | Dataplexa
SQL · Lesson 33

Joins

Combine data from multiple tables using INNER, LEFT, RIGHT, and FULL JOINs to solve complex business queries

1
2
3

Identify Relationship

Find common columns between tables

Choose Join Type

INNER, LEFT, RIGHT, or FULL

Execute Query

Get combined results

Why Joins Matter

Real business data splits across multiple tables. Your e-commerce orders sit in one table, customer details in another, product information in a third. Nobody puts everything in one massive table anymore — that's a nightmare to maintain.

Think about it this way. Flipkart doesn't store your name and address inside every order record. They store your customer ID once, then reference it. Smart design, but now you need joins to answer basic questions like "which customers from Mumbai bought electronics?"

Joins solve the fundamental problem of data normalization. You avoid repeating data, keep storage efficient, but combine information when needed. The catch? You need to understand which join type gives you the right results.

Normalized Database

Customers table, Orders table, Products table — linked by IDs

Business Questions

Need customer names with order amounts — requires joining tables

The Four Join Types

Every join type serves a specific purpose. The wrong choice gives you wrong results — and honestly, this trips up 60% of analysts I've worked with. Here's the breakdown:

Join Type What You Get Use Case Risk
INNER Only matching records Orders with valid customers Loses incomplete data
LEFT All left table + matches All customers, with/without orders NULL values in right columns
RIGHT All right table + matches All orders, with/without customer info NULL values in left columns
FULL Everything from both tables Complete picture analysis Many NULL values

INNER JOIN - The Safe Choice

INNER JOIN gives you the intersection. Only records that exist in both tables make it through. This is your go-to for clean, reliable results.

The scenario: Zomato's data team needs to analyze order patterns, but only wants complete records where they have both order details and valid customer information.
# Create a customers table to join with our orders
import pandas as pd

# Customer demographics data
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5, 6],
    'customer_name': ['Raj Sharma', 'Priya Patel', 'Amit Singh', 
                      'Sneha Gupta', 'Rohit Kumar', 'Anita Joshi'],
    'city': ['Mumbai', 'Delhi', 'Bangalore', 'Chennai', 'Pune', 'Mumbai'],
    'signup_date': ['2023-01-15', '2023-02-10', '2023-01-20', 
                    '2023-03-05', '2023-02-28', '2023-01-30']
})

What just happened?

We created a customers DataFrame with 6 customers across major Indian cities. Each customer has a unique ID that we'll use to join with orders. Try this: Add more customers or change the cities to match your business needs.

Now simulate some orders data with customer IDs:
# Orders data with customer references
orders = pd.DataFrame({
    'order_id': [1001, 1002, 1003, 1004, 1005],
    'customer_id': [1, 2, 3, 1, 7],  # Note: customer_id 7 doesn't exist in customers
    'product_category': ['Electronics', 'Clothing', 'Food', 'Books', 'Electronics'],
    'revenue': [15000, 3500, 1200, 800, 25000],
    'date': ['2023-03-01', '2023-03-02', '2023-03-03', '2023-03-04', '2023-03-05']
})
print("Orders data:")

What just happened?

Notice that customer_id 7 in order 1005 doesn't exist in our customers table. This is realistic — sometimes you have orphaned records. INNER JOIN will handle this. Try this: Add more orders or create mismatched customer IDs to see how joins behave.

Now perform the INNER JOIN:
# INNER JOIN - only matching customer_ids
inner_result = pd.merge(orders, customers, on='customer_id', how='inner')
print(f"Orders before join: {len(orders)}")
print(f"Results after INNER JOIN: {len(inner_result)}")
print("\nJoined data:")
print(inner_result)

What just happened?

We lost one record! Order 1005 with customer_id 7 disappeared because that customer doesn't exist in our customers table. INNER JOIN only keeps perfect matches. Try this: Add customer_id 7 to the customers table and rerun the join.

INNER JOIN filtered out 1 order with invalid customer reference

The chart shows the fundamental characteristic of INNER JOIN — it reduces your dataset to only valid combinations. This matters for revenue calculations where you need clean, reliable data. But if you're doing customer retention analysis, you might want to keep all customers even if they haven't ordered recently.

LEFT JOIN - Keep Everything From Left

LEFT JOIN preserves every record from your first (left) table and adds matching data from the right table. No matches? You get NULL values, but the record survives.

The scenario: Paytm wants to analyze all customers, including those who haven't made recent purchases, to identify potential reactivation targets.
# LEFT JOIN - keep all customers, add order info where available
left_result = pd.merge(customers, orders, on='customer_id', how='left')
print(f"Customers: {len(customers)}")
print(f"Results after LEFT JOIN: {len(left_result)}")
print("\nAll customers with order data:")
print(left_result[['customer_name', 'city', 'order_id', 'revenue']])

What just happened?

We kept all 6 customers! Notice Sneha, Rohit, and Anita have NaN for order_id and revenue — they haven't ordered anything. But Raj appears twice because he made 2 orders. Try this: Filter for customers with NaN orders to find reactivation targets.

Identify customers who need reactivation:
# Find customers with no recent orders
no_orders = left_result[left_result['order_id'].isna()]
print("Customers needing reactivation:")
print(no_orders[['customer_name', 'city', 'signup_date']])

# Calculate reactivation opportunity
reactivation_rate = len(no_orders) / len(customers) * 100
print(f"\nReactivation opportunity: {reactivation_rate:.1f}% of customer base")

What just happened?

We identified 3 out of 6 customers (50%) who signed up but never ordered. This is pure gold for marketing teams — existing customers are 7x easier to convert than new ones. Try this: Add signup_date analysis to prioritize recent signups first.

📊 Data Insight

LEFT JOIN revealed 50% customer activation opportunity worth potential INR 60,000+ revenue based on average order values from active customers

RIGHT JOIN - Mirror of LEFT JOIN

RIGHT JOIN does the opposite — keeps all records from the right table. Honestly, most people just swap table order and use LEFT JOIN instead. But sometimes your query structure makes RIGHT JOIN cleaner.

The scenario: OLA wants to analyze all ride bookings, even those with incomplete driver information due to system glitches.
# RIGHT JOIN - keep all orders, add customer info where available  
right_result = pd.merge(customers, orders, on='customer_id', how='right')
print(f"Orders: {len(orders)}")
print(f"Results after RIGHT JOIN: {len(right_result)}")
print("\nAll orders with customer data:")
print(right_result[['order_id', 'customer_name', 'revenue', 'product_category']])

What just happened?

We kept all 5 orders, including order 1005 which has customer_name = NaN because customer_id 7 doesn't exist. This order still represents INR 25,000 revenue that we can't ignore in financial reports. Try this: Calculate total revenue including orphaned orders.

FULL OUTER JOIN - Everything Combined

FULL OUTER JOIN is the kitchen sink approach. You get every record from both tables, whether they match or not. Lots of NULL values, but complete visibility.

The scenario: BigBasket's executive team wants a complete customer and order analysis for quarterly board presentation — no data can be missing.
# FULL OUTER JOIN - everything from both tables
full_result = pd.merge(customers, orders, on='customer_id', how='outer')
print(f"Customers: {len(customers)}, Orders: {len(orders)}")
print(f"Results after FULL OUTER JOIN: {len(full_result)}")
print("\nComplete dataset:")
print(full_result[['customer_name', 'city', 'order_id', 'revenue']].head(8))
# Analyze the complete picture
total_customers = full_result['customer_name'].notna().sum()
total_orders = full_result['order_id'].notna().sum()
total_revenue = full_result['revenue'].sum()

print(f"Complete business metrics:")
print(f"• Valid customers: {total_customers}")
print(f"• Total orders: {total_orders}")  
print(f"• Total revenue: INR {total_revenue:,.0f}")
print(f"• Orphaned orders: {full_result['customer_name'].isna().sum()}")

What just happened?

We captured everything — 6 customers, 5 orders, INR 45,500 total revenue including that orphaned INR 25,000 order. Row 7 shows an order with no customer info, and rows 4-6 show customers with no orders. Try this: Use this for complete data quality audits.

Complete view shows data quality issues: 50% of customers inactive, 20% of orders orphaned

This doughnut chart reveals critical business insights. Half your customers are inactive, and you have data integrity issues with orphaned orders. The 4 matched records represent your core active business, but the 3 inactive customers and 1 orphaned order need attention.

Multiple Column Joins

Real business scenarios often require joining on multiple columns. Think customer_id AND date for tracking daily behavior, or product_id AND store_id for inventory management.

The scenario: Myntra needs to track customer behavior by both customer and date to identify shopping patterns and seasonal trends.
# Create customer activity log 
activity_log = pd.DataFrame({
    'customer_id': [1, 1, 2, 2, 3],
    'date': ['2023-03-01', '2023-03-04', '2023-03-02', '2023-03-06', '2023-03-03'],
    'page_views': [15, 8, 22, 12, 5],
    'time_spent_minutes': [45, 20, 67, 30, 15]
})

# Add matching date column to orders
orders_with_date = orders.copy()
print("Orders and activity data ready for multi-column join")
# Join on both customer_id AND date to match exact daily behavior
multi_join = pd.merge(orders_with_date, activity_log, 
                     on=['customer_id', 'date'], how='inner')

print("Orders matched with same-day website activity:")
print(multi_join[['customer_id', 'date', 'revenue', 'page_views', 'time_spent_minutes']])

# Calculate conversion insight
avg_time_to_buy = multi_join['time_spent_minutes'].mean()
print(f"\nAverage time spent before purchase: {avg_time_to_buy:.1f} minutes")

What just happened?

Multi-column join revealed same-day purchase behavior. Notice that higher revenue correlates with more page views (22 views → INR 3,500, 15 views → INR 15,000), but the INR 1,200 order took only 5 views — probably a targeted buyer. Try this: Analyze time_spent vs revenue correlation.

No clear correlation between browsing time and purchase value - customer intent varies

The scatter plot reveals interesting customer behavior patterns. There's no linear relationship between time spent and purchase value. The high-value purchase (INR 15,000) happened with moderate browsing time (45 minutes), suggesting focused, high-intent customers. This insight helps optimize the user experience for different customer types.

Common Join Mistake

Wrong: Using INNER JOIN when you need LEFT JOIN for customer analysis — you lose customers without recent orders. Fix: Always ask "Do I need ALL records from the primary table?" If yes, use LEFT JOIN.

Join Performance Tips

Joins can be slow on large datasets. Here's what actually works in production environments — not the theoretical stuff you read in textbooks.

✅ Fast Joins

  • Join on indexed columns (usually IDs)
  • Filter data BEFORE joining
  • Use INNER JOIN when possible
  • Sort large tables first

⚠️ Slow Joins

  • Joining on text columns
  • Multiple LEFT JOINs in sequence
  • No indexes on join keys
  • Cartesian products (missing ON clause)

The biggest performance killer? Joining two large tables without proper indexing. I've seen queries go from 30 seconds to 0.3 seconds just by adding the right index. And always remember — INNER JOIN is fastest because it stops looking once it finds matches.

Pro Tip: When joining tables with millions of rows, filter each table to only relevant records first. Joining 100K rows with 50K rows beats joining 10M with 5M every single time.

Quiz

1. Swiggy wants to analyze all customers including those who haven't ordered recently. They need customer names, signup dates, and order information where available. Which join should they use?


2. You have a customers table with 10 records and an orders table with 5 records. Some orders have customer_id values that don't exist in the customers table. How many records will a RIGHT JOIN return?


3. A junior analyst writes: SELECT * FROM customers JOIN orders. The customers table has 1000 records, orders has 500 records, but the result shows 500,000 records. What went wrong?


Up Next

Aggregations

Master GROUP BY, COUNT, SUM, and AVG to transform joined data into business insights and executive dashboards.