Data Science Lesson 56 – ETL Pipelines | Dataplexa

Big Data Processing · Lesson 56

ETL Pipelines

Build automated data pipelines that extract, transform, and load millions of records with confidence. You'll master pipeline design patterns that handle real-world data chaos.

What ETL Actually Does

Think of ETL like a factory assembly line for data. Raw materials come in messy. Assembly stations clean, reshape, and package everything. Final products go to the warehouse ready for use.

Every company with data runs ETL pipelines. Amazon processes millions of orders daily. Netflix transforms viewing data to power recommendations. The patterns are identical everywhere.

The ETL Flow

Extract: Pull data from sources (databases, APIs, files)

Transform: Clean, validate, enrich, aggregate the data

Load: Insert transformed data into target system

Honestly, the Transform step does 80% of the work. That's where data gets cleaned, business rules get applied, and value gets created.

Building Your First Pipeline

The scenario: Flipkart's data team needs a daily pipeline to process yesterday's orders. Raw transaction data comes in messy CSV files. Clean data needs to go into their analytics database.

# Import libraries for ETL pipeline
import pandas as pd
import numpy as np
from datetime import datetime

# Load the raw ecommerce data
df = pd.read_csv('dataplexa_ecommerce.csv')

# Show basic info about our dataset
print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

Dataset shape: (10000, 11)
Columns: ['order_id', 'date', 'customer_age', 'gender', 'city', 'product_category', 'product_name', 'quantity', 'unit_price', 'revenue', 'rating', 'returned']

What just happened?

We loaded 10,000 ecommerce records with 11 columns each. The shape shows rows × columns. Try this: Check if your data loaded completely by comparing expected vs actual row counts.

# EXTRACT: Check data quality issues
print("Missing values per column:")
print(df.isnull().sum())

print("\nData types:")
print(df.dtypes)

# Look at first few rows
print("\nFirst 3 rows:")
df.head(3)

Missing values per column:
order_id           0
date               0
customer_age       0
gender             0
city               0
product_category   0
product_name       0
quantity           0
unit_price         0
revenue            0
rating            12
returned           0

Data types:
order_id             int64
date                object
customer_age         int64
gender              object
city                object
product_category    object
product_name        object
quantity             int64
unit_price         float64
revenue            float64
rating             float64
returned              bool

First 3 rows:
   order_id        date  customer_age gender      city product_category        product_name  quantity  unit_price   revenue  rating  returned
0     10001  2023-01-15            28      F    Mumbai      Electronics        Smartphone          1     15999.0   15999.0     4.2     False
1     10002  2023-01-15            34      M     Delhi         Clothing           T-Shirt          2       899.0    1798.0     3.8     False
2     10003  2023-01-16            42      F  Bangalore             Food    Organic Rice          3       450.0    1350.0     4.5     False

What just happened?

We found 12 missing ratings but clean data otherwise. The date column is stored as text (object), not datetime. Try this: Always check data types first — they reveal hidden issues.

Transform Phase

Time to fix the issues. Missing ratings get filled. Date strings become proper datetime objects. We'll add derived columns that business analysts actually need.

# TRANSFORM: Fix missing ratings with category averages
category_avg_rating = df.groupby('product_category')['rating'].mean()

# Fill missing ratings based on product category
df['rating'] = df.apply(
    lambda row: category_avg_rating[row['product_category']] 
    if pd.isnull(row['rating']) else row['rating'], axis=1
)

print("After filling missing ratings:")
print(df.isnull().sum()['rating'])

After filling missing ratings:
0

What just happened?

We calculated average ratings per category, then filled missing values intelligently. Electronics missing rating gets Electronics average, not overall average. Try this: Business rules like this prevent data scientists from making bad assumptions.

# Convert date strings to proper datetime
df['date'] = pd.to_datetime(df['date'])

# Add useful derived columns
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.day_name()

print("New columns added:")
print(df[['date', 'year', 'month', 'day_of_week']].head(3))

New columns added:
        date  year  month day_of_week
0 2023-01-15  2023      1      Sunday
1 2023-01-15  2023      1      Sunday
2 2023-01-16  2023      1      Monday

What just happened?

We converted text dates to datetime objects that understand time. Then extracted year, month, day_of_week for analysis. Try this: Business teams love day-of-week patterns — weekend vs weekday sales differ significantly.

Weekend revenue spikes 20% higher than weekdays - critical for inventory planning

The line chart reveals weekend shopping behavior that raw timestamps can't show. Saturday peaks at ₹18.9L while Tuesday dips to ₹11.8L.

This pattern drives business decisions. Flipkart schedules more delivery staff on weekends. Marketing campaigns launch Friday evenings. The ETL pipeline created this insight by transforming raw dates into meaningful dimensions.

Advanced Transformations

Real pipelines do complex transformations. Customer segmentation, profit calculations, fraud detection scores. Each transformation adds business value.

# Create customer segments based on spending
def categorize_customer(revenue):
    if revenue >= 50000:
        return 'Premium'
    elif revenue >= 20000:
        return 'Mid-tier'
    else:
        return 'Budget'

df['customer_segment'] = df['revenue'].apply(categorize_customer)
print("Customer segments:")
print(df['customer_segment'].value_counts())

Customer segments:
Budget      6847
Mid-tier    2653
Premium      500

What just happened?

We applied business logic to create customer segments. Orders ≥₹50K become Premium, ≥₹20K become Mid-tier. Most customers (6,847) are Budget shoppers. Try this: Segment thresholds should match your business model — B2B vs B2C thresholds differ dramatically.

# Calculate profit margins (assume 30% cost)
df['cost'] = df['revenue'] * 0.30
df['profit'] = df['revenue'] - df['cost']
df['profit_margin'] = (df['profit'] / df['revenue']) * 100

# Flag high-risk orders (returns + low ratings)
df['risk_flag'] = ((df['returned'] == True) | (df['rating'] < 3.0))

print("Profit and risk summary:")
print(f"Average profit margin: {df['profit_margin'].mean():.1f}%")
print(f"High-risk orders: {df['risk_flag'].sum()}")

Profit and risk summary:
Average profit margin: 70.0%
High-risk orders: 1847

What just happened?

We calculated profit margins assuming 30% costs, then flagged 1,847 risky orders (returned items or ratings < 3.0). Try this: Risk scoring helps operations teams prioritize customer service efforts.

Budget customers (green) show highest engagement despite lower spending

The bubble chart maps three variables simultaneously: rating, revenue, and customer count. Budget segment (green bubbles) clusters around 4.0 rating but low revenue.

Premium customers spend more but aren't necessarily happier. This insight drives retention strategy — budget customers might upgrade with the right incentives, while premium customers need different satisfaction metrics.

Load Phase & Pipeline Automation

Loading isn't just saving files. Modern data warehouses need specific schemas, partitioning strategies, and incremental updates. The load phase determines query performance for months.

# LOAD: Prepare final dataset with optimized schema
final_columns = [
    'order_id', 'date', 'year', 'month', 'day_of_week',
    'customer_age', 'gender', 'city', 'customer_segment',
    'product_category', 'quantity', 'revenue', 'profit',
    'rating', 'risk_flag'
]

# Select and reorder columns for analytics
clean_df = df[final_columns].copy()

print(f"Final dataset ready: {clean_df.shape}")
print(f"Memory usage: {clean_df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")

Final dataset ready: (10000, 15)
Memory usage: 1.2 MB

What just happened?

We selected 15 optimized columns for the final dataset, dropping unnecessary fields. Memory usage dropped to 1.2MB for efficient storage and queries. Try this: Column order matters for query performance — put frequently filtered columns first.

📊 Data Insight

Our ETL pipeline transformed 11 raw columns into 15 analytical columns, added 4 business-critical fields (customer_segment, profit, day_of_week, risk_flag), and processed 10,000 records in under 2 seconds. Production pipelines handle millions of records using identical patterns.

Transform phase takes 60% of pipeline runtime - optimization focus area

Pipeline performance data shows transformation bottlenecks clearly. Extract (0.3s) and Load (0.4s) run fast, but Transform (1.2s) dominates runtime.

This pattern holds across companies. Netflix spends 70% of pipeline time transforming viewing data. Amazon's recommendation ETL has similar transform-heavy profiles. Optimization efforts focus on vectorization, parallel processing, and smarter algorithms.

Common Pipeline Mistake

Loading untested transformations directly to production databases. Always run df.sample(100).head() to validate transformation logic before full pipeline execution. One bad transformation can corrupt millions of records.

Pipeline Monitoring

Production pipelines need monitoring dashboards. Data quality metrics, processing times, error rates. When Swiggy's order pipeline fails at 8 PM, restaurants don't get orders.

# Add pipeline monitoring metrics
pipeline_stats = {
    'records_processed': len(clean_df),
    'data_quality_score': (1 - clean_df.isnull().sum().sum() / clean_df.size) * 100,
    'avg_revenue': clean_df['revenue'].mean(),
    'processing_timestamp': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
}

print("Pipeline monitoring dashboard:")
for key, value in pipeline_stats.items():
    print(f"{key}: {value}")

Pipeline monitoring dashboard:
records_processed: 10000
data_quality_score: 100.0
avg_revenue: 18247.83
processing_timestamp: 2024-01-20 14:30:22

What just happened?

We created monitoring metrics that track pipeline health: 10,000 records processed, 100% data quality score, ₹18,248 average revenue. Try this: Set alerts when metrics drift beyond expected ranges — sudden revenue drops indicate data issues.

The monitoring reveals perfect data quality (100.0 score) and healthy average revenue (₹18,248). But what happens when these numbers change? Alert systems trigger when data quality drops below 95% or average revenue shifts by more than 20%.

Pro Tip: Schedule ETL pipelines during low-traffic hours. Run heavy transformations at 2 AM, not 2 PM. Your database will thank you, and users won't experience slowdowns during peak hours.

Quiz

Up Next

Hadoop

Master distributed computing for processing datasets too large for single machines - the foundation of modern big data architecture.

← Previous Course Index Next →