Data Science
ETL Pipelines
Build automated data pipelines that extract, transform, and load millions of records with confidence. You'll master pipeline design patterns that handle real-world data chaos.
What ETL Actually Does
Think of ETL like a factory assembly line for data. Raw materials come in messy. Assembly stations clean, reshape, and package everything. Final products go to the warehouse ready for use.
Every company with data runs ETL pipelines. Amazon processes millions of orders daily. Netflix transforms viewing data to power recommendations. The patterns are identical everywhere.
The ETL Flow
Honestly, the Transform step does 80% of the work. That's where data gets cleaned, business rules get applied, and value gets created.
Building Your First Pipeline
The scenario: Flipkart's data team needs a daily pipeline to process yesterday's orders. Raw transaction data comes in messy CSV files. Clean data needs to go into their analytics database.
# Import libraries for ETL pipeline
import pandas as pd
import numpy as np
from datetime import datetime
# Load the raw ecommerce data
df = pd.read_csv('dataplexa_ecommerce.csv')
# Show basic info about our dataset
print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")Dataset shape: (10000, 11) Columns: ['order_id', 'date', 'customer_age', 'gender', 'city', 'product_category', 'product_name', 'quantity', 'unit_price', 'revenue', 'rating', 'returned']
What just happened?
We loaded 10,000 ecommerce records with 11 columns each. The shape shows rows × columns. Try this: Check if your data loaded completely by comparing expected vs actual row counts.
# EXTRACT: Check data quality issues
print("Missing values per column:")
print(df.isnull().sum())
print("\nData types:")
print(df.dtypes)
# Look at first few rows
print("\nFirst 3 rows:")
df.head(3)Missing values per column: order_id 0 date 0 customer_age 0 gender 0 city 0 product_category 0 product_name 0 quantity 0 unit_price 0 revenue 0 rating 12 returned 0 Data types: order_id int64 date object customer_age int64 gender object city object product_category object product_name object quantity int64 unit_price float64 revenue float64 rating float64 returned bool First 3 rows: order_id date customer_age gender city product_category product_name quantity unit_price revenue rating returned 0 10001 2023-01-15 28 F Mumbai Electronics Smartphone 1 15999.0 15999.0 4.2 False 1 10002 2023-01-15 34 M Delhi Clothing T-Shirt 2 899.0 1798.0 3.8 False 2 10003 2023-01-16 42 F Bangalore Food Organic Rice 3 450.0 1350.0 4.5 False
What just happened?
We found 12 missing ratings but clean data otherwise. The date column is stored as text (object), not datetime. Try this: Always check data types first — they reveal hidden issues.
Transform Phase
Time to fix the issues. Missing ratings get filled. Date strings become proper datetime objects. We'll add derived columns that business analysts actually need.
# TRANSFORM: Fix missing ratings with category averages
category_avg_rating = df.groupby('product_category')['rating'].mean()
# Fill missing ratings based on product category
df['rating'] = df.apply(
lambda row: category_avg_rating[row['product_category']]
if pd.isnull(row['rating']) else row['rating'], axis=1
)
print("After filling missing ratings:")
print(df.isnull().sum()['rating'])After filling missing ratings: 0
What just happened?
We calculated average ratings per category, then filled missing values intelligently. Electronics missing rating gets Electronics average, not overall average. Try this: Business rules like this prevent data scientists from making bad assumptions.
# Convert date strings to proper datetime
df['date'] = pd.to_datetime(df['date'])
# Add useful derived columns
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.day_name()
print("New columns added:")
print(df[['date', 'year', 'month', 'day_of_week']].head(3))New columns added:
date year month day_of_week
0 2023-01-15 2023 1 Sunday
1 2023-01-15 2023 1 Sunday
2 2023-01-16 2023 1 MondayWhat just happened?
We converted text dates to datetime objects that understand time. Then extracted year, month, day_of_week for analysis. Try this: Business teams love day-of-week patterns — weekend vs weekday sales differ significantly.
Weekend revenue spikes 20% higher than weekdays - critical for inventory planning
The line chart reveals weekend shopping behavior that raw timestamps can't show. Saturday peaks at ₹18.9L while Tuesday dips to ₹11.8L.
This pattern drives business decisions. Flipkart schedules more delivery staff on weekends. Marketing campaigns launch Friday evenings. The ETL pipeline created this insight by transforming raw dates into meaningful dimensions.
Advanced Transformations
Real pipelines do complex transformations. Customer segmentation, profit calculations, fraud detection scores. Each transformation adds business value.
# Create customer segments based on spending
def categorize_customer(revenue):
if revenue >= 50000:
return 'Premium'
elif revenue >= 20000:
return 'Mid-tier'
else:
return 'Budget'
df['customer_segment'] = df['revenue'].apply(categorize_customer)
print("Customer segments:")
print(df['customer_segment'].value_counts())Customer segments: Budget 6847 Mid-tier 2653 Premium 500
What just happened?
We applied business logic to create customer segments. Orders ≥₹50K become Premium, ≥₹20K become Mid-tier. Most customers (6,847) are Budget shoppers. Try this: Segment thresholds should match your business model — B2B vs B2C thresholds differ dramatically.
# Calculate profit margins (assume 30% cost)
df['cost'] = df['revenue'] * 0.30
df['profit'] = df['revenue'] - df['cost']
df['profit_margin'] = (df['profit'] / df['revenue']) * 100
# Flag high-risk orders (returns + low ratings)
df['risk_flag'] = ((df['returned'] == True) | (df['rating'] < 3.0))
print("Profit and risk summary:")
print(f"Average profit margin: {df['profit_margin'].mean():.1f}%")
print(f"High-risk orders: {df['risk_flag'].sum()}")Profit and risk summary: Average profit margin: 70.0% High-risk orders: 1847
What just happened?
We calculated profit margins assuming 30% costs, then flagged 1,847 risky orders (returned items or ratings < 3.0). Try this: Risk scoring helps operations teams prioritize customer service efforts.
Budget customers (green) show highest engagement despite lower spending
The bubble chart maps three variables simultaneously: rating, revenue, and customer count. Budget segment (green bubbles) clusters around 4.0 rating but low revenue.
Premium customers spend more but aren't necessarily happier. This insight drives retention strategy — budget customers might upgrade with the right incentives, while premium customers need different satisfaction metrics.
Load Phase & Pipeline Automation
Loading isn't just saving files. Modern data warehouses need specific schemas, partitioning strategies, and incremental updates. The load phase determines query performance for months.
# LOAD: Prepare final dataset with optimized schema
final_columns = [
'order_id', 'date', 'year', 'month', 'day_of_week',
'customer_age', 'gender', 'city', 'customer_segment',
'product_category', 'quantity', 'revenue', 'profit',
'rating', 'risk_flag'
]
# Select and reorder columns for analytics
clean_df = df[final_columns].copy()
print(f"Final dataset ready: {clean_df.shape}")
print(f"Memory usage: {clean_df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")Final dataset ready: (10000, 15) Memory usage: 1.2 MB
What just happened?
We selected 15 optimized columns for the final dataset, dropping unnecessary fields. Memory usage dropped to 1.2MB for efficient storage and queries. Try this: Column order matters for query performance — put frequently filtered columns first.
📊 Data Insight
Our ETL pipeline transformed 11 raw columns into 15 analytical columns, added 4 business-critical fields (customer_segment, profit, day_of_week, risk_flag), and processed 10,000 records in under 2 seconds. Production pipelines handle millions of records using identical patterns.
Transform phase takes 60% of pipeline runtime - optimization focus area
Pipeline performance data shows transformation bottlenecks clearly. Extract (0.3s) and Load (0.4s) run fast, but Transform (1.2s) dominates runtime.
This pattern holds across companies. Netflix spends 70% of pipeline time transforming viewing data. Amazon's recommendation ETL has similar transform-heavy profiles. Optimization efforts focus on vectorization, parallel processing, and smarter algorithms.
Common Pipeline Mistake
Loading untested transformations directly to production databases. Always run df.sample(100).head() to validate transformation logic before full pipeline execution. One bad transformation can corrupt millions of records.
Pipeline Monitoring
Production pipelines need monitoring dashboards. Data quality metrics, processing times, error rates. When Swiggy's order pipeline fails at 8 PM, restaurants don't get orders.
# Add pipeline monitoring metrics
pipeline_stats = {
'records_processed': len(clean_df),
'data_quality_score': (1 - clean_df.isnull().sum().sum() / clean_df.size) * 100,
'avg_revenue': clean_df['revenue'].mean(),
'processing_timestamp': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
}
print("Pipeline monitoring dashboard:")
for key, value in pipeline_stats.items():
print(f"{key}: {value}")Pipeline monitoring dashboard: records_processed: 10000 data_quality_score: 100.0 avg_revenue: 18247.83 processing_timestamp: 2024-01-20 14:30:22
What just happened?
We created monitoring metrics that track pipeline health: 10,000 records processed, 100% data quality score, ₹18,248 average revenue. Try this: Set alerts when metrics drift beyond expected ranges — sudden revenue drops indicate data issues.
The monitoring reveals perfect data quality (100.0 score) and healthy average revenue (₹18,248). But what happens when these numbers change? Alert systems trigger when data quality drops below 95% or average revenue shifts by more than 20%.
Pro Tip: Schedule ETL pipelines during low-traffic hours. Run heavy transformations at 2 AM, not 2 PM. Your database will thank you, and users won't experience slowdowns during peak hours.
Quiz
1. A Paytm data engineer reports their ETL pipeline spends 1.8 seconds extracting transaction data, 12.4 seconds in the middle phase, and 0.6 seconds loading to the data warehouse. Which phase is creating the bottleneck and why?
2. Your ecommerce ETL pipeline finds 150 missing customer ratings in a dataset of 50,000 orders. The missing ratings are distributed across Electronics (80 missing), Clothing (45 missing), and Books (25 missing). What's the best transformation approach?
3. Your production ETL pipeline processes daily order data for BigBasket. Yesterday it processed 95,000 orders with 99.2% data quality score. Today it processes 23,000 orders with 87.1% data quality score. What should trigger first?
Up Next
Hadoop
Master distributed computing for processing datasets too large for single machines - the foundation of modern big data architecture.