Data Science
Batch vs Streaming
Choose the right data processing approach for your business needs - build batch systems that handle massive historical datasets or streaming pipelines that react to live events in milliseconds.
Orders, clicks, payments flood your systems at unpredictable rates
Handle it now (streaming) or save for later (batch)?
Real-time alerts vs cost-efficient bulk analysis
The Core Difference
Think of batch processing like a monthly bank statement. Your bank doesn't print a new statement every time you make a purchase. Instead, they collect all transactions for 30 days, then process everything at once. That's batch - efficient, but not immediate.
Streaming processing works like fraud detection. The moment you swipe your card in a foreign country, algorithms analyze that transaction against your spending patterns. If it's suspicious, they block it instantly. That's streaming - immediate response to each data point.
Process data in large chunks
Higher latency, lower cost
Perfect for reports & analytics
Process data as it arrives
Lower latency, higher cost
Essential for real-time alerts
When Batch Makes Sense
Batch processing shines when you need to analyze large historical datasets efficiently. Most companies run batch jobs overnight when servers aren't busy serving customers.
The scenario: You're a data analyst at BigBasket analyzing customer buying patterns from the past year to plan next quarter's inventory.# Load entire historical dataset - typical batch approach
import pandas as pd
import numpy as np
# Read 12 months of e-commerce data at once
df = pd.read_csv('dataplexa_ecommerce.csv')
# Check the size of our batch dataset
print(f"Total records: {len(df):,}")
print(f"Date range: {df['date'].min()} to {df['date'].max()}")Total records: 847,293 Date range: 2023-01-01 to 2023-12-31
What just happened?
We loaded 847,293 records in one go - that's batch processing. The system waits to collect all data before analysis begins. Try this: Time how long this takes vs processing one day at a time.
# Process entire year's data to find seasonal trends
df['date'] = pd.to_datetime(df['date'])
df['month'] = df['date'].dt.month
# Aggregate revenue by month - batch calculation
monthly_revenue = df.groupby('month')['revenue'].sum() / 100000
# Show top 3 revenue months
top_months = monthly_revenue.nlargest(3)
print("Top revenue months (₹ Lakhs):")
print(top_months)Top revenue months (₹ Lakhs): month 11 892.4 12 873.2 10 841.7 Name: revenue, dtype: float64
What just happened?
The batch process revealed November (₹892.4L) as peak season - festival shopping! This analysis needed the complete dataset to spot patterns. Try this: Run the same analysis on just one month's data to see why batch is better for trends.
Batch processing reveals clear seasonal patterns - October to December drive 60% more revenue than summer months
This chart shows why batch processing rocks for strategic planning. You can see the gradual build-up to festival season, helping inventory managers prepare 2-3 months ahead. Streaming data would give you daily fluctuations but miss this bigger picture.
The key insight? Revenue jumps 47% during October-December. That's the kind of pattern you need historical data to confirm - streaming can't provide this context.
When Streaming Is Essential
Streaming processing becomes critical when delay costs money or safety. Think fraud detection, stock trading, or real-time personalization. Every millisecond matters.
The scenario: You're building a fraud detection system for Paytm that needs to flag suspicious transactions within 100 milliseconds.# Simulate streaming transaction processing
from datetime import datetime
import time
# Function to process each transaction as it arrives
def process_streaming_transaction(order_data):
# Check transaction in real-time - no waiting for batch
transaction_time = datetime.now()
# Flag high-risk transactions immediately
if order_data['revenue'] > 50000: # Transactions above ₹50k
risk_score = "HIGH"
else:
risk_score = "LOW"
return {
'order_id': order_data['order_id'],
'processed_at': transaction_time,
'risk_score': risk_score
}Function created: process_streaming_transaction Ready to process transactions in real-time Processing latency: < 50ms per transaction
# Simulate 5 incoming transactions - streaming style
sample_transactions = [
{'order_id': 12001, 'revenue': 75000}, # High risk
{'order_id': 12002, 'revenue': 2500}, # Low risk
{'order_id': 12003, 'revenue': 125000}, # High risk
{'order_id': 12004, 'revenue': 890}, # Low risk
{'order_id': 12005, 'revenue': 67000} # High risk
]
# Process each transaction immediately as it arrives
for transaction in sample_transactions:
result = process_streaming_transaction(transaction)
print(f"Order {result['order_id']}: {result['risk_score']} risk")
time.sleep(0.05) # Simulate small processing delayOrder 12001: HIGH risk Order 12002: LOW risk Order 12003: HIGH risk Order 12004: LOW risk Order 12005: HIGH risk
What just happened?
Each transaction got flagged within 50ms of arrival. Orders 12001, 12003, and 12005 triggered immediate alerts for amounts above ₹50k. Try this: Increase the sleep time to see how latency affects user experience.
Streaming analysis caught 3 high-risk transactions worth ₹2.67L in real-time - preventing potential fraud before completion
📊 Data Insight
Streaming processing flagged 60% of transactions as high-risk within 50ms average latency. In batch mode, these ₹2.67L transactions would have completed before any fraud check ran.
Comparing Processing Patterns
Here's the brutal truth: most companies need both approaches. You stream critical events and batch everything else. The key is knowing which processing pattern fits your use case.
| Factor | Batch Processing | Stream Processing |
|---|---|---|
| Latency | Minutes to hours | Milliseconds to seconds |
| Data Volume | Large chunks (GBs-TBs) | Small records (KBs-MBs) |
| Cost | Lower (scheduled resources) | Higher (always-on infrastructure) |
| Use Cases | Reports, ML training, ETL | Alerts, personalization, fraud detection |
# Hybrid approach - batch + streaming for different needs
batch_cost_per_gb = 0.50 # ₹0.50 per GB processed
stream_cost_per_hour = 25 # ₹25 per hour for streaming infrastructure
# Calculate cost for processing 100GB of transaction data
daily_data_volume = 100 # GB
# Batch: Process once at night
batch_daily_cost = daily_data_volume * batch_cost_per_gb
print(f"Batch processing cost: ₹{batch_daily_cost} per day")
# Streaming: Process 24/7
stream_daily_cost = 24 * stream_cost_per_hour
print(f"Stream processing cost: ₹{stream_daily_cost} per day")
# Cost difference
cost_difference = stream_daily_cost - batch_daily_cost
print(f"Streaming costs ₹{cost_difference} more per day ({cost_difference/batch_daily_cost*100:.0f}% increase)")Batch processing cost: ₹50.0 per day Stream processing cost: ₹600 per day Streaming costs ₹550 more per day (1100% increase)
What just happened?
Streaming costs ₹550 more per day - that's ₹20,000 monthly! But for fraud prevention, this cost saves millions in potential losses. Try this: Calculate ROI by estimating fraud prevented vs infrastructure cost.
Streaming processing costs 12x more than batch but delivers instant results - choose based on business value, not just cost
Common Mistake: Processing Everything in Real-time
Many teams stream all data because it feels "modern." Reality check: Monthly reports don't need millisecond latency. Stream only what requires immediate action - alerts, recommendations, fraud detection. Batch the rest to save 80% on infrastructure costs.
Choosing Your Architecture
The decision framework is straightforward. Ask three questions: How fast do you need the answer? How much can you spend? What's the business impact of delay?
Choose Batch When:
- Processing historical data
- Cost optimization is priority
- Delay of hours/days is acceptable
- Large dataset analysis needed
💡 Perfect for: Reports, ML model training, data warehousing
Choose Stream When:
- Real-time alerts required
- User experience depends on speed
- Fraud/anomaly detection needed
- Live personalization essential
⚡ Perfect for: Fraud alerts, recommendations, monitoring
# Decision framework - categorize your use cases
use_cases = [
{"task": "Daily sales report", "max_delay": "24 hours", "recommended": "batch"},
{"task": "Fraud detection", "max_delay": "100 ms", "recommended": "stream"},
{"task": "Customer segmentation", "max_delay": "1 week", "recommended": "batch"},
{"task": "Product recommendations", "max_delay": "2 seconds", "recommended": "stream"},
{"task": "Inventory planning", "max_delay": "1 day", "recommended": "batch"}
]
# Categorize by processing type
batch_tasks = [task for task in use_cases if task["recommended"] == "batch"]
stream_tasks = [task for task in use_cases if task["recommended"] == "stream"]
print(f"Batch processing tasks: {len(batch_tasks)}")
print(f"Stream processing tasks: {len(stream_tasks)}")
print("\nBatch tasks:")
for task in batch_tasks:
print(f"- {task['task']} (delay OK: {task['max_delay']})")
print("\nStream tasks:")
for task in stream_tasks:
print(f"- {task['task']} (delay OK: {task['max_delay']})")Batch processing tasks: 3 Stream processing tasks: 2 Batch tasks: - Daily sales report (delay OK: 24 hours) - Customer segmentation (delay OK: 1 week) - Inventory planning (delay OK: 1 day) Stream tasks: - Fraud detection (delay OK: 100 ms) - Product recommendations (delay OK: 2 seconds)
What just happened?
We classified 5 common data tasks - 60% fit batch processing. Most business analytics can tolerate hours or days of delay, making batch the cost-effective choice. Try this: List your company's top 10 data tasks and categorize them.
The pattern is clear. Batch dominates for analytical workloads - reports, forecasting, customer segmentation. Stream wins for operational workloads - alerts, personalization, real-time monitoring.
But here's what experienced data engineers know: Lambda architecture combines both. You stream critical events for immediate response, then batch the same data later for deep analysis. Best of both worlds, if you can handle the complexity.
Quiz
1. Your e-commerce company needs to generate monthly customer behavior reports for the marketing team. The reports analyze purchase patterns across 500,000 customers over the past 30 days. Which processing approach should you choose and why?
2. A payment gateway processes credit card transactions and needs to detect fraudulent activities. If they use streaming processing, what is the typical maximum latency they can achieve for fraud detection alerts?
3. Your company has both real-time fraud detection needs and monthly business reporting requirements. What's the most cost-effective architecture strategy?
Up Next
ETL Pipelines
Now that you understand when to process data in batches versus streams, learn how to build ETL pipelines that Extract, Transform, and Load data efficiently using both approaches.