Data Science Lesson 55 – Batch vs Streaming | Dataplexa

Data Engineering · Lesson 55

Batch vs Streaming

Choose the right data processing approach for your business needs - build batch systems that handle massive historical datasets or streaming pipelines that react to live events in milliseconds.

Data Arrives
Orders, clicks, payments flood your systems at unpredictable rates

Processing Decision
Handle it now (streaming) or save for later (batch)?

Business Impact
Real-time alerts vs cost-efficient bulk analysis

The Core Difference

Think of batch processing like a monthly bank statement. Your bank doesn't print a new statement every time you make a purchase. Instead, they collect all transactions for 30 days, then process everything at once. That's batch - efficient, but not immediate.

Streaming processing works like fraud detection. The moment you swipe your card in a foreign country, algorithms analyze that transaction against your spending patterns. If it's suspicious, they block it instantly. That's streaming - immediate response to each data point.

Batch Processing
Process data in large chunks
Higher latency, lower cost
Perfect for reports & analytics

Stream Processing
Process data as it arrives
Lower latency, higher cost
Essential for real-time alerts

When Batch Makes Sense

Batch processing shines when you need to analyze large historical datasets efficiently. Most companies run batch jobs overnight when servers aren't busy serving customers.

The scenario: You're a data analyst at BigBasket analyzing customer buying patterns from the past year to plan next quarter's inventory.

# Load entire historical dataset - typical batch approach
import pandas as pd
import numpy as np

# Read 12 months of e-commerce data at once
df = pd.read_csv('dataplexa_ecommerce.csv')

# Check the size of our batch dataset
print(f"Total records: {len(df):,}")
print(f"Date range: {df['date'].min()} to {df['date'].max()}")

Total records: 847,293
Date range: 2023-01-01 to 2023-12-31

What just happened?

We loaded 847,293 records in one go - that's batch processing. The system waits to collect all data before analysis begins. Try this: Time how long this takes vs processing one day at a time.

Now batch process this data to find seasonal patterns:

# Process entire year's data to find seasonal trends
df['date'] = pd.to_datetime(df['date'])
df['month'] = df['date'].dt.month

# Aggregate revenue by month - batch calculation
monthly_revenue = df.groupby('month')['revenue'].sum() / 100000

# Show top 3 revenue months
top_months = monthly_revenue.nlargest(3)
print("Top revenue months (₹ Lakhs):")
print(top_months)

Top revenue months (₹ Lakhs):
month
11    892.4
12    873.2
10    841.7
Name: revenue, dtype: float64

What just happened?

The batch process revealed November (₹892.4L) as peak season - festival shopping! This analysis needed the complete dataset to spot patterns. Try this: Run the same analysis on just one month's data to see why batch is better for trends.

Batch processing reveals clear seasonal patterns - October to December drive 60% more revenue than summer months

This chart shows why batch processing rocks for strategic planning. You can see the gradual build-up to festival season, helping inventory managers prepare 2-3 months ahead. Streaming data would give you daily fluctuations but miss this bigger picture.

The key insight? Revenue jumps 47% during October-December. That's the kind of pattern you need historical data to confirm - streaming can't provide this context.

When Streaming Is Essential

Streaming processing becomes critical when delay costs money or safety. Think fraud detection, stock trading, or real-time personalization. Every millisecond matters.

The scenario: You're building a fraud detection system for Paytm that needs to flag suspicious transactions within 100 milliseconds.

# Simulate streaming transaction processing
from datetime import datetime
import time

# Function to process each transaction as it arrives
def process_streaming_transaction(order_data):
    # Check transaction in real-time - no waiting for batch
    transaction_time = datetime.now()
    
    # Flag high-risk transactions immediately
    if order_data['revenue'] > 50000:  # Transactions above ₹50k
        risk_score = "HIGH"
    else:
        risk_score = "LOW"
        
    return {
        'order_id': order_data['order_id'],
        'processed_at': transaction_time,
        'risk_score': risk_score
    }

Function created: process_streaming_transaction
Ready to process transactions in real-time
Processing latency: < 50ms per transaction

# Simulate 5 incoming transactions - streaming style
sample_transactions = [
    {'order_id': 12001, 'revenue': 75000},  # High risk
    {'order_id': 12002, 'revenue': 2500},   # Low risk  
    {'order_id': 12003, 'revenue': 125000}, # High risk
    {'order_id': 12004, 'revenue': 890},    # Low risk
    {'order_id': 12005, 'revenue': 67000}   # High risk
]

# Process each transaction immediately as it arrives
for transaction in sample_transactions:
    result = process_streaming_transaction(transaction)
    print(f"Order {result['order_id']}: {result['risk_score']} risk")
    time.sleep(0.05)  # Simulate small processing delay

Order 12001: HIGH risk
Order 12002: LOW risk
Order 12003: HIGH risk
Order 12004: LOW risk
Order 12005: HIGH risk

What just happened?

Each transaction got flagged within 50ms of arrival. Orders 12001, 12003, and 12005 triggered immediate alerts for amounts above ₹50k. Try this: Increase the sleep time to see how latency affects user experience.

Streaming analysis caught 3 high-risk transactions worth ₹2.67L in real-time - preventing potential fraud before completion

📊 Data Insight

Streaming processing flagged 60% of transactions as high-risk within 50ms average latency. In batch mode, these ₹2.67L transactions would have completed before any fraud check ran.

Comparing Processing Patterns

Here's the brutal truth: most companies need both approaches. You stream critical events and batch everything else. The key is knowing which processing pattern fits your use case.

Factor	Batch Processing	Stream Processing
Latency	Minutes to hours	Milliseconds to seconds
Data Volume	Large chunks (GBs-TBs)	Small records (KBs-MBs)
Cost	Lower (scheduled resources)	Higher (always-on infrastructure)
Use Cases	Reports, ML training, ETL	Alerts, personalization, fraud detection

The scenario: HDFC Bank runs both systems - batch for monthly statements, streaming for transaction alerts.

# Hybrid approach - batch + streaming for different needs
batch_cost_per_gb = 0.50   # ₹0.50 per GB processed
stream_cost_per_hour = 25  # ₹25 per hour for streaming infrastructure

# Calculate cost for processing 100GB of transaction data
daily_data_volume = 100  # GB

# Batch: Process once at night
batch_daily_cost = daily_data_volume * batch_cost_per_gb
print(f"Batch processing cost: ₹{batch_daily_cost} per day")

# Streaming: Process 24/7
stream_daily_cost = 24 * stream_cost_per_hour  
print(f"Stream processing cost: ₹{stream_daily_cost} per day")

# Cost difference
cost_difference = stream_daily_cost - batch_daily_cost
print(f"Streaming costs ₹{cost_difference} more per day ({cost_difference/batch_daily_cost*100:.0f}% increase)")

Batch processing cost: ₹50.0 per day
Stream processing cost: ₹600 per day  
Streaming costs ₹550 more per day (1100% increase)

What just happened?

Streaming costs ₹550 more per day - that's ₹20,000 monthly! But for fraud prevention, this cost saves millions in potential losses. Try this: Calculate ROI by estimating fraud prevented vs infrastructure cost.

Streaming processing costs 12x more than batch but delivers instant results - choose based on business value, not just cost

Common Mistake: Processing Everything in Real-time

Many teams stream all data because it feels "modern." Reality check: Monthly reports don't need millisecond latency. Stream only what requires immediate action - alerts, recommendations, fraud detection. Batch the rest to save 80% on infrastructure costs.

Choosing Your Architecture

The decision framework is straightforward. Ask three questions: How fast do you need the answer? How much can you spend? What's the business impact of delay?

Choose Batch When:

Processing historical data
Cost optimization is priority
Delay of hours/days is acceptable
Large dataset analysis needed

💡 Perfect for: Reports, ML model training, data warehousing

Choose Stream When:

Real-time alerts required
User experience depends on speed
Fraud/anomaly detection needed
Live personalization essential

⚡ Perfect for: Fraud alerts, recommendations, monitoring

# Decision framework - categorize your use cases
use_cases = [
    {"task": "Daily sales report", "max_delay": "24 hours", "recommended": "batch"},
    {"task": "Fraud detection", "max_delay": "100 ms", "recommended": "stream"},
    {"task": "Customer segmentation", "max_delay": "1 week", "recommended": "batch"},
    {"task": "Product recommendations", "max_delay": "2 seconds", "recommended": "stream"},
    {"task": "Inventory planning", "max_delay": "1 day", "recommended": "batch"}
]

# Categorize by processing type
batch_tasks = [task for task in use_cases if task["recommended"] == "batch"]
stream_tasks = [task for task in use_cases if task["recommended"] == "stream"]

print(f"Batch processing tasks: {len(batch_tasks)}")
print(f"Stream processing tasks: {len(stream_tasks)}")

print("\nBatch tasks:")
for task in batch_tasks:
    print(f"- {task['task']} (delay OK: {task['max_delay']})")
    
print("\nStream tasks:")  
for task in stream_tasks:
    print(f"- {task['task']} (delay OK: {task['max_delay']})")

Batch processing tasks: 3
Stream processing tasks: 2

Batch tasks:
- Daily sales report (delay OK: 24 hours)
- Customer segmentation (delay OK: 1 week)  
- Inventory planning (delay OK: 1 day)

Stream tasks:
- Fraud detection (delay OK: 100 ms)
- Product recommendations (delay OK: 2 seconds)

What just happened?

We classified 5 common data tasks - 60% fit batch processing. Most business analytics can tolerate hours or days of delay, making batch the cost-effective choice. Try this: List your company's top 10 data tasks and categorize them.

The pattern is clear. Batch dominates for analytical workloads - reports, forecasting, customer segmentation. Stream wins for operational workloads - alerts, personalization, real-time monitoring.

But here's what experienced data engineers know: Lambda architecture combines both. You stream critical events for immediate response, then batch the same data later for deep analysis. Best of both worlds, if you can handle the complexity.

Quiz

Up Next

ETL Pipelines

Now that you understand when to process data in batches versus streams, learn how to build ETL pipelines that Extract, Transform, and Load data efficiently using both approaches.

← Previous Course Index Next →