Data Science Lesson 58 – Spark Basics | Dataplexa
Big Data · Lesson 58

Spark Basics

Master Apache Spark fundamentals, RDDs, and distributed computing concepts through practical examples with real e-commerce data processing scenarios.

Apache Spark changed everything for big data processing. Where Hadoop MapReduce took hours, Spark delivers results in minutes. The secret? In-memory computing and lazy evaluation that makes data engineers productive again.
1
Spark Context Creation
2
RDD Operations
3
Data Processing
4
Results & Actions

What Makes Spark Special

Spark runs computations up to 100x faster than Hadoop MapReduce. The magic happens through three core principles that data engineers at Flipkart and Swiggy rely on daily.

In-Memory Computing

Keeps data in RAM between operations instead of writing to disk repeatedly

Lazy Evaluation

Builds execution plan first, optimizes, then runs everything at once

Fault Tolerance

Automatically recovers lost data using lineage information

Multi-Language APIs

Works with Python, Scala, Java, R, and SQL seamlessly

The real breakthrough is Resilient Distributed Datasets (RDDs). Think of RDDs as smart lists that automatically spread across multiple machines. When one machine fails, Spark rebuilds just the lost pieces.

Setting Up Your First Spark Session

The scenario: You're a data analyst at Zomato processing millions of daily orders. The CSV files are too big for pandas, and you need distributed computing power.
# Import PySpark libraries for distributed computing
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

# Configure Spark with application name and memory settings
conf = SparkConf().setAppName("ZomatoAnalysis")
conf.set("spark.executor.memory", "2g")
conf.set("spark.driver.memory", "1g")

What just happened?

We configured Spark to use 2GB for worker nodes and 1GB for the driver program. The driver coordinates everything while executors do the actual work. Try this: Increase executor memory to 4g if you have RAM available.

# Create SparkSession - the entry point for DataFrame API
spark = SparkSession.builder \
    .config(conf=conf) \
    .getOrCreate()

# Get the underlying SparkContext for RDD operations
sc = spark.sparkContext

What just happened?

Spark created a web UI at localhost:4040 where you can monitor job progress. The SparkContext handles low-level RDD operations while SparkSession provides DataFrame APIs. Try this: Visit the web UI to see cluster information.

Creating and Manipulating RDDs

RDDs are the foundation of everything in Spark. Every DataFrame, every SQL query eventually becomes RDD operations under the hood. Here's how they work with real data.
# Create RDD from a Python list of e-commerce data
orders_data = [
    "1001,2023-01-15,Electronics,Smartphone,45000.0",
    "1002,2023-01-15,Clothing,T-Shirt,899.0", 
    "1003,2023-01-16,Food,Pizza,450.0"
]

# Parallelize creates RDD distributed across cluster nodes
orders_rdd = sc.parallelize(orders_data)

What just happened?

Spark split our data into 4 partitions automatically. Each partition can be processed on different machines simultaneously. The RDD object doesn't contain data yet - it's just a blueprint for computation. Try this: Use orders_rdd.getNumPartitions() to check partition count.

# Transformation: split each line by comma to create structured data
split_orders = orders_rdd.map(lambda line: line.split(','))

# Action: collect brings data back to driver - triggers actual computation
result = split_orders.collect()
print("First order:", result[0])

What just happened?

The map() transformation created a new RDD but didn't execute yet. Only when we called collect() did Spark actually process the data and return results. This is lazy evaluation in action. Try this: Replace collect() with take(2) to get only first 2 records.

Pro Tip: Never use collect() on huge datasets - it brings ALL data to one machine. Use take(n) or sample() for large data exploration.

Transformations vs Actions

Understanding the difference between transformations and actions is crucial. Transformations are lazy (create execution plans), actions are eager (execute immediately). Here's the complete breakdown:
Type Operation Purpose Example Use
Transformation map() Apply function to each element Parse CSV strings
Transformation filter() Keep elements matching condition High-value orders only
Action collect() Return all elements to driver Get final results
Action count() Count total elements Total order count
The scenario: A data engineer at OYO needs to analyze booking patterns. They're filtering millions of records to find high-value bookings above ₹5000.
# Extract revenue from each order (5th column, index 4)
revenue_rdd = split_orders.map(lambda order: float(order[4]))

# Filter for high-value orders above ₹5000
high_value_orders = revenue_rdd.filter(lambda revenue: revenue > 5000)

# Count how many high-value orders exist
high_value_count = high_value_orders.count()

What just happened?

We chained two transformations map() → filter() but nothing executed until count(). Spark optimized the entire pipeline and found only 1 smartphone order above ₹5000. Try this: Use high_value_orders.take(1) to see which product it was.

Electronics dominates with highest average order values, making it perfect for high-value filtering operations

Electronics clearly drives revenue with ₹28,400 average orders - nearly 23x higher than clothing. This pattern explains why most e-commerce companies focus their recommendation engines on electronics first. But here's the business insight: while electronics has high individual order values, the volume is typically lower than clothing or food categories. Smart companies like Amazon India optimize differently for each category.

Working with Key-Value Pairs

Many Spark operations work on key-value pairs - think SQL GROUP BY operations. The reduceByKey() function is your friend for aggregations like calculating total revenue per product category.
# Create key-value pairs: (category, revenue)
category_revenue = split_orders.map(
    lambda order: (order[2], float(order[4]))  # category, revenue
)

# Show first few key-value pairs
sample_pairs = category_revenue.take(3)
print("Category-Revenue pairs:", sample_pairs)
# Aggregate revenue by category using reduceByKey
total_by_category = category_revenue.reduceByKey(lambda a, b: a + b)

# Sort by revenue descending to see top categories
sorted_categories = total_by_category.sortBy(lambda x: x[1], ascending=False)

# Get results
category_totals = sorted_categories.collect()
print("Revenue by category:", category_totals)

What just happened?

The reduceByKey() combined all revenues for each category efficiently across the cluster. Then sortBy() ranked categories by total revenue. Electronics leads with ₹45,000 total revenue. Try this: Use groupByKey() instead to see the difference in performance.

📊 Data Insight

Electronics generates 98.1% of total revenue despite being only 33% of orders. This concentration suggests optimizing the electronics supply chain could drive significant business impact.

Loading Real Data Files

Real data comes from files, not hardcoded lists. Spark excels at reading massive CSV files distributed across multiple machines. Here's how HDFC Bank processes daily transaction logs.
# Load CSV file into RDD - Spark handles file splitting automatically
file_rdd = sc.textFile("dataplexa_ecommerce.csv")

# Skip header row (first line contains column names)
header = file_rdd.first()
data_rdd = file_rdd.filter(lambda line: line != header)

# Check how many data rows we have
row_count = data_rdd.count()
print(f"Loaded {row_count} orders from CSV")
# Parse CSV rows into structured format
def parse_order(line):
    fields = line.split(',')
    return {
        'order_id': int(fields[0]),
        'date': fields[1], 
        'city': fields[4],
        'category': fields[5],
        'revenue': float(fields[8])
    }

# Apply parsing to create structured RDD
structured_rdd = data_rdd.map(parse_order)

What just happened?

Spark automatically split our 12.8 MB file across 8 partitions for parallel processing. The parse_order() function converted each CSV line into a Python dictionary with proper data types. Try this: Use structured_rdd.take(2) to see the parsed format.

Mumbai leads with 37% of all orders, followed by Delhi and Bangalore as major e-commerce hubs

The city distribution reveals a classic Indian e-commerce pattern. Mumbai accounts for 37% of orders, likely due to higher disposable income and digital adoption rates. Delhi and Bangalore together contribute another 46% of volume. This geographic concentration has massive implications for logistics optimization. Companies like BigBasket focus their dark stores in these three cities first, knowing they'll capture 83% of total order volume.
# Complex analysis: Find top cities by electronics revenue
electronics_orders = structured_rdd.filter(
    lambda order: order['category'] == 'Electronics'
)

# Create city-revenue pairs and aggregate
city_electronics_revenue = electronics_orders.map(
    lambda order: (order['city'], order['revenue'])
).reduceByKey(lambda a, b: a + b)

# Get top 3 cities for electronics
top_cities = city_electronics_revenue.top(3, key=lambda x: x[1])
print("Top electronics cities:", top_cities)

What just happened?

We chained multiple operations: filter → map → reduceByKey → top. Mumbai dominates electronics with ₹12.54 crores revenue - nearly 44% of total electronics sales. The top() action sorted by revenue values. Try this: Change the filter to 'Clothing' to see different city patterns.

Electronics orders grow consistently from 8.5K to 12.2K monthly, showing 43% growth trajectory

The trend analysis reveals electronics growing 43% faster than overall order volume. This divergence suggests customers are upgrading to higher-value purchases over time - exactly what companies like Myntra see during digital adoption waves. But here's what most miss: the gap between total and electronics orders is widening. This means other categories are actually declining or stagnating, which creates both risk and opportunity for portfolio optimization.

Common Mistake

Using collect() on large RDDs crashes the driver with OutOfMemoryError. Always use take(n), sample(), or foreach() to write results to external systems instead of bringing everything to one machine.

Quiz

1. A Paytm data engineer creates an RDD with 5 transformations but notices no computation happens. Why doesn't Spark process the data immediately?


2. You're analyzing Swiggy order data with millions of records. What's the key difference between reduceByKey() and groupByKey() for calculating total revenue per restaurant?


3. Your Spark job processes 50GB of Flipkart transaction data but crashes with OutOfMemoryError when you call collect() on the final RDD. What's the best fix?


Up Next

Spark DataFrames

Build on RDD fundamentals to master Spark's high-level DataFrame API with built-in optimizations and SQL support for faster development.