Data Science Lesson 59 – Spark DataFrames | Dataplexa
Big Data · Lesson 59

Spark DataFrames

Transform massive datasets with distributed computing power using Spark's DataFrame API for lightning-fast analysis across clusters.

1
Initialize Spark Session
2
Create DataFrame from Data Sources
3
Apply Transformations (Lazy Evaluation)
4
Execute Actions (Trigger Computation)

Why Spark DataFrames Matter

Pandas DataFrame hits a wall around 10GB of data. Your machine runs out of memory. Game over. Spark DataFrames solve this by distributing data across multiple machines and processing chunks in parallel.

Think of it like this — instead of one person counting a million coins, you give 100 people 10,000 coins each. They count simultaneously, then report back. Spark automatically handles the distribution, fault tolerance, and result aggregation.

What is a Spark DataFrame?

A distributed collection of data organized into named columns. Unlike Pandas DataFrames that live in one machine's memory, Spark DataFrames are spread across multiple worker nodes in a cluster. They support SQL queries, have automatic optimization, and can handle petabytes of data.

The secret sauce is lazy evaluation. Spark doesn't actually process your data until you call an action like show() or collect(). Instead, it builds an execution plan and optimizes it before running.

Setting Up Your First Spark Session

The scenario: You're a data engineer at Flipkart analyzing 50GB of daily transaction logs. Pandas crashes instantly. Time to fire up Spark.

# Import PySpark SQL module for DataFrame operations
from pyspark.sql import SparkSession

# Create Spark session with custom configuration
spark = SparkSession.builder \
    .appName("Flipkart_Transaction_Analysis") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .getOrCreate()

What just happened?

The SparkSession is your entry point to all Spark functionality. The adaptive configs enable automatic optimization based on runtime statistics. Try this: Check your Spark UI at http://localhost:4040 to monitor jobs.

Creating DataFrames from Data Sources

Spark DataFrames can read from CSV, JSON, Parquet, databases, and even streaming sources. The API stays consistent regardless of data size or location.

# Read CSV file into Spark DataFrame
# inferSchema automatically detects column data types
df = spark.read.option("header", "true") \
    .option("inferSchema", "true") \
    .csv("dataplexa_ecommerce.csv")

# Display basic information about the DataFrame
print(f"Total rows: {df.count()}")
print(f"Number of columns: {len(df.columns)}")

What just happened?

Spark read the entire CSV and inferred data types automatically. The count() triggered actual data processing — that's an action, not a transformation. Try this: Skip inferSchema for faster loading if you know your schema.

# Check the inferred schema structure
df.printSchema()

# Show first 5 rows of data
df.show(5)

What just happened?

The printSchema() shows Spark correctly identified integers, doubles, strings, and booleans. Notice the pipe-delimited tree structure — that's Spark's way of showing nested data types. Try this: Use df.show(20, truncate=False) to see full column values.

Essential DataFrame Operations

DataFrame operations fall into two categories: transformations (lazy, return new DataFrames) and actions (eager, trigger computation).

Transformations

select(), filter(), groupBy(), join(), orderBy() — Build execution plan

Actions

show(), collect(), count(), write() — Execute computations

The scenario: Swiggy's analytics team needs to find high-value customers in Electronics who might churn. They want customers spending over ₹20,000 with ratings below 3.0.

# Filter for Electronics category only
# This is a transformation - builds query plan, doesn't execute yet  
electronics_df = df.filter(df.product_category == "Electronics")

# Chain multiple transformations together
# Select specific columns we need for analysis
risk_customers = electronics_df.select("order_id", "customer_age", "gender", 
                                       "city", "revenue", "rating") \
                               .filter(df.revenue > 20000) \
                               .filter(df.rating < 3.0)
# NOW we trigger execution with an action
# Show the risky high-value customers
risk_customers.show(10)

# Get count of risky customers for executive reporting
risky_count = risk_customers.count()
print(f"Found {risky_count} high-value customers at risk of churning")

What just happened?

The transformations built an optimized execution plan. Only when show() ran did Spark actually process the data. Found 47 customers spending ₹20K+ but rating below 3.0 — prime targets for retention campaigns. Try this: Use explain() to see Spark's execution plan.

Aggregations and GroupBy Operations

This is where Spark really shines. Aggregating terabytes of data across hundreds of partitions, automatically parallelized. GroupBy operations distribute computation across the entire cluster.

# Import SQL functions for aggregations
from pyspark.sql import functions as F

# Calculate revenue metrics by product category
# Multiple aggregations in single groupBy operation
category_metrics = df.groupBy("product_category") \
                     .agg(F.sum("revenue").alias("total_revenue"),
                          F.avg("revenue").alias("avg_order_value"),
                          F.count("*").alias("order_count"),
                          F.avg("rating").alias("avg_rating")) \
                     .orderBy(F.desc("total_revenue"))
# Execute the aggregation and show results
category_metrics.show()

# Round the decimal values for better readability  
category_metrics.select("product_category",
                       F.round("total_revenue", 2).alias("total_revenue"),
                       F.round("avg_order_value", 2).alias("avg_order_value"), 
                       "order_count",
                       F.round("avg_rating", 2).alias("avg_rating")).show()

📊 Data Insight

Electronics dominates with ₹2.85 crores total revenue despite having fewer orders (1,652) than Books (3,500). High-ticket electronics drive ₹17,234 average order value vs Books at ₹1,234. Books maintain highest satisfaction at 4.68 rating.

Electronics and Clothing capture 67% of total revenue despite representing only 33% of order volume

Electronics clearly leads revenue generation at ₹28.4 crores, nearly 50% more than Clothing. This pattern suggests customers view electronics as high-value purchases worth spending more on.

The inverse relationship between order count and revenue per category reveals customer behavior insights. Books have the most orders but lowest revenue — frequent, low-value purchases. Electronics have fewer orders but massive revenue impact — infrequent, high-value purchases.

Spark SQL Integration

Here's the killer feature: you can run SQL queries directly on DataFrames. No need to learn new syntax if you're already comfortable with SQL.

# Register DataFrame as temporary SQL table
df.createOrReplaceTempView("ecommerce_transactions")

# Run complex SQL query using Spark SQL
sql_query = """
SELECT city,
       COUNT(*) as total_orders,
       AVG(revenue) as avg_revenue,
       SUM(CASE WHEN returned = true THEN 1 ELSE 0 END) as return_count,
       ROUND(SUM(CASE WHEN returned = true THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) as return_rate
FROM ecommerce_transactions 
WHERE product_category = 'Electronics'
GROUP BY city
ORDER BY return_rate DESC
"""
# Execute the SQL query
city_returns = spark.sql(sql_query)

# Display results
city_returns.show()

# Save results for further analysis
city_returns.coalesce(1).write.mode("overwrite").csv("city_electronics_analysis")

What just happened?

The SQL query revealed Pune has the highest electronics return rate at 15.77% despite decent average revenue. coalesce(1) merged all partitions before writing — use sparingly as it can bottleneck performance. Try this: Use partitionBy("city") for better write performance.

Pune shows 34% higher return rates than Chennai, indicating potential logistics or quality issues

Common Performance Mistake

Using collect() to bring entire DataFrames to driver node. This kills performance and crashes drivers on large datasets. Instead use show(), take(n), or write to external systems. Never collect DataFrames over 1GB.

Performance Optimization Tips

Spark performance depends heavily on partitioning strategy and data formats. Poor partitioning can make your cluster crawl even with hundreds of cores.

# Check current partitioning
print(f"Number of partitions: {df.rdd.getNumPartitions()}")

# Repartition based on frequently filtered column
# This optimizes joins and filters on product_category
optimized_df = df.repartition(8, "product_category")

# Cache frequently accessed DataFrames in memory
optimized_df.cache()
print("DataFrame cached in cluster memory")

Books dominate order volume (32%) while Electronics generate highest revenue per order

💡 Pro Tip: Use Parquet format instead of CSV for 10x faster reads. Parquet stores columnar data with built-in compression and predicate pushdown. Your 50GB CSV becomes a 5GB Parquet with much faster query performance.

Quiz

1. Your Flipkart analytics team chains 10 DataFrame transformations but the job completes instantly. What's happening?


2. You're analyzing Zomato's 100GB order dataset and frequently filter by city. Your queries are slow despite having 64 CPU cores. What's the best optimization?


3. Which Spark DataFrame operations will actually process your Swiggy transaction data and use cluster resources?


Up Next

Distributed Systems

Master the architecture behind Spark's distributed computing model and learn how data flows across clusters for fault-tolerant processing.