Data Science Lesson 41 – Redshift | Dataplexa
Data Infrastructure · Lesson 41

Redshift

Connect Python to Amazon's powerhouse data warehouse, analyze petabyte-scale datasets, and optimize queries that would crash your laptop.

# Terminal mockup - Redshift toolkit
$ psycopg2.connect(host, port, database) # Connect to cluster
$ pd.read_sql(query, connection) # Query to DataFrame
$ df.to_sql('table', con, method) # Upload DataFrame
$ COPY(table FROM s3://bucket) # Bulk load from S3
$ VACUUM(table) # Reclaim space
$ ANALYZE(table) # Update statistics
$ EXPLAIN(query) # Show query plan
$ CREATE DISTKEY(column) # Distribute data

What Makes Redshift Different

Redshift isn't just another database. It's Amazon's columnar data warehouse that stores data differently than traditional row-based systems. Think Excel where you can instantly sum an entire column versus going row by row.

The magic happens with compression. Your 500GB CSV becomes 50GB in Redshift because similar data compresses beautifully. Phone numbers, dates, categories - they shrink dramatically when stored column-wise.

Row Storage (MySQL)

Reads entire rows even for one column analysis. Great for transactions.

Column Storage (Redshift)

Reads only needed columns. Perfect for analytics and aggregations.

Connecting from Python

The scenario: You're a data analyst at Flipkart, and the finance team needs urgent analysis of last quarter's revenue. The data lives in Redshift, and Excel crashed trying to open the 50GB file.

# Install the PostgreSQL adapter (Redshift speaks PostgreSQL)
import psycopg2
import pandas as pd
import numpy as np

# Connection details - typically from environment variables
HOST = 'flipkart-cluster.abc123.us-east-1.redshift.amazonaws.com'
PORT = 5439  # Standard Redshift port
DATABASE = 'analytics'

What just happened?

We imported psycopg2 because Redshift uses PostgreSQL protocol under the hood. The HOST points to your cluster endpoint, and PORT 5439 is Redshift's default. Try this: Check your AWS console for the exact endpoint URL.

Now establish the actual connection. This is where authentication happens - usually through IAM roles in production, but we'll show the basic approach:

# Create the connection object
conn = psycopg2.connect(
    host=HOST,
    port=PORT,
    database=DATABASE,
    user='analytics_user',  # Your Redshift username
    password='secure_password'  # Better: use environment variable
)

# Test connection with a simple query
test_query = "SELECT CURRENT_TIMESTAMP, VERSION()"
result = pd.read_sql(test_query, conn)

What just happened?

The connection worked! CURRENT_TIMESTAMP shows we're connected and VERSION() confirms it's PostgreSQL (Redshift's foundation). Notice the timestamp is UTC - Redshift stores everything in UTC. Try this: Run SELECT COUNT(*) FROM information_schema.tables to see all available tables.

Querying Massive Datasets

Here's where Redshift shines. Aggregating millions of rows happens in seconds, not minutes. But you need to think differently about query structure.

# Query massive ecommerce data - 50M rows, but Redshift handles it
revenue_query = """
SELECT 
    product_category,
    COUNT(*) as order_count,
    SUM(revenue) as total_revenue,
    AVG(rating) as avg_rating
FROM ecommerce_orders 
WHERE date >= '2023-01-01'
GROUP BY product_category
ORDER BY total_revenue DESC
"""
# Execute the query and load into DataFrame
import time
start_time = time.time()

df_revenue = pd.read_sql(revenue_query, conn)

end_time = time.time()
print(f"Query executed in {end_time - start_time:.2f} seconds")
print(f"Retrieved {len(df_revenue)} product categories")
df_revenue.head()

What just happened?

Redshift scanned 50M+ rows in 3.47 seconds thanks to columnar storage. It only read the 4 columns we needed, not all 11 columns in the table. The GROUP BY happened across distributed nodes in parallel. Try this: Add EXPLAIN before SELECT to see the query execution plan.

Electronics dominates with ₹2,847 crores, nearly 50% more than clothing - mobile phones driving the surge

The chart reveals Electronics generating 47% of total revenue despite representing only 37% of orders. This suggests higher average order values in electronics - think smartphones and laptops versus books and clothing.

Books show the highest rating (4.4) but lowest revenue, indicating customer satisfaction doesn't directly translate to revenue volume. The business insight? Double down on electronics inventory and consider premium positioning for high-rated categories.

Time Series Analysis at Scale

The scenario: Swiggy needs to identify daily order patterns for the last 90 days to optimize delivery staffing. This query would take 20 minutes in MySQL but runs instantly in Redshift.

# Daily order trends with window functions - Redshift speciality
daily_trends_query = """
SELECT 
    date,
    COUNT(*) as daily_orders,
    SUM(revenue) as daily_revenue,
    AVG(COUNT(*)) OVER (
        ORDER BY date 
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) as seven_day_avg_orders
FROM ecommerce_orders 
WHERE date >= CURRENT_DATE - 90
GROUP BY date
ORDER BY date
"""
# Execute and process the time series data
df_daily = pd.read_sql(daily_trends_query, conn)

# Convert date column to datetime for proper plotting
df_daily['date'] = pd.to_datetime(df_daily['date'])

# Show recent trends
print(f"Latest 7 days of data:")
print(df_daily.tail(7))

# Calculate key metrics
peak_day = df_daily.loc[df_daily['daily_orders'].idxmax()]
print(f"\nPeak orders: {peak_day['daily_orders']} on {peak_day['date'].date()}")

What just happened?

The WINDOW function calculated rolling 7-day averages without needing self-joins. Peak orders hit 67,892 on Sep 15 - likely a festival sale. The recent 7-day average hovers around 50,235 orders. Try this: Add LAG(daily_orders, 1) to compare day-over-day changes.

Festival spike in Week 5 created staffing chaos - smooth 7-day average helps predict resource needs

📊 Data Insight

The festival spike (Week 5) shows 31% increase over baseline, but the 7-day rolling average smooths out daily volatility. Swiggy can now staff 15% above rolling average during normal periods and scale to 35% during predicted spikes.

Optimization and Performance

Redshift performance isn't automatic. Distribution keys and sort keys make the difference between 3-second queries and 3-minute disasters. Here's what 90% of teams get wrong:

# Check table distribution - this reveals performance bottlenecks
distribution_query = """
SELECT 
    schema as table_schema,
    "table" as table_name,
    diststyle,
    distkey,
    sortkey1
FROM pg_table_def 
WHERE schemaname = 'public' 
  AND tablename = 'ecommerce_orders'
"""

df_distribution = pd.read_sql(distribution_query, conn)
print("Current table structure:")
print(df_distribution)

Common Mistake: EVEN Distribution

EVEN distribution spreads data randomly, forcing cross-node joins for customer analysis. Fix: Use DISTKEY(customer_id) to keep related orders on same nodes, reducing network shuffling by 80%.

# Analyze query performance with EXPLAIN
explain_query = """
EXPLAIN
SELECT customer_age, COUNT(*) as order_count
FROM ecommerce_orders 
WHERE date >= '2023-01-01'
  AND product_category = 'Electronics'
GROUP BY customer_age
ORDER BY order_count DESC
"""

# Get query execution plan  
df_explain = pd.read_sql(explain_query, conn)
print("Query execution plan:")
for idx, row in df_explain.iterrows():
    print(f"{idx+1}. {row['QUERY PLAN']}")

What just happened?

The XN Seq Scan means full table scan - expensive! Cost=42,187 is high because there's no index on product_category. The sort key on date helps, but we need compound sort keys. Try this: Create sort key on (date, product_category) for 10x faster filtering.

85% of query cost comes from sequential scanning - proper indexing could reduce this to under 5,000 units

The doughnut chart exposes the real culprit: sequential scanning consumes 85% of query resources. This happens when Redshift can't use sort keys or distribution keys effectively.

Network transfer (6%) indicates cross-node data movement. With proper distribution keys, related data stays on the same nodes, eliminating most network overhead. The lesson? Spend time on table design upfront to avoid performance headaches later.

Pro Tip: Run ANALYZE TABLE after large data loads to update statistics. Stale statistics cause the query planner to make bad decisions, turning 3-second queries into 3-minute nightmares.

Loading Data at Scale

The scenario: HDFC Bank generates 100GB of transaction logs daily. Loading this through pandas would take 6 hours. COPY command from S3 does it in 12 minutes.

# Bulk load from S3 - the fastest way to get data into Redshift
copy_command = """
COPY ecommerce_orders 
FROM 's3://hdfc-data-lake/transactions/2023/10/'
IAM_ROLE 'arn:aws:iam::123456:role/RedshiftCopyRole'
CSV
GZIP
DELIMITER ','
IGNOREHEADER 1
DATEFORMAT 'YYYY-MM-DD'
"""

# Execute the COPY command
cursor = conn.cursor()
cursor.execute(copy_command)
conn.commit()

What just happened?

COPY loaded 4.57M rows in 11 minutes by processing 847 files in parallel across all Redshift nodes. The GZIP compression saved 87% bandwidth, and IAM_ROLE handled authentication securely. Try this: Split large files into 100-200MB chunks for optimal parallel processing.

Quiz

1. Your Redshift query joining customer orders with customer profiles takes 45 seconds. The explain plan shows massive network transfer costs. What's the most effective optimization?


2. You need to load 50GB of daily transaction data into Redshift. Using pd.DataFrame.to_sql() takes 4 hours. What's the fastest alternative approach?


3. This query runs in 2 seconds on Redshift but would take 15 minutes in MySQL: SELECT date, COUNT(*), AVG(COUNT(*)) OVER (ORDER BY date ROWS 6 PRECEDING) FROM orders GROUP BY date. Why is Redshift so much faster?


Up Next

NumPy

Master the foundational array library that powers pandas, scikit-learn, and every data science workflow you'll build.