Data Science Lesson 41 – Redshift | Dataplexa

Data Infrastructure · Lesson 41

Redshift

Connect Python to Amazon's powerhouse data warehouse, analyze petabyte-scale datasets, and optimize queries that would crash your laptop.

# Terminal mockup - Redshift toolkit
$ psycopg2.connect(host, port, database) # Connect to cluster
$ pd.read_sql(query, connection) # Query to DataFrame
$ df.to_sql('table', con, method) # Upload DataFrame
$ COPY(table FROM s3://bucket) # Bulk load from S3
$ VACUUM(table) # Reclaim space
$ ANALYZE(table) # Update statistics
$ EXPLAIN(query) # Show query plan
$ CREATE DISTKEY(column) # Distribute data

What Makes Redshift Different

Redshift isn't just another database. It's Amazon's columnar data warehouse that stores data differently than traditional row-based systems. Think Excel where you can instantly sum an entire column versus going row by row.

The magic happens with compression. Your 500GB CSV becomes 50GB in Redshift because similar data compresses beautifully. Phone numbers, dates, categories - they shrink dramatically when stored column-wise.

Row Storage (MySQL)

Reads entire rows even for one column analysis. Great for transactions.

Column Storage (Redshift)

Reads only needed columns. Perfect for analytics and aggregations.

Connecting from Python

The scenario: You're a data analyst at Flipkart, and the finance team needs urgent analysis of last quarter's revenue. The data lives in Redshift, and Excel crashed trying to open the 50GB file.

# Install the PostgreSQL adapter (Redshift speaks PostgreSQL)
import psycopg2
import pandas as pd
import numpy as np

# Connection details - typically from environment variables
HOST = 'flipkart-cluster.abc123.us-east-1.redshift.amazonaws.com'
PORT = 5439  # Standard Redshift port
DATABASE = 'analytics'

Successfully imported psycopg2 2.9.3
pandas version: 1.5.2
numpy version: 1.24.1
Connection variables set

What just happened?

We imported psycopg2 because Redshift uses PostgreSQL protocol under the hood. The HOST points to your cluster endpoint, and PORT 5439 is Redshift's default. Try this: Check your AWS console for the exact endpoint URL.

Now establish the actual connection. This is where authentication happens - usually through IAM roles in production, but we'll show the basic approach:

# Create the connection object
conn = psycopg2.connect(
    host=HOST,
    port=PORT,
    database=DATABASE,
    user='analytics_user',  # Your Redshift username
    password='secure_password'  # Better: use environment variable
)

# Test connection with a simple query
test_query = "SELECT CURRENT_TIMESTAMP, VERSION()"
result = pd.read_sql(test_query, conn)

           current_timestamp                                           version
0  2023-10-15 14:23:17.453126+00:00  PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2
Connection successful to Flipkart analytics cluster

What just happened?

The connection worked! CURRENT_TIMESTAMP shows we're connected and VERSION() confirms it's PostgreSQL (Redshift's foundation). Notice the timestamp is UTC - Redshift stores everything in UTC. Try this: Run SELECT COUNT(*) FROM information_schema.tables to see all available tables.

Querying Massive Datasets

Here's where Redshift shines. Aggregating millions of rows happens in seconds, not minutes. But you need to think differently about query structure.

# Query massive ecommerce data - 50M rows, but Redshift handles it
revenue_query = """
SELECT 
    product_category,
    COUNT(*) as order_count,
    SUM(revenue) as total_revenue,
    AVG(rating) as avg_rating
FROM ecommerce_orders 
WHERE date >= '2023-01-01'
GROUP BY product_category
ORDER BY total_revenue DESC
"""

Query prepared - ready to execute against 50M+ order records
Estimated scan: 2.1GB compressed data
Query will use columnar compression for fast aggregation

# Execute the query and load into DataFrame
import time
start_time = time.time()

df_revenue = pd.read_sql(revenue_query, conn)

end_time = time.time()
print(f"Query executed in {end_time - start_time:.2f} seconds")
print(f"Retrieved {len(df_revenue)} product categories")
df_revenue.head()

Query executed in 3.47 seconds
Retrieved 5 product categories

    product_category  order_count  total_revenue     avg_rating
0        Electronics      8234567     284752341.23           4.2
1           Clothing      6891245     192847562.89           4.1  
2               Home      3456789     115234789.45           4.3
3               Food      2134567      87452341.67           4.0
4              Books      1567890      43298765.12           4.4

What just happened?

Redshift scanned 50M+ rows in 3.47 seconds thanks to columnar storage. It only read the 4 columns we needed, not all 11 columns in the table. The GROUP BY happened across distributed nodes in parallel. Try this: Add EXPLAIN before SELECT to see the query execution plan.

Electronics dominates with ₹2,847 crores, nearly 50% more than clothing - mobile phones driving the surge

The chart reveals Electronics generating 47% of total revenue despite representing only 37% of orders. This suggests higher average order values in electronics - think smartphones and laptops versus books and clothing.

Books show the highest rating (4.4) but lowest revenue, indicating customer satisfaction doesn't directly translate to revenue volume. The business insight? Double down on electronics inventory and consider premium positioning for high-rated categories.

Time Series Analysis at Scale

The scenario: Swiggy needs to identify daily order patterns for the last 90 days to optimize delivery staffing. This query would take 20 minutes in MySQL but runs instantly in Redshift.

# Daily order trends with window functions - Redshift speciality
daily_trends_query = """
SELECT 
    date,
    COUNT(*) as daily_orders,
    SUM(revenue) as daily_revenue,
    AVG(COUNT(*)) OVER (
        ORDER BY date 
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) as seven_day_avg_orders
FROM ecommerce_orders 
WHERE date >= CURRENT_DATE - 90
GROUP BY date
ORDER BY date
"""

Window function query prepared
Will calculate 7-day rolling average for 90 days of data
Estimated processing: 4.5M daily records across 90 days

# Execute and process the time series data
df_daily = pd.read_sql(daily_trends_query, conn)

# Convert date column to datetime for proper plotting
df_daily['date'] = pd.to_datetime(df_daily['date'])

# Show recent trends
print(f"Latest 7 days of data:")
print(df_daily.tail(7))

# Calculate key metrics
peak_day = df_daily.loc[df_daily['daily_orders'].idxmax()]
print(f"\nPeak orders: {peak_day['daily_orders']} on {peak_day['date'].date()}")

Latest 7 days of data:
        date  daily_orders  daily_revenue  seven_day_avg_orders
83  2023-10-09         52341    4532177.45              48234.7
84  2023-10-10         47892    4123456.78              47892.3  
85  2023-10-11         49234    4287659.23              49102.1
86  2023-10-12         51567    4456789.12              50234.8
87  2023-10-13         54123    4789123.45              51234.2
88  2023-10-14         49876    4345678.90              50567.9
89  2023-10-15         48234    4198765.43              50234.5

Peak orders: 67,892 on 2023-09-15

What just happened?

The WINDOW function calculated rolling 7-day averages without needing self-joins. Peak orders hit 67,892 on Sep 15 - likely a festival sale. The recent 7-day average hovers around 50,235 orders. Try this: Add LAG(daily_orders, 1) to compare day-over-day changes.

Festival spike in Week 5 created staffing chaos - smooth 7-day average helps predict resource needs

📊 Data Insight

The festival spike (Week 5) shows 31% increase over baseline, but the 7-day rolling average smooths out daily volatility. Swiggy can now staff 15% above rolling average during normal periods and scale to 35% during predicted spikes.

Optimization and Performance

Redshift performance isn't automatic. Distribution keys and sort keys make the difference between 3-second queries and 3-minute disasters. Here's what 90% of teams get wrong:

# Check table distribution - this reveals performance bottlenecks
distribution_query = """
SELECT 
    schema as table_schema,
    "table" as table_name,
    diststyle,
    distkey,
    sortkey1
FROM pg_table_def 
WHERE schemaname = 'public' 
  AND tablename = 'ecommerce_orders'
"""

df_distribution = pd.read_sql(distribution_query, conn)
print("Current table structure:")
print(df_distribution)

Current table structure:
  table_schema      table_name diststyle distkey sortkey1
0       public  ecommerce_orders      EVEN      NaN     date

Table uses EVEN distribution (not optimized)
No distribution key defined - data spread randomly across nodes
Sort key on 'date' helps with time-range queries

Common Mistake: EVEN Distribution

EVEN distribution spreads data randomly, forcing cross-node joins for customer analysis. Fix: Use DISTKEY(customer_id) to keep related orders on same nodes, reducing network shuffling by 80%.

# Analyze query performance with EXPLAIN
explain_query = """
EXPLAIN
SELECT customer_age, COUNT(*) as order_count
FROM ecommerce_orders 
WHERE date >= '2023-01-01'
  AND product_category = 'Electronics'
GROUP BY customer_age
ORDER BY order_count DESC
"""

# Get query execution plan  
df_explain = pd.read_sql(explain_query, conn)
print("Query execution plan:")
for idx, row in df_explain.iterrows():
    print(f"{idx+1}. {row['QUERY PLAN']}")

Query execution plan:
1. XN HashAggregate (cost=45234.56..45242.89 rows=833 width=12)
2. -> XN Seq Scan on ecommerce_orders (cost=0.00..42187.50 rows=609412 width=8)
3.      Filter: ((date >= '2023-01-01'::date) AND (product_category = 'Electronics'::text))

Cost breakdown:
- Sequential scan: 42,187 cost units (expensive!)  
- Hash aggregate: 3,047 cost units
- Total estimated rows: 609,412
- Network distribution: Cross-node shuffling detected

What just happened?

The XN Seq Scan means full table scan - expensive! Cost=42,187 is high because there's no index on product_category. The sort key on date helps, but we need compound sort keys. Try this: Create sort key on (date, product_category) for 10x faster filtering.

85% of query cost comes from sequential scanning - proper indexing could reduce this to under 5,000 units

The doughnut chart exposes the real culprit: sequential scanning consumes 85% of query resources. This happens when Redshift can't use sort keys or distribution keys effectively.

Network transfer (6%) indicates cross-node data movement. With proper distribution keys, related data stays on the same nodes, eliminating most network overhead. The lesson? Spend time on table design upfront to avoid performance headaches later.

Pro Tip: Run ANALYZE TABLE after large data loads to update statistics. Stale statistics cause the query planner to make bad decisions, turning 3-second queries into 3-minute nightmares.

Loading Data at Scale

The scenario: HDFC Bank generates 100GB of transaction logs daily. Loading this through pandas would take 6 hours. COPY command from S3 does it in 12 minutes.

# Bulk load from S3 - the fastest way to get data into Redshift
copy_command = """
COPY ecommerce_orders 
FROM 's3://hdfc-data-lake/transactions/2023/10/'
IAM_ROLE 'arn:aws:iam::123456:role/RedshiftCopyRole'
CSV
GZIP
DELIMITER ','
IGNOREHEADER 1
DATEFORMAT 'YYYY-MM-DD'
"""

# Execute the COPY command
cursor = conn.cursor()
cursor.execute(copy_command)
conn.commit()

COPY command executing...
INFO: Load started at 2023-10-15 14:30:23
INFO: Loaded 4,567,892 rows from 847 files
INFO: Total compressed size: 12.4 GB
INFO: Total uncompressed size: 98.7 GB  
INFO: Load completed at 2023-10-15 14:42:17
COPY COMPLETED
Rows loaded: 4,567,892
Time taken: 11 minutes 54 seconds

What just happened?

COPY loaded 4.57M rows in 11 minutes by processing 847 files in parallel across all Redshift nodes. The GZIP compression saved 87% bandwidth, and IAM_ROLE handled authentication securely. Try this: Split large files into 100-200MB chunks for optimal parallel processing.

Quiz

Up Next

NumPy

Master the foundational array library that powers pandas, scikit-learn, and every data science workflow you'll build.

← Previous Course Index Next →