Data Science
Redshift
Connect Python to Amazon's powerhouse data warehouse, analyze petabyte-scale datasets, and optimize queries that would crash your laptop.
What Makes Redshift Different
Redshift isn't just another database. It's Amazon's columnar data warehouse that stores data differently than traditional row-based systems. Think Excel where you can instantly sum an entire column versus going row by row.
The magic happens with compression. Your 500GB CSV becomes 50GB in Redshift because similar data compresses beautifully. Phone numbers, dates, categories - they shrink dramatically when stored column-wise.
Row Storage (MySQL)
Reads entire rows even for one column analysis. Great for transactions.
Column Storage (Redshift)
Reads only needed columns. Perfect for analytics and aggregations.
Connecting from Python
The scenario: You're a data analyst at Flipkart, and the finance team needs urgent analysis of last quarter's revenue. The data lives in Redshift, and Excel crashed trying to open the 50GB file.
# Install the PostgreSQL adapter (Redshift speaks PostgreSQL)
import psycopg2
import pandas as pd
import numpy as np
# Connection details - typically from environment variables
HOST = 'flipkart-cluster.abc123.us-east-1.redshift.amazonaws.com'
PORT = 5439 # Standard Redshift port
DATABASE = 'analytics'Successfully imported psycopg2 2.9.3 pandas version: 1.5.2 numpy version: 1.24.1 Connection variables set
What just happened?
We imported psycopg2 because Redshift uses PostgreSQL protocol under the hood. The HOST points to your cluster endpoint, and PORT 5439 is Redshift's default. Try this: Check your AWS console for the exact endpoint URL.
Now establish the actual connection. This is where authentication happens - usually through IAM roles in production, but we'll show the basic approach:
# Create the connection object
conn = psycopg2.connect(
host=HOST,
port=PORT,
database=DATABASE,
user='analytics_user', # Your Redshift username
password='secure_password' # Better: use environment variable
)
# Test connection with a simple query
test_query = "SELECT CURRENT_TIMESTAMP, VERSION()"
result = pd.read_sql(test_query, conn)current_timestamp version 0 2023-10-15 14:23:17.453126+00:00 PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 Connection successful to Flipkart analytics cluster
What just happened?
The connection worked! CURRENT_TIMESTAMP shows we're connected and VERSION() confirms it's PostgreSQL (Redshift's foundation). Notice the timestamp is UTC - Redshift stores everything in UTC. Try this: Run SELECT COUNT(*) FROM information_schema.tables to see all available tables.
Querying Massive Datasets
Here's where Redshift shines. Aggregating millions of rows happens in seconds, not minutes. But you need to think differently about query structure.
# Query massive ecommerce data - 50M rows, but Redshift handles it
revenue_query = """
SELECT
product_category,
COUNT(*) as order_count,
SUM(revenue) as total_revenue,
AVG(rating) as avg_rating
FROM ecommerce_orders
WHERE date >= '2023-01-01'
GROUP BY product_category
ORDER BY total_revenue DESC
"""Query prepared - ready to execute against 50M+ order records Estimated scan: 2.1GB compressed data Query will use columnar compression for fast aggregation
# Execute the query and load into DataFrame
import time
start_time = time.time()
df_revenue = pd.read_sql(revenue_query, conn)
end_time = time.time()
print(f"Query executed in {end_time - start_time:.2f} seconds")
print(f"Retrieved {len(df_revenue)} product categories")
df_revenue.head()Query executed in 3.47 seconds
Retrieved 5 product categories
product_category order_count total_revenue avg_rating
0 Electronics 8234567 284752341.23 4.2
1 Clothing 6891245 192847562.89 4.1
2 Home 3456789 115234789.45 4.3
3 Food 2134567 87452341.67 4.0
4 Books 1567890 43298765.12 4.4What just happened?
Redshift scanned 50M+ rows in 3.47 seconds thanks to columnar storage. It only read the 4 columns we needed, not all 11 columns in the table. The GROUP BY happened across distributed nodes in parallel. Try this: Add EXPLAIN before SELECT to see the query execution plan.
Electronics dominates with ₹2,847 crores, nearly 50% more than clothing - mobile phones driving the surge
The chart reveals Electronics generating 47% of total revenue despite representing only 37% of orders. This suggests higher average order values in electronics - think smartphones and laptops versus books and clothing.
Books show the highest rating (4.4) but lowest revenue, indicating customer satisfaction doesn't directly translate to revenue volume. The business insight? Double down on electronics inventory and consider premium positioning for high-rated categories.
Time Series Analysis at Scale
The scenario: Swiggy needs to identify daily order patterns for the last 90 days to optimize delivery staffing. This query would take 20 minutes in MySQL but runs instantly in Redshift.
# Daily order trends with window functions - Redshift speciality
daily_trends_query = """
SELECT
date,
COUNT(*) as daily_orders,
SUM(revenue) as daily_revenue,
AVG(COUNT(*)) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) as seven_day_avg_orders
FROM ecommerce_orders
WHERE date >= CURRENT_DATE - 90
GROUP BY date
ORDER BY date
"""Window function query prepared Will calculate 7-day rolling average for 90 days of data Estimated processing: 4.5M daily records across 90 days
# Execute and process the time series data
df_daily = pd.read_sql(daily_trends_query, conn)
# Convert date column to datetime for proper plotting
df_daily['date'] = pd.to_datetime(df_daily['date'])
# Show recent trends
print(f"Latest 7 days of data:")
print(df_daily.tail(7))
# Calculate key metrics
peak_day = df_daily.loc[df_daily['daily_orders'].idxmax()]
print(f"\nPeak orders: {peak_day['daily_orders']} on {peak_day['date'].date()}")Latest 7 days of data:
date daily_orders daily_revenue seven_day_avg_orders
83 2023-10-09 52341 4532177.45 48234.7
84 2023-10-10 47892 4123456.78 47892.3
85 2023-10-11 49234 4287659.23 49102.1
86 2023-10-12 51567 4456789.12 50234.8
87 2023-10-13 54123 4789123.45 51234.2
88 2023-10-14 49876 4345678.90 50567.9
89 2023-10-15 48234 4198765.43 50234.5
Peak orders: 67,892 on 2023-09-15What just happened?
The WINDOW function calculated rolling 7-day averages without needing self-joins. Peak orders hit 67,892 on Sep 15 - likely a festival sale. The recent 7-day average hovers around 50,235 orders. Try this: Add LAG(daily_orders, 1) to compare day-over-day changes.
Festival spike in Week 5 created staffing chaos - smooth 7-day average helps predict resource needs
📊 Data Insight
The festival spike (Week 5) shows 31% increase over baseline, but the 7-day rolling average smooths out daily volatility. Swiggy can now staff 15% above rolling average during normal periods and scale to 35% during predicted spikes.
Optimization and Performance
Redshift performance isn't automatic. Distribution keys and sort keys make the difference between 3-second queries and 3-minute disasters. Here's what 90% of teams get wrong:
# Check table distribution - this reveals performance bottlenecks
distribution_query = """
SELECT
schema as table_schema,
"table" as table_name,
diststyle,
distkey,
sortkey1
FROM pg_table_def
WHERE schemaname = 'public'
AND tablename = 'ecommerce_orders'
"""
df_distribution = pd.read_sql(distribution_query, conn)
print("Current table structure:")
print(df_distribution)Current table structure: table_schema table_name diststyle distkey sortkey1 0 public ecommerce_orders EVEN NaN date Table uses EVEN distribution (not optimized) No distribution key defined - data spread randomly across nodes Sort key on 'date' helps with time-range queries
Common Mistake: EVEN Distribution
EVEN distribution spreads data randomly, forcing cross-node joins for customer analysis. Fix: Use DISTKEY(customer_id) to keep related orders on same nodes, reducing network shuffling by 80%.
# Analyze query performance with EXPLAIN
explain_query = """
EXPLAIN
SELECT customer_age, COUNT(*) as order_count
FROM ecommerce_orders
WHERE date >= '2023-01-01'
AND product_category = 'Electronics'
GROUP BY customer_age
ORDER BY order_count DESC
"""
# Get query execution plan
df_explain = pd.read_sql(explain_query, conn)
print("Query execution plan:")
for idx, row in df_explain.iterrows():
print(f"{idx+1}. {row['QUERY PLAN']}")
Query execution plan: 1. XN HashAggregate (cost=45234.56..45242.89 rows=833 width=12) 2. -> XN Seq Scan on ecommerce_orders (cost=0.00..42187.50 rows=609412 width=8) 3. Filter: ((date >= '2023-01-01'::date) AND (product_category = 'Electronics'::text)) Cost breakdown: - Sequential scan: 42,187 cost units (expensive!) - Hash aggregate: 3,047 cost units - Total estimated rows: 609,412 - Network distribution: Cross-node shuffling detected
What just happened?
The XN Seq Scan means full table scan - expensive! Cost=42,187 is high because there's no index on product_category. The sort key on date helps, but we need compound sort keys. Try this: Create sort key on (date, product_category) for 10x faster filtering.
85% of query cost comes from sequential scanning - proper indexing could reduce this to under 5,000 units
The doughnut chart exposes the real culprit: sequential scanning consumes 85% of query resources. This happens when Redshift can't use sort keys or distribution keys effectively.
Network transfer (6%) indicates cross-node data movement. With proper distribution keys, related data stays on the same nodes, eliminating most network overhead. The lesson? Spend time on table design upfront to avoid performance headaches later.
Pro Tip: Run ANALYZE TABLE after large data loads to update statistics. Stale statistics cause the query planner to make bad decisions, turning 3-second queries into 3-minute nightmares.
Loading Data at Scale
The scenario: HDFC Bank generates 100GB of transaction logs daily. Loading this through pandas would take 6 hours. COPY command from S3 does it in 12 minutes.
# Bulk load from S3 - the fastest way to get data into Redshift
copy_command = """
COPY ecommerce_orders
FROM 's3://hdfc-data-lake/transactions/2023/10/'
IAM_ROLE 'arn:aws:iam::123456:role/RedshiftCopyRole'
CSV
GZIP
DELIMITER ','
IGNOREHEADER 1
DATEFORMAT 'YYYY-MM-DD'
"""
# Execute the COPY command
cursor = conn.cursor()
cursor.execute(copy_command)
conn.commit()COPY command executing... INFO: Load started at 2023-10-15 14:30:23 INFO: Loaded 4,567,892 rows from 847 files INFO: Total compressed size: 12.4 GB INFO: Total uncompressed size: 98.7 GB INFO: Load completed at 2023-10-15 14:42:17 COPY COMPLETED Rows loaded: 4,567,892 Time taken: 11 minutes 54 seconds
What just happened?
COPY loaded 4.57M rows in 11 minutes by processing 847 files in parallel across all Redshift nodes. The GZIP compression saved 87% bandwidth, and IAM_ROLE handled authentication securely. Try this: Split large files into 100-200MB chunks for optimal parallel processing.
Quiz
1. Your Redshift query joining customer orders with customer profiles takes 45 seconds. The explain plan shows massive network transfer costs. What's the most effective optimization?
2. You need to load 50GB of daily transaction data into Redshift. Using pd.DataFrame.to_sql() takes 4 hours. What's the fastest alternative approach?
3. This query runs in 2 seconds on Redshift but would take 15 minutes in MySQL: SELECT date, COUNT(*), AVG(COUNT(*)) OVER (ORDER BY date ROWS 6 PRECEDING) FROM orders GROUP BY date. Why is Redshift so much faster?
Up Next
NumPy
Master the foundational array library that powers pandas, scikit-learn, and every data science workflow you'll build.