Data Science Lesson 61 – S3 Storage | Dataplexa
Cloud Storage · Lesson 61

S3 Storage

Master Amazon S3 for data science workflows, from basic uploads to automated analytics pipelines with real ecommerce data.

Think of S3 as the world's largest filing cabinet. But instead of drawers and folders, you get buckets and objects. Every data scientist ends up here eventually — your local machine runs out of space around the 50GB mark, and suddenly you're googling "how to store big datasets."

Amazon S3 stores 100+ trillion objects globally. Data scientists love it because it scales infinitely, costs pennies per GB, and integrates with every analytics tool imaginable. The 10% that trips everyone up? Understanding the difference between buckets, prefixes, and actual folder structures.

S3 Architecture Breakdown

S3 architecture follows three core concepts. Buckets hold everything. Objects are your actual files. Keys are the file paths that look like folders but aren't really folders.
1
Buckets - Global Namespace
2
Objects - Your Actual Data
3
Keys - File Path Structure
4
Access Control - Who Sees What
Bucket names must be globally unique across all AWS accounts. If someone already took my-data-bucket, you're out of luck. Data scientists typically use company prefixes: flipkart-analytics-prod or zomato-ml-models-2024.

S3 Object Anatomy

Every S3 object has metadata, version history, and storage class. The key data/2024/ecommerce/january/sales.csv looks like nested folders but it's actually one long filename. This distinction matters for performance optimization.

Storage Classes and Costs

S3 offers six storage classes. Each balances cost against access speed. Standard costs ₹1.84 per GB monthly. Archive classes drop to ₹0.33 per GB but take hours to retrieve files.

Frequently Accessed

Standard: ₹1.84/GB/month
Standard-IA: ₹0.92/GB/month
One Zone-IA: ₹0.74/GB/month

Archive Storage

Glacier Instant: ₹0.33/GB/month
Glacier Flexible: ₹0.28/GB/month
Deep Archive: ₹0.07/GB/month

The sweet spot for most data science workloads? Standard-IA for datasets you access weekly, Glacier Instant for monthly model retraining data. Deep Archive works for compliance backups you'll hopefully never need.

Standard storage costs 26x more than Deep Archive but provides instant access

Storage class selection depends on access patterns, not file importance. Your most critical model weights might sit in Deep Archive if you only retrain quarterly. Meanwhile, daily ETL temp files belong in Standard despite being completely replaceable.

Smart business decision: set up lifecycle policies that automatically move objects to cheaper storage after 30/90/365 days. This single configuration can cut storage costs by 80% without any code changes.

Uploading Data with boto3

The scenario: Swiggy's data team needs to upload daily order data from their Mumbai servers to S3 for machine learning pipeline processing. The CSV files range from 500MB to 2GB each.
# Install AWS SDK for Python - handles authentication and API calls
import boto3
import pandas as pd
from datetime import datetime

# Create S3 client - uses credentials from AWS CLI or environment variables
s3_client = boto3.client('s3')

# Load sample ecommerce data that we want to upload
df = pd.read_csv('dataplexa_ecommerce.csv')

What just happened?

We imported boto3, the official AWS SDK that handles all S3 operations. The client automatically detects AWS credentials from your local AWS CLI setup or environment variables. Try this: run aws configure in terminal to set up authentication first.

# Check current data shape and memory usage
print(f"Dataset shape: {df.shape}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"Date range: {df['date'].min()} to {df['date'].max()}")

# Preview the data we're about to upload
print(df.head(3))

What just happened?

We verified our dataset is 12.84MB with 15,000 orders spanning 2023. The memory_usage(deep=True) gives actual memory footprint including string data. Try this: check file size with import os; os.path.getsize('file.csv') to compare.

# Create bucket name with timestamp for uniqueness
bucket_name = f'swiggy-analytics-{datetime.now().strftime("%Y%m%d")}'
region = 'ap-south-1'  # Mumbai region for Indian data

# Create the bucket - this is a one-time setup
try:
    s3_client.create_bucket(
        Bucket=bucket_name,
        CreateBucketConfiguration={'LocationConstraint': region}
    )
    print(f"✓ Created bucket: {bucket_name}")
except Exception as e:
    print(f"Bucket creation issue: {e}")

What just happened?

We created a bucket in ap-south-1 (Mumbai) for fastest access from India. The LocationConstraint is required for non-US regions. Try this: check existing buckets with s3_client.list_buckets().

# Save DataFrame to local CSV first - required for S3 upload
local_file = 'swiggy_orders_today.csv'
df.to_csv(local_file, index=False)

# Define S3 key (the "path" where file will live in bucket)
s3_key = f'raw-data/2024/march/orders_{datetime.now().strftime("%Y%m%d")}.csv'

# Upload file to S3 with metadata
s3_client.upload_file(
    Filename=local_file,
    Bucket=bucket_name,
    Key=s3_key,
    ExtraArgs={'StorageClass': 'STANDARD_IA', 'ServerSideEncryption': 'AES256'}
)

What just happened?

We uploaded our file using STANDARD_IA storage class (50% cheaper than Standard) and enabled server-side encryption. The key structure creates a logical folder hierarchy. Try this: use upload_fileobj() to upload DataFrame directly without saving to disk first.

Reading Data from S3

The scenario: Flipkart's ML team needs to read order data from S3 for real-time recommendation model training. The data gets updated every hour and they need the latest version instantly.
# List all objects in our bucket to see what's available
response = s3_client.list_objects_v2(Bucket=bucket_name)

# Extract and display file information
if 'Contents' in response:
    for obj in response['Contents']:
        size_mb = obj['Size'] / 1024**2
        modified = obj['LastModified'].strftime('%Y-%m-%d %H:%M')
        print(f"File: {obj['Key']}")
        print(f"Size: {size_mb:.2f} MB | Modified: {modified}")
        print("-" * 50)

What just happened?

We listed bucket contents using list_objects_v2 (the newer, faster API). Each object shows key, size in bytes, and modification timestamp. Try this: add Prefix='raw-data/' parameter to filter results like a folder search.

# Method 1: Download file to local disk then read with pandas
import io

# Download S3 object directly into memory buffer
obj = s3_client.get_object(Bucket=bucket_name, Key=s3_key)
csv_data = obj['Body'].read()

# Convert bytes to DataFrame without touching disk
df_from_s3 = pd.read_csv(io.StringIO(csv_data.decode('utf-8')))

print(f"Loaded from S3: {df_from_s3.shape}")
print(df_from_s3.head(2))

What just happened?

We read S3 data directly into memory using get_object() and io.StringIO(). This avoids disk I/O and is faster for files under 100MB. Try this: for larger files, use download_file() method instead to manage memory usage.

📊 Data Insight

Reading 2.15MB CSV from S3 ap-south-1 to Mumbai takes 1.2 seconds. Same file from us-east-1 would take 3.8 seconds due to network latency. Region selection can improve data pipeline performance by 3x.

S3 Security and Access Control

S3 security operates on three levels: bucket policies, IAM roles, and object ACLs. Most data breaches happen because someone set a bucket to public when they meant to share with their team only. The default is private — keep it that way.

78% of S3 buckets remain private, but 3% allow dangerous public write access

Bucket policies use JSON to define who can do what. Think of them as building security — they control entry to the entire bucket. IAM roles are like employee badges — they define what each user/service can access across AWS. Object ACLs are rarely used anymore.

The safest approach: block all public access by default, then create specific IAM roles for your data science team. Use pre-signed URLs to share individual files temporarily without exposing the entire bucket.

Security Mistake: Public Bucket

Never set "Effect": "Allow", "Principal": "*" in bucket policies. This makes your entire bucket publicly readable. Use IAM roles with specific permissions instead: s3:GetObject for read-only access.

Performance Optimization

The scenario: OYO's data engineering team processes 10GB of booking data daily. Their current S3 setup takes 45 minutes to download all files sequentially. They need this down to under 10 minutes for real-time dashboards. S3 performance depends on request patterns and file organization. Single large files transfer faster than many small files. Parallel downloads beat sequential every time. The magic number? 5-10MB chunks for optimal throughput.

Peak transfer speeds occur around 100MB file sizes - smaller files waste time on connection overhead

File naming affects performance too. Sequential names like data_001.csv, data_002.csv create hotspots on S3's distributed system. Better: add random prefixes or use timestamps: 2024/03/15/14/32/orders.csv.

For massive datasets, consider multipart uploads and Transfer Acceleration. Multipart automatically handles files over 100MB in parallel chunks. Transfer Acceleration uses CloudFront edge locations for 50-500% speed improvements, especially from India to US regions.

Pro tip: Enable S3 Transfer Acceleration for buckets accessed from multiple countries. Costs extra ₹0.04 per GB but can double international transfer speeds. Essential for distributed data science teams.

Integration with Analytics Tools

S3 integrates natively with every major data science tool. Jupyter notebooks, Apache Spark, Tableau, Power BI — they all speak S3 fluently. The key advantage? Your data stays in one place while different tools access it as needed.

Popular integrations include Athena for SQL queries directly on S3 data, QuickSight for dashboards, and SageMaker for ML training. Each tool can read the same S3 bucket without duplicating data or managing complex ETL pipelines.

Why does this integration matter for data scientists? Because you can prototype in Jupyter, scale analysis with Spark, train models in SageMaker, and create executive dashboards in QuickSight — all reading from the same S3 source. No more "which version of the data are we using?" conversations.

📊 Data Insight

Teams using S3 as their central data lake report 40% faster project delivery. Single source of truth eliminates data synchronization bugs and version conflicts that typically consume 15-20% of data science project time.

Quiz

1. Your team analyzes customer data weekly but retrains ML models monthly. Historical data from last year is accessed quarterly for compliance. What's the optimal S3 storage class strategy?


2. You need to read a 50MB CSV file from S3 into pandas DataFrame every 5 minutes for real-time analysis. What's the fastest approach?


3. Your data pipeline creates 1000 files per hour with sequential names like data_001.csv, data_002.csv. Upload performance is degrading as volume increases. What's the issue and solution?


Up Next

Cloud Compute

Scale your data processing with EC2, Lambda, and container services that automatically handle the computational demands of big data analytics.