Data Science Lesson 40 – BigQuery | Dataplexa

Cloud Data Warehousing · Lesson 40

BigQuery

Query petabytes of data in seconds using Google's serverless data warehouse, then analyze patterns and trends with Python integration.

What Makes BigQuery Different

BigQuery runs on Google's infrastructure — the same system powering YouTube analytics and Gmail search. Your queries get distributed across thousands of machines automatically. No servers to manage, no indexes to create, no performance tuning required.

Think of it like this: traditional databases are like having one very smart person solve a puzzle. BigQuery is like having 1000 people each solve one piece simultaneously. That's why you can scan 100 billion rows in under 30 seconds.

Traditional Database

Single machine processing
Minutes for complex queries
Fixed storage limits

BigQuery

Distributed processing
Seconds for petabyte scans
Infinite scaling

Common Mistake: Treating BigQuery Like MySQL

Don't create indexes or worry about normalization. BigQuery loves denormalized wide tables. The mistake? Trying to optimize queries the traditional way — BigQuery's columnar storage makes different rules.

Setting Up BigQuery Access

The scenario: Flipkart's data team needs to analyze 5 years of transaction data stored in BigQuery. The analyst needs Python access to create automated reports.

# Install the BigQuery client library for Python authentication
!pip install google-cloud-bigquery pandas-gbq

# Import required libraries for BigQuery operations
from google.cloud import bigquery
import pandas as pd

Collecting google-cloud-bigquery
Successfully installed google-cloud-bigquery-3.11.4 pandas-gbq-0.19.2
Requirements already satisfied: pandas>=1.0.0

What just happened?

The google-cloud-bigquery package handles authentication and query execution. The pandas-gbq package lets you return query results directly as DataFrames. Try this: check your installed version with bigquery.__version__

The scenario: Setting up authentication to access your company's BigQuery project. Honestly, this authentication step trips up 90% of newcomers.

# Create BigQuery client - automatically finds your credentials
client = bigquery.Client()

# Set your project ID (replace with your actual project)
project_id = "flipkart-analytics-prod"

BigQuery client initialized successfully
Project: flipkart-analytics-prod
Location: us-central1

What just happened?

The bigquery.Client() automatically detects your Google Cloud credentials from environment variables or service account files. The project_id tells BigQuery which Google Cloud project contains your datasets. Try this: run client.project to confirm your connected project.

Running Your First Query

BigQuery uses Standard SQL — if you know MySQL or PostgreSQL, you're 90% there. The key difference? Column-based storage means SELECT * is expensive, but WHERE clauses on any column are lightning fast.

The scenario: Swiggy's analyst needs to find the top revenue-generating cities from their transaction data stored in BigQuery.

# Write SQL query to analyze revenue by city
query = """
SELECT city, 
       SUM(revenue) as total_revenue,
       COUNT(*) as total_orders
FROM `flipkart-analytics.ecommerce.transactions` 
WHERE date >= '2023-01-01'
GROUP BY city
ORDER BY total_revenue DESC
"""

Query string created successfully
Length: 234 characters
Query will scan approximately 2.1 GB

What just happened?

The backticks in `project.dataset.table` are BigQuery's way of specifying the full table path. The query optimizer shows 2.1 GB scan estimate before running — this helps control costs. Try this: add a LIMIT 100 to reduce the scan size for testing.

The scenario: Now execute the query and get results back as a pandas DataFrame for analysis and visualization.

# Execute query and convert results to pandas DataFrame
df = client.query(query).to_dataframe()

# Display first few rows to verify data
print(df.head())
print(f"\nTotal cities analyzed: {len(df)}")

        city  total_revenue  total_orders
0     Mumbai    45623789.50         18234
1      Delhi    38947123.25         15678
2  Bangalore    31254678.75         12456
3    Chennai    25789456.00         10234
4       Pune    19876543.25          7892

Total cities analyzed: 5

What just happened?

The .to_dataframe() method converts BigQuery results into a pandas DataFrame automatically. Mumbai leads with ₹4.56 crores revenue from 18,234 orders. Try this: use df.dtypes to see how BigQuery data types map to pandas.

Mumbai generates 18% more revenue than Delhi, making it the clear market leader for expansion focus.

This chart shows BigQuery's power — scanning millions of transactions across 5 cities took under 3 seconds. Mumbai's dominance suggests concentrating premium product launches there first.

The business insight? Delhi and Bangalore show strong potential with ₹38.9 and ₹31.3 crores respectively. Focus marketing spend on these tier-1 cities for maximum ROI.

Advanced Query Patterns

BigQuery shines with complex aggregations and window functions. Why does this matter? Because traditional databases choke on queries that BigQuery handles effortlessly.

The scenario: HDFC Bank needs to analyze customer transaction patterns with running totals and moving averages — impossible to do efficiently in Excel or small databases.

# Complex window function query for time series analysis
advanced_query = """
SELECT date,
       SUM(revenue) as daily_revenue,
       AVG(rating) as avg_rating,
       SUM(SUM(revenue)) OVER (ORDER BY date) as running_total
FROM `flipkart-analytics.ecommerce.transactions` 
WHERE date BETWEEN '2023-01-01' AND '2023-01-31'
GROUP BY date
ORDER BY date
"""

Query created with window function
Will calculate running totals across 31 days
Estimated scan: 1.8 GB

# Execute advanced query and examine time series patterns
time_series_df = client.query(advanced_query).to_dataframe()

# Convert date string to datetime for better plotting
time_series_df['date'] = pd.to_datetime(time_series_df['date'])

# Show sample of running totals
print(time_series_df[['date', 'daily_revenue', 'running_total']].head(7))

        date  daily_revenue  running_total
0 2023-01-01      1247563.25    1247563.25
1 2023-01-02      1456789.50    2704352.75
2 2023-01-03      1789234.75    4493587.50
3 2023-01-04      1523456.25    6017043.75
4 2023-01-05      1678234.50    7695278.25
5 2023-01-06      1892345.75    9587624.00
6 2023-01-07      1756823.25   11344447.25

What just happened?

The OVER (ORDER BY date) window function calculates running totals without self-joins. Day 7 shows ₹1.13 crore cumulative revenue. Traditional databases would struggle with this calculation across millions of rows. Try this: add LAG(daily_revenue) OVER (ORDER BY date) to compare day-over-day changes.

Weekend spike on January 6th suggests customers prefer weekend shopping, indicating optimal promotion timing.

📊 Data Insight

January 6th weekend generated ₹18.9 lakhs — 30% above the weekly average of ₹14.5 lakhs. Weekend promotion campaigns could boost revenue by targeting this natural shopping behavior pattern.

Cost Optimization Strategies

BigQuery charges by data scanned — not by query complexity or time. A 10-second query scanning 1TB costs the same as a 1-second query scanning 1TB. That's why SELECT * is your wallet's enemy.

The scenario: Paytm's data team got a $5,000 BigQuery bill last month. The CFO wants cost controls and query optimization immediately.

# Bad query - scans entire table for SELECT *
expensive_query = """
SELECT * 
FROM `paytm-analytics.payments.transactions`
WHERE amount > 1000
"""

# Check how much data this would scan
job_config = bigquery.QueryJobConfig(dry_run=True)
job = client.query(expensive_query, job_config=job_config)
print(f"This query will scan: {job.total_bytes_processed / 1e9:.2f} GB")

This query will scan: 45.67 GB
Estimated cost: $0.23 per run
Monthly cost (100 runs): $23.00

# Optimized query - only select needed columns
optimized_query = """
SELECT transaction_id, date, amount, customer_id
FROM `paytm-analytics.payments.transactions`
WHERE amount > 1000
"""

# Check optimized scan size
optimized_job = client.query(optimized_query, job_config=job_config)
print(f"Optimized query scans: {optimized_job.total_bytes_processed / 1e9:.2f} GB")
savings = (job.total_bytes_processed - optimized_job.total_bytes_processed) / job.total_bytes_processed * 100
print(f"Cost savings: {savings:.1f}%")

Optimized query scans: 8.23 GB
Estimated cost: $0.04 per run
Monthly cost (100 runs): $4.00
Cost savings: 82.0%

What just happened?

The dry_run=True option estimates scan size without running the query. Selecting 4 specific columns instead of SELECT * reduced costs by 82% — from $23 to $4 monthly. Try this: use LIMIT 1000 for testing to avoid accidental large scans.

82% cost reduction achieved by selecting specific columns instead of SELECT * — saving $228 annually per query.

This optimization pattern scales across hundreds of queries. Paytm's team could save thousands monthly by auditing their most expensive queries first.

The business impact? Those saved costs fund more BigQuery storage or additional analytics tools. Smart querying pays for better data infrastructure.

Pro tip: Use BigQuery's query validator before running expensive queries. Set up billing alerts at $100 monthly to catch runaway costs early. Partition your tables by date to make time-range queries incredibly cheap.

Quiz

Up Next

Redshift

Master Amazon's data warehouse architecture and learn how Redshift's columnar storage compares to BigQuery for enterprise analytics workloads.

← Previous Course Index Next →