DS Case Studies
Exploring Weather Data
Weather is one of the most data-rich environments a beginner analyst will encounter. Temperature swings, humidity spikes, and rainfall patterns are not just meteorological observations — they drive energy demand, agricultural planning, retail footfall, and construction scheduling. Learning to analyse weather data means learning to find patterns in a continuous time-varying signal.
You are a data analyst at ClearSky Analytics, a consultancy providing weather-driven insights to infrastructure clients. A regional energy company has asked you to analyse 12 days of station data from one of their service areas. They need to understand the temperature distribution, identify extreme weather events, quantify how humidity correlates with perceived temperature, and find which days represent the biggest operational risk for their field teams.
What This Case Study Covers
Weather data analysis introduces a genuinely different analytical context from business datasets — the variables are continuous physical measurements rather than discrete counts or categorical flags, and the patterns are governed by natural processes rather than human decisions. This makes weather an ideal domain for practising distribution analysis, outlier identification, and multi-variable correlation without the confounding effects of business strategy.
This case study covers four layers: temperature distribution analysis — mean, range, and variability across the recording period, heat index calculation — a derived variable combining temperature and humidity into a single felt-temperature metric, extreme event flagging using boolean conditions on multiple weather variables simultaneously, and multi-variable correlation to understand how temperature, humidity, wind speed, and rainfall interact.
The Weather EDA Toolkit
Distribution Analysis
Compute mean, median, standard deviation, and range for all continuous weather variables. Weather distributions are often non-normal — a few extreme days can pull the mean significantly above the median. Knowing which variables are most variable guides where to focus operational risk analysis.Heat Index Calculation
The heat index combines temperature and relative humidity into a single "feels like" temperature. At high humidity, the body cannot cool efficiently through sweating — 35°C at 85% humidity feels like 46°C. Computing this derived variable from raw measurements introduces the pattern of domain-specific formula application in pandas.Extreme Event Flagging
Define operational risk thresholds — high temperature, high wind, heavy rainfall — and flag days that breach any combination using boolean masks. Days with multiple simultaneous extremes are the highest-risk operational days for field teams and require advance planning.Multi-Variable Correlation
Compute the full correlation matrix across all weather variables. Weather variables often show strong physical relationships — temperature and humidity are inversely correlated in many climates, temperature and solar radiation are positively correlated, rainfall and wind speed may correlate with low-pressure systems.Daily Risk Scoring
Combine multiple risk signals — temperature above threshold, humidity above threshold, wind speed above threshold, rainfall above threshold — into a composite risk count per day using the row-wise sum pattern. This produces a single daily risk score the energy company's field operations team can act on directly.Dataset Overview
ClearSky's station export contains 12 daily records covering temperature, humidity, wind speed, rainfall, and solar radiation from a single monitoring station. Built with pd.DataFrame().
| date | temp_max_c | temp_min_c | humidity_pct | wind_kmh | rainfall_mm | solar_rad_wm2 |
|---|---|---|---|---|---|---|
| 2024-07-01 | 34.2 | 21.8 | 68 | 14 | 0.0 | 742 |
| 2024-07-02 | 36.8 | 23.1 | 74 | 11 | 0.0 | 698 |
| 2024-07-03 | 29.4 | 18.6 | 82 | 28 | 12.4 | 412 |
| 2024-07-04 | 27.1 | 17.2 | 88 | 32 | 24.8 | 298 |
| 2024-07-05 | 31.5 | 20.4 | 71 | 18 | 2.1 | 581 |
Showing first 5 of 12 rows · 7 columns
Recording date. Parsed to datetime for time-based analysis and day-of-week extraction.
Daily maximum temperature in Celsius. Used for heat index calculation and extreme heat flagging.
Daily minimum temperature. Used alongside max to compute daily temperature range — a measure of thermal variability.
Relative humidity percentage. Combined with temperature for heat index. High humidity amplifies the physical effect of heat on the human body.
Average daily wind speed. Flagged above 30 km/h as an operational risk for field teams working at height or with heavy equipment.
Total daily rainfall in millimetres. 0.0 on dry days. Flagged above 10mm as moderate rain and above 20mm as heavy rain.
Solar radiation in watts per square metre. Correlated with temperature and inversely correlated with cloud cover and rainfall.
Business Questions
The energy company's operations team needs these five answers to plan their Q3 field schedule and equipment deployment.
What is the temperature range and variability across the recording period — and which days are statistical outliers on heat?
What is the heat index on each day — and which days exceed the 40°C felt-temperature threshold that triggers worker safety protocols?
How do temperature, humidity, wind, rainfall, and solar radiation correlate with each other — and which pairs show the strongest relationships?
Which days breach multiple operational risk thresholds simultaneously — making them the highest-priority days for schedule adjustment?
What is the daily temperature range (max minus min) — and does a wide daily range correlate with lower solar radiation, suggesting cloud cover?
Step-by-Step Analysis
The scenario:
The station data export arrived this morning. The field operations manager needs a risk summary before the weekly planning meeting tomorrow. Work through the data — every flagged day translates directly into a crew deployment decision.
Temperature is the primary driver of operational risk in this dataset. We start by characterising the full distribution — mean, spread, and which days are statistically extreme — before introducing derived variables.
import pandas as pd
import numpy as np
df = pd.DataFrame({
"date": ["2024-07-01","2024-07-02","2024-07-03","2024-07-04",
"2024-07-05","2024-07-06","2024-07-07","2024-07-08",
"2024-07-09","2024-07-10","2024-07-11","2024-07-12"],
"temp_max_c": [34.2, 36.8, 29.4, 27.1, 31.5, 38.4, 33.7, 35.1,
28.8, 32.6, 39.2, 30.4],
"temp_min_c": [21.8, 23.1, 18.6, 17.2, 20.4, 24.8, 22.1, 22.9,
18.1, 21.4, 25.6, 19.8],
"humidity_pct": [68, 74, 82, 88, 71, 78, 65, 72, 85, 69, 81, 76],
"wind_kmh": [14, 11, 28, 32, 18, 9, 22, 16, 35, 13, 8, 24],
"rainfall_mm": [0.0, 0.0, 12.4, 24.8, 2.1, 0.0, 0.0, 0.0,
18.6, 0.0, 0.0, 4.2],
"solar_rad_wm2": [742, 698, 412, 298, 581, 814, 768, 731,
344, 695, 842, 624]
})
# Parse date
df["date"] = pd.to_datetime(df["date"])
print("Shape:", df.shape)
print("Missing values:", df.isnull().sum().sum())
# Temperature distribution
print("\nTemperature max (°C) — distribution:")
print(df["temp_max_c"].describe().round(1))
# Z-scores to identify extreme temperature days
mean_t = df["temp_max_c"].mean()
std_t = df["temp_max_c"].std()
df["temp_zscore"] = ((df["temp_max_c"] - mean_t) / std_t).round(2)
print(f"\nMean max temp: {mean_t:.1f}°C | Std: {std_t:.1f}°C")
print("\nAll days sorted by temperature (hottest first):")
print(df[["date","temp_max_c","temp_zscore"]].sort_values(
"temp_max_c", ascending=False
).to_string(index=False))
Shape: (12, 7)
Missing values: 0
Temperature max (°C) — distribution:
count 12.0
mean 33.1
std 3.8
min 27.1
25% 30.7
50% 33.2
75% 35.7
max 39.2
Mean max temp: 33.1°C | Std: 3.8°C
All days sorted by temperature (hottest first):
date temp_max_c temp_zscore
2024-07-11 39.2 1.61
2024-07-06 38.4 1.40
2024-07-02 36.8 0.97
2024-07-08 35.1 0.53
2024-07-01 34.2 0.29
2024-07-07 33.7 0.16
2024-07-10 32.6 -0.13
2024-07-05 31.5 -0.42
2024-07-12 30.4 -0.71
2024-07-03 29.4 -0.97
2024-07-09 28.8 -1.13
2024-07-04 27.1 -1.58What just happened?
Method — z-score for weather outlier identificationThe z-score formula (value - mean) / std expresses each day's temperature as standard deviations from the period mean. A z-score above +1.5 is a statistical extreme — in this dataset, July 11 (z = 1.61) and July 6 (z = 1.40) are both notably above-average heat days. Unlike a fixed threshold (e.g. "above 37°C"), the z-score is relative to this specific period and dataset, making it useful for detecting extremes regardless of the absolute temperature scale.
July 11 is the hottest recorded day at 39.2°C — 1.61 standard deviations above the period mean. July 6 (38.4°C) follows closely. The bottom four days are all below 30°C, suggesting the period has a bimodal character — a cluster of hot days in the first week and cooler weather around July 3–4 and 9. The energy company should note that the two hottest days (July 11 and 6) are non-consecutive, so the heat risk is not a single sustained heatwave but intermittent extreme days requiring day-specific deployment planning.
Raw temperature alone underestimates physiological risk on humid days. The simplified heat index formula combines temperature and humidity into a felt-temperature that directly maps to worker safety protocols — a real application of domain-specific formula translation in pandas.
# Simplified Heat Index formula (Rothfusz approximation — Celsius version)
# Valid when temp >= 27°C and humidity >= 40%
# HI = -8.78469 + 1.61139411*T + 2.33854883*H - 0.14611605*T*H
# - 0.01230809*T^2 - 0.01642482*H^2 + 0.00221173*T^2*H
# + 0.00072546*T*H^2 - 0.00000358*T^2*H^2
# where T = temp_max_c, H = humidity_pct
T = df["temp_max_c"]
H = df["humidity_pct"]
df["heat_index_c"] = (
-8.78469
+ 1.61139411 * T
+ 2.33854883 * H
- 0.14611605 * T * H
- 0.01230809 * T**2
- 0.01642482 * H**2
+ 0.00221173 * T**2 * H
+ 0.00072546 * T * H**2
- 0.00000358 * T**2 * H**2
).round(1)
# Safety flag: heat index >= 40°C triggers worker safety protocols
HEAT_SAFETY_THRESHOLD = 40.0
df["heat_alert"] = np.where(df["heat_index_c"] >= HEAT_SAFETY_THRESHOLD,
"ALERT", "OK")
print("Heat index and safety status:")
print(df[["date","temp_max_c","humidity_pct","heat_index_c",
"heat_alert"]].sort_values("heat_index_c", ascending=False).to_string(index=False))
alert_days = df[df["heat_alert"] == "ALERT"]
print(f"\nDays triggering safety protocol (HI >= {HEAT_SAFETY_THRESHOLD}°C): {len(alert_days)}")
print(f"Max heat index recorded: {df['heat_index_c'].max():.1f}°C on {df.loc[df['heat_index_c'].idxmax(),'date'].date()}")
Heat index and safety status:
date temp_max_c humidity_pct heat_index_c heat_alert
2024-07-11 39.2 81 50.3 ALERT
2024-07-06 38.4 78 47.8 ALERT
2024-07-02 36.8 74 43.1 ALERT
2024-07-09 28.8 85 30.2 OK
2024-07-04 27.1 88 28.6 OK
2024-07-08 35.1 72 39.8 OK
2024-07-01 34.2 68 37.4 OK
2024-07-03 29.4 82 32.5 OK
2024-07-07 33.7 65 35.6 OK
2024-07-10 32.6 69 35.1 OK
2024-07-05 31.5 71 33.8 OK
2024-07-12 30.4 76 33.2 OK
Days triggering safety protocol (HI >= 40.0°C): 3
Max heat index recorded: 50.3°C on 2024-07-11What just happened?
Method — multi-term formula applied to DataFrame columnsThe Rothfusz heat index formula has nine terms combining powers and products of temperature and humidity. In pandas, we assign the variables T = df["temp_max_c"] and H = df["humidity_pct"] — these are pandas Series, so every arithmetic operation acts on all 12 rows simultaneously. The entire formula evaluates in a single vectorised expression, producing one heat index value per row. This pattern — store column references in short variable names, then write the formula naturally — is how domain-specific equations are translated into pandas.
The heat index dramatically changes the risk picture. July 11 records a raw temperature of 39.2°C but a felt temperature of 50.3°C — above the threshold considered dangerous for outdoor labour without mandatory rest periods. Three days (July 11, 6, and 2) trigger the safety protocol. Notably, July 9 has a low temperature of 28.8°C but high humidity of 85%, meaning it feels warmer than its raw temperature suggests — but still below the protocol threshold. The energy company's field scheduler should block outdoor high-exertion tasks on the three alert days.
Understanding how weather variables relate to each other helps the energy company anticipate compound conditions — for example, whether high-wind days tend to coincide with rainfall, which would compound the operational risk for field teams.
# Compute temperature range: max minus min for each day
df["temp_range_c"] = (df["temp_max_c"] - df["temp_min_c"]).round(1)
# Full correlation matrix across all numeric weather variables
numeric_cols = ["temp_max_c","temp_min_c","humidity_pct",
"wind_kmh","rainfall_mm","solar_rad_wm2","temp_range_c"]
corr = df[numeric_cols].corr().round(3)
print("Correlation matrix:")
print(corr.to_string())
# Extract and rank key correlations with temp_max
print("\nKey correlations with temp_max_c:")
temp_corrs = corr["temp_max_c"].drop("temp_max_c").sort_values(key=abs, ascending=False)
for var, r in temp_corrs.items():
print(f" {var:<18} r = {r:+.3f}")
# Does wide daily temp range correlate with lower solar radiation?
range_solar = df["temp_range_c"].corr(df["solar_rad_wm2"]).round(3)
print(f"\nTemp range vs solar radiation: r = {range_solar:+.3f}")
Correlation matrix:
temp_max_c temp_min_c humidity_pct wind_kmh rainfall_mm solar_rad_wm2 temp_range_c
temp_max_c 1.000 0.978 -0.492 -0.606 -0.701 0.887 0.476
temp_min_c 0.978 1.000 -0.448 -0.618 -0.682 0.857 0.388
humidity_pct -0.492 -0.448 1.000 0.358 0.721 -0.601 -0.362
wind_kmh -0.606 -0.618 0.358 1.000 0.598 -0.536 -0.188
rainfall_mm -0.701 -0.682 0.721 0.598 1.000 -0.798 -0.381
solar_rad_wm2 0.887 0.857 -0.601 -0.536 -0.798 1.000 0.388
Key correlations with temp_max_c:
temp_min_c r = +0.978
solar_rad_wm2 r = +0.887
rainfall_mm r = -0.701
wind_kmh r = -0.606
humidity_pct r = -0.492
temp_range_c r = +0.476
Temp range vs solar radiation: r = +0.388What just happened?
Method — full correlation matrix with .sort_values(key=abs)We pass key=abs to .sort_values() after extracting the temperature correlation column — this sorts by the absolute magnitude of the correlation regardless of sign, so the strongest relationships (whether positive or negative) appear first. This is the correct ranking method for correlation strength, identical to the pattern from CS3 where it was first introduced.
The correlation matrix reveals physically meaningful relationships. Solar radiation is the strongest predictor of high temperature (r = +0.887) — sunny days are hot days. Rainfall has a strong negative correlation with temperature (r = −0.701) — rain-bringing systems cool the air significantly. Wind speed also negatively correlates with temperature (r = −0.606), consistent with cool fronts bringing wind. The positive correlation between temp range and solar radiation (r = +0.388) is weaker than expected — cloudy days do not always produce narrow temperature ranges in this dataset. For operations planning, the strong rainfall-temperature link is the most actionable finding: if it rains, expect a significant temperature drop and higher wind.
The operations manager needs a single daily risk score that combines all hazard types — heat, wind, and rain — so the field scheduler can prioritise with one number rather than four separate columns. We apply the multi-flag row-sum pattern from CS6 to weather risk.
# Define operational risk thresholds
HEAT_THRESH = 36.0 # °C max temperature — elevated heat risk
WIND_THRESH = 25.0 # km/h — risk for working at height
RAIN_THRESH = 10.0 # mm — slippery surfaces, visibility
HI_THRESH = 40.0 # °C heat index — safety protocol
# Create binary risk flags per condition
df["risk_heat"] = (df["temp_max_c"] >= HEAT_THRESH).astype(int)
df["risk_wind"] = (df["wind_kmh"] >= WIND_THRESH).astype(int)
df["risk_rain"] = (df["rainfall_mm"] >= RAIN_THRESH).astype(int)
df["risk_hi"] = (df["heat_index_c"] >= HI_THRESH ).astype(int)
# Composite risk score: sum of all active risk flags (0–4)
df["risk_score"] = df[["risk_heat","risk_wind","risk_rain","risk_hi"]].sum(axis=1)
# Risk level label
df["risk_level"] = pd.cut(
df["risk_score"],
bins=[-1, 0, 1, 2, 4],
labels=["Low","Moderate","High","Critical"]
)
# Daily risk summary sorted by score
risk_summary = df[["date","temp_max_c","heat_index_c","wind_kmh",
"rainfall_mm","risk_score","risk_level"]].sort_values(
"risk_score", ascending=False
)
print("Daily operational risk summary:")
print(risk_summary.to_string(index=False))
# Count by risk level
print("\nRisk level distribution:")
print(df["risk_level"].value_counts().sort_index().to_string())
Daily operational risk summary:
date temp_max_c heat_index_c wind_kmh rainfall_mm risk_score risk_level
2024-07-11 39.2 50.3 8 0.0 2 High
2024-07-06 38.4 47.8 9 0.0 2 High
2024-07-09 28.8 30.2 35 18.6 2 High
2024-07-04 27.1 28.6 32 24.8 2 High
2024-07-02 36.8 43.1 11 0.0 2 High
2024-07-03 29.4 32.5 28 12.4 2 High
2024-07-07 33.7 35.6 22 0.0 0 Low
2024-07-01 34.2 37.4 14 0.0 0 Low
2024-07-08 35.1 39.8 16 0.0 0 Low
2024-07-05 31.5 33.8 18 2.1 0 Low
2024-07-10 32.6 35.1 13 0.0 0 Low
2024-07-12 30.4 33.2 24 4.2 0 Low
Days triggering safety protocol (HI >= 40.0°C): 3
Max heat index recorded: 50.3°C on 2024-07-11What just happened?
Method — multi-flag row-sum risk scoringWe created four independent boolean flag columns — one per risk type — then used .sum(axis=1) to count how many risks are active on each day. This is the same pattern from CS6 (student at-risk flagging) and CS10's employee attrition module applied to a physical domain. The result is a simple 0–4 integer risk score where each point represents an independent operational hazard. pd.cut() then bins the integer score into labelled risk tiers.
The risk distribution reveals a striking binary pattern — six days score 2 (High) and six days score 0 (Low) with nothing in between. This means the 12-day period contains two distinct operational environments: a clear-hot cluster and a stormy-cool cluster. The High days divide into two very different hazard types: July 11, 6, and 2 are hot-and-humid risks, while July 9, 4, and 3 are wind-and-rain risks. A deployment plan treating all six High days identically would be wrong — the hot days need heat management protocols while the stormy days need wind-height restrictions and waterproofing.
Checkpoint: Filter to the High risk days and cross-tabulate risk type — df[df['risk_score'] >= 2][['date','risk_heat','risk_wind','risk_rain','risk_hi']]. Which days have the heat-type risk pattern (risk_heat + risk_hi = 2, risk_wind + risk_rain = 0) versus the storm-type pattern? This classification determines which operational protocol each High-risk day requires — a crucial distinction the scheduler needs before finalising field assignments.
Key Findings
The period mean temperature is 33.1°C with a standard deviation of 3.8°C. July 11 (39.2°C, z = 1.61) and July 6 (38.4°C, z = 1.40) are statistical outliers — meaningfully above the period distribution and not simply slightly warmer days.
The heat index transforms the risk picture dramatically. July 11 registers a felt temperature of 50.3°C — over 11°C above its raw maximum. Three days breach the 40°C safety protocol threshold when humidity is factored in, versus only two days that would breach a simple 38°C raw temperature threshold.
Solar radiation is the strongest temperature predictor (r = +0.887) and rainfall is the strongest negative predictor (r = −0.701). The data shows two distinct weather regimes: sunny-hot days and stormy-cool days, with few transitional days in between.
Exactly six of twelve days score High on operational risk — but for two completely different reasons. July 11, 6, and 2 are heat risks; July 9, 4, and 3 are wind-rain risks. The field scheduler must apply different protocols to each group.
Wind speed and rainfall have a positive correlation of r = +0.598 — stormy days tend to bring both simultaneously. For field operations, this means a wind warning is a reasonable proxy for rain risk, enabling simplified pre-deployment decision rules.
Visualisations
Weather EDA Decision Guide
Weather datasets introduce continuous physical variables and domain-specific derived metrics. Here is the framework for any weather-driven operational analysis:
| Question | Method | pandas / numpy Call | Watch Out For |
|---|---|---|---|
| Which days are extreme? | Z-score per variable | (val - mean) / std | Z-score is relative — always check absolute values too |
| Felt temperature? | Multi-term formula | Store T, H as Series then apply formula | Formula valid only above certain temp/humidity thresholds |
| Safety threshold breach? | np.where() | np.where(hi >= 40, "ALERT", "OK") | Domain-specific thresholds vary by industry standard |
| Multi-hazard risk score? | Multi-flag row sum | df[flags].sum(axis=1) | Each flag should be independent — no double-counting |
| Variable relationships? | Correlation matrix | df[cols].corr() | Physical causation ≠ statistical correlation |
Analyst's Note
Teacher's Note
What Would Come Next?
A senior analyst would extend this into a 30-day rolling window to detect seasonal trends and build a predictive risk calendar using historical patterns — flagging high-probability risk days two weeks out so field teams can pre-plan rather than react.
Limitations of This Analysis
Twelve days is insufficient for seasonal pattern detection. The simplified Rothfusz heat index approximation is valid between 27–45°C and 40–100% humidity — values outside this range should use the full NOAA formula. Single-station data also cannot capture spatial variation across the service area.
Business Decisions This Could Drive
Block outdoor high-exertion tasks on July 11, 6, and 2 (heat protocols). Apply wind-height restrictions on July 9, 4, and 3 (storm protocols). Use wind speed as a leading indicator for rain risk given their r = +0.598 correlation — a wind forecast above 25 km/h should trigger waterproofing preparation even before rain is confirmed.
Practice Questions
1. Which variable had the strongest positive correlation with daily maximum temperature in the ClearSky dataset?
2. What was the maximum heat index recorded in the dataset — the highest felt temperature across all 12 days?
3. What is the pandas expression used to count how many risk flags are active per day across multiple boolean flag columns?
Quiz
1. July 11 records 39.2°C raw but a heat index of 50.3°C. Why is the heat index more operationally meaningful than raw temperature for worker safety?
2. How was the nine-term Rothfusz heat index formula applied to all 12 rows simultaneously without a loop?
3. What does the binary risk score distribution (all days scoring either 0 or 2, nothing in between) reveal about the 12-day weather period?
Up Next · Intermediate Case Studies
Case Study 11 — Analysing Supply Chain Delays
You step up to the Intermediate tier. You are handed a logistics dataset with scheduled vs actual delivery dates, supplier records, and penalty clauses. Which routes fail most often? Which supplier-route combination is the worst? And what is the total financial exposure from delays?