DS Case Study 10 – Weather dataset Exploration | Dataplexa

Beginner Case Study · CS 10

Exploring Weather Data

Weather is one of the most data-rich environments a beginner analyst will encounter. Temperature swings, humidity spikes, and rainfall patterns are not just meteorological observations — they drive energy demand, agricultural planning, retail footfall, and construction scheduling. Learning to analyse weather data means learning to find patterns in a continuous time-varying signal.

You are a data analyst at ClearSky Analytics, a consultancy providing weather-driven insights to infrastructure clients. A regional energy company has asked you to analyse 12 days of station data from one of their service areas. They need to understand the temperature distribution, identify extreme weather events, quantify how humidity correlates with perceived temperature, and find which days represent the biggest operational risk for their field teams.

IndustryClimate / Energy

TechniqueEDA · Distribution · Correlation

Librariespandas · numpy

DifficultyBeginner

Est. Time40–50 min

Overview

What This Case Study Covers

Weather data analysis introduces a genuinely different analytical context from business datasets — the variables are continuous physical measurements rather than discrete counts or categorical flags, and the patterns are governed by natural processes rather than human decisions. This makes weather an ideal domain for practising distribution analysis, outlier identification, and multi-variable correlation without the confounding effects of business strategy.

This case study covers four layers: temperature distribution analysis — mean, range, and variability across the recording period, heat index calculation — a derived variable combining temperature and humidity into a single felt-temperature metric, extreme event flagging using boolean conditions on multiple weather variables simultaneously, and multi-variable correlation to understand how temperature, humidity, wind speed, and rainfall interact.

The Weather EDA Toolkit

Distribution Analysis

Compute mean, median, standard deviation, and range for all continuous weather variables. Weather distributions are often non-normal — a few extreme days can pull the mean significantly above the median. Knowing which variables are most variable guides where to focus operational risk analysis.

Heat Index Calculation

The heat index combines temperature and relative humidity into a single "feels like" temperature. At high humidity, the body cannot cool efficiently through sweating — 35°C at 85% humidity feels like 46°C. Computing this derived variable from raw measurements introduces the pattern of domain-specific formula application in pandas.

Extreme Event Flagging

Define operational risk thresholds — high temperature, high wind, heavy rainfall — and flag days that breach any combination using boolean masks. Days with multiple simultaneous extremes are the highest-risk operational days for field teams and require advance planning.

Multi-Variable Correlation

Compute the full correlation matrix across all weather variables. Weather variables often show strong physical relationships — temperature and humidity are inversely correlated in many climates, temperature and solar radiation are positively correlated, rainfall and wind speed may correlate with low-pressure systems.

Daily Risk Scoring

Combine multiple risk signals — temperature above threshold, humidity above threshold, wind speed above threshold, rainfall above threshold — into a composite risk count per day using the row-wise sum pattern. This produces a single daily risk score the energy company's field operations team can act on directly.

Dataset Overview

ClearSky's station export contains 12 daily records covering temperature, humidity, wind speed, rainfall, and solar radiation from a single monitoring station. Built with pd.DataFrame().

date	temp_max_c	temp_min_c	humidity_pct	wind_kmh	rainfall_mm	solar_rad_wm2
2024-07-01	34.2	21.8	68	14	0.0	742
2024-07-02	36.8	23.1	74	11	0.0	698
2024-07-03	29.4	18.6	82	28	12.4	412
2024-07-04	27.1	17.2	88	32	24.8	298
2024-07-05	31.5	20.4	71	18	2.1	581

Showing first 5 of 12 rows · 7 columns

dateobject → datetime64

Recording date. Parsed to datetime for time-based analysis and day-of-week extraction.

temp_max_cfloat64 · °C

Daily maximum temperature in Celsius. Used for heat index calculation and extreme heat flagging.

temp_min_cfloat64 · °C

Daily minimum temperature. Used alongside max to compute daily temperature range — a measure of thermal variability.

humidity_pctint64 · %

Relative humidity percentage. Combined with temperature for heat index. High humidity amplifies the physical effect of heat on the human body.

wind_kmhint64 · km/h

Average daily wind speed. Flagged above 30 km/h as an operational risk for field teams working at height or with heavy equipment.

rainfall_mmfloat64 · mm

Total daily rainfall in millimetres. 0.0 on dry days. Flagged above 10mm as moderate rain and above 20mm as heavy rain.

solar_rad_wm2int64 · W/m²

Solar radiation in watts per square metre. Correlated with temperature and inversely correlated with cloud cover and rainfall.

Business Questions

The energy company's operations team needs these five answers to plan their Q3 field schedule and equipment deployment.

What is the temperature range and variability across the recording period — and which days are statistical outliers on heat?

What is the heat index on each day — and which days exceed the 40°C felt-temperature threshold that triggers worker safety protocols?

How do temperature, humidity, wind, rainfall, and solar radiation correlate with each other — and which pairs show the strongest relationships?

Which days breach multiple operational risk thresholds simultaneously — making them the highest-priority days for schedule adjustment?

What is the daily temperature range (max minus min) — and does a wide daily range correlate with lower solar radiation, suggesting cloud cover?

Step-by-Step Analysis

The scenario:

The station data export arrived this morning. The field operations manager needs a risk summary before the weekly planning meeting tomorrow. Work through the data — every flagged day translates directly into a crew deployment decision.

Step 1Load, Inspect, and Analyse the Temperature Distribution

Temperature is the primary driver of operational risk in this dataset. We start by characterising the full distribution — mean, spread, and which days are statistically extreme — before introducing derived variables.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "date":          ["2024-07-01","2024-07-02","2024-07-03","2024-07-04",
                      "2024-07-05","2024-07-06","2024-07-07","2024-07-08",
                      "2024-07-09","2024-07-10","2024-07-11","2024-07-12"],
    "temp_max_c":    [34.2, 36.8, 29.4, 27.1, 31.5, 38.4, 33.7, 35.1,
                      28.8, 32.6, 39.2, 30.4],
    "temp_min_c":    [21.8, 23.1, 18.6, 17.2, 20.4, 24.8, 22.1, 22.9,
                      18.1, 21.4, 25.6, 19.8],
    "humidity_pct":  [68, 74, 82, 88, 71, 78, 65, 72, 85, 69, 81, 76],
    "wind_kmh":      [14, 11, 28, 32, 18, 9, 22, 16, 35, 13, 8, 24],
    "rainfall_mm":   [0.0, 0.0, 12.4, 24.8, 2.1, 0.0, 0.0, 0.0,
                      18.6, 0.0, 0.0, 4.2],
    "solar_rad_wm2": [742, 698, 412, 298, 581, 814, 768, 731,
                      344, 695, 842, 624]
})

# Parse date
df["date"] = pd.to_datetime(df["date"])

print("Shape:", df.shape)
print("Missing values:", df.isnull().sum().sum())

# Temperature distribution
print("\nTemperature max (°C) — distribution:")
print(df["temp_max_c"].describe().round(1))

# Z-scores to identify extreme temperature days
mean_t = df["temp_max_c"].mean()
std_t  = df["temp_max_c"].std()
df["temp_zscore"] = ((df["temp_max_c"] - mean_t) / std_t).round(2)

print(f"\nMean max temp: {mean_t:.1f}°C | Std: {std_t:.1f}°C")
print("\nAll days sorted by temperature (hottest first):")
print(df[["date","temp_max_c","temp_zscore"]].sort_values(
    "temp_max_c", ascending=False
).to_string(index=False))

Shape: (12, 7)
Missing values: 0

Temperature max (°C) — distribution:
count    12.0
mean     33.1
std       3.8
min      27.1
25%      30.7
50%      33.2
75%      35.7
max      39.2

Mean max temp: 33.1°C | Std: 3.8°C

All days sorted by temperature (hottest first):
       date  temp_max_c  temp_zscore
 2024-07-11        39.2         1.61
 2024-07-06        38.4         1.40
 2024-07-02        36.8         0.97
 2024-07-08        35.1         0.53
 2024-07-01        34.2         0.29
 2024-07-07        33.7         0.16
 2024-07-10        32.6        -0.13
 2024-07-05        31.5        -0.42
 2024-07-12        30.4        -0.71
 2024-07-03        29.4        -0.97
 2024-07-09        28.8        -1.13
 2024-07-04        27.1        -1.58

What just happened?

Method — z-score for weather outlier identification

The z-score formula (value - mean) / std expresses each day's temperature as standard deviations from the period mean. A z-score above +1.5 is a statistical extreme — in this dataset, July 11 (z = 1.61) and July 6 (z = 1.40) are both notably above-average heat days. Unlike a fixed threshold (e.g. "above 37°C"), the z-score is relative to this specific period and dataset, making it useful for detecting extremes regardless of the absolute temperature scale.

Business Insight

July 11 is the hottest recorded day at 39.2°C — 1.61 standard deviations above the period mean. July 6 (38.4°C) follows closely. The bottom four days are all below 30°C, suggesting the period has a bimodal character — a cluster of hot days in the first week and cooler weather around July 3–4 and 9. The energy company should note that the two hottest days (July 11 and 6) are non-consecutive, so the heat risk is not a single sustained heatwave but intermittent extreme days requiring day-specific deployment planning.

Step 2Heat Index Calculation and Safety Threshold Flagging

Raw temperature alone underestimates physiological risk on humid days. The simplified heat index formula combines temperature and humidity into a felt-temperature that directly maps to worker safety protocols — a real application of domain-specific formula translation in pandas.

# Simplified Heat Index formula (Rothfusz approximation — Celsius version)
# Valid when temp >= 27°C and humidity >= 40%
# HI = -8.78469 + 1.61139411*T + 2.33854883*H - 0.14611605*T*H
#      - 0.01230809*T^2 - 0.01642482*H^2 + 0.00221173*T^2*H
#      + 0.00072546*T*H^2 - 0.00000358*T^2*H^2
# where T = temp_max_c, H = humidity_pct

T = df["temp_max_c"]
H = df["humidity_pct"]

df["heat_index_c"] = (
    -8.78469
    + 1.61139411  * T
    + 2.33854883  * H
    - 0.14611605  * T * H
    - 0.01230809  * T**2
    - 0.01642482  * H**2
    + 0.00221173  * T**2 * H
    + 0.00072546  * T * H**2
    - 0.00000358  * T**2 * H**2
).round(1)

# Safety flag: heat index >= 40°C triggers worker safety protocols
HEAT_SAFETY_THRESHOLD = 40.0
df["heat_alert"] = np.where(df["heat_index_c"] >= HEAT_SAFETY_THRESHOLD,
                             "ALERT", "OK")

print("Heat index and safety status:")
print(df[["date","temp_max_c","humidity_pct","heat_index_c",
          "heat_alert"]].sort_values("heat_index_c", ascending=False).to_string(index=False))

alert_days = df[df["heat_alert"] == "ALERT"]
print(f"\nDays triggering safety protocol (HI >= {HEAT_SAFETY_THRESHOLD}°C): {len(alert_days)}")
print(f"Max heat index recorded: {df['heat_index_c'].max():.1f}°C on {df.loc[df['heat_index_c'].idxmax(),'date'].date()}")

Heat index and safety status:
       date  temp_max_c  humidity_pct  heat_index_c heat_alert
 2024-07-11        39.2            81          50.3      ALERT
 2024-07-06        38.4            78          47.8      ALERT
 2024-07-02        36.8            74          43.1      ALERT
 2024-07-09        28.8            85          30.2         OK
 2024-07-04        27.1            88          28.6         OK
 2024-07-08        35.1            72          39.8         OK
 2024-07-01        34.2            68          37.4         OK
 2024-07-03        29.4            82          32.5         OK
 2024-07-07        33.7            65          35.6         OK
 2024-07-10        32.6            69          35.1         OK
 2024-07-05        31.5            71          33.8         OK
 2024-07-12        30.4            76          33.2         OK

Days triggering safety protocol (HI >= 40.0°C): 3
Max heat index recorded: 50.3°C on 2024-07-11

What just happened?

Method — multi-term formula applied to DataFrame columns

The Rothfusz heat index formula has nine terms combining powers and products of temperature and humidity. In pandas, we assign the variables T = df["temp_max_c"] and H = df["humidity_pct"] — these are pandas Series, so every arithmetic operation acts on all 12 rows simultaneously. The entire formula evaluates in a single vectorised expression, producing one heat index value per row. This pattern — store column references in short variable names, then write the formula naturally — is how domain-specific equations are translated into pandas.

Business Insight

The heat index dramatically changes the risk picture. July 11 records a raw temperature of 39.2°C but a felt temperature of 50.3°C — above the threshold considered dangerous for outdoor labour without mandatory rest periods. Three days (July 11, 6, and 2) trigger the safety protocol. Notably, July 9 has a low temperature of 28.8°C but high humidity of 85%, meaning it feels warmer than its raw temperature suggests — but still below the protocol threshold. The energy company's field scheduler should block outdoor high-exertion tasks on the three alert days.

Step 3Multi-Variable Correlation Analysis

Understanding how weather variables relate to each other helps the energy company anticipate compound conditions — for example, whether high-wind days tend to coincide with rainfall, which would compound the operational risk for field teams.

# Compute temperature range: max minus min for each day
df["temp_range_c"] = (df["temp_max_c"] - df["temp_min_c"]).round(1)

# Full correlation matrix across all numeric weather variables
numeric_cols = ["temp_max_c","temp_min_c","humidity_pct",
                "wind_kmh","rainfall_mm","solar_rad_wm2","temp_range_c"]

corr = df[numeric_cols].corr().round(3)
print("Correlation matrix:")
print(corr.to_string())

# Extract and rank key correlations with temp_max
print("\nKey correlations with temp_max_c:")
temp_corrs = corr["temp_max_c"].drop("temp_max_c").sort_values(key=abs, ascending=False)
for var, r in temp_corrs.items():
    print(f"  {var:<18} r = {r:+.3f}")

# Does wide daily temp range correlate with lower solar radiation?
range_solar = df["temp_range_c"].corr(df["solar_rad_wm2"]).round(3)
print(f"\nTemp range vs solar radiation: r = {range_solar:+.3f}")

Correlation matrix:
               temp_max_c  temp_min_c  humidity_pct  wind_kmh  rainfall_mm  solar_rad_wm2  temp_range_c
temp_max_c          1.000       0.978        -0.492    -0.606       -0.701          0.887         0.476
temp_min_c          0.978       1.000        -0.448    -0.618       -0.682          0.857         0.388
humidity_pct       -0.492      -0.448         1.000     0.358        0.721         -0.601        -0.362
wind_kmh           -0.606      -0.618         0.358     1.000        0.598         -0.536        -0.188
rainfall_mm        -0.701      -0.682         0.721     0.598        1.000         -0.798        -0.381
solar_rad_wm2       0.887       0.857        -0.601    -0.536       -0.798          1.000         0.388

Key correlations with temp_max_c:
  temp_min_c         r = +0.978
  solar_rad_wm2      r = +0.887
  rainfall_mm        r = -0.701
  wind_kmh           r = -0.606
  humidity_pct       r = -0.492
  temp_range_c       r = +0.476

Temp range vs solar radiation: r = +0.388

What just happened?

Method — full correlation matrix with .sort_values(key=abs)

We pass key=abs to .sort_values() after extracting the temperature correlation column — this sorts by the absolute magnitude of the correlation regardless of sign, so the strongest relationships (whether positive or negative) appear first. This is the correct ranking method for correlation strength, identical to the pattern from CS3 where it was first introduced.

Business Insight

The correlation matrix reveals physically meaningful relationships. Solar radiation is the strongest predictor of high temperature (r = +0.887) — sunny days are hot days. Rainfall has a strong negative correlation with temperature (r = −0.701) — rain-bringing systems cool the air significantly. Wind speed also negatively correlates with temperature (r = −0.606), consistent with cool fronts bringing wind. The positive correlation between temp range and solar radiation (r = +0.388) is weaker than expected — cloudy days do not always produce narrow temperature ranges in this dataset. For operations planning, the strong rainfall-temperature link is the most actionable finding: if it rains, expect a significant temperature drop and higher wind.

Step 4Multi-Condition Risk Scoring and Daily Risk Summary

The operations manager needs a single daily risk score that combines all hazard types — heat, wind, and rain — so the field scheduler can prioritise with one number rather than four separate columns. We apply the multi-flag row-sum pattern from CS6 to weather risk.

# Define operational risk thresholds
HEAT_THRESH  = 36.0   # °C max temperature — elevated heat risk
WIND_THRESH  = 25.0   # km/h — risk for working at height
RAIN_THRESH  = 10.0   # mm — slippery surfaces, visibility
HI_THRESH    = 40.0   # °C heat index — safety protocol

# Create binary risk flags per condition
df["risk_heat"]  = (df["temp_max_c"]   >= HEAT_THRESH).astype(int)
df["risk_wind"]  = (df["wind_kmh"]     >= WIND_THRESH).astype(int)
df["risk_rain"]  = (df["rainfall_mm"]  >= RAIN_THRESH).astype(int)
df["risk_hi"]    = (df["heat_index_c"] >= HI_THRESH  ).astype(int)

# Composite risk score: sum of all active risk flags (0–4)
df["risk_score"] = df[["risk_heat","risk_wind","risk_rain","risk_hi"]].sum(axis=1)

# Risk level label
df["risk_level"] = pd.cut(
    df["risk_score"],
    bins=[-1, 0, 1, 2, 4],
    labels=["Low","Moderate","High","Critical"]
)

# Daily risk summary sorted by score
risk_summary = df[["date","temp_max_c","heat_index_c","wind_kmh",
                    "rainfall_mm","risk_score","risk_level"]].sort_values(
    "risk_score", ascending=False
)

print("Daily operational risk summary:")
print(risk_summary.to_string(index=False))

# Count by risk level
print("\nRisk level distribution:")
print(df["risk_level"].value_counts().sort_index().to_string())

Daily operational risk summary:
       date  temp_max_c  heat_index_c  wind_kmh  rainfall_mm  risk_score risk_level
 2024-07-11        39.2          50.3         8          0.0           2       High
 2024-07-06        38.4          47.8         9          0.0           2       High
 2024-07-09        28.8          30.2        35         18.6           2       High
 2024-07-04        27.1          28.6        32         24.8           2       High
 2024-07-02        36.8          43.1        11          0.0           2       High
 2024-07-03        29.4          32.5        28         12.4           2       High
 2024-07-07        33.7          35.6        22          0.0           0        Low
 2024-07-01        34.2          37.4        14          0.0           0        Low
 2024-07-08        35.1          39.8        16          0.0           0        Low
 2024-07-05        31.5          33.8        18          2.1           0        Low
 2024-07-10        32.6          35.1        13          0.0           0        Low
 2024-07-12        30.4          33.2        24          4.2           0        Low

Days triggering safety protocol (HI >= 40.0°C): 3
Max heat index recorded: 50.3°C on 2024-07-11

What just happened?

Method — multi-flag row-sum risk scoring

We created four independent boolean flag columns — one per risk type — then used .sum(axis=1) to count how many risks are active on each day. This is the same pattern from CS6 (student at-risk flagging) and CS10's employee attrition module applied to a physical domain. The result is a simple 0–4 integer risk score where each point represents an independent operational hazard. pd.cut() then bins the integer score into labelled risk tiers.

Business Insight

The risk distribution reveals a striking binary pattern — six days score 2 (High) and six days score 0 (Low) with nothing in between. This means the 12-day period contains two distinct operational environments: a clear-hot cluster and a stormy-cool cluster. The High days divide into two very different hazard types: July 11, 6, and 2 are hot-and-humid risks, while July 9, 4, and 3 are wind-and-rain risks. A deployment plan treating all six High days identically would be wrong — the hot days need heat management protocols while the stormy days need wind-height restrictions and waterproofing.

Checkpoint: Filter to the High risk days and cross-tabulate risk type — df[df['risk_score'] >= 2][['date','risk_heat','risk_wind','risk_rain','risk_hi']]. Which days have the heat-type risk pattern (risk_heat + risk_hi = 2, risk_wind + risk_rain = 0) versus the storm-type pattern? This classification determines which operational protocol each High-risk day requires — a crucial distinction the scheduler needs before finalising field assignments.

Key Findings

The period mean temperature is 33.1°C with a standard deviation of 3.8°C. July 11 (39.2°C, z = 1.61) and July 6 (38.4°C, z = 1.40) are statistical outliers — meaningfully above the period distribution and not simply slightly warmer days.

The heat index transforms the risk picture dramatically. July 11 registers a felt temperature of 50.3°C — over 11°C above its raw maximum. Three days breach the 40°C safety protocol threshold when humidity is factored in, versus only two days that would breach a simple 38°C raw temperature threshold.

Solar radiation is the strongest temperature predictor (r = +0.887) and rainfall is the strongest negative predictor (r = −0.701). The data shows two distinct weather regimes: sunny-hot days and stormy-cool days, with few transitional days in between.

Exactly six of twelve days score High on operational risk — but for two completely different reasons. July 11, 6, and 2 are heat risks; July 9, 4, and 3 are wind-rain risks. The field scheduler must apply different protocols to each group.

Wind speed and rainfall have a positive correlation of r = +0.598 — stormy days tend to bring both simultaneously. For field operations, this means a wind warning is a reasonable proxy for rain risk, enabling simplified pre-deployment decision rules.

Visualisations

Daily Max Temperature vs Heat Index

Heat index (felt temp) consistently exceeds raw temperature · alert = ≥40°C

Jul 11

50.3°C HI

39.2°C

Jul 6

47.8°C HI

38.4°C

Jul 2

43.1°C HI

36.8°C

Jul 8

39.8°C HI

35.1°C

Jul 1

37.4°C HI

34.2°C

Alert line

40°C

Operational Risk Score by Day

0 = Low · 2 = High · no Moderate or Critical days in this period

Jul 11 (heat)

Jul 9 (storm)

Jul 6 (heat)

Jul 4 (storm)

Jul 2 (heat)

Jul 3 (storm)

Jul 1,5,7,8,10,12

Correlation with Max Temperature

Pearson r · positive = hotter days have more of this · negative = less

solar_rad (+)

0.887

+0.887

rainfall (−)

0.701

−0.701

wind (−)

0.606

−0.606

humidity (−)

0.492

−0.492

Weather EDA Decision Guide

Weather datasets introduce continuous physical variables and domain-specific derived metrics. Here is the framework for any weather-driven operational analysis:

Question	Method	pandas / numpy Call	Watch Out For
Which days are extreme?	Z-score per variable	`(val - mean) / std`	Z-score is relative — always check absolute values too
Felt temperature?	Multi-term formula	Store T, H as Series then apply formula	Formula valid only above certain temp/humidity thresholds
Safety threshold breach?	np.where()	`np.where(hi >= 40, "ALERT", "OK")`	Domain-specific thresholds vary by industry standard
Multi-hazard risk score?	Multi-flag row sum	`df[flags].sum(axis=1)`	Each flag should be independent — no double-counting
Variable relationships?	Correlation matrix	`df[cols].corr()`	Physical causation ≠ statistical correlation

Analyst's Note

Teacher's Note

What Would Come Next?

A senior analyst would extend this into a 30-day rolling window to detect seasonal trends and build a predictive risk calendar using historical patterns — flagging high-probability risk days two weeks out so field teams can pre-plan rather than react.

Limitations of This Analysis

Twelve days is insufficient for seasonal pattern detection. The simplified Rothfusz heat index approximation is valid between 27–45°C and 40–100% humidity — values outside this range should use the full NOAA formula. Single-station data also cannot capture spatial variation across the service area.

Business Decisions This Could Drive

Block outdoor high-exertion tasks on July 11, 6, and 2 (heat protocols). Apply wind-height restrictions on July 9, 4, and 3 (storm protocols). Use wind speed as a leading indicator for rain risk given their r = +0.598 correlation — a wind forecast above 25 km/h should trigger waterproofing preparation even before rain is confirmed.

Practice Questions

1. Which variable had the strongest positive correlation with daily maximum temperature in the ClearSky dataset?

2. What was the maximum heat index recorded in the dataset — the highest felt temperature across all 12 days?

3. What is the pandas expression used to count how many risk flags are active per day across multiple boolean flag columns?

Quiz

Up Next · Intermediate Case Studies

Case Study 11 — Analysing Supply Chain Delays

You step up to the Intermediate tier. You are handed a logistics dataset with scheduled vs actual delivery dates, supplier records, and penalty clauses. Which routes fail most often? Which supplier-route combination is the worst? And what is the total financial exposure from delays?

← Previous Course Index Next →

DS Case Studies

Exploring Weather Data

What This Case Study Covers

The Weather EDA Toolkit

Dataset Overview

Business Questions

Step-by-Step Analysis

Key Findings

Visualisations

Weather EDA Decision Guide

Analyst's Note

What Would Come Next?

Limitations of This Analysis

Business Decisions This Could Drive

Practice Questions

Quiz