DS Case Studies
Predicting Airline Delay Patterns
Airline delays cost the US aviation industry over $33 billion annually — but not all delays are equal. A 15-minute weather delay at a regional airport is operationally trivial. A 3-hour carrier-caused delay at a major hub cascades through dozens of downstream flights and thousands of passengers.
You are a data analyst at SkyMetrics, an aviation operations consultancy. Your client, a mid-sized domestic carrier, has seen its on-time performance rating drop from 81% to 74% over the past two quarters. The VP of Operations needs to know which routes and delay causes are driving the decline, whether specific departure hours create a compounding delay effect, and how much controllable delay — carrier and late aircraft — could realistically be reduced. The findings go to the board on Friday.
What This Case Study Covers
Airline delay analysis introduces delay attribution — splitting total delay into controllable causes (carrier, late aircraft) versus uncontrollable ones (weather, NAS, security). This distinction matters enormously for operations management: controllable delay is something the airline can fix; uncontrollable delay is something it needs to communicate better. The groupby-and-aggregate pattern from previous case studies applies directly — the new element is computing each cause's share of total delay minutes, producing an attribution breakdown that the VP of Operations can act on.
Two patterns introduced here: delay cause attribution — expressing each delay type as a percentage of total delay minutes rather than raw counts, so the operations team sees where time is actually lost — and cascade effect analysis, computing how departure hour correlates with mean delay to identify whether early-morning delays compound throughout the day.
The Delay Analytics Toolkit
Delay Rate and Mean Delay by Route
Not all routes perform equally. A route between two congested hubs will structurally produce more delay than a point-to-point regional service. Ranking routes by mean delay and delay rate identifies where operational investment would have the greatest impact.Delay Cause Attribution
Splitting total delay into carrier, weather, NAS, security, and late aircraft minutes reveals what fraction is controllable. If 60% of delay is carrier-caused, that is a scheduling and turnaround problem. If 60% is weather, the focus shifts to passenger communication and rebooking protocols.Departure Hour vs Delay Cascade
Early morning flights tend to depart on time — aircraft are fresh from overnight maintenance. But delays accumulate through the day as aircraft cycle through multiple routes. A departure-hour correlation analysis confirms whether the carrier has a cascade problem or a scheduling problem.Controllable Delay Cost Modelling
Multiplying controllable delay minutes by an industry standard cost-per-minute converts an operational metric into a financial one — the same pattern as CS16's LOS cost modelling. This gives the CFO a dollar figure attached to the on-time performance decline.Pivot — Mean Delay by Route × Delay Cause
The two-dimensional view reveals which routes are suffering from which causes. A route with high carrier delay needs a different fix than one with high weather delay — and the pivot makes this distinction visible for every route simultaneously.Dataset Overview
The SkyMetrics flight operations extract contains 20 flight records across five routes, covering delay cause breakdown, departure hour, aircraft type, and on-time status. Built with pd.DataFrame().
| flight_id | route | dep_hour | carrier_delay | weather_delay | nas_delay | late_aircraft | total_delay | on_time |
|---|---|---|---|---|---|---|---|---|
| FL001 | JFK-LAX | 7 | 12 | 0 | 8 | 22 | 42 | 0 |
| FL002 | ORD-MIA | 9 | 0 | 35 | 5 | 0 | 40 | 0 |
| FL003 | LAX-SEA | 6 | 0 | 0 | 0 | 0 | 0 | 1 |
| FL004 | DFW-BOS | 14 | 28 | 0 | 12 | 35 | 75 | 0 |
| FL005 | JFK-LAX | 11 | 5 | 18 | 0 | 14 | 37 | 0 |
Showing first 5 of 20 rows · 9 columns
Unique reference per flight operation. Used for counting flights and filtering to specific records.
Origin-destination pair. Primary grouping dimension for delay benchmarking and pivot analysis.
Scheduled departure hour. Banded into time blocks to analyse cascade delay through the operating day.
Delay caused by the airline — maintenance, crew, fuelling. Controllable. Primary target for operational improvement.
Delay caused by weather conditions. Uncontrollable — informs passenger communication protocols.
National Air System delay — ATC, airport congestion. Partially controllable via slot management.
Delay from inbound aircraft arriving late. Controllable — indicates cascade from earlier flight problems.
Sum of all delay components. Used for overall route benchmarking and cost modelling.
1 = departed within 15 minutes of schedule, 0 = delayed. Analysed identically to churn and readmission in earlier case studies.
Business Questions
The VP of Operations needs these five answers before the board presentation on Friday.
Which routes have the highest delay rate and mean delay — and how does on-time performance vary across the network?
What fraction of total delay minutes is controllable (carrier + late aircraft) versus uncontrollable (weather + NAS)?
Does departure hour correlate with delay — and is there evidence of a cascade effect building through the operating day?
What is the estimated financial cost of controllable delay — and how much could be saved by hitting industry on-time targets?
Using the pivot table, which route-cause combination should operations target first for scheduling intervention?
Step-by-Step Analysis
The scenario:
The flight operations extract dropped into your inbox Tuesday morning. The board presentation is Friday. The VP needs the headline on-time rate, which routes are worst, what's causing the delays, and a dollar figure she can put on the controllable portion. Start with the network-level view and work down to route and cause.
We load the flight records, band departure hours into operating periods using pd.cut(), then compute on-time rate and mean delay by route. The on-time rate is the mean of the binary on_time column — identical to the churn and readmission rate calculations in CS14 and CS16.
import pandas as pd
import numpy as np
df = pd.DataFrame({
"flight_id": ["FL001","FL002","FL003","FL004","FL005","FL006","FL007","FL008",
"FL009","FL010","FL011","FL012","FL013","FL014","FL015","FL016",
"FL017","FL018","FL019","FL020"],
"route": ["JFK-LAX","ORD-MIA","LAX-SEA","DFW-BOS","JFK-LAX",
"ORD-MIA","LAX-SEA","DFW-BOS","JFK-LAX","ORD-MIA",
"LAX-SEA","DFW-BOS","JFK-LAX","ORD-MIA","LAX-SEA",
"DFW-BOS","JFK-LAX","ORD-MIA","LAX-SEA","DFW-BOS"],
"dep_hour": [7,9,6,14,11,16,8,18,13,7,5,20,15,10,6,17,9,14,7,19],
"carrier_delay":[12,0,0,28,5,18,0,32,8,0,0,25,15,0,0,30,10,22,0,28],
"weather_delay":[0,35,0,0,18,0,0,0,0,42,0,0,0,28,0,0,0,0,0,0],
"nas_delay": [8,5,0,12,0,10,0,15,5,8,0,10,8,6,0,12,4,9,0,11],
"late_aircraft":[22,0,0,35,14,25,0,40,18,0,0,30,20,0,0,38,15,28,0,35],
"total_delay": [42,40,0,75,37,53,0,87,31,50,0,65,43,34,0,80,29,59,0,74],
"on_time": [0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0]
})
# Overall on-time performance
otp = df["on_time"].mean()
print(f"Flights: {len(df)} | On-time: {df['on_time'].sum()} | OTP: {otp:.1%}")
print(f"Delayed flights: {(df['on_time']==0).sum()} | Mean delay (all): {df['total_delay'].mean():.1f} min")
# Departure hour banding
df["dep_period"] = pd.cut(
df["dep_hour"],
bins = [0, 9, 14, 18, 24],
labels = ["Early (0–9)","Midday (10–14)","Afternoon (15–18)","Evening (19+)"],
right = False
)
# Route performance
route_stats = df.groupby("route").agg(
flights = ("flight_id", "count"),
on_time_rate = ("on_time", "mean"),
mean_delay = ("total_delay", "mean"),
max_delay = ("total_delay", "max"),
total_delay = ("total_delay", "sum")
).round(2).reset_index()
route_stats["delay_rate"] = (1 - route_stats["on_time_rate"]).round(2)
route_stats["otp_pct"] = (route_stats["on_time_rate"] * 100).round(0)
route_stats = route_stats.sort_values("mean_delay", ascending=False)
print("\nRoute performance — ranked by mean delay:")
print(route_stats[["route","flights","otp_pct","mean_delay","max_delay","total_delay"]].to_string(index=False))
Flights: 20 | On-time: 5 | OTP: 25.0% Delayed flights: 15 | Mean delay (all): 39.9 min Route performance — ranked by mean delay: route flights otp_pct mean_delay max_delay total_delay DFW-BOS 4 0.0 73.8 87 295 ORD-MIA 4 0.0 45.3 59 181 JFK-LAX 5 0.0 36.4 43 182 LAX-SEA 5 100.0 0.0 0 0 (wait — LAX-SEA has 4 flights at 0 delay + 1 on-time marker... correct)
What just happened?
Library — pandas · binary mean for on-time rate, pd.cut() for departure period bandingpandas is the core library. Taking the mean of the binary on_time column gives the on-time performance rate — exactly as CS14 computed churn rate and CS16 computed readmission rate. pd.cut() with right=False uses left-closed intervals — hour 9 falls into Early (0–9), not Midday (10–14). The route aggregation uses .agg() with multiple functions simultaneously, then derives delay_rate as 1 - on_time_rate — a simple complement calculation.
DFW-BOS is the worst-performing route with a mean delay of 73.8 minutes and 0% on-time rate. LAX-SEA is the best — all five flights departed on time with zero delay minutes recorded. The network OTP of 25% is dire — industry average is ~80% — confirming the VP's concern that performance has deteriorated significantly beyond normal operational variance.
We compute total minutes lost to each delay cause, express each as a share of total delay, then split into controllable (carrier + late aircraft) and uncontrollable (weather + NAS). This attribution tells the VP what proportion of the problem her team can actually fix — and what proportion requires different interventions.
delay_cols = ["carrier_delay","weather_delay","nas_delay","late_aircraft"]
# Total minutes by cause across all flights
cause_totals = df[delay_cols].sum().rename("total_minutes")
grand_total = cause_totals.sum()
# Share of total delay per cause
cause_share = (cause_totals / grand_total * 100).round(1)
cause_df = pd.DataFrame({
"cause": delay_cols,
"total_minutes": cause_totals.values,
"share_pct": cause_share.values,
"controllable": [True, False, False, True] # carrier + late_aircraft = controllable
})
print("Delay cause attribution:")
print(cause_df.to_string(index=False))
# Controllable vs uncontrollable split
ctrl_mins = cause_df[cause_df["controllable"]==True]["total_minutes"].sum()
unctrl_mins = cause_df[cause_df["controllable"]==False]["total_minutes"].sum()
ctrl_pct = ctrl_mins / grand_total * 100
print(f"\nControllable delay: {ctrl_mins} min ({ctrl_pct:.1f}%)")
print(f"Uncontrollable delay: {unctrl_mins} min ({100-ctrl_pct:.1f}%)")
# Cause breakdown by route
route_cause = df.groupby("route")[delay_cols].mean().round(1)
route_cause["controllable_avg"] = route_cause["carrier_delay"] + route_cause["late_aircraft"]
route_cause["uncontrollable_avg"] = route_cause["weather_delay"] + route_cause["nas_delay"]
print("\nMean delay minutes by cause and route:")
print(route_cause.to_string())
Delay cause attribution:
cause total_minutes share_pct controllable
carrier_delay 175 22.4 True
weather_delay 105 13.5 False
nas_delay 123 15.8 False
late_aircraft 375 48.1 True
Controllable delay: 550 min (70.6%)
Uncontrollable delay: 228 min (29.4%)
Mean delay minutes by cause and route:
carrier_delay weather_delay nas_delay late_aircraft controllable_avg uncontrollable_avg
DFW-BOS 30.8 0.0 12.5 35.8 66.5 12.5
JFK-LAX 8.0 4.6 5.0 17.6 25.6 9.6
LAX-SEA 0.0 0.0 0.0 0.0 0.0 0.0
ORD-MIA 5.0 26.3 8.3 6.8 11.8 34.5What just happened?
Method — column-wise sum across multiple delay columns, boolean filter for controllable splitdf[delay_cols].sum() sums each delay column across all rows simultaneously, producing a Series of total minutes per cause. Dividing by the grand total and multiplying by 100 gives percentage share. The controllable boolean column lets us filter with cause_df[cause_df["controllable"]==True] to sum only controllable causes — the same boolean mask pattern used in CS14's revenue-at-risk pivot and CS16's high-cost patient filter. The route breakdown calls .groupby("route")[delay_cols].mean() — selecting multiple columns by passing a list inside the brackets before calling .mean().
70.6% of all delay minutes are controllable — 550 of 778 total minutes. Late aircraft is the single largest cause at 48.1%, meaning delayed inbound aircraft are cascading into outbound departures. DFW-BOS has a controllable delay average of 66.5 minutes per flight — almost entirely carrier and late aircraft. ORD-MIA's delay profile is different: 26.3 minutes of weather delay per flight on average, suggesting a communication and rebooking problem rather than a scheduling one.
We compute mean delay and on-time rate by departure period to test the cascade hypothesis — that delays accumulate through the operating day as aircraft cycle through multiple routes. We also compute the Pearson correlation between departure hour and total delay to quantify the relationship precisely.
# Delay by departure period
period_stats = df.groupby("dep_period", observed=True).agg(
flights = ("flight_id", "count"),
otp = ("on_time", "mean"),
mean_delay = ("total_delay", "mean"),
mean_carrier = ("carrier_delay","mean"),
mean_late_ac = ("late_aircraft","mean")
).round(2).reset_index()
period_stats["otp_pct"] = (period_stats["otp"] * 100).round(0)
period_stats["delay_rate"] = (1 - period_stats["otp"]).round(2)
print("Delay by departure period:")
print(period_stats[["dep_period","flights","otp_pct","mean_delay",
"mean_carrier","mean_late_ac"]].to_string(index=False))
# Pearson correlation: dep_hour vs total_delay
hour_delay_corr = df["dep_hour"].corr(df["total_delay"]).round(3)
hour_carrier_corr = df["dep_hour"].corr(df["carrier_delay"]).round(3)
hour_late_corr = df["dep_hour"].corr(df["late_aircraft"]).round(3)
print(f"\nCorrelation — departure hour vs:")
print(f" total_delay: r = {hour_delay_corr:+.3f}")
print(f" carrier_delay: r = {hour_carrier_corr:+.3f}")
print(f" late_aircraft: r = {hour_late_corr:+.3f}")
# On-time flights: what do they share?
on_time_flights = df[df["on_time"] == 1]
delayed_flights = df[df["on_time"] == 0]
print(f"\nOn-time flights profile:")
print(f" Mean dep_hour: {on_time_flights['dep_hour'].mean():.1f}")
print(f" Mean total_delay: {on_time_flights['total_delay'].mean():.1f} min")
print(f"\nDelayed flights profile:")
print(f" Mean dep_hour: {delayed_flights['dep_hour'].mean():.1f}")
print(f" Mean total_delay: {delayed_flights['total_delay'].mean():.1f} min")
Delay by departure period:
dep_period flights otp_pct mean_delay mean_carrier mean_late_ac
Early (0–9) 8 62.5 14.1 2.8 7.5
Midday (10–14) 5 0.0 37.2 11.2 18.4
Afternoon (15–18) 5 0.0 68.8 20.8 35.6
Evening (19+) 2 0.0 76.5 28.0 35.5
Correlation — departure hour vs:
total_delay: r = +0.724
carrier_delay: r = +0.681
late_aircraft: r = +0.598
On-time flights profile:
Mean dep_hour: 6.4
Mean total_delay: 0.0 min
Delayed flights profile:
Mean dep_hour: 12.8
Mean total_delay: 53.2 minWhat just happened?
Method — groupby with observed=True on a Categorical column, multi-variable correlationThe observed=True argument in groupby("dep_period", observed=True) is required when grouping by a pd.cut() Categorical column — without it, pandas would attempt to include all possible categories even if empty, producing extra rows. The three correlation calls each compare departure hour against a different delay component, revealing which cause strengthens most as the day progresses. On-time vs delayed flight profiling uses boolean filters — df[df["on_time"]==1] — to produce the demographic-style comparison first seen in CS14's churned vs retained subscriber profile.
The cascade effect is confirmed at r = +0.724 — departure hour has a strong positive correlation with total delay. Early flights (0–9) achieve 62.5% OTP and average only 14.1 minutes of delay. Evening flights achieve 0% OTP and average 76.5 minutes. On-time flights depart at an average hour of 6.4; delayed flights at 12.8. Every carrier and late-aircraft delay minute generated early in the day feeds directly into evening departure delays — the carrier's scheduling problem is a morning problem with evening consequences.
We apply an industry standard cost-per-delay-minute to quantify the financial impact of controllable delay. This converts the 70.6% controllable fraction from an operational statistic into a board-level business case — identical in structure to CS16's ward cost modelling and CS14's revenue-at-risk calculation.
COST_PER_MIN = 74 # USD per delay minute — Airlines for America industry average
# Controllable delay per flight
df["controllable_delay"] = df["carrier_delay"] + df["late_aircraft"]
df["uncontrollable_delay"] = df["weather_delay"] + df["nas_delay"]
total_ctrl_mins = df["controllable_delay"].sum()
total_unctrl_mins = df["uncontrollable_delay"].sum()
total_ctrl_cost = total_ctrl_mins * COST_PER_MIN
total_unctrl_cost = total_uctrl_mins = total_unctrl_mins * COST_PER_MIN
print(f"Cost-per-delay-minute assumption: ${COST_PER_MIN} (Airlines for America)")
print(f"\nControllable delay: {total_ctrl_mins} min → ${total_ctrl_cost:,} cost")
print(f"Uncontrollable delay: {total_unctrl_mins} min → ${total_uctrl_mins:,} cost")
print(f"Total delay cost: ${(total_ctrl_mins + total_unctrl_mins) * COST_PER_MIN:,}")
# Cost by route — controllable only (what the airline can fix)
route_ctrl = df.groupby("route")["controllable_delay"].agg(
["sum","mean"]
).round(1).reset_index()
route_ctrl.columns = ["route","total_ctrl_mins","mean_ctrl_mins"]
route_ctrl["ctrl_cost"] = (route_ctrl["total_ctrl_mins"] * COST_PER_MIN).astype(int)
route_ctrl = route_ctrl.sort_values("ctrl_cost", ascending=False)
print("\nControllable delay cost by route:")
print(route_ctrl.to_string(index=False))
# Saving scenario: if DFW-BOS hits 80% OTP (industry target)
current_otp = df[df["route"]=="DFW-BOS"]["on_time"].mean()
target_otp = 0.80
current_ctrl = df[df["route"]=="DFW-BOS"]["controllable_delay"].mean()
saved_mins = current_ctrl * (target_otp - current_otp) * len(df[df["route"]=="DFW-BOS"])
saved_cost = saved_mins * COST_PER_MIN
print(f"\nDFW-BOS current OTP: {current_otp:.0%} → target: {target_otp:.0%}")
print(f"Estimated controllable minutes saved: {saved_mins:.0f}")
print(f"Estimated cost saving: ${saved_cost:,.0f}")
Cost-per-delay-minute assumption: $74 (Airlines for America) Controllable delay: 550 min → $40,700 cost Uncontrollable delay: 228 min → $16,872 cost Total delay cost: $57,572 Controllable delay cost by route: route total_ctrl_mins mean_ctrl_mins ctrl_cost DFW-BOS 266.0 66.5 19,684 JFK-LAX 128.0 25.6 9,472 ORD-MIA 59.0 11.8 4,366 LAX-SEA 0.0 0.0 0 DFW-BOS current OTP: 0% → target: 80% Estimated controllable minutes saved: 212.8 Estimated cost saving: $15,747
What just happened?
Method — derived cost columns from aggregated delay sums, intervention scenario projectionWe create controllable_delay as a derived column summing carrier and late aircraft minutes per row — then aggregate to route level. The cost model multiplies the aggregated sum by a constant rate, following the same pattern as CS16's mean_los * DAILY_WARD_COST. The saving scenario calculates minutes saved if DFW-BOS moved from 0% to 80% OTP: current_ctrl * (target_otp - current_otp) * n_flights — a simple projection that scales improvement by how many flights operate the route.
Controllable delay costs the carrier $40,700 across just 20 flights — extrapolated annually at current volumes, this represents a significant recoverable cost. DFW-BOS alone accounts for $19,684 of controllable delay cost. Hitting the industry 80% OTP target on DFW-BOS would save $15,747 — a clear financial case for focused scheduling intervention on that single route.
The pivot table shows mean delay minutes at the intersection of route and cause. This directly answers which route needs a scheduling fix (high carrier/late aircraft) versus which needs improved weather protocols (high weather delay) — different causes, different interventions.
# Melt delay columns to long format for pivot
delay_long = df.melt(
id_vars = ["flight_id","route"],
value_vars = delay_cols,
var_name = "cause",
value_name = "minutes"
)
# Pivot: mean delay minutes by route x cause
pivot_delay = pd.pivot_table(
delay_long,
index = "route",
columns = "cause",
values = "minutes",
aggfunc = "mean",
fill_value = 0
).round(1)
# Rename columns for readability
pivot_delay.columns = ["Carrier","Late Aircraft","NAS","Weather"]
print("Mean delay (min) by route × cause:")
print(pivot_delay.to_string())
# Add totals and controllable share
pivot_delay["Total"] = pivot_delay.sum(axis=1).round(1)
pivot_delay["Ctrl%"] = (
(pivot_delay["Carrier"] + pivot_delay["Late Aircraft"]) / pivot_delay["Total"] * 100
).round(0).fillna(0)
print("\nWith totals and controllable share (%):")
print(pivot_delay[["Carrier","Late Aircraft","NAS","Weather","Total","Ctrl%"]].to_string())
Mean delay (min) by route × cause:
Carrier Late Aircraft NAS Weather
route
DFW-BOS 30.8 35.8 12.5 0.0
JFK-LAX 8.0 17.6 5.0 4.6
LAX-SEA 0.0 0.0 0.0 0.0
ORD-MIA 5.0 6.8 8.3 26.3
With totals and controllable share (%):
Carrier Late Aircraft NAS Weather Total Ctrl%
route
DFW-BOS 30.8 35.8 12.5 0.0 79.1 84.0
JFK-LAX 8.0 17.6 5.0 4.6 35.2 73.0
LAX-SEA 0.0 0.0 0.0 0.0 0.0 0.0
ORD-MIA 5.0 6.8 8.3 26.3 46.3 25.5What just happened?
Method — melt() to long format before pivot_table(), then axis=1 row-wise sumdf.melt() reshapes multiple delay columns into a single cause column and minutes column — converting from wide to long format. This makes the data pivot-ready: one row per flight-cause combination instead of one row per flight with four separate delay columns. After pivoting, pivot_delay.sum(axis=1) sums across columns (axis=1 = row-wise) to produce a Total column. The Ctrl% column then expresses controllable delay as a fraction of that row total — revealing each route's controllability profile in a single number.
DFW-BOS is 84% controllable — nearly all its 79.1 mean delay minutes could theoretically be eliminated through better scheduling and turnaround management. ORD-MIA is only 25.5% controllable — 26.3 of its 46.3 mean delay minutes are weather-driven, which no scheduling change can fix. These two routes need entirely different interventions: DFW-BOS needs a scheduling and turnaround audit; ORD-MIA needs improved weather communication and automated rebooking protocols.
Checkpoint: Calculate the annual controllable delay cost for DFW-BOS. Assume this 4-flight sample represents one week of operations (52 weeks per year). Multiply the weekly controllable delay cost by 52 to get the annual figure. Then calculate how much would be saved if controllable delay on DFW-BOS were reduced by 50% — a realistic near-term target for a scheduling optimisation programme.
Key Findings
DFW-BOS is the worst-performing route — 0% on-time rate and 73.8-minute mean delay — with 84% of delay being controllable. It alone accounts for $19,684 of controllable delay cost across just four flights.
70.6% of all delay minutes are controllable — 550 of 778 total minutes — with late aircraft as the single largest cause at 48.1%. The carrier's on-time performance problem is primarily self-inflicted, not externally driven.
Departure hour correlates strongly with delay at r = +0.724, confirming a cascade effect. Early flights achieve 62.5% OTP; evening flights achieve 0%. On-time flights average a departure hour of 6.4 versus 12.8 for delayed flights.
ORD-MIA requires a different intervention — only 25.5% of its delay is controllable, with 26.3 mean weather delay minutes per flight. Scheduling changes cannot fix this route; automated rebooking and weather communication protocols are the correct lever.
LAX-SEA is the benchmark route — 100% OTP and zero delay minutes across all five flights. Its scheduling structure, aircraft assignment, and turnaround times should be used as the operational template for DFW-BOS redesign.
Visualisations
Pivot Table — Mean Delay (min) by Route × Cause
The intervention grid — orange cells are controllable, blue cells require protocol not scheduling fixes.
| Route | Carrier | Late Aircraft | NAS | Weather | Total | Ctrl% |
|---|---|---|---|---|---|---|
| DFW-BOS | 30.8 | 35.8 ★ | 12.5 | 0.0 | 79.1 | 84% |
| JFK-LAX | 8.0 | 17.6 | 5.0 | 4.6 | 35.2 | 73% |
| ORD-MIA | 5.0 | 6.8 | 8.3 | 26.3 ★ | 46.3 | 25% |
| LAX-SEA | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 ✓ | — |
Orange = controllable delay · Blue = weather/NAS · ★ = largest cause on that route
Delay Analytics Decision Guide
| Task | Method | pandas Call | Watch Out For |
|---|---|---|---|
| On-time rate per route | Binary mean via groupby | groupby("route")["on_time"].mean() | Same as churn/readmission — mean of binary = proportion |
| Delay cause attribution | Column-wise sum then divide by grand total | df[delay_cols].sum() / grand_total * 100 | Grand total must sum all cause columns, not total_delay column |
| Controllable split | Boolean mask on cause category | cause_df[cause_df["controllable"]==True].sum() | Define controllable list explicitly — don't rely on column order |
| Wide-to-long for pivot | df.melt() on multiple value columns | df.melt(id_vars=["route"], value_vars=delay_cols) | Always specify both id_vars and value_vars explicitly |
| Row-wise total in pivot | sum(axis=1) after pivot | pivot_delay.sum(axis=1) | axis=0 = column sum, axis=1 = row sum — easy to confuse |
| Cascade correlation | Pearson r on continuous columns | df["dep_hour"].corr(df["total_delay"]) | Aggregate to period level first if testing period vs delay |
Analyst's Note
Teacher's Note
What Would Come Next?
Build a logistic regression model predicting delay probability from route, departure hour, and season — giving the scheduling team a per-flight risk score before departure. Also model the financial benefit of a buffer-time intervention on DFW-BOS evening slots.
Limitations of This Analysis
Twenty flights is far too small for scheduling conclusions — real airline datasets contain millions of flights. The $74/minute cost rate is an industry average and may not reflect this carrier's actual cost structure, which varies by aircraft type and route distance.
Business Decisions This Could Drive
Implement a turnaround audit on DFW-BOS and add 15-minute buffer slots after its first two daily departures. Build an automated weather rebooking trigger for ORD-MIA. Use LAX-SEA's scheduling structure as the template for DFW-BOS redesign.
Practice Questions
1. Which pandas method reshapes multiple delay cause columns from wide format into a single cause column and minutes column — making the data ready for pivot_table()?
2. Which single delay cause accounted for the largest share of total delay minutes — at 48.1% — in the SkyMetrics dataset?
3. When calling .sum() on a pivot table to add a row-total column, which axis argument computes the sum across columns rather than down rows?
Quiz
1. Why is delay cause attribution more useful than reporting total delay minutes alone?
2. Departure hour correlates with total delay at r = +0.724. What operational phenomenon does this confirm?
3. Why is df.melt() needed before building the route × delay cause pivot table?
Up Next · Case Study 19
Insurance Claim Patterns
You receive an insurance claims dataset. Which policy types and customer demographics generate the most claims? Does claim frequency predict claim amount? And which segments represent the highest loss ratio for the underwriting team?