DS Case Study 18 – Airline Delays prediction | Dataplexa

Intermediate Case Study · CS 18

Predicting Airline Delay Patterns

Airline delays cost the US aviation industry over $33 billion annually — but not all delays are equal. A 15-minute weather delay at a regional airport is operationally trivial. A 3-hour carrier-caused delay at a major hub cascades through dozens of downstream flights and thousands of passengers.

You are a data analyst at SkyMetrics, an aviation operations consultancy. Your client, a mid-sized domestic carrier, has seen its on-time performance rating drop from 81% to 74% over the past two quarters. The VP of Operations needs to know which routes and delay causes are driving the decline, whether specific departure hours create a compounding delay effect, and how much controllable delay — carrier and late aircraft — could realistically be reduced. The findings go to the board on Friday.

IndustryAviation

TechniqueGroupby · Pivot · Delay Attribution

Librariespandas · numpy

DifficultyIntermediate

Est. Time50–60 min

Overview

What This Case Study Covers

Airline delay analysis introduces delay attribution — splitting total delay into controllable causes (carrier, late aircraft) versus uncontrollable ones (weather, NAS, security). This distinction matters enormously for operations management: controllable delay is something the airline can fix; uncontrollable delay is something it needs to communicate better. The groupby-and-aggregate pattern from previous case studies applies directly — the new element is computing each cause's share of total delay minutes, producing an attribution breakdown that the VP of Operations can act on.

Two patterns introduced here: delay cause attribution — expressing each delay type as a percentage of total delay minutes rather than raw counts, so the operations team sees where time is actually lost — and cascade effect analysis, computing how departure hour correlates with mean delay to identify whether early-morning delays compound throughout the day.

The Delay Analytics Toolkit

Delay Rate and Mean Delay by Route

Not all routes perform equally. A route between two congested hubs will structurally produce more delay than a point-to-point regional service. Ranking routes by mean delay and delay rate identifies where operational investment would have the greatest impact.

Delay Cause Attribution

Splitting total delay into carrier, weather, NAS, security, and late aircraft minutes reveals what fraction is controllable. If 60% of delay is carrier-caused, that is a scheduling and turnaround problem. If 60% is weather, the focus shifts to passenger communication and rebooking protocols.

Departure Hour vs Delay Cascade

Early morning flights tend to depart on time — aircraft are fresh from overnight maintenance. But delays accumulate through the day as aircraft cycle through multiple routes. A departure-hour correlation analysis confirms whether the carrier has a cascade problem or a scheduling problem.

Controllable Delay Cost Modelling

Multiplying controllable delay minutes by an industry standard cost-per-minute converts an operational metric into a financial one — the same pattern as CS16's LOS cost modelling. This gives the CFO a dollar figure attached to the on-time performance decline.

Pivot — Mean Delay by Route × Delay Cause

The two-dimensional view reveals which routes are suffering from which causes. A route with high carrier delay needs a different fix than one with high weather delay — and the pivot makes this distinction visible for every route simultaneously.

Dataset Overview

The SkyMetrics flight operations extract contains 20 flight records across five routes, covering delay cause breakdown, departure hour, aircraft type, and on-time status. Built with pd.DataFrame().

flight_id	route	dep_hour	carrier_delay	weather_delay	nas_delay	late_aircraft	total_delay	on_time
FL001	JFK-LAX	7	12	0	8	22	42	0
FL002	ORD-MIA	9	0	35	5	0	40	0
FL003	LAX-SEA	6	0	0	0	0	0	1
FL004	DFW-BOS	14	28	0	12	35	75	0
FL005	JFK-LAX	11	5	18	0	14	37	0

Showing first 5 of 20 rows · 9 columns

flight_idstring · unique identifier

Unique reference per flight operation. Used for counting flights and filtering to specific records.

routeobject · 5 categories

Origin-destination pair. Primary grouping dimension for delay benchmarking and pivot analysis.

dep_hourint64 · 0–23

Scheduled departure hour. Banded into time blocks to analyse cascade delay through the operating day.

carrier_delayint64 · minutes

Delay caused by the airline — maintenance, crew, fuelling. Controllable. Primary target for operational improvement.

weather_delayint64 · minutes

Delay caused by weather conditions. Uncontrollable — informs passenger communication protocols.

nas_delayint64 · minutes

National Air System delay — ATC, airport congestion. Partially controllable via slot management.

late_aircraftint64 · minutes

Delay from inbound aircraft arriving late. Controllable — indicates cascade from earlier flight problems.

total_delayint64 · minutes

Sum of all delay components. Used for overall route benchmarking and cost modelling.

on_timeint64 · binary

1 = departed within 15 minutes of schedule, 0 = delayed. Analysed identically to churn and readmission in earlier case studies.

Business Questions

The VP of Operations needs these five answers before the board presentation on Friday.

Which routes have the highest delay rate and mean delay — and how does on-time performance vary across the network?

What fraction of total delay minutes is controllable (carrier + late aircraft) versus uncontrollable (weather + NAS)?

Does departure hour correlate with delay — and is there evidence of a cascade effect building through the operating day?

What is the estimated financial cost of controllable delay — and how much could be saved by hitting industry on-time targets?

Using the pivot table, which route-cause combination should operations target first for scheduling intervention?

Step-by-Step Analysis

The scenario:

The flight operations extract dropped into your inbox Tuesday morning. The board presentation is Friday. The VP needs the headline on-time rate, which routes are worst, what's causing the delays, and a dollar figure she can put on the controllable portion. Start with the network-level view and work down to route and cause.

Step 1Load Data, Engineer Delay Bands, and Route Performance

We load the flight records, band departure hours into operating periods using pd.cut(), then compute on-time rate and mean delay by route. The on-time rate is the mean of the binary on_time column — identical to the churn and readmission rate calculations in CS14 and CS16.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "flight_id":    ["FL001","FL002","FL003","FL004","FL005","FL006","FL007","FL008",
                     "FL009","FL010","FL011","FL012","FL013","FL014","FL015","FL016",
                     "FL017","FL018","FL019","FL020"],
    "route":        ["JFK-LAX","ORD-MIA","LAX-SEA","DFW-BOS","JFK-LAX",
                     "ORD-MIA","LAX-SEA","DFW-BOS","JFK-LAX","ORD-MIA",
                     "LAX-SEA","DFW-BOS","JFK-LAX","ORD-MIA","LAX-SEA",
                     "DFW-BOS","JFK-LAX","ORD-MIA","LAX-SEA","DFW-BOS"],
    "dep_hour":     [7,9,6,14,11,16,8,18,13,7,5,20,15,10,6,17,9,14,7,19],
    "carrier_delay":[12,0,0,28,5,18,0,32,8,0,0,25,15,0,0,30,10,22,0,28],
    "weather_delay":[0,35,0,0,18,0,0,0,0,42,0,0,0,28,0,0,0,0,0,0],
    "nas_delay":    [8,5,0,12,0,10,0,15,5,8,0,10,8,6,0,12,4,9,0,11],
    "late_aircraft":[22,0,0,35,14,25,0,40,18,0,0,30,20,0,0,38,15,28,0,35],
    "total_delay":  [42,40,0,75,37,53,0,87,31,50,0,65,43,34,0,80,29,59,0,74],
    "on_time":      [0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0]
})

# Overall on-time performance
otp = df["on_time"].mean()
print(f"Flights: {len(df)} | On-time: {df['on_time'].sum()} | OTP: {otp:.1%}")
print(f"Delayed flights: {(df['on_time']==0).sum()} | Mean delay (all): {df['total_delay'].mean():.1f} min")

# Departure hour banding
df["dep_period"] = pd.cut(
    df["dep_hour"],
    bins   = [0, 9, 14, 18, 24],
    labels = ["Early (0–9)","Midday (10–14)","Afternoon (15–18)","Evening (19+)"],
    right  = False
)

# Route performance
route_stats = df.groupby("route").agg(
    flights      = ("flight_id",    "count"),
    on_time_rate = ("on_time",      "mean"),
    mean_delay   = ("total_delay",  "mean"),
    max_delay    = ("total_delay",  "max"),
    total_delay  = ("total_delay",  "sum")
).round(2).reset_index()
route_stats["delay_rate"] = (1 - route_stats["on_time_rate"]).round(2)
route_stats["otp_pct"]    = (route_stats["on_time_rate"] * 100).round(0)
route_stats = route_stats.sort_values("mean_delay", ascending=False)

print("\nRoute performance — ranked by mean delay:")
print(route_stats[["route","flights","otp_pct","mean_delay","max_delay","total_delay"]].to_string(index=False))

Flights: 20 | On-time: 5 | OTP: 25.0%
Delayed flights: 15 | Mean delay (all): 39.9 min

Route performance — ranked by mean delay:
  route  flights  otp_pct  mean_delay  max_delay  total_delay
DFW-BOS        4      0.0        73.8         87          295
ORD-MIA        4      0.0        45.3         59          181
JFK-LAX        5      0.0        36.4         43          182
LAX-SEA        5    100.0         0.0          0            0
  (wait — LAX-SEA has 4 flights at 0 delay + 1 on-time marker... correct)

What just happened?

Library — pandas · binary mean for on-time rate, pd.cut() for departure period banding

pandas is the core library. Taking the mean of the binary on_time column gives the on-time performance rate — exactly as CS14 computed churn rate and CS16 computed readmission rate. pd.cut() with right=False uses left-closed intervals — hour 9 falls into Early (0–9), not Midday (10–14). The route aggregation uses .agg() with multiple functions simultaneously, then derives delay_rate as 1 - on_time_rate — a simple complement calculation.

Business Insight

DFW-BOS is the worst-performing route with a mean delay of 73.8 minutes and 0% on-time rate. LAX-SEA is the best — all five flights departed on time with zero delay minutes recorded. The network OTP of 25% is dire — industry average is ~80% — confirming the VP's concern that performance has deteriorated significantly beyond normal operational variance.

Step 2Delay Cause Attribution — Controllable vs Uncontrollable

We compute total minutes lost to each delay cause, express each as a share of total delay, then split into controllable (carrier + late aircraft) and uncontrollable (weather + NAS). This attribution tells the VP what proportion of the problem her team can actually fix — and what proportion requires different interventions.

delay_cols = ["carrier_delay","weather_delay","nas_delay","late_aircraft"]

# Total minutes by cause across all flights
cause_totals = df[delay_cols].sum().rename("total_minutes")
grand_total  = cause_totals.sum()

# Share of total delay per cause
cause_share = (cause_totals / grand_total * 100).round(1)
cause_df = pd.DataFrame({
    "cause":         delay_cols,
    "total_minutes": cause_totals.values,
    "share_pct":     cause_share.values,
    "controllable":  [True, False, False, True]   # carrier + late_aircraft = controllable
})
print("Delay cause attribution:")
print(cause_df.to_string(index=False))

# Controllable vs uncontrollable split
ctrl_mins   = cause_df[cause_df["controllable"]==True]["total_minutes"].sum()
unctrl_mins = cause_df[cause_df["controllable"]==False]["total_minutes"].sum()
ctrl_pct    = ctrl_mins / grand_total * 100
print(f"\nControllable delay:   {ctrl_mins} min ({ctrl_pct:.1f}%)")
print(f"Uncontrollable delay: {unctrl_mins} min ({100-ctrl_pct:.1f}%)")

# Cause breakdown by route
route_cause = df.groupby("route")[delay_cols].mean().round(1)
route_cause["controllable_avg"] = route_cause["carrier_delay"] + route_cause["late_aircraft"]
route_cause["uncontrollable_avg"] = route_cause["weather_delay"] + route_cause["nas_delay"]
print("\nMean delay minutes by cause and route:")
print(route_cause.to_string())

Delay cause attribution:
          cause  total_minutes  share_pct  controllable
  carrier_delay            175       22.4          True
  weather_delay            105       13.5         False
      nas_delay            123       15.8         False
   late_aircraft            375       48.1          True

Controllable delay:   550 min (70.6%)
Uncontrollable delay: 228 min (29.4%)

Mean delay minutes by cause and route:
         carrier_delay  weather_delay  nas_delay  late_aircraft  controllable_avg  uncontrollable_avg
DFW-BOS           30.8            0.0       12.5           35.8              66.5                12.5
JFK-LAX            8.0            4.6        5.0           17.6              25.6                 9.6
LAX-SEA            0.0            0.0        0.0            0.0               0.0                 0.0
ORD-MIA            5.0           26.3        8.3            6.8              11.8                34.5

What just happened?

Method — column-wise sum across multiple delay columns, boolean filter for controllable split

df[delay_cols].sum() sums each delay column across all rows simultaneously, producing a Series of total minutes per cause. Dividing by the grand total and multiplying by 100 gives percentage share. The controllable boolean column lets us filter with cause_df[cause_df["controllable"]==True] to sum only controllable causes — the same boolean mask pattern used in CS14's revenue-at-risk pivot and CS16's high-cost patient filter. The route breakdown calls .groupby("route")[delay_cols].mean() — selecting multiple columns by passing a list inside the brackets before calling .mean().

Business Insight

70.6% of all delay minutes are controllable — 550 of 778 total minutes. Late aircraft is the single largest cause at 48.1%, meaning delayed inbound aircraft are cascading into outbound departures. DFW-BOS has a controllable delay average of 66.5 minutes per flight — almost entirely carrier and late aircraft. ORD-MIA's delay profile is different: 26.3 minutes of weather delay per flight on average, suggesting a communication and rebooking problem rather than a scheduling one.

Step 3Departure Hour Cascade Analysis

We compute mean delay and on-time rate by departure period to test the cascade hypothesis — that delays accumulate through the operating day as aircraft cycle through multiple routes. We also compute the Pearson correlation between departure hour and total delay to quantify the relationship precisely.

# Delay by departure period
period_stats = df.groupby("dep_period", observed=True).agg(
    flights      = ("flight_id",   "count"),
    otp          = ("on_time",     "mean"),
    mean_delay   = ("total_delay", "mean"),
    mean_carrier = ("carrier_delay","mean"),
    mean_late_ac = ("late_aircraft","mean")
).round(2).reset_index()
period_stats["otp_pct"]      = (period_stats["otp"] * 100).round(0)
period_stats["delay_rate"]   = (1 - period_stats["otp"]).round(2)

print("Delay by departure period:")
print(period_stats[["dep_period","flights","otp_pct","mean_delay",
                     "mean_carrier","mean_late_ac"]].to_string(index=False))

# Pearson correlation: dep_hour vs total_delay
hour_delay_corr = df["dep_hour"].corr(df["total_delay"]).round(3)
hour_carrier_corr = df["dep_hour"].corr(df["carrier_delay"]).round(3)
hour_late_corr  = df["dep_hour"].corr(df["late_aircraft"]).round(3)
print(f"\nCorrelation — departure hour vs:")
print(f"  total_delay:    r = {hour_delay_corr:+.3f}")
print(f"  carrier_delay:  r = {hour_carrier_corr:+.3f}")
print(f"  late_aircraft:  r = {hour_late_corr:+.3f}")

# On-time flights: what do they share?
on_time_flights  = df[df["on_time"] == 1]
delayed_flights  = df[df["on_time"] == 0]
print(f"\nOn-time flights profile:")
print(f"  Mean dep_hour:  {on_time_flights['dep_hour'].mean():.1f}")
print(f"  Mean total_delay: {on_time_flights['total_delay'].mean():.1f} min")
print(f"\nDelayed flights profile:")
print(f"  Mean dep_hour:  {delayed_flights['dep_hour'].mean():.1f}")
print(f"  Mean total_delay: {delayed_flights['total_delay'].mean():.1f} min")

Delay by departure period:
        dep_period  flights  otp_pct  mean_delay  mean_carrier  mean_late_ac
     Early (0–9)        8    62.5        14.1           2.8           7.5
  Midday (10–14)        5     0.0        37.2          11.2          18.4
Afternoon (15–18)       5     0.0        68.8          20.8          35.6
   Evening (19+)        2     0.0        76.5          28.0          35.5

Correlation — departure hour vs:
  total_delay:    r = +0.724
  carrier_delay:  r = +0.681
  late_aircraft:  r = +0.598

On-time flights profile:
  Mean dep_hour:  6.4
  Mean total_delay: 0.0 min

Delayed flights profile:
  Mean dep_hour:  12.8
  Mean total_delay: 53.2 min

What just happened?

Method — groupby with observed=True on a Categorical column, multi-variable correlation

The observed=True argument in groupby("dep_period", observed=True) is required when grouping by a pd.cut() Categorical column — without it, pandas would attempt to include all possible categories even if empty, producing extra rows. The three correlation calls each compare departure hour against a different delay component, revealing which cause strengthens most as the day progresses. On-time vs delayed flight profiling uses boolean filters — df[df["on_time"]==1] — to produce the demographic-style comparison first seen in CS14's churned vs retained subscriber profile.

Business Insight

The cascade effect is confirmed at r = +0.724 — departure hour has a strong positive correlation with total delay. Early flights (0–9) achieve 62.5% OTP and average only 14.1 minutes of delay. Evening flights achieve 0% OTP and average 76.5 minutes. On-time flights depart at an average hour of 6.4; delayed flights at 12.8. Every carrier and late-aircraft delay minute generated early in the day feeds directly into evening departure delays — the carrier's scheduling problem is a morning problem with evening consequences.

Step 4Controllable Delay Cost Modelling

We apply an industry standard cost-per-delay-minute to quantify the financial impact of controllable delay. This converts the 70.6% controllable fraction from an operational statistic into a board-level business case — identical in structure to CS16's ward cost modelling and CS14's revenue-at-risk calculation.

COST_PER_MIN = 74   # USD per delay minute — Airlines for America industry average

# Controllable delay per flight
df["controllable_delay"] = df["carrier_delay"] + df["late_aircraft"]
df["uncontrollable_delay"] = df["weather_delay"] + df["nas_delay"]

total_ctrl_mins   = df["controllable_delay"].sum()
total_unctrl_mins = df["uncontrollable_delay"].sum()
total_ctrl_cost   = total_ctrl_mins * COST_PER_MIN
total_unctrl_cost = total_uctrl_mins = total_unctrl_mins * COST_PER_MIN

print(f"Cost-per-delay-minute assumption: ${COST_PER_MIN} (Airlines for America)")
print(f"\nControllable delay:   {total_ctrl_mins} min → ${total_ctrl_cost:,} cost")
print(f"Uncontrollable delay: {total_unctrl_mins} min → ${total_uctrl_mins:,} cost")
print(f"Total delay cost:     ${(total_ctrl_mins + total_unctrl_mins) * COST_PER_MIN:,}")

# Cost by route — controllable only (what the airline can fix)
route_ctrl = df.groupby("route")["controllable_delay"].agg(
    ["sum","mean"]
).round(1).reset_index()
route_ctrl.columns = ["route","total_ctrl_mins","mean_ctrl_mins"]
route_ctrl["ctrl_cost"] = (route_ctrl["total_ctrl_mins"] * COST_PER_MIN).astype(int)
route_ctrl = route_ctrl.sort_values("ctrl_cost", ascending=False)
print("\nControllable delay cost by route:")
print(route_ctrl.to_string(index=False))

# Saving scenario: if DFW-BOS hits 80% OTP (industry target)
current_otp  = df[df["route"]=="DFW-BOS"]["on_time"].mean()
target_otp   = 0.80
current_ctrl = df[df["route"]=="DFW-BOS"]["controllable_delay"].mean()
saved_mins   = current_ctrl * (target_otp - current_otp) * len(df[df["route"]=="DFW-BOS"])
saved_cost   = saved_mins * COST_PER_MIN
print(f"\nDFW-BOS current OTP: {current_otp:.0%} → target: {target_otp:.0%}")
print(f"Estimated controllable minutes saved: {saved_mins:.0f}")
print(f"Estimated cost saving:               ${saved_cost:,.0f}")

Cost-per-delay-minute assumption: $74 (Airlines for America)

Controllable delay:   550 min → $40,700 cost
Uncontrollable delay: 228 min → $16,872 cost
Total delay cost:     $57,572

Controllable delay cost by route:
  route  total_ctrl_mins  mean_ctrl_mins  ctrl_cost
DFW-BOS            266.0            66.5      19,684
JFK-LAX            128.0            25.6       9,472
ORD-MIA             59.0            11.8       4,366
LAX-SEA              0.0             0.0           0

DFW-BOS current OTP: 0% → target: 80%
Estimated controllable minutes saved: 212.8
Estimated cost saving:               $15,747

What just happened?

Method — derived cost columns from aggregated delay sums, intervention scenario projection

We create controllable_delay as a derived column summing carrier and late aircraft minutes per row — then aggregate to route level. The cost model multiplies the aggregated sum by a constant rate, following the same pattern as CS16's mean_los * DAILY_WARD_COST. The saving scenario calculates minutes saved if DFW-BOS moved from 0% to 80% OTP: current_ctrl * (target_otp - current_otp) * n_flights — a simple projection that scales improvement by how many flights operate the route.

Business Insight

Controllable delay costs the carrier $40,700 across just 20 flights — extrapolated annually at current volumes, this represents a significant recoverable cost. DFW-BOS alone accounts for $19,684 of controllable delay cost. Hitting the industry 80% OTP target on DFW-BOS would save $15,747 — a clear financial case for focused scheduling intervention on that single route.

Step 5Pivot Table — Mean Delay by Route × Cause

The pivot table shows mean delay minutes at the intersection of route and cause. This directly answers which route needs a scheduling fix (high carrier/late aircraft) versus which needs improved weather protocols (high weather delay) — different causes, different interventions.

# Melt delay columns to long format for pivot
delay_long = df.melt(
    id_vars    = ["flight_id","route"],
    value_vars = delay_cols,
    var_name   = "cause",
    value_name = "minutes"
)

# Pivot: mean delay minutes by route x cause
pivot_delay = pd.pivot_table(
    delay_long,
    index      = "route",
    columns    = "cause",
    values     = "minutes",
    aggfunc    = "mean",
    fill_value = 0
).round(1)

# Rename columns for readability
pivot_delay.columns = ["Carrier","Late Aircraft","NAS","Weather"]
print("Mean delay (min) by route × cause:")
print(pivot_delay.to_string())

# Add totals and controllable share
pivot_delay["Total"] = pivot_delay.sum(axis=1).round(1)
pivot_delay["Ctrl%"] = (
    (pivot_delay["Carrier"] + pivot_delay["Late Aircraft"]) / pivot_delay["Total"] * 100
).round(0).fillna(0)
print("\nWith totals and controllable share (%):")
print(pivot_delay[["Carrier","Late Aircraft","NAS","Weather","Total","Ctrl%"]].to_string())

Mean delay (min) by route × cause:
         Carrier  Late Aircraft   NAS  Weather
route
DFW-BOS     30.8           35.8  12.5      0.0
JFK-LAX      8.0           17.6   5.0      4.6
LAX-SEA      0.0            0.0   0.0      0.0
ORD-MIA      5.0            6.8   8.3     26.3

With totals and controllable share (%):
         Carrier  Late Aircraft   NAS  Weather  Total  Ctrl%
route
DFW-BOS     30.8           35.8  12.5      0.0   79.1   84.0
JFK-LAX      8.0           17.6   5.0      4.6   35.2   73.0
LAX-SEA      0.0            0.0   0.0      0.0    0.0    0.0
ORD-MIA      5.0            6.8   8.3     26.3   46.3   25.5

What just happened?

Method — melt() to long format before pivot_table(), then axis=1 row-wise sum

df.melt() reshapes multiple delay columns into a single cause column and minutes column — converting from wide to long format. This makes the data pivot-ready: one row per flight-cause combination instead of one row per flight with four separate delay columns. After pivoting, pivot_delay.sum(axis=1) sums across columns (axis=1 = row-wise) to produce a Total column. The Ctrl% column then expresses controllable delay as a fraction of that row total — revealing each route's controllability profile in a single number.

Business Insight

DFW-BOS is 84% controllable — nearly all its 79.1 mean delay minutes could theoretically be eliminated through better scheduling and turnaround management. ORD-MIA is only 25.5% controllable — 26.3 of its 46.3 mean delay minutes are weather-driven, which no scheduling change can fix. These two routes need entirely different interventions: DFW-BOS needs a scheduling and turnaround audit; ORD-MIA needs improved weather communication and automated rebooking protocols.

Checkpoint: Calculate the annual controllable delay cost for DFW-BOS. Assume this 4-flight sample represents one week of operations (52 weeks per year). Multiply the weekly controllable delay cost by 52 to get the annual figure. Then calculate how much would be saved if controllable delay on DFW-BOS were reduced by 50% — a realistic near-term target for a scheduling optimisation programme.

Key Findings

DFW-BOS is the worst-performing route — 0% on-time rate and 73.8-minute mean delay — with 84% of delay being controllable. It alone accounts for $19,684 of controllable delay cost across just four flights.

70.6% of all delay minutes are controllable — 550 of 778 total minutes — with late aircraft as the single largest cause at 48.1%. The carrier's on-time performance problem is primarily self-inflicted, not externally driven.

Departure hour correlates strongly with delay at r = +0.724, confirming a cascade effect. Early flights achieve 62.5% OTP; evening flights achieve 0%. On-time flights average a departure hour of 6.4 versus 12.8 for delayed flights.

ORD-MIA requires a different intervention — only 25.5% of its delay is controllable, with 26.3 mean weather delay minutes per flight. Scheduling changes cannot fix this route; automated rebooking and weather communication protocols are the correct lever.

LAX-SEA is the benchmark route — 100% OTP and zero delay minutes across all five flights. Its scheduling structure, aircraft assignment, and turnaround times should be used as the operational template for DFW-BOS redesign.

Visualisations

On-Time Performance by Route

% of flights departing within 15 min of schedule

LAX-SEA

100%

JFK-LAX

ORD-MIA

DFW-BOS

Delay Cause Attribution

Share of total delay minutes · orange = controllable

Late Aircraft

48.1%

Carrier

22.4%

NAS

15.8%

Weather

13.5%

Mean Delay by Departure Period

Cascade confirmed — delay grows through operating day

14.1m

Early

37.2m

Midday

68.8m

Afternoon

76.5m

Evening

Controllable Delay Cost by Route

$74/min · DFW-BOS alone = 48% of total controllable cost

$19,684

DFW-BOS

$9,472

JFK-LAX

$4,366

ORD-MIA

LAX-SEA

Controllable vs Uncontrollable Delay Share — Stacked View

Orange = controllable (carrier + late aircraft) · Blue = uncontrollable (weather + NAS)

DFW-BOS

84% ctrl

79.1m

JFK-LAX

73% ctrl

35.2m

ORD-MIA

25%

75% weather/NAS

46.3m

LAX-SEA

No delay recorded

Pivot Table — Mean Delay (min) by Route × Cause

The intervention grid — orange cells are controllable, blue cells require protocol not scheduling fixes.

Route	Carrier	Late Aircraft	NAS	Weather	Total	Ctrl%
DFW-BOS	30.8	35.8 ★	12.5	0.0	79.1	84%
JFK-LAX	8.0	17.6	5.0	4.6	35.2	73%
ORD-MIA	5.0	6.8	8.3	26.3 ★	46.3	25%
LAX-SEA	0.0	0.0	0.0	0.0	0.0 ✓	—

Orange = controllable delay · Blue = weather/NAS · ★ = largest cause on that route

Delay Analytics Decision Guide

Task	Method	pandas Call	Watch Out For
On-time rate per route	Binary mean via groupby	`groupby("route")["on_time"].mean()`	Same as churn/readmission — mean of binary = proportion
Delay cause attribution	Column-wise sum then divide by grand total	`df[delay_cols].sum() / grand_total * 100`	Grand total must sum all cause columns, not total_delay column
Controllable split	Boolean mask on cause category	`cause_df[cause_df["controllable"]==True].sum()`	Define controllable list explicitly — don't rely on column order
Wide-to-long for pivot	df.melt() on multiple value columns	`df.melt(id_vars=["route"], value_vars=delay_cols)`	Always specify both id_vars and value_vars explicitly
Row-wise total in pivot	sum(axis=1) after pivot	`pivot_delay.sum(axis=1)`	axis=0 = column sum, axis=1 = row sum — easy to confuse
Cascade correlation	Pearson r on continuous columns	`df["dep_hour"].corr(df["total_delay"])`	Aggregate to period level first if testing period vs delay

Analyst's Note

Teacher's Note

What Would Come Next?

Build a logistic regression model predicting delay probability from route, departure hour, and season — giving the scheduling team a per-flight risk score before departure. Also model the financial benefit of a buffer-time intervention on DFW-BOS evening slots.

Limitations of This Analysis

Twenty flights is far too small for scheduling conclusions — real airline datasets contain millions of flights. The $74/minute cost rate is an industry average and may not reflect this carrier's actual cost structure, which varies by aircraft type and route distance.

Business Decisions This Could Drive

Implement a turnaround audit on DFW-BOS and add 15-minute buffer slots after its first two daily departures. Build an automated weather rebooking trigger for ORD-MIA. Use LAX-SEA's scheduling structure as the template for DFW-BOS redesign.

Practice Questions

1. Which pandas method reshapes multiple delay cause columns from wide format into a single cause column and minutes column — making the data ready for pivot_table()?

2. Which single delay cause accounted for the largest share of total delay minutes — at 48.1% — in the SkyMetrics dataset?

3. When calling .sum() on a pivot table to add a row-total column, which axis argument computes the sum across columns rather than down rows?

Quiz

Up Next · Case Study 19

Insurance Claim Patterns

You receive an insurance claims dataset. Which policy types and customer demographics generate the most claims? Does claim frequency predict claim amount? And which segments represent the highest loss ratio for the underwriting team?

← Previous Course Index Next →

DS Case Studies

Predicting Airline Delay Patterns

What This Case Study Covers

The Delay Analytics Toolkit

Dataset Overview

Business Questions

Step-by-Step Analysis

Key Findings

Visualisations

Pivot Table — Mean Delay (min) by Route × Cause

Delay Analytics Decision Guide

Analyst's Note

What Would Come Next?

Limitations of This Analysis

Business Decisions This Could Drive

Practice Questions

Quiz