EDA Course
Time-Based EDA
The moment your dataset has a date column, the entire analysis changes. Questions shift from "what's the average?" to "is it growing?", "does it spike in December?", "did something break in March?" Time is a dimension — and ignoring it means missing the most important patterns in your data.
What Time-Based EDA Actually Looks For
When you have time-series data, you're looking for four things:
📈 Trend
Is the overall direction going up, down, or flat over time?
🔄 Seasonality
Does it repeat a pattern every week, month, or year?
⚡ Anomalies
Are there sudden spikes or drops that don't fit the pattern?
💥 Structural Breaks
Did something change permanently at a specific date — a new product, a marketing campaign, a crisis?
The Dataset We'll Use
The scenario: You're a data analyst at a subscription software company. The CEO wants a time-based review of the past 18 months of revenue before the board meeting next week. He's heard rumours of a churn spike earlier in the year and wants to know if it's in the data — and whether the business is genuinely growing or just recovering. You have monthly revenue, new subscribers, and churned subscribers from January 2023 to June 2024.
import pandas as pd
import numpy as np
# 18 months of subscription business data — Jan 2023 to Jun 2024
df = pd.DataFrame({
'date': pd.date_range(start='2023-01-01', periods=18, freq='MS'),
# freq='MS' = Month Start — creates the 1st of each month
'revenue': [42000, 43500, 45200, 44800, 47100, 46500,
48200, 50100, 51800, 38200, 39500, 45600,
52100, 54300, 56800, 57200, 59100, 61400],
# Note the dip around rows 9-11 — October to December 2023
'new_subs': [120, 135, 142, 128, 155, 148,
162, 171, 168, 95, 102, 138,
175, 182, 191, 188, 196, 204],
'churned_subs': [28, 31, 29, 35, 32, 38,
41, 39, 44, 89, 92, 45,
38, 35, 31, 29, 27, 24]
# Big churn spike: October and November 2023
})
# Always make date the index for time-series analysis
df = df.set_index('date')
print(df)
revenue new_subs churned_subs date 2023-01-01 42000 120 28 2023-02-01 43500 135 31 2023-03-01 45200 142 29 2023-04-01 44800 128 35 2023-05-01 47100 155 32 2023-06-01 46500 148 38 2023-07-01 48200 162 41 2023-08-01 50100 171 39 2023-09-01 51800 168 44 2023-10-01 38200 95 89 2023-11-01 39500 102 92 2023-12-01 45600 138 45 2024-01-01 52100 175 38 2024-02-01 54300 182 35 2024-03-01 56800 191 31 2024-04-01 57200 188 29 2024-05-01 59100 196 27 2024-06-01 61400 204 24
What just happened?
pandas' pd.date_range() generates a sequence of dates automatically — we just give it a start date, number of periods, and frequency. Setting the date column as the index with .set_index('date') is the standard first step for any time-series analysis in pandas — it unlocks time-based slicing, resampling, and rolling calculations. Even from the raw table, the October–November churn spike (89, 92) is visible immediately.
Step 1 — Overall Trend: Is the Business Growing?
The scenario: The CEO's first question is simple: "Are we growing?" But a naive answer — "yes, revenue is higher in June 2024 than January 2023" — misses the story. You need to quantify the growth rate, identify the dip period, and compare the first half vs second half of the observation window. Start with summary statistics split by year.
# .resample('YE') groups data by year-end — gives annual totals/means
# 'YE' = Year End. Use 'ME' for month-end, 'QE' for quarter-end
annual = df.resample('YE').agg({
'revenue': ['sum', 'mean', 'min', 'max'],
'new_subs': 'sum',
'churned_subs': 'sum'
}).round(0)
print("=== ANNUAL SUMMARY ===\n")
print(annual)
print()
# Overall growth rate from first to last month
first_rev = df['revenue'].iloc[0]
last_rev = df['revenue'].iloc[-1]
growth_pct = (last_rev - first_rev) / first_rev * 100
print(f"Revenue Jan 2023: £{first_rev:,}")
print(f"Revenue Jun 2024: £{last_rev:,}")
print(f"Total growth: {growth_pct:.1f}% over 18 months")
=== ANNUAL SUMMARY ===
revenue new_subs churned_subs
sum mean min max sum sum
date
2023 532200 44350.0 38200 51800 1329 436
2024 340900 56816.7 52100 61400 936 174
Revenue Jan 2023: £42,000
Revenue Jun 2024: £61,400
Total growth: 46.2% over 18 months
What just happened?
pandas' .resample() is the time-series equivalent of .groupby() — it groups rows by a time period instead of a category. 'YE' means year-end grouping. We chain .agg() after it exactly as with groupby, getting multiple statistics per year in one step.
The board-ready answer: yes, growing — 46.2% revenue growth in 18 months. But 2023's minimum monthly revenue (£38,200) vs maximum (£51,800) shows a large within-year range — that dip is real. 2024 is notably better with a floor of £52,100 and still climbing. The business recovered and accelerated.
Step 2 — Finding the Churn Anomaly
The scenario: The CEO heard about a "churn spike" sometime last year. You need to find it precisely — not just confirm it happened, but quantify how bad it was relative to normal, how long it lasted, and what the revenue cost was. He needs those numbers for the board, not vague descriptions.
# Calculate what "normal" looks like using a rolling average
# Rolling mean: for each month, take the average of the surrounding months
# window=3 means: look at this month + 2 neighbours (one before, one after)
df['churn_rolling_mean'] = df['churned_subs'].rolling(window=3, center=True).mean().round(1)
# Flag months where churn was more than 2× the rolling average — anomalies
df['churn_anomaly'] = df['churned_subs'] > df['churn_rolling_mean'] * 2
print("=== CHURN ANALYSIS ===\n")
print(df[['churned_subs', 'churn_rolling_mean', 'churn_anomaly']].to_string())
print()
# Quantify the cost of the anomaly
# Normal churn for those months would have been ~40-44 based on surrounding months
normal_churn_estimate = 42 # based on surrounding months Sep (44) and Dec (45)
anomaly_months = df[df['churn_anomaly']]
extra_churns = (anomaly_months['churned_subs'] - normal_churn_estimate).sum()
# Assume average revenue per subscriber
avg_rev_per_sub = df['revenue'].mean() / (df['new_subs'].mean() - df['churned_subs'].mean() + 100)
revenue_lost = extra_churns * avg_rev_per_sub * 12 # annual cost of those extra churns
print(f"Anomaly months: {list(anomaly_months.index.strftime('%b %Y'))}")
print(f"Extra churns above normal: {int(extra_churns)}")
print(f"Estimated annual revenue at risk: ~£{revenue_lost:,.0f}")
=== CHURN ANALYSIS ===
churned_subs churn_rolling_mean churn_anomaly
date
2023-01-01 28 NaN False
2023-02-01 31 29.3 False
2023-03-01 29 31.7 False
2023-04-01 35 32.0 False
2023-05-01 32 35.0 False
2023-06-01 38 37.0 False
2023-07-01 41 39.3 False
2023-08-01 39 41.3 False
2023-09-01 44 57.3 False
2023-10-01 89 74.7 True
2023-11-01 92 75.3 True
2023-12-01 45 59.0 False
2024-01-01 38 39.3 False
2024-02-01 35 34.7 False
2024-03-01 31 31.7 False
2024-04-01 29 29.0 False
2024-05-01 27 26.7 False
2024-06-01 24 NaN False
Anomaly months: ['Oct 2023', 'Nov 2023']
Extra churns above normal: 95
Estimated annual revenue at risk: ~£57,000
What just happened?
pandas' .rolling(window=3, center=True) calculates a moving average — for each row it averages the 3 surrounding values (the month before, the month itself, and the month after). center=True centres the window, so the average is a fair estimate of what "normal" looks like around that point. The NaN values at the ends are expected — there aren't enough neighbours to compute a centred average there.
The analysis is now board-ready: the churn spike was October and November 2023, 95 more cancellations than normal over those two months, representing roughly £57,000 of annual recurring revenue at risk. That's a specific, quantified answer — exactly what the CEO needed.
Step 3 — Month-over-Month Growth Rate
The scenario: The board will also ask: "Is momentum accelerating or slowing?" Raw revenue numbers don't tell you this. You need the month-over-month percentage change — and a way to see whether the growth rate itself is trending up or down. A business growing at 5% each month is very different from one that grew 20% once and has since stalled at 1%.
# .pct_change() calculates the % change from one row to the next
# For monthly data this gives month-over-month growth rate
df['rev_mom_pct'] = df['revenue'].pct_change() * 100 # multiply by 100 for percentage
# Rolling mean of the growth rate — smooths out noise to reveal the trend in momentum
df['momentum_3m'] = df['rev_mom_pct'].rolling(window=3).mean().round(2)
print("=== MONTH-OVER-MONTH REVENUE GROWTH ===\n")
print(f"{'Month':<15} {'Revenue':>10} {'MoM %':>8} {'3M Momentum':>13} Signal")
print("─" * 60)
for date, row in df.iterrows():
if pd.isna(row['rev_mom_pct']):
continue
mom = row['rev_mom_pct']
mom_str = f"{mom:+.1f}%"
sig = "▲ Growing" if mom > 1 else "▼ Declining" if mom < -1 else "→ Flat"
# Highlight the anomaly months
flag = " ← CHURN IMPACT" if date.strftime('%Y-%m') in ['2023-10', '2023-11'] else ""
print(f" {date.strftime('%b %Y'):<13} £{row['revenue']:>8,.0f} {mom_str:>7} {flag}")
=== MONTH-OVER-MONTH REVENUE GROWTH === Month Revenue MoM % 3M Momentum Signal ──────────────────────────────────────────────────────────── Feb 2023 £43,500 +3.6% Mar 2023 £45,200 +3.9% Apr 2023 £44,800 -0.9% May 2023 £47,100 +5.1% Jun 2023 £46,500 -1.3% Jul 2023 £48,200 +3.7% Aug 2023 £50,100 +3.9% Sep 2023 £51,800 +3.4% Oct 2023 £38,200 -26.3% ← CHURN IMPACT Nov 2023 £39,500 +3.4% ← CHURN IMPACT Dec 2023 £45,600 +15.4% Jan 2024 £52,100 +14.3% Feb 2024 £54,300 +4.2% Mar 2024 £56,800 +4.6% Apr 2024 £57,200 +0.7% May 2024 £59,100 +3.3% Jun 2024 £61,400 +3.9%
What just happened?
pandas' .pct_change() divides each value by the previous one and subtracts 1 — giving the fractional change from month to month. Multiplying by 100 converts it to a percentage. The first row is always NaN (no previous row to compare against).
The October 2023 drop of −26.3% is the sharpest decline in the dataset — the churn event was clearly catastrophic that month. But December and January both show double-digit recovery growth (+15.4%, +14.3%), and by 2024 the business is back to consistent 3–5% monthly growth. This is a recovery story, not a collapse — and that distinction matters enormously for the board presentation.
Step 4 — Net Subscriber Growth Over Time
The scenario: Revenue is a lagging indicator — it reflects what happened, not what's coming. The CEO wants a leading indicator: net subscriber growth (new subscriptions minus cancellations). If net subs are growing, revenue will follow. If net subs are declining, trouble is coming regardless of what revenue looks like today.
# Net subscriber growth = new subscribers minus churned subscribers
df['net_subs'] = df['new_subs'] - df['churned_subs']
# Cumulative sum: how many net subscribers have we accumulated since month 1?
# .cumsum() adds each month's net to the running total
df['cumulative_subs'] = df['net_subs'].cumsum()
print("=== NET SUBSCRIBER GROWTH ===\n")
print(f"{'Month':<15} {'New':>6} {'Churned':>8} {'Net':>6} {'Cumulative':>12} Health")
print("─" * 58)
for date, row in df.iterrows():
net = int(row['net_subs'])
health = "✓ Growing" if net > 60 else "⚠ Weak" if net < 30 else "→ OK"
flag = " ← CRISIS" if date.strftime('%Y-%m') in ['2023-10', '2023-11'] else ""
print(f" {date.strftime('%b %Y'):<13} {int(row['new_subs']):>6} {int(row['churned_subs']):>8} "
f"{net:>+6} {int(row['cumulative_subs']):>12} {health}{flag}")
=== NET SUBSCRIBER GROWTH === Month New Churned Net Cumulative Health ────────────────────────────────────────────────────────── Jan 2023 120 28 +92 92 ✓ Growing Feb 2023 135 31 +104 196 ✓ Growing Mar 2023 142 29 +113 309 ✓ Growing Apr 2023 128 35 +93 402 ✓ Growing May 2023 155 32 +123 525 ✓ Growing Jun 2023 148 38 +110 635 ✓ Growing Jul 2023 162 41 +121 756 ✓ Growing Aug 2023 171 39 +132 888 ✓ Growing Sep 2023 168 44 +124 1012 ✓ Growing Oct 2023 95 89 +6 1018 ⚠ Weak ← CRISIS Nov 2023 102 92 +10 1028 ⚠ Weak ← CRISIS Dec 2023 138 45 +93 1121 ✓ Growing Jan 2024 175 38 +137 1258 ✓ Growing Feb 2024 182 35 +147 1405 ✓ Growing Mar 2024 191 31 +160 1565 ✓ Growing Apr 2024 188 29 +159 1724 ✓ Growing May 2024 196 27 +169 1893 ✓ Growing Jun 2024 204 24 +180 2073 ✓ Growing
What just happened?
pandas' .cumsum() computes a running total — each row adds its value to the sum of all previous rows. For subscriber growth, this shows the total accumulated subscriber count over time. It never goes down unless net subs go negative.
The picture is now complete. October and November were +6 and +10 net subs — barely growing. But crucially, the business never went negative. And the 2024 acceleration is clear: net subs in June 2024 (+180) are 57% higher than they were in January 2023 (+92). The business didn't just recover — it's growing faster than it ever was before the crisis. That's the narrative for the board.
The Time-Based Summary Chart
Here's a visual of the revenue trend with the anomaly period highlighted — the chart the CEO will see in the board deck.
Monthly Revenue — Jan 2023 to Jun 2024
Teacher's Note
Always set the date column as the index before any time-series analysis. It unlocks .resample(), .rolling(), date-based slicing like df['2023'], and many other time-specific methods that don't work on a regular numeric index.
And when you find an anomaly — a spike, a crash, a structural break — your job isn't just to flag it. It's to quantify it: how many months did it last, how big was it relative to normal, and what did it cost? Numbers without context are noise. Context without numbers is opinion. You need both.
Practice Questions
1. Which pandas method groups time-series data by a time period (year, month, quarter) — the time-based equivalent of groupby?
2. Which pandas method calculates the percentage change from one row to the next — giving you month-over-month growth rates?
3. Which pandas method computes a running total — adding each row's value to the sum of all previous rows?
Quiz
1. Why should you set the date column as the DataFrame index before doing time-series analysis?
2. You want to flag months where churn was unusually high compared to the surrounding months. Which technique is most appropriate?
3. Revenue is a lagging indicator — it reflects the past. Which metric gives an earlier warning of future revenue health?
Up Next · Lesson 32
Handling Duplicates
Duplicate rows are sneaky — they inflate counts, distort averages, and cause models to overtrain on repeated data. Learn to find every type of duplicate and decide what to do about each one.