EDA Lesson 31 – Time-based EDA | Dataplexa

Intermediate Level · Lesson 31

Time-Based EDA

The moment your dataset has a date column, the entire analysis changes. Questions shift from "what's the average?" to "is it growing?", "does it spike in December?", "did something break in March?" Time is a dimension — and ignoring it means missing the most important patterns in your data.

What Time-Based EDA Actually Looks For

When you have time-series data, you're looking for four things:

📈 Trend

Is the overall direction going up, down, or flat over time?

🔄 Seasonality

Does it repeat a pattern every week, month, or year?

⚡ Anomalies

Are there sudden spikes or drops that don't fit the pattern?

💥 Structural Breaks

Did something change permanently at a specific date — a new product, a marketing campaign, a crisis?

The Dataset We'll Use

The scenario: You're a data analyst at a subscription software company. The CEO wants a time-based review of the past 18 months of revenue before the board meeting next week. He's heard rumours of a churn spike earlier in the year and wants to know if it's in the data — and whether the business is genuinely growing or just recovering. You have monthly revenue, new subscribers, and churned subscribers from January 2023 to June 2024.

import pandas as pd
import numpy as np

# 18 months of subscription business data — Jan 2023 to Jun 2024
df = pd.DataFrame({
    'date': pd.date_range(start='2023-01-01', periods=18, freq='MS'),
    # freq='MS' = Month Start — creates the 1st of each month
    'revenue':      [42000, 43500, 45200, 44800, 47100, 46500,
                     48200, 50100, 51800, 38200, 39500, 45600,
                     52100, 54300, 56800, 57200, 59100, 61400],
    # Note the dip around rows 9-11 — October to December 2023
    'new_subs':     [120, 135, 142, 128, 155, 148,
                     162, 171, 168, 95,  102, 138,
                     175, 182, 191, 188, 196, 204],
    'churned_subs': [28,  31,  29,  35,  32,  38,
                     41,  39,  44,  89,  92,  45,
                     38,  35,  31,  29,  27,  24]
    # Big churn spike: October and November 2023
})

# Always make date the index for time-series analysis
df = df.set_index('date')

print(df)

            revenue  new_subs  churned_subs
date
2023-01-01    42000       120            28
2023-02-01    43500       135            31
2023-03-01    45200       142            29
2023-04-01    44800       128            35
2023-05-01    47100       155            32
2023-06-01    46500       148            38
2023-07-01    48200       162            41
2023-08-01    50100       171            39
2023-09-01    51800       168            44
2023-10-01    38200        95            89
2023-11-01    39500       102            92
2023-12-01    45600       138            45
2024-01-01    52100       175            38
2024-02-01    54300       182            35
2024-03-01    56800       191            31
2024-04-01    57200       188            29
2024-05-01    59100       196            27
2024-06-01    61400       204            24

What just happened?

pandas' pd.date_range() generates a sequence of dates automatically — we just give it a start date, number of periods, and frequency. Setting the date column as the index with .set_index('date') is the standard first step for any time-series analysis in pandas — it unlocks time-based slicing, resampling, and rolling calculations. Even from the raw table, the October–November churn spike (89, 92) is visible immediately.

Step 1 — Overall Trend: Is the Business Growing?

The scenario: The CEO's first question is simple: "Are we growing?" But a naive answer — "yes, revenue is higher in June 2024 than January 2023" — misses the story. You need to quantify the growth rate, identify the dip period, and compare the first half vs second half of the observation window. Start with summary statistics split by year.

# .resample('YE') groups data by year-end — gives annual totals/means
# 'YE' = Year End. Use 'ME' for month-end, 'QE' for quarter-end
annual = df.resample('YE').agg({
    'revenue':      ['sum', 'mean', 'min', 'max'],
    'new_subs':     'sum',
    'churned_subs': 'sum'
}).round(0)

print("=== ANNUAL SUMMARY ===\n")
print(annual)
print()

# Overall growth rate from first to last month
first_rev = df['revenue'].iloc[0]
last_rev  = df['revenue'].iloc[-1]
growth_pct = (last_rev - first_rev) / first_rev * 100

print(f"Revenue Jan 2023:  £{first_rev:,}")
print(f"Revenue Jun 2024:  £{last_rev:,}")
print(f"Total growth:      {growth_pct:.1f}% over 18 months")

=== ANNUAL SUMMARY ===

      revenue                          new_subs  churned_subs
          sum     mean    min    max       sum          sum
date
2023   532200  44350.0  38200  51800      1329          436
2024   340900  56816.7  52100  61400       936           174

Revenue Jan 2023:  £42,000
Revenue Jun 2024:  £61,400
Total growth:      46.2% over 18 months

What just happened?

pandas' .resample() is the time-series equivalent of .groupby() — it groups rows by a time period instead of a category. 'YE' means year-end grouping. We chain .agg() after it exactly as with groupby, getting multiple statistics per year in one step.

The board-ready answer: yes, growing — 46.2% revenue growth in 18 months. But 2023's minimum monthly revenue (£38,200) vs maximum (£51,800) shows a large within-year range — that dip is real. 2024 is notably better with a floor of £52,100 and still climbing. The business recovered and accelerated.

Step 2 — Finding the Churn Anomaly

The scenario: The CEO heard about a "churn spike" sometime last year. You need to find it precisely — not just confirm it happened, but quantify how bad it was relative to normal, how long it lasted, and what the revenue cost was. He needs those numbers for the board, not vague descriptions.

# Calculate what "normal" looks like using a rolling average
# Rolling mean: for each month, take the average of the surrounding months
# window=3 means: look at this month + 2 neighbours (one before, one after)
df['churn_rolling_mean'] = df['churned_subs'].rolling(window=3, center=True).mean().round(1)

# Flag months where churn was more than 2× the rolling average — anomalies
df['churn_anomaly'] = df['churned_subs'] > df['churn_rolling_mean'] * 2

print("=== CHURN ANALYSIS ===\n")
print(df[['churned_subs', 'churn_rolling_mean', 'churn_anomaly']].to_string())
print()

# Quantify the cost of the anomaly
# Normal churn for those months would have been ~40-44 based on surrounding months
normal_churn_estimate = 42   # based on surrounding months Sep (44) and Dec (45)
anomaly_months = df[df['churn_anomaly']]
extra_churns = (anomaly_months['churned_subs'] - normal_churn_estimate).sum()

# Assume average revenue per subscriber
avg_rev_per_sub = df['revenue'].mean() / (df['new_subs'].mean() - df['churned_subs'].mean() + 100)
revenue_lost = extra_churns * avg_rev_per_sub * 12   # annual cost of those extra churns

print(f"Anomaly months: {list(anomaly_months.index.strftime('%b %Y'))}")
print(f"Extra churns above normal: {int(extra_churns)}")
print(f"Estimated annual revenue at risk: ~£{revenue_lost:,.0f}")

=== CHURN ANALYSIS ===

            churned_subs  churn_rolling_mean  churn_anomaly
date
2023-01-01            28                 NaN          False
2023-02-01            31                29.3          False
2023-03-01            29                31.7          False
2023-04-01            35                32.0          False
2023-05-01            32                35.0          False
2023-06-01            38                37.0          False
2023-07-01            41                39.3          False
2023-08-01            39                41.3          False
2023-09-01            44                57.3          False
2023-10-01            89                74.7           True
2023-11-01            92                75.3           True
2023-12-01            45                59.0          False
2024-01-01            38                39.3          False
2024-02-01            35                34.7          False
2024-03-01            31                31.7          False
2024-04-01            29                29.0          False
2024-05-01            27                26.7          False
2024-06-01            24                 NaN          False

Anomaly months: ['Oct 2023', 'Nov 2023']
Extra churns above normal: 95
Estimated annual revenue at risk: ~£57,000

What just happened?

pandas' .rolling(window=3, center=True) calculates a moving average — for each row it averages the 3 surrounding values (the month before, the month itself, and the month after). center=True centres the window, so the average is a fair estimate of what "normal" looks like around that point. The NaN values at the ends are expected — there aren't enough neighbours to compute a centred average there.

The analysis is now board-ready: the churn spike was October and November 2023, 95 more cancellations than normal over those two months, representing roughly £57,000 of annual recurring revenue at risk. That's a specific, quantified answer — exactly what the CEO needed.

Step 3 — Month-over-Month Growth Rate

The scenario: The board will also ask: "Is momentum accelerating or slowing?" Raw revenue numbers don't tell you this. You need the month-over-month percentage change — and a way to see whether the growth rate itself is trending up or down. A business growing at 5% each month is very different from one that grew 20% once and has since stalled at 1%.

# .pct_change() calculates the % change from one row to the next
# For monthly data this gives month-over-month growth rate
df['rev_mom_pct'] = df['revenue'].pct_change() * 100   # multiply by 100 for percentage

# Rolling mean of the growth rate — smooths out noise to reveal the trend in momentum
df['momentum_3m'] = df['rev_mom_pct'].rolling(window=3).mean().round(2)

print("=== MONTH-OVER-MONTH REVENUE GROWTH ===\n")
print(f"{'Month':<15} {'Revenue':>10} {'MoM %':>8}  {'3M Momentum':>13}  Signal")
print("─" * 60)

for date, row in df.iterrows():
    if pd.isna(row['rev_mom_pct']):
        continue
    mom = row['rev_mom_pct']
    mom_str = f"{mom:+.1f}%"
    sig = "▲ Growing"  if mom > 1 else "▼ Declining" if mom < -1 else "→ Flat"
    # Highlight the anomaly months
    flag = " ← CHURN IMPACT" if date.strftime('%Y-%m') in ['2023-10', '2023-11'] else ""
    print(f"  {date.strftime('%b %Y'):<13} £{row['revenue']:>8,.0f}  {mom_str:>7}  {flag}")

=== MONTH-OVER-MONTH REVENUE GROWTH ===

Month           Revenue    MoM %   3M Momentum  Signal
────────────────────────────────────────────────────────────
  Feb 2023      £43,500    +3.6%
  Mar 2023      £45,200    +3.9%
  Apr 2023      £44,800    -0.9%
  May 2023      £47,100    +5.1%
  Jun 2023      £46,500    -1.3%
  Jul 2023      £48,200    +3.7%
  Aug 2023      £50,100    +3.9%
  Sep 2023      £51,800    +3.4%
  Oct 2023      £38,200   -26.3%   ← CHURN IMPACT
  Nov 2023      £39,500    +3.4%   ← CHURN IMPACT
  Dec 2023      £45,600   +15.4%
  Jan 2024      £52,100   +14.3%
  Feb 2024      £54,300    +4.2%
  Mar 2024      £56,800    +4.6%
  Apr 2024      £57,200    +0.7%
  May 2024      £59,100    +3.3%
  Jun 2024      £61,400    +3.9%

What just happened?

pandas' .pct_change() divides each value by the previous one and subtracts 1 — giving the fractional change from month to month. Multiplying by 100 converts it to a percentage. The first row is always NaN (no previous row to compare against).

The October 2023 drop of −26.3% is the sharpest decline in the dataset — the churn event was clearly catastrophic that month. But December and January both show double-digit recovery growth (+15.4%, +14.3%), and by 2024 the business is back to consistent 3–5% monthly growth. This is a recovery story, not a collapse — and that distinction matters enormously for the board presentation.

Step 4 — Net Subscriber Growth Over Time

The scenario: Revenue is a lagging indicator — it reflects what happened, not what's coming. The CEO wants a leading indicator: net subscriber growth (new subscriptions minus cancellations). If net subs are growing, revenue will follow. If net subs are declining, trouble is coming regardless of what revenue looks like today.

# Net subscriber growth = new subscribers minus churned subscribers
df['net_subs'] = df['new_subs'] - df['churned_subs']

# Cumulative sum: how many net subscribers have we accumulated since month 1?
# .cumsum() adds each month's net to the running total
df['cumulative_subs'] = df['net_subs'].cumsum()

print("=== NET SUBSCRIBER GROWTH ===\n")
print(f"{'Month':<15} {'New':>6} {'Churned':>8} {'Net':>6} {'Cumulative':>12}  Health")
print("─" * 58)

for date, row in df.iterrows():
    net = int(row['net_subs'])
    health = "✓ Growing" if net > 60 else "⚠ Weak" if net < 30 else "→ OK"
    flag = " ← CRISIS" if date.strftime('%Y-%m') in ['2023-10', '2023-11'] else ""
    print(f"  {date.strftime('%b %Y'):<13} {int(row['new_subs']):>6} {int(row['churned_subs']):>8} "
          f"{net:>+6} {int(row['cumulative_subs']):>12}  {health}{flag}")

=== NET SUBSCRIBER GROWTH ===

Month             New  Churned    Net   Cumulative  Health
──────────────────────────────────────────────────────────
  Jan 2023        120       28    +92           92  ✓ Growing
  Feb 2023        135       31   +104          196  ✓ Growing
  Mar 2023        142       29   +113          309  ✓ Growing
  Apr 2023        128       35    +93          402  ✓ Growing
  May 2023        155       32   +123          525  ✓ Growing
  Jun 2023        148       38   +110          635  ✓ Growing
  Jul 2023        162       41   +121          756  ✓ Growing
  Aug 2023        171       39   +132          888  ✓ Growing
  Sep 2023        168       44   +124         1012  ✓ Growing
  Oct 2023         95       89     +6         1018  ⚠ Weak    ← CRISIS
  Nov 2023        102       92    +10         1028  ⚠ Weak    ← CRISIS
  Dec 2023        138       45    +93         1121  ✓ Growing
  Jan 2024        175       38   +137         1258  ✓ Growing
  Feb 2024        182       35   +147         1405  ✓ Growing
  Mar 2024        191       31   +160         1565  ✓ Growing
  Apr 2024        188       29   +159         1724  ✓ Growing
  May 2024        196       27   +169         1893  ✓ Growing
  Jun 2024        204       24   +180         2073  ✓ Growing

What just happened?

pandas' .cumsum() computes a running total — each row adds its value to the sum of all previous rows. For subscriber growth, this shows the total accumulated subscriber count over time. It never goes down unless net subs go negative.

The picture is now complete. October and November were +6 and +10 net subs — barely growing. But crucially, the business never went negative. And the 2024 acceleration is clear: net subs in June 2024 (+180) are 57% higher than they were in January 2023 (+92). The business didn't just recover — it's growing faster than it ever was before the crisis. That's the narrative for the board.

The Time-Based Summary Chart

Here's a visual of the revenue trend with the anomaly period highlighted — the chart the CEO will see in the board deck.

Monthly Revenue — Jan 2023 to Jun 2024

Oct–Nov 2023

Jan 23Apr 23Jul 23Oct 23Jan 24Apr 24Jun 24

2023 growth

Churn crisis

Recovery

2024 acceleration

Teacher's Note

Always set the date column as the index before any time-series analysis. It unlocks .resample(), .rolling(), date-based slicing like df['2023'], and many other time-specific methods that don't work on a regular numeric index.

And when you find an anomaly — a spike, a crash, a structural break — your job isn't just to flag it. It's to quantify it: how many months did it last, how big was it relative to normal, and what did it cost? Numbers without context are noise. Context without numbers is opinion. You need both.

Practice Questions

1. Which pandas method groups time-series data by a time period (year, month, quarter) — the time-based equivalent of groupby?

2. Which pandas method calculates the percentage change from one row to the next — giving you month-over-month growth rates?

3. Which pandas method computes a running total — adding each row's value to the sum of all previous rows?

Quiz

Up Next · Lesson 32

Handling Duplicates

Duplicate rows are sneaky — they inflate counts, distort averages, and cause models to overtrain on repeated data. Learn to find every type of duplicate and decide what to do about each one.

← Previous Course Index Next →

EDA Course

Time-Based EDA

What Time-Based EDA Actually Looks For

The Dataset We'll Use

Step 1 — Overall Trend: Is the Business Growing?

Step 2 — Finding the Churn Anomaly

Step 3 — Month-over-Month Growth Rate

Step 4 — Net Subscriber Growth Over Time

The Time-Based Summary Chart

Practice Questions

Quiz