Time Series Lesson 28 – TS Split | Dataplexa

Train–Test Split for Time Series

Before any model is trained, one decision silently determines whether the results are meaningful or completely misleading.

That decision is how we split the data.

In time series, splitting data the wrong way does not just reduce accuracy — it breaks the logic of time itself.


The Real-World Situation

Imagine you are forecasting daily electricity demand for a city.

You have two years of historical data and want to predict future usage.

There is only one rule the real world follows:

The future is never allowed to influence the past.

Time series models must obey this rule exactly.


The Common (and Dangerous) Mistake

In regular machine learning, data is often split randomly.

That approach completely fails for time series.

Random splitting causes the model to see future information during training.

This is called data leakage.


Our Example Data

We continue with the same electricity usage example.

Python: Base Time Series
import numpy as np

np.random.seed(10)
days = np.arange(200)
usage = 130 + 0.15*days + 12*np.sin(2*np.pi*days/7) + np.random.normal(0,5,200)

This is a realistic series:

  • Slow upward trend
  • Weekly seasonality
  • Random noise

What a Random Split Looks Like (Wrong)

A random split mixes past and future together.

Visually, it looks like this:

Why this is wrong:

  • The model learns patterns from the future
  • Evaluation becomes overly optimistic
  • Real deployment will fail

This mistake is extremely common and very costly.


The Correct Way: Time-Based Split

Time series must be split chronologically.

Training uses the past.

Testing uses the future.

Nothing crosses that boundary.


Chronological Split in Code

Python: Time-Based Split
split_point = int(len(usage) * 0.8)

train = usage[:split_point]
test = usage[split_point:]

Here is how the correct split looks visually:

Now the logic is preserved:

  • Training sees only historical data
  • Testing represents unseen future
  • Evaluation reflects real performance

Why This Matters So Much

Forecasting models are judged on how well they predict the future.

If future data leaks into training:

  • Accuracy numbers become meaningless
  • Models appear better than they are
  • Business decisions become risky

Correct splitting protects you from false confidence.


How Much Data Should Be Used for Testing?

There is no single rule, but common choices are:

  • Last 20% of data
  • Last 30 days
  • Last full season (weekly, monthly, yearly)

The test set should represent the future period you actually care about.


What This Enables Later

Once splitting is done correctly, you can safely:

  • Evaluate forecasting accuracy
  • Compare models honestly
  • Trust real-world deployment results

Practice Questions

Q1. Why is random splitting dangerous in time series?

It allows future data to influence training, causing data leakage.

Q2. What should the test set represent?

Unseen future data that the model would face in real use.

Key Takeaways

  • Time series must be split chronologically
  • Random splitting breaks forecasting logic
  • Correct splits lead to trustworthy models

Next lesson: we’ll apply this split while training our first regression-based forecasting model.