Time Series Lesson 28 – TS Split | Dataplexa

Train–Test Split for Time Series

Before any model is trained, one decision silently determines whether the results are meaningful or completely misleading.

That decision is how we split the data.

In time series, splitting data the wrong way does not just reduce accuracy — it breaks the logic of time itself.

The Real-World Situation

Imagine you are forecasting daily electricity demand for a city.

You have two years of historical data and want to predict future usage.

There is only one rule the real world follows:

The future is never allowed to influence the past.

Time series models must obey this rule exactly.

The Common (and Dangerous) Mistake

In regular machine learning, data is often split randomly.

That approach completely fails for time series.

Random splitting causes the model to see future information during training.

This is called data leakage.

Our Example Data

We continue with the same electricity usage example.

Python: Base Time Series

import numpy as np

np.random.seed(10)
days = np.arange(200)
usage = 130 + 0.15*days + 12*np.sin(2*np.pi*days/7) + np.random.normal(0,5,200)

This is a realistic series:

Slow upward trend
Weekly seasonality
Random noise

What a Random Split Looks Like (Wrong)

A random split mixes past and future together.

Visually, it looks like this:

Why this is wrong:

The model learns patterns from the future
Evaluation becomes overly optimistic
Real deployment will fail

This mistake is extremely common and very costly.

The Correct Way: Time-Based Split

Time series must be split chronologically.

Training uses the past.

Testing uses the future.

Nothing crosses that boundary.

Chronological Split in Code

Python: Time-Based Split

split_point = int(len(usage) * 0.8)

train = usage[:split_point]
test = usage[split_point:]

Here is how the correct split looks visually:

Now the logic is preserved:

Training sees only historical data
Testing represents unseen future
Evaluation reflects real performance

Why This Matters So Much

Forecasting models are judged on how well they predict the future.

If future data leaks into training:

Accuracy numbers become meaningless
Models appear better than they are
Business decisions become risky

Correct splitting protects you from false confidence.

How Much Data Should Be Used for Testing?

There is no single rule, but common choices are:

Last 20% of data
Last 30 days
Last full season (weekly, monthly, yearly)

The test set should represent the future period you actually care about.

What This Enables Later

Once splitting is done correctly, you can safely:

Evaluate forecasting accuracy
Compare models honestly
Trust real-world deployment results

Practice Questions

Q1. Why is random splitting dangerous in time series?

It allows future data to influence training, causing data leakage.

Q2. What should the test set represent?

Unseen future data that the model would face in real use.

Key Takeaways

Time series must be split chronologically
Random splitting breaks forecasting logic
Correct splits lead to trustworthy models

Next lesson: we’ll apply this split while training our first regression-based forecasting model.

← Previous Course Index Next →