Time Series Lesson 23 – Evaluation | Dataplexa

Model Evaluation in Time Series

So far, you learned how to build forecasting models. But building a model is only half the job.

The real question is:

How do we know if a forecast is actually good?

A Real-World Situation

Imagine you are forecasting daily demand for an online store.

You create two models:

Model A predicts slightly higher values
Model B predicts slightly lower values

Both look reasonable by eye. But only one will minimize stock loss and over-ordering.

This is where model evaluation becomes critical.

Why Time Series Evaluation Is Different

Time series data is ordered in time.

You cannot randomly shuffle it like regular machine learning data.

If you evaluate incorrectly:

You leak future information
Results look unrealistically good
Models fail in production

So evaluation must respect time.

Train vs Test (Time-Aware)

In time series:

Train → past data
Test → future data

Let’s simulate a simple real-world demand series.

Python: Train–Test Split

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(2)
time = np.arange(100)
demand = 50 + 0.3*time + np.random.normal(0,4,100)

train = demand[:80]
test = demand[80:]

plt.figure(figsize=(9,4))
plt.plot(train, label="Train")
plt.plot(range(80,100), test, label="Test")
plt.legend()
plt.show()

This mirrors reality:

Model only sees the past
Evaluation happens on unseen future values

Forecast vs Actual

Let’s assume a simple forecasting model:

Tomorrow’s demand ≈ today’s demand

This is called a naive forecast.

Python: Naive Forecast

forecast = np.repeat(train[-1], len(test))

plt.figure(figsize=(9,4))
plt.plot(test, label="Actual")
plt.plot(forecast, label="Forecast")
plt.legend()
plt.show()

Visually, you can already see:

The forecast is flat
The actual data continues to rise

But we still need numbers.

Evaluation Metrics (Intuition First)

Evaluation metrics answer:

How far off were the predictions?

The most common metrics are:

MAE – Mean Absolute Error
MSE – Mean Squared Error
RMSE – Root Mean Squared Error

Mean Absolute Error (MAE)

MAE measures the average absolute difference between forecast and actual values.

Interpretation:

Easy to understand
Same units as the data

Python: MAE

mae = np.mean(np.abs(test - forecast))
mae

If MAE = 6:

On average, the forecast is off by 6 units.

RMSE (Punishes Large Errors)

RMSE gives more penalty to large mistakes.

This matters when big errors are costly.

Python: RMSE

rmse = np.sqrt(np.mean((test - forecast)**2))
rmse

If RMSE is much larger than MAE:

The model makes some very large errors

Residual Analysis

Residuals = Actual − Forecast

A good model leaves behind only random noise.

Python: Residuals

residuals = test - forecast

plt.figure(figsize=(9,4))
plt.plot(residuals)
plt.title("Residuals")
plt.show()

Here, residuals show a pattern.

That tells us:

The model is missing trend
We need a better forecasting method

What a Good Evaluation Looks Like

Low MAE and RMSE
No visible pattern in residuals
Errors behave randomly

Evaluation is not about a single number — it’s about understanding model behavior.

Practice Questions

Q1. Why can’t we randomly shuffle time series data?

Because it breaks the temporal order and leaks future information.

Q2. When is RMSE preferred over MAE?

When large errors are especially costly and should be penalized more.

Key Takeaways

Evaluation must respect time order
Visual inspection matters
Residuals reveal hidden problems
No model is useful without proper evaluation

Next, we will build a complete forecasting pipeline from start to finish.

← Previous Course Index Next →