Time Series Lesson 23 – Evaluation | Dataplexa

Model Evaluation in Time Series

So far, you learned how to build forecasting models. But building a model is only half the job.

The real question is:

How do we know if a forecast is actually good?


A Real-World Situation

Imagine you are forecasting daily demand for an online store.

You create two models:

  • Model A predicts slightly higher values
  • Model B predicts slightly lower values

Both look reasonable by eye. But only one will minimize stock loss and over-ordering.

This is where model evaluation becomes critical.


Why Time Series Evaluation Is Different

Time series data is ordered in time.

You cannot randomly shuffle it like regular machine learning data.

If you evaluate incorrectly:

  • You leak future information
  • Results look unrealistically good
  • Models fail in production

So evaluation must respect time.


Train vs Test (Time-Aware)

In time series:

  • Train → past data
  • Test → future data

Let’s simulate a simple real-world demand series.

Python: Train–Test Split
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(2)
time = np.arange(100)
demand = 50 + 0.3*time + np.random.normal(0,4,100)

train = demand[:80]
test = demand[80:]

plt.figure(figsize=(9,4))
plt.plot(train, label="Train")
plt.plot(range(80,100), test, label="Test")
plt.legend()
plt.show()

This mirrors reality:

  • Model only sees the past
  • Evaluation happens on unseen future values

Forecast vs Actual

Let’s assume a simple forecasting model:

Tomorrow’s demand ≈ today’s demand

This is called a naive forecast.

Python: Naive Forecast
forecast = np.repeat(train[-1], len(test))

plt.figure(figsize=(9,4))
plt.plot(test, label="Actual")
plt.plot(forecast, label="Forecast")
plt.legend()
plt.show()

Visually, you can already see:

  • The forecast is flat
  • The actual data continues to rise

But we still need numbers.


Evaluation Metrics (Intuition First)

Evaluation metrics answer:

How far off were the predictions?

The most common metrics are:

  • MAE – Mean Absolute Error
  • MSE – Mean Squared Error
  • RMSE – Root Mean Squared Error

Mean Absolute Error (MAE)

MAE measures the average absolute difference between forecast and actual values.

Interpretation:

  • Easy to understand
  • Same units as the data
Python: MAE
mae = np.mean(np.abs(test - forecast))
mae

If MAE = 6:

On average, the forecast is off by 6 units.


RMSE (Punishes Large Errors)

RMSE gives more penalty to large mistakes.

This matters when big errors are costly.

Python: RMSE
rmse = np.sqrt(np.mean((test - forecast)**2))
rmse

If RMSE is much larger than MAE:

  • The model makes some very large errors

Residual Analysis

Residuals = Actual − Forecast

A good model leaves behind only random noise.

Python: Residuals
residuals = test - forecast

plt.figure(figsize=(9,4))
plt.plot(residuals)
plt.title("Residuals")
plt.show()

Here, residuals show a pattern.

That tells us:

  • The model is missing trend
  • We need a better forecasting method

What a Good Evaluation Looks Like

  • Low MAE and RMSE
  • No visible pattern in residuals
  • Errors behave randomly

Evaluation is not about a single number — it’s about understanding model behavior.


Practice Questions

Q1. Why can’t we randomly shuffle time series data?

Because it breaks the temporal order and leaks future information.

Q2. When is RMSE preferred over MAE?

When large errors are especially costly and should be penalized more.

Key Takeaways

  • Evaluation must respect time order
  • Visual inspection matters
  • Residuals reveal hidden problems
  • No model is useful without proper evaluation

Next, we will build a complete forecasting pipeline from start to finish.