Model Evaluation in Time Series
So far, you learned how to build forecasting models. But building a model is only half the job.
The real question is:
How do we know if a forecast is actually good?
A Real-World Situation
Imagine you are forecasting daily demand for an online store.
You create two models:
- Model A predicts slightly higher values
- Model B predicts slightly lower values
Both look reasonable by eye. But only one will minimize stock loss and over-ordering.
This is where model evaluation becomes critical.
Why Time Series Evaluation Is Different
Time series data is ordered in time.
You cannot randomly shuffle it like regular machine learning data.
If you evaluate incorrectly:
- You leak future information
- Results look unrealistically good
- Models fail in production
So evaluation must respect time.
Train vs Test (Time-Aware)
In time series:
- Train → past data
- Test → future data
Let’s simulate a simple real-world demand series.
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(2)
time = np.arange(100)
demand = 50 + 0.3*time + np.random.normal(0,4,100)
train = demand[:80]
test = demand[80:]
plt.figure(figsize=(9,4))
plt.plot(train, label="Train")
plt.plot(range(80,100), test, label="Test")
plt.legend()
plt.show()
This mirrors reality:
- Model only sees the past
- Evaluation happens on unseen future values
Forecast vs Actual
Let’s assume a simple forecasting model:
Tomorrow’s demand ≈ today’s demand
This is called a naive forecast.
forecast = np.repeat(train[-1], len(test))
plt.figure(figsize=(9,4))
plt.plot(test, label="Actual")
plt.plot(forecast, label="Forecast")
plt.legend()
plt.show()
Visually, you can already see:
- The forecast is flat
- The actual data continues to rise
But we still need numbers.
Evaluation Metrics (Intuition First)
Evaluation metrics answer:
How far off were the predictions?
The most common metrics are:
- MAE – Mean Absolute Error
- MSE – Mean Squared Error
- RMSE – Root Mean Squared Error
Mean Absolute Error (MAE)
MAE measures the average absolute difference between forecast and actual values.
Interpretation:
- Easy to understand
- Same units as the data
mae = np.mean(np.abs(test - forecast))
mae
If MAE = 6:
On average, the forecast is off by 6 units.
RMSE (Punishes Large Errors)
RMSE gives more penalty to large mistakes.
This matters when big errors are costly.
rmse = np.sqrt(np.mean((test - forecast)**2))
rmse
If RMSE is much larger than MAE:
- The model makes some very large errors
Residual Analysis
Residuals = Actual − Forecast
A good model leaves behind only random noise.
residuals = test - forecast
plt.figure(figsize=(9,4))
plt.plot(residuals)
plt.title("Residuals")
plt.show()
Here, residuals show a pattern.
That tells us:
- The model is missing trend
- We need a better forecasting method
What a Good Evaluation Looks Like
- Low MAE and RMSE
- No visible pattern in residuals
- Errors behave randomly
Evaluation is not about a single number — it’s about understanding model behavior.
Practice Questions
Q1. Why can’t we randomly shuffle time series data?
Q2. When is RMSE preferred over MAE?
Key Takeaways
- Evaluation must respect time order
- Visual inspection matters
- Residuals reveal hidden problems
- No model is useful without proper evaluation
Next, we will build a complete forecasting pipeline from start to finish.