Attention Models for Time Series
Traditional sequence models treat all past data almost equally. But in real life, not every past value matters the same.
Attention models solve this by learning where to focus.
Why Attention Is Needed
Think about forecasting daily sales.
- Yesterday’s sales matter a lot
- Last week’s same day matters
- Sales from 6 months ago may not matter
Attention models automatically learn which past points are important.
Real-World Example: Online Sales Forecasting
Consider an e-commerce store:
- Recent promotions affect demand
- Weekend patterns repeat
- Old data slowly loses relevance
Attention allows the model to assign higher weight to the most useful historical moments.
Sales Time Series
This chart shows simulated daily sales with:
- Trend
- Weekly seasonality
- Occasional spikes
How Attention Works (Conceptually)
- Each past timestep produces a hidden state
- The model scores how relevant each state is
- Important states get higher weights
- Weighted sum is used for prediction
Attention Weights Visualization
Below you see how attention assigns importance to different timesteps.
Notice:
- Recent days get higher weights
- Some weekly points stand out
- Older values fade away
Attention Model Structure
# h = LSTM hidden states
scores = Dense(1)(h)
weights = Softmax(axis=1)(scores)
context = Sum(weights * h)
Key idea:
- The model decides what matters
- No manual feature engineering
Why Attention Improves Forecasting
- Handles long sequences better
- Reduces information overload
- Improves interpretability
You can visually inspect which timesteps influenced predictions.
Common Use Cases
- Financial forecasting
- Demand prediction
- Energy load forecasting
- Traffic and mobility data
Limitations
- More parameters
- Needs more data
- Slower training
Attention is powerful but should be used wisely.
Practice Questions
Q1. Why does attention outperform plain LSTM for long sequences?
Q2. Can attention explain model decisions?
Next lesson: Transformer models for time series forecasting.