Attention Mechanism
The attention mechanism was introduced to solve a fundamental limitation of traditional encoder–decoder and Seq2Seq models.
In earlier models, the entire input sequence was compressed into a single fixed-length context vector. This worked for short sequences but failed for longer and more complex inputs.
Attention allows the model to dynamically focus on different parts of the input sequence while generating each output token.
Why Attention Was Needed
Consider translating a long sentence.
Not every word in the input is equally important for every word in the output.
A fixed context vector forces the decoder to rely on the same compressed information at every decoding step.
Attention removes this restriction by allowing the decoder to selectively look back at relevant encoder states.
Core Idea of Attention
At each decoding step, the model asks:
"Which parts of the input sequence should I focus on right now?"
Instead of using a single context vector, attention computes a new context vector for every output token.
This context vector is a weighted sum of encoder hidden states.
Attention as a Weighted Sum
The attention mechanism assigns a weight to each encoder hidden state.
These weights represent how relevant each input token is to the current output token being generated.
Mathematically:
• Encoder hidden states: h₁, h₂, …, hₙ • Attention weights: α₁, α₂, …, αₙ
The context vector is computed as:
context = α₁h₁ + α₂h₂ + … + αₙhₙ
How Attention Weights Are Computed
The decoder hidden state is compared with each encoder hidden state to compute a relevance score.
Common scoring methods include:
• Dot-product attention • Scaled dot-product attention • Additive (Bahdanau) attention
These scores are then normalized using the softmax function to produce attention weights.
Step-by-Step Attention Flow
1. Decoder generates a query vector from its current state
2. Query is compared with all encoder key vectors
3. Scores are normalized into attention weights
4. Weighted sum of encoder values forms the context vector
5. Context vector helps predict the next output token
Conceptual Attention Code
# Compute attention scores
scores = dot(decoder_state, encoder_states)
# Normalize scores
attention_weights = softmax(scores)
# Compute context vector
context_vector = sum(attention_weights * encoder_states)
This context vector changes at every decoding step, allowing the model to adapt dynamically.
Real-World Interpretation
Imagine reading a sentence and translating it word by word.
Your focus shifts depending on the word you are translating.
Attention mimics this human behavior by allowing the model to shift focus dynamically instead of remembering everything at once.
Benefits of Attention
Attention significantly improves:
• Performance on long sequences • Interpretability of models • Learning stability
It also enables visualization of which input tokens influence each output token.
Limitations of Basic Attention
While attention improves Seq2Seq models, it still processes sequences sequentially.
This limits parallelization and training speed.
These limitations directly led to the development of self-attention and transformer architectures.
Mini Thinking Exercise
Think about this:
Why does attention work better than a fixed context vector for long sentences?
Exercises
Exercise 1:
What problem does attention solve in Seq2Seq models?
Exercise 2:
What does an attention weight represent?
Quick Check
Q: Does attention remove the need for an encoder?
Next, we will build upon attention and explore Transformer architectures, which remove recurrence entirely and rely purely on attention mechanisms.