Bahdanau Attention (Additive Attention)
In the previous lesson, you learned the intuition behind the Attention Mechanism and why it solved the encoder–decoder bottleneck. Now, we move from intuition to a formal attention model.
Bahdanau Attention was the first widely successful attention mechanism, introduced in neural machine translation. It is also known as Additive Attention.
This lesson explains:
- What Bahdanau Attention is
- How it computes attention scores
- Why it works better than basic Seq2Seq
Why Bahdanau Attention Was Introduced
Basic encoder–decoder models had two major issues:
- Single fixed context vector
- Poor performance on long sentences
Bahdanau Attention introduced a way for the decoder to:
- Access all encoder hidden states
- Decide relevance dynamically
- Create a context vector for each output step
This was a major breakthrough in NLP.
What Makes Bahdanau Attention “Additive”?
Bahdanau Attention computes attention scores using:
- Decoder hidden state
- Encoder hidden state
- A small neural network
The score is calculated by:
- Adding transformed encoder and decoder states
- Applying a non-linearity (tanh)
- Projecting to a scalar score
That is why it is called Additive Attention.
Core Components of Bahdanau Attention
For each decoding step, Bahdanau Attention uses:
- Encoder hidden states: h₁, h₂, …, hₙ
- Decoder hidden state: sₜ
- Trainable weight matrices
These components work together to compute relevance scores.
Step-by-Step Working (High-Level)
At decoding time step t:
- Compare decoder state with each encoder state
- Compute alignment (attention) scores
- Normalize scores using softmax
- Create weighted context vector
- Generate output word
This repeats for every output token.
Bahdanau Attention Score Function
The attention score is computed as:
score(st, hi) = vᵀ tanh(W₁st + W₂hi)
Where:
- st = decoder hidden state
- hi = encoder hidden state
- W₁, W₂, v = trainable parameters
This neural scoring function learns alignment automatically.
From Scores to Attention Weights
Raw scores are converted into probabilities:
αt,i = softmax(score(st, hi))
These weights represent:
- How much attention the decoder gives to each input word
All attention weights sum to 1.
Context Vector Construction
The context vector is computed as:
ct = Σ αt,i · hi
This vector captures the most relevant information needed to generate the next word.
Why Bahdanau Attention Works Well
Bahdanau Attention improves performance because:
- Alignment is learned, not fixed
- Context adapts per output token
- Gradient flow improves
- Long-range dependencies are handled better
This dramatically improved translation quality.
Conceptual Pseudocode
This pseudocode explains the logic, not framework syntax.
Practice Environment:
- Google Colab
- Jupyter Notebook
for each decoder_step:
for each encoder_state:
score = vᵀ tanh(W1 * decoder_state + W2 * encoder_state)
attention_weights = softmax(scores)
context_vector = sum(attention_weights * encoder_states)
output = decoder(context_vector, decoder_state)
Bahdanau Attention vs No Attention
| Aspect | No Attention | Bahdanau Attention |
|---|---|---|
| Context | Single fixed vector | Dynamic per output |
| Alignment | Implicit | Explicit & learned |
| Long sentences | Poor | Strong |
Real-World Usage
Bahdanau Attention is used in:
- Early neural machine translation systems
- Speech recognition
- Text summarization
It laid the foundation for all modern attention models.
Assignment / Homework
Theory:
- Explain additive attention in your own words
- Compare Bahdanau and basic Seq2Seq
Practical:
- Implement Bahdanau attention using NumPy
- Visualize attention weights as a heatmap
Environment:
- Google Colab
- Jupyter Notebook
Practice Questions
Q1. Why is Bahdanau attention called additive?
Q2. What does the context vector represent?
Quick Quiz
Q1. Which activation function is used in Bahdanau scoring?
Q2. Are attention weights learned or fixed?
Quick Recap
- Bahdanau Attention is additive attention
- Uses a neural network to compute alignment
- Creates dynamic context vectors
- Improves long-sequence modeling
- Foundation for modern NLP attention
Next lesson: Luong (Multiplicative) Attention