NLP Lesson 39 – Bahdanau Attention | Dataplexa

Bahdanau Attention (Additive Attention)

In the previous lesson, you learned the intuition behind the Attention Mechanism and why it solved the encoder–decoder bottleneck. Now, we move from intuition to a formal attention model.

Bahdanau Attention was the first widely successful attention mechanism, introduced in neural machine translation. It is also known as Additive Attention.

This lesson explains:

What Bahdanau Attention is
How it computes attention scores
Why it works better than basic Seq2Seq

Why Bahdanau Attention Was Introduced

Basic encoder–decoder models had two major issues:

Single fixed context vector
Poor performance on long sentences

Bahdanau Attention introduced a way for the decoder to:

Access all encoder hidden states
Decide relevance dynamically
Create a context vector for each output step

This was a major breakthrough in NLP.

What Makes Bahdanau Attention “Additive”?

Bahdanau Attention computes attention scores using:

Decoder hidden state
Encoder hidden state
A small neural network

The score is calculated by:

Adding transformed encoder and decoder states
Applying a non-linearity (tanh)
Projecting to a scalar score

That is why it is called Additive Attention.

Core Components of Bahdanau Attention

For each decoding step, Bahdanau Attention uses:

Encoder hidden states: h₁, h₂, …, hₙ
Decoder hidden state: sₜ
Trainable weight matrices

These components work together to compute relevance scores.

Step-by-Step Working (High-Level)

At decoding time step t:

Compare decoder state with each encoder state
Compute alignment (attention) scores
Normalize scores using softmax
Create weighted context vector
Generate output word

This repeats for every output token.

Bahdanau Attention Score Function

The attention score is computed as:

score(s_t, h_i) = vᵀ tanh(W₁s_t + W₂h_i)

Where:

s_t = decoder hidden state
h_i = encoder hidden state
W₁, W₂, v = trainable parameters

This neural scoring function learns alignment automatically.

From Scores to Attention Weights

Raw scores are converted into probabilities:

α_t,i = softmax(score(s_t, h_i))

These weights represent:

How much attention the decoder gives to each input word

All attention weights sum to 1.

Context Vector Construction

The context vector is computed as:

c_t = Σ α_t,i · h_i

This vector captures the most relevant information needed to generate the next word.

Why Bahdanau Attention Works Well

Bahdanau Attention improves performance because:

Alignment is learned, not fixed
Context adapts per output token
Gradient flow improves
Long-range dependencies are handled better

This dramatically improved translation quality.

Conceptual Pseudocode

This pseudocode explains the logic, not framework syntax.

Practice Environment:

Google Colab
Jupyter Notebook

Bahdanau Attention – Conceptual Flow

for each decoder_step:
    for each encoder_state:
        score = vᵀ tanh(W1 * decoder_state + W2 * encoder_state)

    attention_weights = softmax(scores)
    context_vector = sum(attention_weights * encoder_states)

    output = decoder(context_vector, decoder_state)

Bahdanau Attention vs No Attention

Aspect	No Attention	Bahdanau Attention
Context	Single fixed vector	Dynamic per output
Alignment	Implicit	Explicit & learned
Long sentences	Poor	Strong

Real-World Usage

Bahdanau Attention is used in:

Early neural machine translation systems
Speech recognition
Text summarization

It laid the foundation for all modern attention models.

Assignment / Homework

Theory:

Explain additive attention in your own words
Compare Bahdanau and basic Seq2Seq

Practical:

Implement Bahdanau attention using NumPy
Visualize attention weights as a heatmap

Environment:

Google Colab
Jupyter Notebook

Practice Questions

Q1. Why is Bahdanau attention called additive?

Because it adds transformed encoder and decoder states before scoring.

Q2. What does the context vector represent?

A weighted summary of encoder states relevant to the current output step.

Quick Quiz

Q1. Which activation function is used in Bahdanau scoring?

tanh

Q2. Are attention weights learned or fixed?

Learned dynamically during training.

Quick Recap

Bahdanau Attention is additive attention
Uses a neural network to compute alignment
Creates dynamic context vectors
Improves long-sequence modeling
Foundation for modern NLP attention

Next lesson: Luong (Multiplicative) Attention

← Previous Course Index Next →