NLP Lesson 39 – Bahdanau Attention | Dataplexa

Bahdanau Attention (Additive Attention)

In the previous lesson, you learned the intuition behind the Attention Mechanism and why it solved the encoder–decoder bottleneck. Now, we move from intuition to a formal attention model.

Bahdanau Attention was the first widely successful attention mechanism, introduced in neural machine translation. It is also known as Additive Attention.

This lesson explains:

  • What Bahdanau Attention is
  • How it computes attention scores
  • Why it works better than basic Seq2Seq

Why Bahdanau Attention Was Introduced

Basic encoder–decoder models had two major issues:

  • Single fixed context vector
  • Poor performance on long sentences

Bahdanau Attention introduced a way for the decoder to:

  • Access all encoder hidden states
  • Decide relevance dynamically
  • Create a context vector for each output step

This was a major breakthrough in NLP.


What Makes Bahdanau Attention “Additive”?

Bahdanau Attention computes attention scores using:

  • Decoder hidden state
  • Encoder hidden state
  • A small neural network

The score is calculated by:

  • Adding transformed encoder and decoder states
  • Applying a non-linearity (tanh)
  • Projecting to a scalar score

That is why it is called Additive Attention.


Core Components of Bahdanau Attention

For each decoding step, Bahdanau Attention uses:

  • Encoder hidden states: h₁, h₂, …, hₙ
  • Decoder hidden state: sₜ
  • Trainable weight matrices

These components work together to compute relevance scores.


Step-by-Step Working (High-Level)

At decoding time step t:

  1. Compare decoder state with each encoder state
  2. Compute alignment (attention) scores
  3. Normalize scores using softmax
  4. Create weighted context vector
  5. Generate output word

This repeats for every output token.


Bahdanau Attention Score Function

The attention score is computed as:

score(st, hi) = vᵀ tanh(W₁st + W₂hi)

Where:

  • st = decoder hidden state
  • hi = encoder hidden state
  • W₁, W₂, v = trainable parameters

This neural scoring function learns alignment automatically.


From Scores to Attention Weights

Raw scores are converted into probabilities:

αt,i = softmax(score(st, hi))

These weights represent:

  • How much attention the decoder gives to each input word

All attention weights sum to 1.


Context Vector Construction

The context vector is computed as:

ct = Σ αt,i · hi

This vector captures the most relevant information needed to generate the next word.


Why Bahdanau Attention Works Well

Bahdanau Attention improves performance because:

  • Alignment is learned, not fixed
  • Context adapts per output token
  • Gradient flow improves
  • Long-range dependencies are handled better

This dramatically improved translation quality.


Conceptual Pseudocode

This pseudocode explains the logic, not framework syntax.

Practice Environment:

  • Google Colab
  • Jupyter Notebook
Bahdanau Attention – Conceptual Flow
for each decoder_step:
    for each encoder_state:
        score = vᵀ tanh(W1 * decoder_state + W2 * encoder_state)

    attention_weights = softmax(scores)
    context_vector = sum(attention_weights * encoder_states)

    output = decoder(context_vector, decoder_state)

Bahdanau Attention vs No Attention

Aspect No Attention Bahdanau Attention
Context Single fixed vector Dynamic per output
Alignment Implicit Explicit & learned
Long sentences Poor Strong

Real-World Usage

Bahdanau Attention is used in:

  • Early neural machine translation systems
  • Speech recognition
  • Text summarization

It laid the foundation for all modern attention models.


Assignment / Homework

Theory:

  • Explain additive attention in your own words
  • Compare Bahdanau and basic Seq2Seq

Practical:

  • Implement Bahdanau attention using NumPy
  • Visualize attention weights as a heatmap

Environment:

  • Google Colab
  • Jupyter Notebook

Practice Questions

Q1. Why is Bahdanau attention called additive?

Because it adds transformed encoder and decoder states before scoring.

Q2. What does the context vector represent?

A weighted summary of encoder states relevant to the current output step.

Quick Quiz

Q1. Which activation function is used in Bahdanau scoring?

tanh

Q2. Are attention weights learned or fixed?

Learned dynamically during training.

Quick Recap

  • Bahdanau Attention is additive attention
  • Uses a neural network to compute alignment
  • Creates dynamic context vectors
  • Improves long-sequence modeling
  • Foundation for modern NLP attention

Next lesson: Luong (Multiplicative) Attention