NLP Lesson Luong Attention – Lesson Title | Dataplexa

Luong Attention (Multiplicative Attention)

In the previous lesson, you learned Bahdanau Attention, also called Additive Attention. That model introduced the idea of learning alignment using a small neural network.

In this lesson, we study another important attention mechanism: Luong Attention, also known as Multiplicative Attention.

Luong Attention is simpler, faster, and widely used in practice, especially when computational efficiency matters.


Why Luong Attention Was Introduced

While Bahdanau Attention works very well, it has one limitation:

  • It uses an additional neural network for scoring

Luong Attention was introduced to:

  • Reduce computation
  • Simplify attention scoring
  • Improve training speed

Instead of using an additive neural network, Luong Attention relies on vector multiplication.


Key Idea Behind Luong Attention

Luong Attention computes relevance by measuring similarity between:

  • Decoder hidden state
  • Encoder hidden states

The more similar they are, the higher the attention score.

This is similar to measuring how closely two vectors point in the same direction.


Main Components Used

Luong Attention uses:

  • Encoder hidden states (h₁, h₂, …, hₙ)
  • Decoder hidden state (sₜ)
  • Optional trainable weight matrix (W)

No extra neural network is required.


Luong Attention Score Functions

Luong proposed three scoring methods. All are based on multiplication.


1. Dot Product Attention

The simplest form of Luong Attention.

score(st, hi) = stᵀ · hi

Here:

  • No extra parameters
  • Fast computation
  • Works well when dimensions match

2. General (Scaled) Attention

This introduces a trainable weight matrix.

score(st, hi) = stᵀ · W · hi

This allows the model to learn a better similarity measure.


3. Concatenation Attention (Luong Variant)

A hybrid approach that slightly resembles Bahdanau, but still relies on multiplication internally.

This variant is less common in practice.


From Scores to Attention Weights

Like all attention mechanisms, Luong Attention uses softmax to normalize scores:

αt,i = softmax(score(st, hi))

These weights represent how much focus each input word receives.


Context Vector Computation

The context vector is calculated as:

ct = Σ αt,i · hi

This is identical to Bahdanau Attention. The difference lies only in how the scores are computed.


Why Luong Attention Is Faster

Luong Attention is computationally efficient because:

  • No additional neural network
  • Matrix multiplication is optimized on GPUs
  • Fewer parameters to train

This makes it attractive for large datasets.


Conceptual Pseudocode

This pseudocode shows the logical flow.

Practice Environment:

  • Google Colab
  • Jupyter Notebook
Luong Attention – Conceptual Flow
for each decoder_step:
    for each encoder_state:
        score = dot(decoder_state, encoder_state)

    attention_weights = softmax(scores)
    context_vector = sum(attention_weights * encoder_states)

    output = decoder(context_vector, decoder_state)

Luong vs Bahdanau Attention

Aspect Bahdanau Luong
Scoring Additive (NN-based) Multiplicative
Speed Slower Faster
Parameters More Fewer
Best use Smaller datasets Large-scale training

Real-World Usage

Luong Attention is commonly used in:

  • Neural Machine Translation
  • Speech recognition systems
  • Large-scale NLP pipelines

Many early production systems preferred Luong Attention for its speed advantage.


Assignment / Homework

Theory:

  • Explain why Luong Attention is faster than Bahdanau
  • List the three Luong scoring methods

Practical:

  • Implement dot-product attention using NumPy
  • Compare outputs with additive attention

Environment:

  • Google Colab
  • Jupyter Notebook

Practice Questions

Q1. Why is Luong Attention called multiplicative?

Because it computes attention scores using vector multiplication.

Q2. Which Luong variant has no trainable parameters?

Dot-product attention.

Quick Quiz

Q1. Which attention mechanism is faster in practice?

Luong (Multiplicative) Attention.

Q2. Does Luong Attention use a separate neural network for scoring?

No.

Quick Recap

  • Luong Attention uses vector multiplication
  • It is faster and simpler than Bahdanau
  • Dot, general, and concat variants exist
  • Context vector logic remains the same
  • Widely used in production systems

Next lesson: Machine Translation with Attention