NLP Lesson Luong Attention – Lesson Title | Dataplexa

Luong Attention (Multiplicative Attention)

In the previous lesson, you learned Bahdanau Attention, also called Additive Attention. That model introduced the idea of learning alignment using a small neural network.

In this lesson, we study another important attention mechanism: Luong Attention, also known as Multiplicative Attention.

Luong Attention is simpler, faster, and widely used in practice, especially when computational efficiency matters.

Why Luong Attention Was Introduced

While Bahdanau Attention works very well, it has one limitation:

It uses an additional neural network for scoring

Luong Attention was introduced to:

Reduce computation
Simplify attention scoring
Improve training speed

Instead of using an additive neural network, Luong Attention relies on vector multiplication.

Key Idea Behind Luong Attention

Luong Attention computes relevance by measuring similarity between:

Decoder hidden state
Encoder hidden states

The more similar they are, the higher the attention score.

This is similar to measuring how closely two vectors point in the same direction.

Main Components Used

Luong Attention uses:

Encoder hidden states (h₁, h₂, …, hₙ)
Decoder hidden state (sₜ)
Optional trainable weight matrix (W)

No extra neural network is required.

Luong Attention Score Functions

Luong proposed three scoring methods. All are based on multiplication.

1. Dot Product Attention

The simplest form of Luong Attention.

score(s_t, h_i) = s_tᵀ · h_i

Here:

No extra parameters
Fast computation
Works well when dimensions match

2. General (Scaled) Attention

This introduces a trainable weight matrix.

score(s_t, h_i) = s_tᵀ · W · h_i

This allows the model to learn a better similarity measure.

3. Concatenation Attention (Luong Variant)

A hybrid approach that slightly resembles Bahdanau, but still relies on multiplication internally.

This variant is less common in practice.

From Scores to Attention Weights

Like all attention mechanisms, Luong Attention uses softmax to normalize scores:

α_t,i = softmax(score(s_t, h_i))

These weights represent how much focus each input word receives.

Context Vector Computation

The context vector is calculated as:

c_t = Σ α_t,i · h_i

This is identical to Bahdanau Attention. The difference lies only in how the scores are computed.

Why Luong Attention Is Faster

Luong Attention is computationally efficient because:

No additional neural network
Matrix multiplication is optimized on GPUs
Fewer parameters to train

This makes it attractive for large datasets.

Conceptual Pseudocode

This pseudocode shows the logical flow.

Practice Environment:

Google Colab
Jupyter Notebook

Luong Attention – Conceptual Flow

for each decoder_step:
    for each encoder_state:
        score = dot(decoder_state, encoder_state)

    attention_weights = softmax(scores)
    context_vector = sum(attention_weights * encoder_states)

    output = decoder(context_vector, decoder_state)

Luong vs Bahdanau Attention

Aspect	Bahdanau	Luong
Scoring	Additive (NN-based)	Multiplicative
Speed	Slower	Faster
Parameters	More	Fewer
Best use	Smaller datasets	Large-scale training

Real-World Usage

Luong Attention is commonly used in:

Neural Machine Translation
Speech recognition systems
Large-scale NLP pipelines

Many early production systems preferred Luong Attention for its speed advantage.

Assignment / Homework

Theory:

Explain why Luong Attention is faster than Bahdanau
List the three Luong scoring methods

Practical:

Implement dot-product attention using NumPy
Compare outputs with additive attention

Environment:

Google Colab
Jupyter Notebook

Practice Questions

Q1. Why is Luong Attention called multiplicative?

Because it computes attention scores using vector multiplication.

Q2. Which Luong variant has no trainable parameters?

Dot-product attention.

Quick Quiz

Q1. Which attention mechanism is faster in practice?

Luong (Multiplicative) Attention.

Q2. Does Luong Attention use a separate neural network for scoring?

No.

Quick Recap

Luong Attention uses vector multiplication
It is faster and simpler than Bahdanau
Dot, general, and concat variants exist
Context vector logic remains the same
Widely used in production systems

Next lesson: Machine Translation with Attention

← Previous Course Index Next →