DL Lesson 53 – Attention Mechanism | Dataplexa

Attention Mechanism

The attention mechanism was introduced to solve a fundamental limitation of traditional encoder–decoder and Seq2Seq models.

In earlier models, the entire input sequence was compressed into a single fixed-length context vector. This worked for short sequences but failed for longer and more complex inputs.

Attention allows the model to dynamically focus on different parts of the input sequence while generating each output token.


Why Attention Was Needed

Consider translating a long sentence.

Not every word in the input is equally important for every word in the output.

A fixed context vector forces the decoder to rely on the same compressed information at every decoding step.

Attention removes this restriction by allowing the decoder to selectively look back at relevant encoder states.


Core Idea of Attention

At each decoding step, the model asks:

"Which parts of the input sequence should I focus on right now?"

Instead of using a single context vector, attention computes a new context vector for every output token.

This context vector is a weighted sum of encoder hidden states.


Attention as a Weighted Sum

The attention mechanism assigns a weight to each encoder hidden state.

These weights represent how relevant each input token is to the current output token being generated.

Mathematically:

• Encoder hidden states: h₁, h₂, …, hₙ • Attention weights: α₁, α₂, …, αₙ

The context vector is computed as:

context = α₁h₁ + α₂h₂ + … + αₙhₙ


How Attention Weights Are Computed

The decoder hidden state is compared with each encoder hidden state to compute a relevance score.

Common scoring methods include:

• Dot-product attention • Scaled dot-product attention • Additive (Bahdanau) attention

These scores are then normalized using the softmax function to produce attention weights.


Step-by-Step Attention Flow

1. Decoder generates a query vector from its current state

2. Query is compared with all encoder key vectors

3. Scores are normalized into attention weights

4. Weighted sum of encoder values forms the context vector

5. Context vector helps predict the next output token


Conceptual Attention Code

# Compute attention scores
scores = dot(decoder_state, encoder_states)

# Normalize scores
attention_weights = softmax(scores)

# Compute context vector
context_vector = sum(attention_weights * encoder_states)

This context vector changes at every decoding step, allowing the model to adapt dynamically.


Real-World Interpretation

Imagine reading a sentence and translating it word by word.

Your focus shifts depending on the word you are translating.

Attention mimics this human behavior by allowing the model to shift focus dynamically instead of remembering everything at once.


Benefits of Attention

Attention significantly improves:

• Performance on long sequences • Interpretability of models • Learning stability

It also enables visualization of which input tokens influence each output token.


Limitations of Basic Attention

While attention improves Seq2Seq models, it still processes sequences sequentially.

This limits parallelization and training speed.

These limitations directly led to the development of self-attention and transformer architectures.


Mini Thinking Exercise

Think about this:

Why does attention work better than a fixed context vector for long sentences?


Exercises

Exercise 1:
What problem does attention solve in Seq2Seq models?

It allows the model to focus on relevant parts of the input sequence dynamically.

Exercise 2:
What does an attention weight represent?

The relevance of an input token to the current output token.

Quick Check

Q: Does attention remove the need for an encoder?

No. Attention enhances the encoder–decoder relationship, it does not replace it.

Next, we will build upon attention and explore Transformer architectures, which remove recurrence entirely and rely purely on attention mechanisms.