Generative AI Course
Self-Attention: The Core Mechanism Inside Transformers
Transformers work because of one idea done extremely well: self-attention.
This lesson explains how self-attention actually operates, how engineers reason about it before writing code, and how it transforms raw tokens into meaningful representations.
The Question Self-Attention Answers
When processing a sequence, the model must decide:
Which parts of this sequence matter for understanding this token?
Self-attention allows every token to look at every other token and decide how much influence it should have.
Why Simple Context Is Not Enough
Consider the sentence:
“The bank approved the loan.”
The meaning of “bank” depends on other words.
Self-attention lets the model connect “bank” with “loan” directly, even if they are far apart in longer sentences.
How Engineers Think Before Coding Attention
Before implementation, engineers break attention into steps:
- Represent tokens numerically
- Measure similarity between tokens
- Weight information based on relevance
- Combine results into new representations
These steps guide the math and the code.
Query, Key, and Value Intuition
Self-attention uses three projections:
- Query: what this token is looking for
- Key: what each token offers
- Value: the information to extract
A token compares its query with all keys to decide which values matter.
Attention at a Conceptual Level
For one token:
- Compare its query with all keys
- Compute similarity scores
- Normalize scores into probabilities
- Create a weighted sum of values
This happens for every token in parallel.
Minimal Attention Calculation (Step-by-Step)
Start with a small example to understand mechanics.
import torch
import torch.nn.functional as F
# pretend token embeddings
Q = torch.randn(4, 8) # queries
K = torch.randn(4, 8) # keys
V = torch.randn(4, 8) # values
scores = Q @ K.T
weights = F.softmax(scores, dim=-1)
output = weights @ V
This code shows the mathematical heart of self-attention.
What matters is not the exact numbers, but how relationships are formed.
What Actually Happens Inside the Code
Breaking it down:
Q @ K.Tmeasures similarity between tokens- Softmax converts similarity into attention weights
- Weights control how much each value contributes
Tokens dynamically exchange information based on relevance.
Why Scaling Is Required
As dimensions grow, dot products become large.
This causes unstable gradients.
To fix this, attention uses scaling:
import math
scores = (Q @ K.T) / math.sqrt(Q.size(-1))
weights = F.softmax(scores, dim=-1)
This simple division keeps training stable.
Multi-Head Self-Attention
One attention head captures one type of relationship.
Multiple heads allow the model to:
- Track syntax
- Track semantics
- Track long-range dependencies
Each head attends differently to the same input.
Why Self-Attention Scales So Well
Self-attention processes all tokens in parallel.
This enables:
- Fast training on GPUs
- Long-context modeling
- Massive model scaling
This is why LLMs are possible today.
Common Mistakes Learners Make
- Thinking attention is static
- Ignoring scaling factor
- Confusing values with keys
Attention is dynamic and input-dependent.
Practice
Which component represents what a token is looking for?
Which operation normalizes attention scores?
Why does self-attention scale well?
Quick Quiz
Query represents?
Why divide by square root of dimension?
Why use multiple attention heads?
Recap: Self-attention allows tokens to dynamically exchange information based on relevance.
Next up: Positional Encoding — how Transformers understand order without recurrence.