GenAI Lesson 30 – Self-Attention | Dataplexa

Self-Attention: The Core Mechanism Inside Transformers

Transformers work because of one idea done extremely well: self-attention.

This lesson explains how self-attention actually operates, how engineers reason about it before writing code, and how it transforms raw tokens into meaningful representations.

The Question Self-Attention Answers

When processing a sequence, the model must decide:

Which parts of this sequence matter for understanding this token?

Self-attention allows every token to look at every other token and decide how much influence it should have.

Why Simple Context Is Not Enough

Consider the sentence:

“The bank approved the loan.”

The meaning of “bank” depends on other words.

Self-attention lets the model connect “bank” with “loan” directly, even if they are far apart in longer sentences.

How Engineers Think Before Coding Attention

Before implementation, engineers break attention into steps:

Represent tokens numerically
Measure similarity between tokens
Weight information based on relevance
Combine results into new representations

These steps guide the math and the code.

Query, Key, and Value Intuition

Self-attention uses three projections:

Query: what this token is looking for
Key: what each token offers
Value: the information to extract

A token compares its query with all keys to decide which values matter.

Attention at a Conceptual Level

For one token:

Compare its query with all keys
Compute similarity scores
Normalize scores into probabilities
Create a weighted sum of values

This happens for every token in parallel.

Minimal Attention Calculation (Step-by-Step)

Start with a small example to understand mechanics.


import torch
import torch.nn.functional as F

# pretend token embeddings
Q = torch.randn(4, 8)  # queries
K = torch.randn(4, 8)  # keys
V = torch.randn(4, 8)  # values

scores = Q @ K.T
weights = F.softmax(scores, dim=-1)
output = weights @ V

This code shows the mathematical heart of self-attention.

What matters is not the exact numbers, but how relationships are formed.

What Actually Happens Inside the Code

Breaking it down:

Q @ K.T measures similarity between tokens
Softmax converts similarity into attention weights
Weights control how much each value contributes

Tokens dynamically exchange information based on relevance.

Why Scaling Is Required

As dimensions grow, dot products become large.

This causes unstable gradients.

To fix this, attention uses scaling:


import math

scores = (Q @ K.T) / math.sqrt(Q.size(-1))
weights = F.softmax(scores, dim=-1)

This simple division keeps training stable.

Multi-Head Self-Attention

One attention head captures one type of relationship.

Multiple heads allow the model to:

Track syntax
Track semantics
Track long-range dependencies

Each head attends differently to the same input.

Why Self-Attention Scales So Well

Self-attention processes all tokens in parallel.

This enables:

Fast training on GPUs
Long-context modeling
Massive model scaling

This is why LLMs are possible today.

Common Mistakes Learners Make

Thinking attention is static
Ignoring scaling factor
Confusing values with keys

Attention is dynamic and input-dependent.

Practice

Which component represents what a token is looking for?

Which operation normalizes attention scores?

Why does self-attention scale well?

Quick Quiz

Query represents?

What a token wants
Stored information
Metadata

Why divide by square root of dimension?

Numerical stability
Speed
Memory

Why use multiple attention heads?

Capture different relationships
Reduce noise
Faster inference

Recap: Self-attention allows tokens to dynamically exchange information based on relevance.

Next up: Positional Encoding — how Transformers understand order without recurrence.

← Previous Course Index Next →

Generative AI Course