GenAI Lesson 30 – Self-Attention | Dataplexa

Self-Attention: The Core Mechanism Inside Transformers

Transformers work because of one idea done extremely well: self-attention.

This lesson explains how self-attention actually operates, how engineers reason about it before writing code, and how it transforms raw tokens into meaningful representations.

The Question Self-Attention Answers

When processing a sequence, the model must decide:

Which parts of this sequence matter for understanding this token?

Self-attention allows every token to look at every other token and decide how much influence it should have.

Why Simple Context Is Not Enough

Consider the sentence:

“The bank approved the loan.”

The meaning of “bank” depends on other words.

Self-attention lets the model connect “bank” with “loan” directly, even if they are far apart in longer sentences.

How Engineers Think Before Coding Attention

Before implementation, engineers break attention into steps:

  • Represent tokens numerically
  • Measure similarity between tokens
  • Weight information based on relevance
  • Combine results into new representations

These steps guide the math and the code.

Query, Key, and Value Intuition

Self-attention uses three projections:

  • Query: what this token is looking for
  • Key: what each token offers
  • Value: the information to extract

A token compares its query with all keys to decide which values matter.

Attention at a Conceptual Level

For one token:

  • Compare its query with all keys
  • Compute similarity scores
  • Normalize scores into probabilities
  • Create a weighted sum of values

This happens for every token in parallel.

Minimal Attention Calculation (Step-by-Step)

Start with a small example to understand mechanics.


import torch
import torch.nn.functional as F

# pretend token embeddings
Q = torch.randn(4, 8)  # queries
K = torch.randn(4, 8)  # keys
V = torch.randn(4, 8)  # values

scores = Q @ K.T
weights = F.softmax(scores, dim=-1)
output = weights @ V
  

This code shows the mathematical heart of self-attention.

What matters is not the exact numbers, but how relationships are formed.

What Actually Happens Inside the Code

Breaking it down:

  • Q @ K.T measures similarity between tokens
  • Softmax converts similarity into attention weights
  • Weights control how much each value contributes

Tokens dynamically exchange information based on relevance.

Why Scaling Is Required

As dimensions grow, dot products become large.

This causes unstable gradients.

To fix this, attention uses scaling:


import math

scores = (Q @ K.T) / math.sqrt(Q.size(-1))
weights = F.softmax(scores, dim=-1)
  

This simple division keeps training stable.

Multi-Head Self-Attention

One attention head captures one type of relationship.

Multiple heads allow the model to:

  • Track syntax
  • Track semantics
  • Track long-range dependencies

Each head attends differently to the same input.

Why Self-Attention Scales So Well

Self-attention processes all tokens in parallel.

This enables:

  • Fast training on GPUs
  • Long-context modeling
  • Massive model scaling

This is why LLMs are possible today.

Common Mistakes Learners Make

  • Thinking attention is static
  • Ignoring scaling factor
  • Confusing values with keys

Attention is dynamic and input-dependent.

Practice

Which component represents what a token is looking for?



Which operation normalizes attention scores?



Why does self-attention scale well?



Quick Quiz

Query represents?





Why divide by square root of dimension?





Why use multiple attention heads?





Recap: Self-attention allows tokens to dynamically exchange information based on relevance.

Next up: Positional Encoding — how Transformers understand order without recurrence.