Generative AI Course
Positional Encoding: How Transformers Understand Order
Transformers process all tokens in parallel.
That strength creates a serious problem: the model has no natural sense of order.
This lesson explains how positional encoding solves that problem, how engineers reason about it, and how it changes model behavior in practice.
The Core Problem: Order Is Lost
Consider these two sentences:
- Dog bites man
- Man bites dog
They contain the same words, but the meaning is completely different.
Self-attention alone treats tokens as a set, not a sequence.
Without extra information, Transformers cannot distinguish word order.
Why RNNs Did Not Have This Issue
RNNs process tokens sequentially.
Order is implicitly encoded through time.
Transformers remove recurrence, so order must be added explicitly.
The Engineering Requirement
Any solution for order must:
- Scale to long sequences
- Work with parallel computation
- Not break attention mechanics
Positional encoding satisfies all three.
What Positional Encoding Actually Does
Each token embedding is modified using its position.
Instead of replacing embeddings, positional information is added to them.
This allows attention to consider both:
- What the token is
- Where the token is
High-Level Idea Before Code
Think of positional encoding as a signal:
- Unique per position
- Consistent across sequences
- Interpretable by attention layers
Once added, attention can reason about order.
Sinusoidal Positional Encoding (Original Transformer)
The original Transformer paper used sine and cosine functions.
Why?
- Continuous values
- Unbounded sequence length
- Relative positions can be inferred
How the Formula Is Designed
Each dimension encodes position at a different frequency.
Lower dimensions change slowly. Higher dimensions change rapidly.
This gives the model access to both local and global order.
Minimal Sinusoidal Encoding Code
This example shows how positional encodings are created.
import torch
import math
def positional_encoding(seq_len, dim):
pe = torch.zeros(seq_len, dim)
position = torch.arange(0, seq_len).unsqueeze(1)
div_term = torch.exp(
torch.arange(0, dim, 2) * (-math.log(10000.0) / dim)
)
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
return pe
What This Code Is Really Doing
Breaking it down conceptually:
- Each position gets a unique pattern
- Patterns repeat smoothly across dimensions
- Nearby positions are mathematically related
Attention layers learn to use these patterns naturally.
How Positional Encoding Is Applied
Positional encodings are added to token embeddings:
token_embeddings = torch.randn(10, 512)
pos_embeddings = positional_encoding(10, 512)
input_embeddings = token_embeddings + pos_embeddings
Nothing else changes in the model.
This simplicity is intentional.
Learned vs Fixed Positional Encoding
There are two common approaches:
- Fixed (sinusoidal)
- Learned positional embeddings
Learned embeddings can adapt better to data, but fixed encodings generalize to longer sequences.
Why Relative Position Matters
In many tasks, the distance between tokens matters more than absolute position.
Modern models often use relative or rotary encodings to capture this more effectively.
What Changes Without Positional Encoding
If positional encoding is removed:
- Word order is ignored
- Sequences become ambiguous
- Model performance collapses
This is not optional — it is foundational.
Common Learner Mistakes
- Thinking attention learns order automatically
- Ignoring position in short sequences
- Confusing positional encoding with token embeddings
Order must be injected deliberately.
Practice
What problem does positional encoding solve?
How is positional information applied to embeddings?
Which type of positional encoding uses sine and cosine?
Quick Quiz
Why do Transformers need positional encoding?
How are positional encodings combined with embeddings?
Main purpose of positional encoding?
Recap: Positional encoding injects order into Transformers so attention can reason about sequences.
Next up: Encoder–Decoder Architecture — how Transformers generate outputs step by step.