Positional Encoding
Transformers process all tokens in a sequence at the same time. While this gives them massive speed and parallelism, it also creates a fundamental challenge.
Unlike RNNs, transformers do not inherently understand the order of tokens in a sequence.
Positional encoding is the mechanism that gives transformers a sense of position, order, and sequence structure.
Why Order Matters in Sequences
In language, meaning depends heavily on word order.
Consider these two sentences:
"Dog bites man"
"Man bites dog"
Both contain the same words, but their meanings are completely different.
Without positional information, a transformer would treat both sentences as identical bags of tokens.
The Core Problem
Self-attention looks at relationships between tokens, but it does not know where tokens appear in the sequence.
This means:
• No notion of first, middle, or last • No understanding of relative distance • No awareness of word order
Positional encoding solves this problem by injecting position information into token embeddings.
How Positional Encoding Works
Each token embedding is combined with a position-specific vector.
The resulting representation contains:
• Semantic meaning of the token • Positional information within the sequence
This combined embedding is then passed into the transformer layers.
Fixed (Sinusoidal) Positional Encoding
The original transformer uses sinusoidal positional encodings.
Each position is encoded using sine and cosine functions of different frequencies.
This allows the model to generalize to sequence lengths longer than those seen during training.
PE(pos, 2i) = sin(pos / 10000^(2i / d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i / d_model))
Here:
pos = position index
i = embedding dimension index
d_model = embedding size
Why Sine and Cosine?
Sine and cosine functions create smooth, continuous patterns.
This allows the model to infer:
• Relative positions between tokens • Distance relationships • Sequential structure
A key advantage is that relative positions can be computed from linear combinations of encodings.
Learned Positional Embeddings
Instead of fixed mathematical functions, some models learn positional embeddings directly.
In this approach:
• Each position has a trainable vector • Position information is learned from data
This method is flexible but may not generalize well to unseen sequence lengths.
Absolute vs Relative Positional Encoding
Absolute positional encoding assigns a unique vector to each absolute position in the sequence.
Relative positional encoding focuses on the distance between tokens instead of exact positions.
Relative encodings often improve performance on long sequences and contextual tasks.
Positional Encoding in Practice
In most transformer implementations, positional encodings are added directly to token embeddings before attention.
token_embedding = embedding(tokens)
position_embedding = positional_encoding(sequence_length)
input_embedding = token_embedding + position_embedding
This simple addition allows the model to learn order-sensitive representations.
Impact on Modern Models
Different transformer families use different approaches:
• BERT uses learned positional embeddings • GPT uses learned positional embeddings • Some modern models use rotary or relative encodings
The choice of positional encoding directly affects model performance and scalability.
Exercises
Exercise 1:
Why can’t transformers rely on self-attention alone to understand order?
Exercise 2:
What advantage do sinusoidal encodings have over learned encodings?
Quick Check
Q: Are positional encodings learned or fixed by default?
Next, we will explore BERT and understand how transformers are pretrained for language understanding tasks.