Positional Encoding
In the previous lesson, you learned about Self-Attention. You saw how each word can attend to every other word in a sentence.
But there is a very important question:
If Transformers read all words at the same time, how do they know the order of words?
The answer is Positional Encoding.
The Core Problem: No Sense of Order
Unlike RNNs or LSTMs, Transformers do not read text step-by-step. They process all words in parallel.
This is fast, but it creates a problem:
The model does not automatically know which word comes first, second, or last.
For example:
- “Dog bites man”
- “Man bites dog”
The words are the same, but the meaning is very different.
Why Word Order Matters in Language
In human language, position changes meaning.
- Subject vs object depends on order
- Questions depend on word placement
- Context often relies on nearby words
So Transformers need an explicit way to understand order.
What Is Positional Encoding?
Positional Encoding is a technique that adds position information to word embeddings.
Each word gets:
- A word embedding (meaning)
- A positional encoding (position)
These two are combined so the model knows both what the word is and where it is.
Simple Intuition (Human Analogy)
Imagine a classroom roll call:
- Name = who the student is
- Roll number = where the student sits
If you remove roll numbers, names alone don’t tell order.
Positional encoding works like roll numbers for words.
How Positional Encoding Is Used
For each word in the sentence:
- The word is converted into an embedding
- A positional vector is generated
- Both vectors are added together
This combined vector is then passed into self-attention.
So attention now understands:
- Which words matter
- Where they appear in the sentence
Why Not Just Use Index Numbers?
You might think: “Why not just give position numbers like 1, 2, 3…?”
That approach fails because:
- Numbers don’t capture relative distance well
- Models struggle to generalize to longer sentences
So Transformers use a smarter approach.
Sinusoidal Positional Encoding (Conceptual)
The original Transformer uses sine and cosine functions to generate positional encodings.
Why sine and cosine?
- They create smooth patterns
- They allow relative position comparison
- They generalize to unseen sequence lengths
You do not need to memorize formulas, but you must understand the idea.
Key Properties of Positional Encoding
Positional encodings ensure:
- Each position has a unique representation
- Nearby positions have related encodings
- Word order is preserved
This allows attention to reason about sequence structure.
Learned vs Fixed Positional Encoding
There are two main approaches:
- Fixed: sinusoidal (original Transformer)
- Learned: position embeddings learned during training
Modern models like BERT often use learned positional embeddings, while the original Transformer used fixed ones.
What Happens Without Positional Encoding?
Without positional encoding:
- Sentences become unordered bags of words
- Grammar is lost
- Meaning collapses
Self-attention alone is not enough.
Positional Encoding in One Line
Positional encoding gives Transformers a sense of word order.
Practice Questions
Q1. Why do Transformers need positional encoding?
Q2. What two pieces of information are combined for each word?
Quick Quiz
Q1. Which model component gives word order information?
Q2. Are positional encodings always learned?
Homework / Assignment
Conceptual:
- Explain positional encoding in your own words
- Give a real-life analogy for word order importance
Preparation:
- Revise attention mechanism
- Prepare to study full Transformer architecture
Quick Recap
- Transformers do not understand order by default
- Positional encoding adds position information
- It is combined with word embeddings
- It preserves sentence meaning
Next lesson: Transformer Architecture