NLP Lesson 48 – Positional Encoding | Dataplexa

Positional Encoding

In the previous lesson, you learned about Self-Attention. You saw how each word can attend to every other word in a sentence.

But there is a very important question:

If Transformers read all words at the same time, how do they know the order of words?

The answer is Positional Encoding.


The Core Problem: No Sense of Order

Unlike RNNs or LSTMs, Transformers do not read text step-by-step. They process all words in parallel.

This is fast, but it creates a problem:

The model does not automatically know which word comes first, second, or last.

For example:

  • “Dog bites man”
  • “Man bites dog”

The words are the same, but the meaning is very different.


Why Word Order Matters in Language

In human language, position changes meaning.

  • Subject vs object depends on order
  • Questions depend on word placement
  • Context often relies on nearby words

So Transformers need an explicit way to understand order.


What Is Positional Encoding?

Positional Encoding is a technique that adds position information to word embeddings.

Each word gets:

  • A word embedding (meaning)
  • A positional encoding (position)

These two are combined so the model knows both what the word is and where it is.


Simple Intuition (Human Analogy)

Imagine a classroom roll call:

  • Name = who the student is
  • Roll number = where the student sits

If you remove roll numbers, names alone don’t tell order.

Positional encoding works like roll numbers for words.


How Positional Encoding Is Used

For each word in the sentence:

  • The word is converted into an embedding
  • A positional vector is generated
  • Both vectors are added together

This combined vector is then passed into self-attention.

So attention now understands:

  • Which words matter
  • Where they appear in the sentence

Why Not Just Use Index Numbers?

You might think: “Why not just give position numbers like 1, 2, 3…?”

That approach fails because:

  • Numbers don’t capture relative distance well
  • Models struggle to generalize to longer sentences

So Transformers use a smarter approach.


Sinusoidal Positional Encoding (Conceptual)

The original Transformer uses sine and cosine functions to generate positional encodings.

Why sine and cosine?

  • They create smooth patterns
  • They allow relative position comparison
  • They generalize to unseen sequence lengths

You do not need to memorize formulas, but you must understand the idea.


Key Properties of Positional Encoding

Positional encodings ensure:

  • Each position has a unique representation
  • Nearby positions have related encodings
  • Word order is preserved

This allows attention to reason about sequence structure.


Learned vs Fixed Positional Encoding

There are two main approaches:

  • Fixed: sinusoidal (original Transformer)
  • Learned: position embeddings learned during training

Modern models like BERT often use learned positional embeddings, while the original Transformer used fixed ones.


What Happens Without Positional Encoding?

Without positional encoding:

  • Sentences become unordered bags of words
  • Grammar is lost
  • Meaning collapses

Self-attention alone is not enough.


Positional Encoding in One Line

Positional encoding gives Transformers a sense of word order.


Practice Questions

Q1. Why do Transformers need positional encoding?

Because Transformers process words in parallel and do not inherently know word order.

Q2. What two pieces of information are combined for each word?

Word embedding and positional encoding.

Quick Quiz

Q1. Which model component gives word order information?

Positional Encoding.

Q2. Are positional encodings always learned?

No. They can be fixed (sinusoidal) or learned.

Homework / Assignment

Conceptual:

  • Explain positional encoding in your own words
  • Give a real-life analogy for word order importance

Preparation:

  • Revise attention mechanism
  • Prepare to study full Transformer architecture

Quick Recap

  • Transformers do not understand order by default
  • Positional encoding adds position information
  • It is combined with word embeddings
  • It preserves sentence meaning

Next lesson: Transformer Architecture