DL Lesson 55 – Positional Encoding | Dataplexa

Positional Encoding

Transformers process all tokens in a sequence at the same time. While this gives them massive speed and parallelism, it also creates a fundamental challenge.

Unlike RNNs, transformers do not inherently understand the order of tokens in a sequence.

Positional encoding is the mechanism that gives transformers a sense of position, order, and sequence structure.


Why Order Matters in Sequences

In language, meaning depends heavily on word order.

Consider these two sentences:

"Dog bites man"
"Man bites dog"

Both contain the same words, but their meanings are completely different.

Without positional information, a transformer would treat both sentences as identical bags of tokens.


The Core Problem

Self-attention looks at relationships between tokens, but it does not know where tokens appear in the sequence.

This means:

• No notion of first, middle, or last • No understanding of relative distance • No awareness of word order

Positional encoding solves this problem by injecting position information into token embeddings.


How Positional Encoding Works

Each token embedding is combined with a position-specific vector.

The resulting representation contains:

• Semantic meaning of the token • Positional information within the sequence

This combined embedding is then passed into the transformer layers.


Fixed (Sinusoidal) Positional Encoding

The original transformer uses sinusoidal positional encodings.

Each position is encoded using sine and cosine functions of different frequencies.

This allows the model to generalize to sequence lengths longer than those seen during training.

PE(pos, 2i)   = sin(pos / 10000^(2i / d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i / d_model))

Here:

pos = position index
i = embedding dimension index
d_model = embedding size


Why Sine and Cosine?

Sine and cosine functions create smooth, continuous patterns.

This allows the model to infer:

• Relative positions between tokens • Distance relationships • Sequential structure

A key advantage is that relative positions can be computed from linear combinations of encodings.


Learned Positional Embeddings

Instead of fixed mathematical functions, some models learn positional embeddings directly.

In this approach:

• Each position has a trainable vector • Position information is learned from data

This method is flexible but may not generalize well to unseen sequence lengths.


Absolute vs Relative Positional Encoding

Absolute positional encoding assigns a unique vector to each absolute position in the sequence.

Relative positional encoding focuses on the distance between tokens instead of exact positions.

Relative encodings often improve performance on long sequences and contextual tasks.


Positional Encoding in Practice

In most transformer implementations, positional encodings are added directly to token embeddings before attention.

token_embedding = embedding(tokens)
position_embedding = positional_encoding(sequence_length)
input_embedding = token_embedding + position_embedding

This simple addition allows the model to learn order-sensitive representations.


Impact on Modern Models

Different transformer families use different approaches:

• BERT uses learned positional embeddings • GPT uses learned positional embeddings • Some modern models use rotary or relative encodings

The choice of positional encoding directly affects model performance and scalability.


Exercises

Exercise 1:
Why can’t transformers rely on self-attention alone to understand order?

Because self-attention has no inherent notion of token position.

Exercise 2:
What advantage do sinusoidal encodings have over learned encodings?

They generalize to sequence lengths not seen during training.

Quick Check

Q: Are positional encodings learned or fixed by default?

They can be either fixed (sinusoidal) or learned.

Next, we will explore BERT and understand how transformers are pretrained for language understanding tasks.