GenAI Lesson 31 – Positional Encode | Dataplexa

Positional Encoding: How Transformers Understand Order

Transformers process all tokens in parallel.

That strength creates a serious problem: the model has no natural sense of order.

This lesson explains how positional encoding solves that problem, how engineers reason about it, and how it changes model behavior in practice.

The Core Problem: Order Is Lost

Consider these two sentences:

Dog bites man
Man bites dog

They contain the same words, but the meaning is completely different.

Self-attention alone treats tokens as a set, not a sequence.

Without extra information, Transformers cannot distinguish word order.

Why RNNs Did Not Have This Issue

RNNs process tokens sequentially.

Order is implicitly encoded through time.

Transformers remove recurrence, so order must be added explicitly.

The Engineering Requirement

Any solution for order must:

Scale to long sequences
Work with parallel computation
Not break attention mechanics

Positional encoding satisfies all three.

What Positional Encoding Actually Does

Each token embedding is modified using its position.

Instead of replacing embeddings, positional information is added to them.

This allows attention to consider both:

What the token is
Where the token is

High-Level Idea Before Code

Think of positional encoding as a signal:

Unique per position
Consistent across sequences
Interpretable by attention layers

Once added, attention can reason about order.

Sinusoidal Positional Encoding (Original Transformer)

The original Transformer paper used sine and cosine functions.

Why?

Continuous values
Unbounded sequence length
Relative positions can be inferred

How the Formula Is Designed

Each dimension encodes position at a different frequency.

Lower dimensions change slowly. Higher dimensions change rapidly.

This gives the model access to both local and global order.

Minimal Sinusoidal Encoding Code

This example shows how positional encodings are created.


import torch
import math

def positional_encoding(seq_len, dim):
    pe = torch.zeros(seq_len, dim)
    position = torch.arange(0, seq_len).unsqueeze(1)
    div_term = torch.exp(
        torch.arange(0, dim, 2) * (-math.log(10000.0) / dim)
    )

    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    return pe

What This Code Is Really Doing

Breaking it down conceptually:

Each position gets a unique pattern
Patterns repeat smoothly across dimensions
Nearby positions are mathematically related

Attention layers learn to use these patterns naturally.

How Positional Encoding Is Applied

Positional encodings are added to token embeddings:


token_embeddings = torch.randn(10, 512)
pos_embeddings = positional_encoding(10, 512)

input_embeddings = token_embeddings + pos_embeddings

Nothing else changes in the model.

This simplicity is intentional.

Learned vs Fixed Positional Encoding

There are two common approaches:

Fixed (sinusoidal)
Learned positional embeddings

Learned embeddings can adapt better to data, but fixed encodings generalize to longer sequences.

Why Relative Position Matters

In many tasks, the distance between tokens matters more than absolute position.

Modern models often use relative or rotary encodings to capture this more effectively.

What Changes Without Positional Encoding

If positional encoding is removed:

Word order is ignored
Sequences become ambiguous
Model performance collapses

This is not optional — it is foundational.

Common Learner Mistakes

Thinking attention learns order automatically
Ignoring position in short sequences
Confusing positional encoding with token embeddings

Order must be injected deliberately.

Practice

What problem does positional encoding solve?

How is positional information applied to embeddings?

Which type of positional encoding uses sine and cosine?

Quick Quiz

Why do Transformers need positional encoding?

Parallel processing removes order
Speed optimization
Noise reduction

How are positional encodings combined with embeddings?

Added
Concatenated
Replaced

Main purpose of positional encoding?

Encode order
Encode meaning
Speed up training

Recap: Positional encoding injects order into Transformers so attention can reason about sequences.

Next up: Encoder–Decoder Architecture — how Transformers generate outputs step by step.

← Previous Course Index Next →

Generative AI Course