GenAI Lesson 31 – Positional Encode | Dataplexa

Positional Encoding: How Transformers Understand Order

Transformers process all tokens in parallel.

That strength creates a serious problem: the model has no natural sense of order.

This lesson explains how positional encoding solves that problem, how engineers reason about it, and how it changes model behavior in practice.

The Core Problem: Order Is Lost

Consider these two sentences:

  • Dog bites man
  • Man bites dog

They contain the same words, but the meaning is completely different.

Self-attention alone treats tokens as a set, not a sequence.

Without extra information, Transformers cannot distinguish word order.

Why RNNs Did Not Have This Issue

RNNs process tokens sequentially.

Order is implicitly encoded through time.

Transformers remove recurrence, so order must be added explicitly.

The Engineering Requirement

Any solution for order must:

  • Scale to long sequences
  • Work with parallel computation
  • Not break attention mechanics

Positional encoding satisfies all three.

What Positional Encoding Actually Does

Each token embedding is modified using its position.

Instead of replacing embeddings, positional information is added to them.

This allows attention to consider both:

  • What the token is
  • Where the token is

High-Level Idea Before Code

Think of positional encoding as a signal:

  • Unique per position
  • Consistent across sequences
  • Interpretable by attention layers

Once added, attention can reason about order.

Sinusoidal Positional Encoding (Original Transformer)

The original Transformer paper used sine and cosine functions.

Why?

  • Continuous values
  • Unbounded sequence length
  • Relative positions can be inferred

How the Formula Is Designed

Each dimension encodes position at a different frequency.

Lower dimensions change slowly. Higher dimensions change rapidly.

This gives the model access to both local and global order.

Minimal Sinusoidal Encoding Code

This example shows how positional encodings are created.


import torch
import math

def positional_encoding(seq_len, dim):
    pe = torch.zeros(seq_len, dim)
    position = torch.arange(0, seq_len).unsqueeze(1)
    div_term = torch.exp(
        torch.arange(0, dim, 2) * (-math.log(10000.0) / dim)
    )

    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    return pe
  

What This Code Is Really Doing

Breaking it down conceptually:

  • Each position gets a unique pattern
  • Patterns repeat smoothly across dimensions
  • Nearby positions are mathematically related

Attention layers learn to use these patterns naturally.

How Positional Encoding Is Applied

Positional encodings are added to token embeddings:


token_embeddings = torch.randn(10, 512)
pos_embeddings = positional_encoding(10, 512)

input_embeddings = token_embeddings + pos_embeddings
  

Nothing else changes in the model.

This simplicity is intentional.

Learned vs Fixed Positional Encoding

There are two common approaches:

  • Fixed (sinusoidal)
  • Learned positional embeddings

Learned embeddings can adapt better to data, but fixed encodings generalize to longer sequences.

Why Relative Position Matters

In many tasks, the distance between tokens matters more than absolute position.

Modern models often use relative or rotary encodings to capture this more effectively.

What Changes Without Positional Encoding

If positional encoding is removed:

  • Word order is ignored
  • Sequences become ambiguous
  • Model performance collapses

This is not optional — it is foundational.

Common Learner Mistakes

  • Thinking attention learns order automatically
  • Ignoring position in short sequences
  • Confusing positional encoding with token embeddings

Order must be injected deliberately.

Practice

What problem does positional encoding solve?



How is positional information applied to embeddings?



Which type of positional encoding uses sine and cosine?



Quick Quiz

Why do Transformers need positional encoding?





How are positional encodings combined with embeddings?





Main purpose of positional encoding?





Recap: Positional encoding injects order into Transformers so attention can reason about sequences.

Next up: Encoder–Decoder Architecture — how Transformers generate outputs step by step.