GenAI Lesson 36 – Tokenization | Dataplexa

Tokenization: How Text Becomes Numbers Inside LLMs

Large Language Models do not read words, sentences, or characters.

They operate purely on numbers.

Tokenization is the process that converts raw text into numerical units the model can understand.

Why Tokenization Exists

Neural networks cannot process strings directly.

Every input must be transformed into fixed numeric representations.

Tokenization defines the smallest meaningful units used during training and inference.

What a Token Actually Is

A token is not always a word.

Depending on the tokenizer, a token may represent:

A full word
A subword
A character sequence
Punctuation or whitespace

Modern LLMs almost always use subword tokenization.

Why Not Word-Level Tokenization

Word-level tokenization fails for:

Rare words
Misspellings
New vocabulary

Subword tokenization solves this by breaking words into reusable pieces.

How Tokenization Fits Into the GPT Pipeline

Before any attention or reasoning happens:

Text is tokenized
Tokens are mapped to IDs
IDs are converted into embeddings

Everything downstream depends on this step.

Thinking Like an Engineer Before Tokenizing

Before choosing a tokenizer, engineers ask:

What languages must be supported?
How large should the vocabulary be?
How much context efficiency is needed?

These decisions affect model size, speed, and cost.

Byte Pair Encoding (BPE) Concept

BPE starts with individual characters.

It repeatedly merges the most frequent character pairs.

This creates subwords that balance vocabulary size and flexibility.

Seeing Tokenization in Action

The following example shows how a tokenizer splits text.


from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

text = "Tokenization matters a lot"
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.convert_tokens_to_ids(tokens)

tokens, token_ids

This code loads a real tokenizer used by GPT-style models.

It splits the sentence into subword tokens and maps them to numeric IDs.

['Token', 'ization', 'Ġmatters', 'Ġa', 'Ġlot'] [12893, 12017, 5672, 257, 1767]

Understanding the Output

The special symbol Ġ represents a leading space.

This allows the model to learn spacing patterns naturally.

Each number corresponds to a learned embedding vector.

Why Token IDs Matter

Token IDs are used to:

Index embedding matrices
Compute attention
Predict next tokens

Changing tokenization changes model behavior.

Tokenization and Context Length

Context windows are measured in tokens, not characters.

A long sentence can consume more context than expected.

This directly impacts:

Cost
Latency
Prompt design

How Developers Should Practice Tokenization

Effective practice includes:

Tokenizing the same sentence using different tokenizers
Counting token usage for prompts
Testing multilingual text

This builds intuition for real-world GenAI systems.

Common Mistakes

Assuming one word equals one token
Ignoring token limits
Not testing edge cases

These mistakes cause unexpected failures in production systems.

Practice

What unit does a GPT model process internally?

Modern LLMs mainly use what type of tokenization?

What form must text take before entering a neural network?

Quick Quiz

Which algorithm is commonly used for GPT tokenization?

Byte Pair Encoding
TF-IDF
Stemming

Context length limits are measured in?

Tokens
Words
Characters

Token IDs are primarily used to index?

Embeddings
Optimizers
Loss functions

Recap: Tokenization converts text into subword-based numeric units that power all LLM behavior.

Next up: Training Large Language Models — how trillions of tokens shape intelligence.

← Previous Course Index Next →

Generative AI Course