Generative AI Course
Tokenization: How Text Becomes Numbers Inside LLMs
Large Language Models do not read words, sentences, or characters.
They operate purely on numbers.
Tokenization is the process that converts raw text into numerical units the model can understand.
Why Tokenization Exists
Neural networks cannot process strings directly.
Every input must be transformed into fixed numeric representations.
Tokenization defines the smallest meaningful units used during training and inference.
What a Token Actually Is
A token is not always a word.
Depending on the tokenizer, a token may represent:
- A full word
- A subword
- A character sequence
- Punctuation or whitespace
Modern LLMs almost always use subword tokenization.
Why Not Word-Level Tokenization
Word-level tokenization fails for:
- Rare words
- Misspellings
- New vocabulary
Subword tokenization solves this by breaking words into reusable pieces.
How Tokenization Fits Into the GPT Pipeline
Before any attention or reasoning happens:
- Text is tokenized
- Tokens are mapped to IDs
- IDs are converted into embeddings
Everything downstream depends on this step.
Thinking Like an Engineer Before Tokenizing
Before choosing a tokenizer, engineers ask:
- What languages must be supported?
- How large should the vocabulary be?
- How much context efficiency is needed?
These decisions affect model size, speed, and cost.
Byte Pair Encoding (BPE) Concept
BPE starts with individual characters.
It repeatedly merges the most frequent character pairs.
This creates subwords that balance vocabulary size and flexibility.
Seeing Tokenization in Action
The following example shows how a tokenizer splits text.
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
text = "Tokenization matters a lot"
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
tokens, token_ids
This code loads a real tokenizer used by GPT-style models.
It splits the sentence into subword tokens and maps them to numeric IDs.
Understanding the Output
The special symbol Ġ represents a leading space.
This allows the model to learn spacing patterns naturally.
Each number corresponds to a learned embedding vector.
Why Token IDs Matter
Token IDs are used to:
- Index embedding matrices
- Compute attention
- Predict next tokens
Changing tokenization changes model behavior.
Tokenization and Context Length
Context windows are measured in tokens, not characters.
A long sentence can consume more context than expected.
This directly impacts:
- Cost
- Latency
- Prompt design
How Developers Should Practice Tokenization
Effective practice includes:
- Tokenizing the same sentence using different tokenizers
- Counting token usage for prompts
- Testing multilingual text
This builds intuition for real-world GenAI systems.
Common Mistakes
- Assuming one word equals one token
- Ignoring token limits
- Not testing edge cases
These mistakes cause unexpected failures in production systems.
Practice
What unit does a GPT model process internally?
Modern LLMs mainly use what type of tokenization?
What form must text take before entering a neural network?
Quick Quiz
Which algorithm is commonly used for GPT tokenization?
Context length limits are measured in?
Token IDs are primarily used to index?
Recap: Tokenization converts text into subword-based numeric units that power all LLM behavior.
Next up: Training Large Language Models — how trillions of tokens shape intelligence.