GenAI Lesson 36 – Tokenization | Dataplexa

Tokenization: How Text Becomes Numbers Inside LLMs

Large Language Models do not read words, sentences, or characters.

They operate purely on numbers.

Tokenization is the process that converts raw text into numerical units the model can understand.

Why Tokenization Exists

Neural networks cannot process strings directly.

Every input must be transformed into fixed numeric representations.

Tokenization defines the smallest meaningful units used during training and inference.

What a Token Actually Is

A token is not always a word.

Depending on the tokenizer, a token may represent:

  • A full word
  • A subword
  • A character sequence
  • Punctuation or whitespace

Modern LLMs almost always use subword tokenization.

Why Not Word-Level Tokenization

Word-level tokenization fails for:

  • Rare words
  • Misspellings
  • New vocabulary

Subword tokenization solves this by breaking words into reusable pieces.

How Tokenization Fits Into the GPT Pipeline

Before any attention or reasoning happens:

  • Text is tokenized
  • Tokens are mapped to IDs
  • IDs are converted into embeddings

Everything downstream depends on this step.

Thinking Like an Engineer Before Tokenizing

Before choosing a tokenizer, engineers ask:

  • What languages must be supported?
  • How large should the vocabulary be?
  • How much context efficiency is needed?

These decisions affect model size, speed, and cost.

Byte Pair Encoding (BPE) Concept

BPE starts with individual characters.

It repeatedly merges the most frequent character pairs.

This creates subwords that balance vocabulary size and flexibility.

Seeing Tokenization in Action

The following example shows how a tokenizer splits text.


from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

text = "Tokenization matters a lot"
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.convert_tokens_to_ids(tokens)

tokens, token_ids
  

This code loads a real tokenizer used by GPT-style models.

It splits the sentence into subword tokens and maps them to numeric IDs.

['Token', 'ization', 'Ġmatters', 'Ġa', 'Ġlot'] [12893, 12017, 5672, 257, 1767]

Understanding the Output

The special symbol Ġ represents a leading space.

This allows the model to learn spacing patterns naturally.

Each number corresponds to a learned embedding vector.

Why Token IDs Matter

Token IDs are used to:

  • Index embedding matrices
  • Compute attention
  • Predict next tokens

Changing tokenization changes model behavior.

Tokenization and Context Length

Context windows are measured in tokens, not characters.

A long sentence can consume more context than expected.

This directly impacts:

  • Cost
  • Latency
  • Prompt design

How Developers Should Practice Tokenization

Effective practice includes:

  • Tokenizing the same sentence using different tokenizers
  • Counting token usage for prompts
  • Testing multilingual text

This builds intuition for real-world GenAI systems.

Common Mistakes

  • Assuming one word equals one token
  • Ignoring token limits
  • Not testing edge cases

These mistakes cause unexpected failures in production systems.

Practice

What unit does a GPT model process internally?



Modern LLMs mainly use what type of tokenization?



What form must text take before entering a neural network?



Quick Quiz

Which algorithm is commonly used for GPT tokenization?





Context length limits are measured in?





Token IDs are primarily used to index?





Recap: Tokenization converts text into subword-based numeric units that power all LLM behavior.

Next up: Training Large Language Models — how trillions of tokens shape intelligence.