BERT Tokenization
In the previous lesson, you learned what BERT is, why it was a breakthrough, and how it uses the Transformer encoder to understand language bidirectionally.
Now we move to a very important and often misunderstood topic: BERT Tokenization.
Tokenization is the very first step before any text enters BERT. If tokenization is wrong, even the best model will fail.
Why Tokenization Is Critical in BERT
Computers cannot understand raw text. They need text to be broken into smaller units called tokens.
BERT does not work on:
- Raw sentences
- Whole words directly
Instead, it works on a special type of tokenization called WordPiece Tokenization.
What Is Tokenization?
Tokenization is the process of splitting text into smaller units.
Depending on the method, tokens can be:
- Words
- Subwords
- Characters
BERT uses subword-level tokenization.
Why BERT Does NOT Use Simple Word Tokenization
Simple word tokenization has major problems:
- Large vocabulary size
- Unknown words (OOV problem)
- Poor handling of rare words
BERT solves this using WordPiece.
What Is WordPiece Tokenization?
WordPiece breaks words into frequently occurring subwords.
Instead of treating every word as new, it reuses known pieces.
This helps BERT handle:
- Rare words
- New words
- Misspellings
Example: WordPiece in Action
Consider the word:
“unbelievable”
BERT may split it as:
un + ##believable
The prefix ## means:
“This token is a continuation of the previous token.”
Handling Unknown Words
If BERT encounters a completely unknown word, it breaks it down into smaller known pieces.
If it still cannot tokenize it, it uses a special token:
[UNK]
This prevents crashes during inference.
Special Tokens Used by BERT
BERT uses several special tokens to structure input.
- [CLS] – Classification token
- [SEP] – Separator token
- [PAD] – Padding token
- [MASK] – Masked token
- [UNK] – Unknown token
The [CLS] Token
The [CLS] token is added at the beginning of every input.
It represents the entire sentence.
For classification tasks, BERT uses the output embedding of [CLS].
The [SEP] Token
The [SEP] token separates sentences.
It is used:
- Between two sentences
- At the end of a sentence
This helps BERT understand sentence boundaries.
Segment Embeddings (Sentence A / B)
When BERT processes two sentences, it uses segment embeddings to distinguish them.
- Sentence A → Segment 0
- Sentence B → Segment 1
This is crucial for tasks like question answering.
Padding and Attention Masks
All BERT inputs must be the same length.
Shorter sentences are padded using:
[PAD]
An attention mask tells BERT:
- Which tokens are real
- Which tokens are padding
Complete BERT Input Representation
Each token entering BERT has:
- Token embedding
- Position embedding
- Segment embedding
These three are added together before entering the encoder layers.
Why Tokenization Affects Model Performance
Good tokenization improves:
- Generalization
- Handling of rare words
- Model efficiency
Poor tokenization leads to:
- Loss of meaning
- Incorrect predictions
Practice Questions
Q1. What tokenization method does BERT use?
Q2. What does the prefix “##” indicate?
Quick Quiz
Q1. Which token represents the whole sentence?
Q2. Why is padding required?
Homework / Assignment
Conceptual:
- Explain WordPiece tokenization with your own example
- List all special tokens used by BERT
Preparation:
- Next lesson: Fine-Tuning BERT
- Revise BERT architecture and token flow
Quick Recap
- BERT uses WordPiece tokenization
- Subwords help handle rare and unknown words
- Special tokens structure the input
- Tokenization directly affects model performance
Next lesson: Fine-Tuning BERT