NLP Lesson 51 – BERT Tokenization | Dataplexa

BERT Tokenization

In the previous lesson, you learned what BERT is, why it was a breakthrough, and how it uses the Transformer encoder to understand language bidirectionally.

Now we move to a very important and often misunderstood topic: BERT Tokenization.

Tokenization is the very first step before any text enters BERT. If tokenization is wrong, even the best model will fail.


Why Tokenization Is Critical in BERT

Computers cannot understand raw text. They need text to be broken into smaller units called tokens.

BERT does not work on:

  • Raw sentences
  • Whole words directly

Instead, it works on a special type of tokenization called WordPiece Tokenization.


What Is Tokenization?

Tokenization is the process of splitting text into smaller units.

Depending on the method, tokens can be:

  • Words
  • Subwords
  • Characters

BERT uses subword-level tokenization.


Why BERT Does NOT Use Simple Word Tokenization

Simple word tokenization has major problems:

  • Large vocabulary size
  • Unknown words (OOV problem)
  • Poor handling of rare words

BERT solves this using WordPiece.


What Is WordPiece Tokenization?

WordPiece breaks words into frequently occurring subwords.

Instead of treating every word as new, it reuses known pieces.

This helps BERT handle:

  • Rare words
  • New words
  • Misspellings

Example: WordPiece in Action

Consider the word:

“unbelievable”

BERT may split it as:

un + ##believable

The prefix ## means:

“This token is a continuation of the previous token.”


Handling Unknown Words

If BERT encounters a completely unknown word, it breaks it down into smaller known pieces.

If it still cannot tokenize it, it uses a special token:

[UNK]

This prevents crashes during inference.


Special Tokens Used by BERT

BERT uses several special tokens to structure input.

  • [CLS] – Classification token
  • [SEP] – Separator token
  • [PAD] – Padding token
  • [MASK] – Masked token
  • [UNK] – Unknown token

The [CLS] Token

The [CLS] token is added at the beginning of every input.

It represents the entire sentence.

For classification tasks, BERT uses the output embedding of [CLS].


The [SEP] Token

The [SEP] token separates sentences.

It is used:

  • Between two sentences
  • At the end of a sentence

This helps BERT understand sentence boundaries.


Segment Embeddings (Sentence A / B)

When BERT processes two sentences, it uses segment embeddings to distinguish them.

  • Sentence A → Segment 0
  • Sentence B → Segment 1

This is crucial for tasks like question answering.


Padding and Attention Masks

All BERT inputs must be the same length.

Shorter sentences are padded using:

[PAD]

An attention mask tells BERT:

  • Which tokens are real
  • Which tokens are padding

Complete BERT Input Representation

Each token entering BERT has:

  • Token embedding
  • Position embedding
  • Segment embedding

These three are added together before entering the encoder layers.


Why Tokenization Affects Model Performance

Good tokenization improves:

  • Generalization
  • Handling of rare words
  • Model efficiency

Poor tokenization leads to:

  • Loss of meaning
  • Incorrect predictions

Practice Questions

Q1. What tokenization method does BERT use?

WordPiece tokenization.

Q2. What does the prefix “##” indicate?

It indicates a subword that continues the previous token.

Quick Quiz

Q1. Which token represents the whole sentence?

[CLS]

Q2. Why is padding required?

Because BERT requires fixed-length input sequences.

Homework / Assignment

Conceptual:

  • Explain WordPiece tokenization with your own example
  • List all special tokens used by BERT

Preparation:

  • Next lesson: Fine-Tuning BERT
  • Revise BERT architecture and token flow

Quick Recap

  • BERT uses WordPiece tokenization
  • Subwords help handle rare and unknown words
  • Special tokens structure the input
  • Tokenization directly affects model performance

Next lesson: Fine-Tuning BERT