Generative AI Course
BERT Architecture: Encoder-Only Models Explained
Not every AI problem is about generating text.
Many real-world tasks require deep understanding rather than free-form generation.
This is where encoder-only architectures like BERT excel.
Why BERT Was Created
Before BERT, language models mostly learned from left-to-right context.
That approach limited understanding because:
- Future context was ignored
- Representations were asymmetric
- Understanding depended on position
BERT changed this by learning from both directions at once.
The Core Idea Behind Encoder-Only Models
Encoder-only models focus entirely on understanding.
They read the whole sequence at once and build rich representations.
No generation happens inside the model itself.
Instead, outputs are used for downstream tasks.
Bidirectional Attention Explained
In BERT, every token can attend to:
- Tokens before it
- Tokens after it
This allows the model to understand context holistically.
Meaning is shaped by the entire sentence.
Why This Matters in Practice
Bidirectional attention improves:
- Sentence classification
- Named entity recognition
- Semantic similarity
These tasks depend on full context, not prediction.
Thinking Like an Engineer: When to Use BERT
Engineers choose encoder-only models when:
- Input is known upfront
- Output is a label or embedding
- Generation is not required
This is common in search, ranking, and analytics systems.
BERT Training Objectives
BERT is trained using two core objectives.
The first is Masked Language Modeling.
Random tokens are hidden, and the model learns to predict them.
Why Masking Works
Masking forces the model to:
- Use both left and right context
- Build deeper representations
- Avoid trivial sequence learning
This creates strong semantic embeddings.
Minimal Masking Example
This example shows how masking looks conceptually.
sentence = ["The", "cat", "sat", "on", "the", "mat"]
masked = ["The", "[MASK]", "sat", "on", "the", "mat"]
The model predicts the masked word using full context.
The Second Objective: Sentence Relationships
BERT also learns relationships between sentences.
This helps with:
- Question answering
- Document understanding
- Contextual reasoning
Architecture Structure
BERT consists of stacked Transformer encoder layers.
Each layer includes:
- Self-attention
- Feed-forward networks
- Layer normalization
Minimal Encoder Stack Example
This skeleton shows how an encoder stack is constructed.
import torch.nn as nn
encoder_layer = nn.TransformerEncoderLayer(
d_model=768,
nhead=12
)
bert_encoder = nn.TransformerEncoder(
encoder_layer,
num_layers=12
)
This structure processes entire sequences simultaneously.
Why BERT Does Not Generate Text
BERT lacks causal masking.
It sees the full sequence during training and inference.
Because of this, it cannot predict next tokens reliably.
Its strength lies in understanding, not generation.
How BERT Is Used in Real Systems
- Search relevance ranking
- Semantic similarity
- Content classification
- Embedding generation
Many GenAI pipelines still rely on BERT-style encoders.
BERT vs GPT: Mental Model
A useful way to remember the difference:
- BERT reads everything, then thinks
- GPT thinks as it writes
Both are essential in modern AI systems.
Common Learner Mistakes
- Trying to use BERT for generation
- Ignoring masking during training
- Assuming bidirectionality is optional
Encoder-only models solve a different class of problems.
Practice
What type of attention does BERT use?
What is BERT primarily designed for?
Which training technique hides tokens?
Quick Quiz
BERT belongs to which architecture type?
Main training objective of BERT?
Why is bidirectional context important?
Recap: BERT uses encoder-only, bidirectional attention to build deep language understanding.
Next up: GPT Architecture — a deeper look at how decoder-only models are built internally.