GenAI Lesson 34 – BERT | Dataplexa

BERT Architecture: Encoder-Only Models Explained

Not every AI problem is about generating text.

Many real-world tasks require deep understanding rather than free-form generation.

This is where encoder-only architectures like BERT excel.

Why BERT Was Created

Before BERT, language models mostly learned from left-to-right context.

That approach limited understanding because:

Future context was ignored
Representations were asymmetric
Understanding depended on position

BERT changed this by learning from both directions at once.

The Core Idea Behind Encoder-Only Models

Encoder-only models focus entirely on understanding.

They read the whole sequence at once and build rich representations.

No generation happens inside the model itself.

Instead, outputs are used for downstream tasks.

Bidirectional Attention Explained

In BERT, every token can attend to:

Tokens before it
Tokens after it

This allows the model to understand context holistically.

Meaning is shaped by the entire sentence.

Why This Matters in Practice

Bidirectional attention improves:

Sentence classification
Named entity recognition
Semantic similarity

These tasks depend on full context, not prediction.

Thinking Like an Engineer: When to Use BERT

Engineers choose encoder-only models when:

Input is known upfront
Output is a label or embedding
Generation is not required

This is common in search, ranking, and analytics systems.

BERT Training Objectives

BERT is trained using two core objectives.

The first is Masked Language Modeling.

Random tokens are hidden, and the model learns to predict them.

Why Masking Works

Masking forces the model to:

Use both left and right context
Build deeper representations
Avoid trivial sequence learning

This creates strong semantic embeddings.

Minimal Masking Example

This example shows how masking looks conceptually.


sentence = ["The", "cat", "sat", "on", "the", "mat"]
masked = ["The", "[MASK]", "sat", "on", "the", "mat"]

The model predicts the masked word using full context.

The Second Objective: Sentence Relationships

BERT also learns relationships between sentences.

This helps with:

Question answering
Document understanding
Contextual reasoning

Architecture Structure

BERT consists of stacked Transformer encoder layers.

Each layer includes:

Self-attention
Feed-forward networks
Layer normalization

Minimal Encoder Stack Example

This skeleton shows how an encoder stack is constructed.


import torch.nn as nn

encoder_layer = nn.TransformerEncoderLayer(
    d_model=768,
    nhead=12
)

bert_encoder = nn.TransformerEncoder(
    encoder_layer,
    num_layers=12
)

This structure processes entire sequences simultaneously.

Why BERT Does Not Generate Text

BERT lacks causal masking.

It sees the full sequence during training and inference.

Because of this, it cannot predict next tokens reliably.

Its strength lies in understanding, not generation.

How BERT Is Used in Real Systems

Search relevance ranking
Semantic similarity
Content classification
Embedding generation

Many GenAI pipelines still rely on BERT-style encoders.

BERT vs GPT: Mental Model

A useful way to remember the difference:

BERT reads everything, then thinks
GPT thinks as it writes

Both are essential in modern AI systems.

Common Learner Mistakes

Trying to use BERT for generation
Ignoring masking during training
Assuming bidirectionality is optional

Encoder-only models solve a different class of problems.

Practice

What type of attention does BERT use?

What is BERT primarily designed for?

Which training technique hides tokens?

Quick Quiz

BERT belongs to which architecture type?

Encoder-only
Decoder-only
Hybrid

Main training objective of BERT?

Masked Language Modeling
Next token prediction
Diffusion

Why is bidirectional context important?

Full understanding
Faster training
Less memory

Recap: BERT uses encoder-only, bidirectional attention to build deep language understanding.

Next up: GPT Architecture — a deeper look at how decoder-only models are built internally.

← Previous Course Index Next →

Generative AI Course