GenAI Lesson 34 – BERT | Dataplexa

BERT Architecture: Encoder-Only Models Explained

Not every AI problem is about generating text.

Many real-world tasks require deep understanding rather than free-form generation.

This is where encoder-only architectures like BERT excel.

Why BERT Was Created

Before BERT, language models mostly learned from left-to-right context.

That approach limited understanding because:

  • Future context was ignored
  • Representations were asymmetric
  • Understanding depended on position

BERT changed this by learning from both directions at once.

The Core Idea Behind Encoder-Only Models

Encoder-only models focus entirely on understanding.

They read the whole sequence at once and build rich representations.

No generation happens inside the model itself.

Instead, outputs are used for downstream tasks.

Bidirectional Attention Explained

In BERT, every token can attend to:

  • Tokens before it
  • Tokens after it

This allows the model to understand context holistically.

Meaning is shaped by the entire sentence.

Why This Matters in Practice

Bidirectional attention improves:

  • Sentence classification
  • Named entity recognition
  • Semantic similarity

These tasks depend on full context, not prediction.

Thinking Like an Engineer: When to Use BERT

Engineers choose encoder-only models when:

  • Input is known upfront
  • Output is a label or embedding
  • Generation is not required

This is common in search, ranking, and analytics systems.

BERT Training Objectives

BERT is trained using two core objectives.

The first is Masked Language Modeling.

Random tokens are hidden, and the model learns to predict them.

Why Masking Works

Masking forces the model to:

  • Use both left and right context
  • Build deeper representations
  • Avoid trivial sequence learning

This creates strong semantic embeddings.

Minimal Masking Example

This example shows how masking looks conceptually.


sentence = ["The", "cat", "sat", "on", "the", "mat"]
masked = ["The", "[MASK]", "sat", "on", "the", "mat"]
  

The model predicts the masked word using full context.

The Second Objective: Sentence Relationships

BERT also learns relationships between sentences.

This helps with:

  • Question answering
  • Document understanding
  • Contextual reasoning

Architecture Structure

BERT consists of stacked Transformer encoder layers.

Each layer includes:

  • Self-attention
  • Feed-forward networks
  • Layer normalization

Minimal Encoder Stack Example

This skeleton shows how an encoder stack is constructed.


import torch.nn as nn

encoder_layer = nn.TransformerEncoderLayer(
    d_model=768,
    nhead=12
)

bert_encoder = nn.TransformerEncoder(
    encoder_layer,
    num_layers=12
)
  

This structure processes entire sequences simultaneously.

Why BERT Does Not Generate Text

BERT lacks causal masking.

It sees the full sequence during training and inference.

Because of this, it cannot predict next tokens reliably.

Its strength lies in understanding, not generation.

How BERT Is Used in Real Systems

  • Search relevance ranking
  • Semantic similarity
  • Content classification
  • Embedding generation

Many GenAI pipelines still rely on BERT-style encoders.

BERT vs GPT: Mental Model

A useful way to remember the difference:

  • BERT reads everything, then thinks
  • GPT thinks as it writes

Both are essential in modern AI systems.

Common Learner Mistakes

  • Trying to use BERT for generation
  • Ignoring masking during training
  • Assuming bidirectionality is optional

Encoder-only models solve a different class of problems.

Practice

What type of attention does BERT use?



What is BERT primarily designed for?



Which training technique hides tokens?



Quick Quiz

BERT belongs to which architecture type?





Main training objective of BERT?





Why is bidirectional context important?





Recap: BERT uses encoder-only, bidirectional attention to build deep language understanding.

Next up: GPT Architecture — a deeper look at how decoder-only models are built internally.