AI Lesson 72 – BERT - Basics & Architecture | Dataplexa

Lesson 72: BERT Basics

In the previous lesson, we learned how transformers changed Natural Language Processing by using attention and parallel processing. In this lesson, we focus on one of the most important transformer-based models ever created — BERT.

BERT is designed to understand language deeply. Instead of predicting the next word like some models, BERT focuses on understanding the meaning of text as a whole.

Real-World Connection

Whenever Google understands your search query better, suggests the right result, or highlights the exact answer inside a long article, BERT is working behind the scenes.

BERT is widely used in search engines, question-answering systems, document classification, and enterprise NLP tools.

What Is BERT?

BERT stands for Bidirectional Encoder Representations from Transformers. It is an encoder-only transformer model that reads text in both directions — left to right and right to left — at the same time.

This bidirectional reading allows BERT to understand context more accurately than earlier models.

  • Uses transformer encoder architecture
  • Understands full sentence context
  • Pretrained on massive text data

Why Bidirectional Understanding Matters

Consider the sentence: “The bank was crowded.” Without context, the word “bank” is ambiguous. BERT reads surrounding words on both sides to understand whether it refers to a financial institution or a river bank.

This bidirectional context is the key reason BERT outperforms traditional NLP models.

How BERT Is Trained

BERT is pretrained using two main tasks:

  • Masked Language Modeling: Random words are hidden, and the model predicts them
  • Next Sentence Prediction: The model learns sentence relationships

These tasks teach BERT grammar, semantics, and sentence-level understanding.

Simple BERT Example

Below is a basic example of using a pretrained BERT model for text classification.


from transformers import pipeline

classifier = pipeline("text-classification", model="bert-base-uncased")

text = "Dataplexa makes learning AI simple and practical"
result = classifier(text)

print(result)
  
[{'label': 'POSITIVE', 'score': 0.98}]

Understanding the Code

The pipeline loads a pretrained BERT model. The input sentence is tokenized and passed through multiple transformer encoder layers.

Each word attends to every other word, allowing BERT to understand meaning in context. The output shows the predicted label and confidence score.

What BERT Is Best At

  • Text classification
  • Question answering
  • Named Entity Recognition
  • Semantic search

BERT vs GPT (High Level)

BERT focuses on understanding text, while GPT focuses on generating text. BERT is encoder-only, whereas GPT is decoder-only.

Because of this, BERT is usually chosen for tasks where understanding is more important than generation.

Limitations of BERT

  • Cannot generate long text
  • Computationally expensive
  • Limited input length

Practice Questions

Practice 1: How does BERT read text?



Practice 2: Which training task hides words during training?



Practice 3: BERT is based on which transformer component?



Quick Quiz

Quiz 1: Which model is mainly used for text understanding?





Quiz 2: What does BERT predict during masked language modeling?





Quiz 3: What type of transformer architecture does BERT use?





Coming up next: BERT Fine-Tuning — adapting pretrained models to real-world tasks.