DL Lesson 57 – GPT Overview | Dataplexa

GPT Overview

GPT stands for Generative Pre-trained Transformer. It is a family of transformer-based models designed primarily for language generation.

Unlike BERT, which focuses on understanding text, GPT focuses on producing fluent, coherent, and context-aware text.

This difference in objective leads to major architectural and behavioral differences.

Core Idea Behind GPT

GPT models are trained to predict the next word given a sequence of previous words.

This simple objective enables powerful behaviors such as:

• Text generation • Dialogue systems • Code generation • Summarization • Story writing

At its heart, GPT learns how language flows.

Autoregressive Language Modeling

GPT uses an autoregressive approach.

This means each token is predicted based only on the tokens that come before it.

Input:  "Deep learning models are"
Output: "powerful tools"

The model never sees future words during prediction. This forces it to learn strong sequential patterns.

Decoder-Only Transformer Architecture

GPT uses only the Transformer decoder.

There is no encoder stack.

Masked self-attention ensures that each token can only attend to earlier tokens in the sequence.

This design makes GPT ideal for generation but less suited for bidirectional understanding.

How GPT Differs from BERT

The difference between GPT and BERT is not just architectural, but philosophical.

BERT asks: "What does this text mean?"

GPT asks: "What comes next?"

Because of this:

• BERT excels at comprehension tasks • GPT excels at generation tasks

Pretraining at Massive Scale

GPT models are pretrained on extremely large corpora.

These include:

• Books • Articles • Code • Web text

This allows the model to internalize grammar, facts, reasoning patterns, and even programming logic.

Tokenization in GPT

GPT does not operate on words directly.

Instead, text is split into tokens, which may represent words, subwords, or characters.

"unbelievable" → ["un", "believ", "able"]

This token-based approach improves flexibility and handles rare words efficiently.

Using a Pretrained GPT Model

In practice, GPT models are loaded with pretrained weights and optionally fine-tuned.

from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

input_ids = tokenizer.encode(
    "Deep learning will change",
    return_tensors="pt"
)

output = model.generate(
    input_ids,
    max_length=30
)

print(tokenizer.decode(output[0]))

This code demonstrates how GPT generates text autoregressively.

Why GPT Can Write Code and Reason

GPT does not truly "understand" code or logic.

Instead, it learns patterns from massive examples of structured text.

This allows it to:

• Imitate programming syntax • Follow logical steps • Maintain context across long outputs

Scale plays a critical role here.

Strengths of GPT

GPT is extremely strong at:

• Natural language generation • Conversational systems • Creative writing • Code completion

Its autoregressive nature makes outputs feel natural and continuous.

Limitations of GPT

GPT also has limitations:

• No true bidirectional understanding • Can hallucinate facts • Sensitive to prompt phrasing

These limitations are addressed partially through instruction tuning and reinforcement learning.

Exercises

Exercise 1:
Why does GPT use masked self-attention?

To prevent the model from seeing future tokens during generation.

Exercise 2:
Why is GPT better at generation than BERT?

Because GPT is trained autoregressively to predict the next token.

Quick Check

Q: Can GPT be used for classification tasks?

Yes, but it is not as naturally suited as encoder-based models.

Next, we will explore training large sequence models, including memory, compute constraints, and scaling challenges.

← Previous Lesson DL Index Next ➜