GPT Overview
GPT stands for Generative Pre-trained Transformer. It is a family of transformer-based models designed primarily for language generation.
Unlike BERT, which focuses on understanding text, GPT focuses on producing fluent, coherent, and context-aware text.
This difference in objective leads to major architectural and behavioral differences.
Core Idea Behind GPT
GPT models are trained to predict the next word given a sequence of previous words.
This simple objective enables powerful behaviors such as:
• Text generation • Dialogue systems • Code generation • Summarization • Story writing
At its heart, GPT learns how language flows.
Autoregressive Language Modeling
GPT uses an autoregressive approach.
This means each token is predicted based only on the tokens that come before it.
Input: "Deep learning models are"
Output: "powerful tools"
The model never sees future words during prediction. This forces it to learn strong sequential patterns.
Decoder-Only Transformer Architecture
GPT uses only the Transformer decoder.
There is no encoder stack.
Masked self-attention ensures that each token can only attend to earlier tokens in the sequence.
This design makes GPT ideal for generation but less suited for bidirectional understanding.
How GPT Differs from BERT
The difference between GPT and BERT is not just architectural, but philosophical.
BERT asks: "What does this text mean?"
GPT asks: "What comes next?"
Because of this:
• BERT excels at comprehension tasks • GPT excels at generation tasks
Pretraining at Massive Scale
GPT models are pretrained on extremely large corpora.
These include:
• Books • Articles • Code • Web text
This allows the model to internalize grammar, facts, reasoning patterns, and even programming logic.
Tokenization in GPT
GPT does not operate on words directly.
Instead, text is split into tokens, which may represent words, subwords, or characters.
"unbelievable" → ["un", "believ", "able"]
This token-based approach improves flexibility and handles rare words efficiently.
Using a Pretrained GPT Model
In practice, GPT models are loaded with pretrained weights and optionally fine-tuned.
from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
input_ids = tokenizer.encode(
"Deep learning will change",
return_tensors="pt"
)
output = model.generate(
input_ids,
max_length=30
)
print(tokenizer.decode(output[0]))
This code demonstrates how GPT generates text autoregressively.
Why GPT Can Write Code and Reason
GPT does not truly "understand" code or logic.
Instead, it learns patterns from massive examples of structured text.
This allows it to:
• Imitate programming syntax • Follow logical steps • Maintain context across long outputs
Scale plays a critical role here.
Strengths of GPT
GPT is extremely strong at:
• Natural language generation • Conversational systems • Creative writing • Code completion
Its autoregressive nature makes outputs feel natural and continuous.
Limitations of GPT
GPT also has limitations:
• No true bidirectional understanding • Can hallucinate facts • Sensitive to prompt phrasing
These limitations are addressed partially through instruction tuning and reinforcement learning.
Exercises
Exercise 1:
Why does GPT use masked self-attention?
Exercise 2:
Why is GPT better at generation than BERT?
Quick Check
Q: Can GPT be used for classification tasks?
Next, we will explore training large sequence models, including memory, compute constraints, and scaling challenges.