GenAI Lesson 35 – GPT | Dataplexa

GPT Architecture: Inside Modern Large Language Models

GPT models are built to generate language by predicting what comes next.

Unlike encoder-based models, GPT does not try to understand everything first.

Understanding emerges naturally as a result of generation.

The Core Design Philosophy

GPT follows a simple but powerful rule:

Given everything so far, predict the next token.

This single objective drives all capabilities: reasoning, coding, summarization, and dialogue.

How GPT Processes Input

A GPT model receives:

  • A sequence of tokens
  • Position information
  • A causal attention mask

There is no separation between input and output.

Everything is treated as one growing sequence.

Causal Self-Attention Explained

Causal attention ensures the model never sees the future.

Each token attends only to:

  • Itself
  • All previous tokens

This is what makes GPT autoregressive.

Why This Matters

Because the model cannot cheat by seeing future tokens, it learns real language structure.

This makes generation coherent and controllable.

Thinking Before Writing Code

Before implementing GPT-style models, engineers decide:

  • Maximum context window
  • Embedding dimension
  • Number of layers
  • Number of attention heads

These choices directly affect: cost, latency, and model capability.

Minimal GPT Block Structure

This code shows the core building block used repeatedly in GPT.


import torch
import torch.nn as nn

class GPTBlock(nn.Module):
    def __init__(self, dim, heads):
        super().__init__()
        self.attn = nn.MultiheadAttention(dim, heads)
        self.ff = nn.Sequential(
            nn.Linear(dim, dim * 4),
            nn.GELU(),
            nn.Linear(dim * 4, dim)
        )
        self.ln1 = nn.LayerNorm(dim)
        self.ln2 = nn.LayerNorm(dim)

    def forward(self, x, mask):
        attn_out, _ = self.attn(x, x, x, attn_mask=mask)
        x = self.ln1(x + attn_out)
        ff_out = self.ff(x)
        x = self.ln2(x + ff_out)
        return x
  

What Happens Inside This Block

Step-by-step behavior:

  • Self-attention mixes past context
  • Causal mask blocks future tokens
  • Residual connections preserve information
  • Feed-forward layers expand representation

This block is stacked dozens of times in real GPT models.

Token Prediction Head

After passing through all layers, GPT maps hidden states to vocabulary logits.


logits = hidden_states @ vocab_embedding.T
  

Each logit represents how likely a token is to appear next.

How Generation Happens

During inference:

  • One token is predicted
  • That token is appended
  • The process repeats

This loop continues until stopping conditions are met.

Sampling Strategies Matter

GPT does not always pick the highest-probability token.

Sampling strategies control creativity.


next_token = sample(logits, temperature=0.8, top_p=0.9)
  

Temperature and top-p affect diversity and coherence.

Why GPT Is Good at Coding

Code is also a sequence.

GPT learns:

  • Syntax patterns
  • Logical structure
  • Long-range dependencies

This enables autocomplete, refactoring, and debugging.

Real-World Systems Built on GPT

  • Chat assistants
  • Code copilots
  • Document generators
  • AI agents

Most modern GenAI products rely on this architecture.

Limitations to Be Aware Of

  • Hallucinations
  • Context length limits
  • High compute cost

These limitations shape system design choices.

How Learners Should Practice

To internalize GPT architecture:

  • Experiment with small models locally
  • Visualize attention patterns
  • Modify sampling parameters

Practice focuses on understanding behavior, not memorization.

Practice

What does GPT predict at each step?



What type of attention mask does GPT use?



GPT generation is described as what process?



Quick Quiz

GPT belongs to which architecture?





GPT primarily focuses on?





What controls creativity in GPT outputs?





Recap: GPT uses decoder-only, causal attention to generate language token by token.

Next up: Tokenization — how text becomes numbers inside LLMs.