Generative AI Course
Encoder–Decoder Architecture: How Transformers Generate Outputs
Not all generative problems are solved by predicting the next token.
Some tasks require the model to read an entire input, understand it deeply, and then produce a structured output.
This lesson explains how the encoder–decoder architecture enables that behavior, and how engineers decide when to use it.
The Problem This Architecture Solves
Consider tasks like:
- Machine translation
- Summarization
- Question answering
In all these cases:
- Input and output lengths differ
- Output depends on the full input
- Generation must be guided, not free-form
A single-stream model is not ideal here.
Why Encoder and Decoder Are Separated
Engineers separate responsibilities:
- Encoder: Understand the input
- Decoder: Generate the output
This separation makes the system easier to train, debug, and reason about.
What the Encoder Actually Does
The encoder processes the entire input sequence at once.
It produces contextual representations for every input token.
After encoding, the input is no longer raw text — it becomes structured meaning.
What the Decoder Actually Does
The decoder generates output tokens step by step.
At each step, it:
- Looks at previously generated tokens
- Attends to the encoder’s output
- Predicts the next token
This is controlled, conditional generation.
How Information Flows Between Them
The key connection is cross-attention.
Cross-attention allows the decoder to:
- Focus on relevant parts of the input
- Ignore irrelevant details
- Dynamically adjust during generation
This is why translations stay aligned with input meaning.
High-Level Architecture Flow
An encoder–decoder Transformer follows this flow:
- Input tokens → encoder
- Encoder outputs → memory
- Decoder attends to memory + past outputs
- Next token prediction
Each component has a clear role.
Thinking Like an Engineer Before Coding
Before implementation, engineers decide:
- Input representation size
- Number of encoder layers
- Number of decoder layers
- Attention head configuration
These decisions affect accuracy and latency.
Minimal Encoder–Decoder Skeleton
This example shows structure, not a full production model.
import torch
import torch.nn as nn
encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
encoder = nn.TransformerEncoder(encoder_layer, num_layers=6)
decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8)
decoder = nn.TransformerDecoder(decoder_layer, num_layers=6)
This code defines two independent stacks.
What matters is how they interact.
How Encoder Output Is Used
The encoder produces a memory tensor.
The decoder queries this memory at every generation step.
This is how the output remains grounded in the input.
Cross-Attention Conceptually
In cross-attention:
- Queries come from the decoder
- Keys and values come from the encoder
This allows precise alignment between input and output.
Why Decoder Is Autoregressive
The decoder predicts one token at a time.
Masking ensures it cannot see future tokens.
This preserves causal generation.
Where This Architecture Is Used Today
- Translation systems
- Instruction-following models
- Speech-to-text pipelines
Encoder–decoder models are still widely used in production.
Common Learner Mistakes
- Confusing self-attention with cross-attention
- Thinking encoder output is static text
- Assuming decoder sees the full output
Understanding data flow is critical here.
Practice
Which component processes the full input?
Which component generates output tokens?
What mechanism connects encoder and decoder?
Quick Quiz
Encoder primarily focuses on?
Why is the decoder masked?
Cross-attention enables?
Recap: Encoder–decoder Transformers separate understanding from generation using cross-attention.
Next up: Decoder-Only Models — why GPT-style architectures dominate modern LLMs.