GenAI Lesson 32 – Enc-Dec | Dataplexa

Encoder–Decoder Architecture: How Transformers Generate Outputs

Not all generative problems are solved by predicting the next token.

Some tasks require the model to read an entire input, understand it deeply, and then produce a structured output.

This lesson explains how the encoder–decoder architecture enables that behavior, and how engineers decide when to use it.

The Problem This Architecture Solves

Consider tasks like:

  • Machine translation
  • Summarization
  • Question answering

In all these cases:

  • Input and output lengths differ
  • Output depends on the full input
  • Generation must be guided, not free-form

A single-stream model is not ideal here.

Why Encoder and Decoder Are Separated

Engineers separate responsibilities:

  • Encoder: Understand the input
  • Decoder: Generate the output

This separation makes the system easier to train, debug, and reason about.

What the Encoder Actually Does

The encoder processes the entire input sequence at once.

It produces contextual representations for every input token.

After encoding, the input is no longer raw text — it becomes structured meaning.

What the Decoder Actually Does

The decoder generates output tokens step by step.

At each step, it:

  • Looks at previously generated tokens
  • Attends to the encoder’s output
  • Predicts the next token

This is controlled, conditional generation.

How Information Flows Between Them

The key connection is cross-attention.

Cross-attention allows the decoder to:

  • Focus on relevant parts of the input
  • Ignore irrelevant details
  • Dynamically adjust during generation

This is why translations stay aligned with input meaning.

High-Level Architecture Flow

An encoder–decoder Transformer follows this flow:

  • Input tokens → encoder
  • Encoder outputs → memory
  • Decoder attends to memory + past outputs
  • Next token prediction

Each component has a clear role.

Thinking Like an Engineer Before Coding

Before implementation, engineers decide:

  • Input representation size
  • Number of encoder layers
  • Number of decoder layers
  • Attention head configuration

These decisions affect accuracy and latency.

Minimal Encoder–Decoder Skeleton

This example shows structure, not a full production model.


import torch
import torch.nn as nn

encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
encoder = nn.TransformerEncoder(encoder_layer, num_layers=6)

decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8)
decoder = nn.TransformerDecoder(decoder_layer, num_layers=6)
  

This code defines two independent stacks.

What matters is how they interact.

How Encoder Output Is Used

The encoder produces a memory tensor.

The decoder queries this memory at every generation step.

This is how the output remains grounded in the input.

Cross-Attention Conceptually

In cross-attention:

  • Queries come from the decoder
  • Keys and values come from the encoder

This allows precise alignment between input and output.

Why Decoder Is Autoregressive

The decoder predicts one token at a time.

Masking ensures it cannot see future tokens.

This preserves causal generation.

Where This Architecture Is Used Today

  • Translation systems
  • Instruction-following models
  • Speech-to-text pipelines

Encoder–decoder models are still widely used in production.

Common Learner Mistakes

  • Confusing self-attention with cross-attention
  • Thinking encoder output is static text
  • Assuming decoder sees the full output

Understanding data flow is critical here.

Practice

Which component processes the full input?



Which component generates output tokens?



What mechanism connects encoder and decoder?



Quick Quiz

Encoder primarily focuses on?





Why is the decoder masked?





Cross-attention enables?





Recap: Encoder–decoder Transformers separate understanding from generation using cross-attention.

Next up: Decoder-Only Models — why GPT-style architectures dominate modern LLMs.