DL Lesson 52 – Seq2Seq | Dataplexa

Sequence-to-Sequence (Seq2Seq) Models

Sequence-to-Sequence models, commonly called Seq2Seq, are a natural extension of encoder–decoder architectures.

While encoder–decoder describes the structure, Seq2Seq describes how the model is trained and used to map one sequence directly into another sequence.

Seq2Seq models form the foundation of modern systems such as machine translation, chatbots, text summarization, question answering, and speech recognition.


What Makes Seq2Seq Different?

In a basic encoder–decoder setup, we describe two networks.

In a Seq2Seq model, we define a learning objective:

Given an input sequence, the model must learn to produce the correct output sequence step by step.

This transforms sequence problems into supervised learning problems where entire sequences become training examples.


How Seq2Seq Training Works

During training, the model sees many pairs of sequences:

• Input sequence • Target output sequence

The encoder processes the input sequence and produces internal states.

The decoder then learns to generate the target sequence using both the encoder states and the previous correct outputs.

This technique is known as Teacher Forcing.


Teacher Forcing Explained

Instead of feeding the decoder its own previous predictions, we feed it the actual correct output during training.

This stabilizes learning and speeds up convergence.

However, during inference (real-world usage), the decoder must rely on its own predictions.

This difference between training and inference is one of the key challenges in Seq2Seq modeling.


Step-by-Step Seq2Seq Flow

1. Encoder reads the full input sequence

2. Encoder produces final hidden states

3. Decoder initializes using encoder states

4. Decoder predicts one token at a time

5. Prediction stops when end-of-sequence token is reached

Each output token depends on:

• Encoder context • Decoder hidden state • Previously generated tokens


Practical Example: Text Translation

Input sequence:

"How are you today?"

Target output sequence:

"Comment ça va aujourd'hui ?"

During training, the model learns the alignment between input and output words implicitly.

This is where Seq2Seq truly shines — it learns relationships between entire sequences, not just individual words.


Basic Seq2Seq Model (Conceptual Code)

# Encoder
encoder = LSTM(256, return_state=True)
encoder_output, state_h, state_c = encoder(input_seq)

# Decoder
decoder = LSTM(256, return_sequences=True)
decoder_output = decoder(target_seq,
                         initial_state=[state_h, state_c])

The decoder is trained to predict the next token given the previous correct token and encoder states.


Limitations of Plain Seq2Seq Models

Early Seq2Seq models struggle with long sequences.

Compressing an entire sentence into a single vector creates information loss.

This problem becomes severe in long texts or complex conversations.

This limitation directly motivated the development of Attention Mechanisms, which allow the decoder to look back at encoder states dynamically.


Why Seq2Seq Models Matter

Seq2Seq models changed how machines handle language and time-series data.

They introduced:

• Flexible-length input and output • End-to-end sequence learning • Foundations for attention and transformers

Without Seq2Seq, modern NLP systems would not exist.


Mini Thinking Exercise

Consider this carefully:

• Why is teacher forcing useful during training? • Why does inference behave differently from training?


Exercises

Exercise 1:
What is teacher forcing?

Feeding the correct previous output to the decoder during training.

Exercise 2:
Why does Seq2Seq struggle with long sequences?

Because all information is compressed into a single context vector.

Quick Check

Q: Is Seq2Seq only used for text?

No. It is also used for speech, time-series, and signal translation tasks.

Next, we will remove Seq2Seq’s biggest limitation by introducing the Attention Mechanism, which allows models to focus dynamically on relevant input parts.