DL Lesson 59 – Beam Search | Dataplexa

Beam Search

Beam Search is a decoding strategy used during inference for sequence models such as RNNs, LSTMs, and Transformers.

Its purpose is to generate higher-quality sequences by exploring multiple possible outputs instead of committing to a single greedy choice.


Why Decoding Matters

Sequence models do not output an entire sentence at once.

They generate tokens step by step, predicting the next token based on previously generated tokens.

The way we choose these tokens directly affects output quality.


Greedy Decoding and Its Limitation

The simplest approach is greedy decoding.

At each step, the model selects the token with the highest probability.

While fast, this approach can easily miss better overall sequences because it never revisits earlier decisions.


Core Idea Behind Beam Search

Beam Search keeps multiple candidate sequences instead of just one.

Each candidate is called a beam.

At every step:

• Each beam is expanded with possible next tokens • All expanded sequences are scored • Only the top k sequences are kept

The value k is called the beam width.


Scoring Sequences

Beam Search scores a sequence by summing the log probabilities of its tokens.

Log probabilities are used to avoid numerical underflow and to fairly compare long sequences.

Higher total log probability means a better candidate sequence.


Example Walkthrough

Suppose a model is generating a sentence word by word.

Instead of choosing only the best next word, Beam Search keeps several partial sentences and evaluates how promising each one is.

This allows the model to recover from locally optimal but globally poor choices.


Beam Width Tradeoff

Choosing the beam width is critical.

Small beam width:

• Faster inference • Less memory usage • Lower output quality

Large beam width:

• Better exploration • Higher quality sequences • Slower inference

In practice, beam widths between 3 and 10 are commonly used.


Length Bias Problem

Beam Search tends to prefer shorter sequences because they accumulate fewer penalties.

This can cause the model to stop generation too early.

To address this, length normalization is often applied.


Length Normalization Concept

Instead of dividing by raw probability, scores are normalized by sequence length.

This helps balance short and long sequences during comparison.

Modern frameworks handle this internally.


Beam Search in Practice

Beam Search is widely used in:

• Machine translation • Speech recognition • Text summarization • Code generation

It is rarely used during training and primarily applied during inference.


Comparison with Sampling

Beam Search is deterministic.

Sampling-based methods introduce randomness, which can improve creativity but reduce consistency.

Beam Search prioritizes correctness and high-probability outputs.


Exercises

Exercise 1:
Why is greedy decoding often suboptimal?

Because it commits to locally optimal choices without considering future outcomes.

Exercise 2:
What does beam width control?

It controls how many candidate sequences are explored at each step.

Quick Check

Q: Is Beam Search used during training?

No. It is used during inference to generate sequences.

Next, we conclude the Deep Learning module by exploring real-world Sequence Model Applications and how these models are deployed in practice.