Beam Search
Beam Search is a decoding strategy used during inference for sequence models such as RNNs, LSTMs, and Transformers.
Its purpose is to generate higher-quality sequences by exploring multiple possible outputs instead of committing to a single greedy choice.
Why Decoding Matters
Sequence models do not output an entire sentence at once.
They generate tokens step by step, predicting the next token based on previously generated tokens.
The way we choose these tokens directly affects output quality.
Greedy Decoding and Its Limitation
The simplest approach is greedy decoding.
At each step, the model selects the token with the highest probability.
While fast, this approach can easily miss better overall sequences because it never revisits earlier decisions.
Core Idea Behind Beam Search
Beam Search keeps multiple candidate sequences instead of just one.
Each candidate is called a beam.
At every step:
• Each beam is expanded with possible next tokens • All expanded sequences are scored • Only the top k sequences are kept
The value k is called the beam width.
Scoring Sequences
Beam Search scores a sequence by summing the log probabilities of its tokens.
Log probabilities are used to avoid numerical underflow and to fairly compare long sequences.
Higher total log probability means a better candidate sequence.
Example Walkthrough
Suppose a model is generating a sentence word by word.
Instead of choosing only the best next word, Beam Search keeps several partial sentences and evaluates how promising each one is.
This allows the model to recover from locally optimal but globally poor choices.
Beam Width Tradeoff
Choosing the beam width is critical.
Small beam width:
• Faster inference • Less memory usage • Lower output quality
Large beam width:
• Better exploration • Higher quality sequences • Slower inference
In practice, beam widths between 3 and 10 are commonly used.
Length Bias Problem
Beam Search tends to prefer shorter sequences because they accumulate fewer penalties.
This can cause the model to stop generation too early.
To address this, length normalization is often applied.
Length Normalization Concept
Instead of dividing by raw probability, scores are normalized by sequence length.
This helps balance short and long sequences during comparison.
Modern frameworks handle this internally.
Beam Search in Practice
Beam Search is widely used in:
• Machine translation • Speech recognition • Text summarization • Code generation
It is rarely used during training and primarily applied during inference.
Comparison with Sampling
Beam Search is deterministic.
Sampling-based methods introduce randomness, which can improve creativity but reduce consistency.
Beam Search prioritizes correctness and high-probability outputs.
Exercises
Exercise 1:
Why is greedy decoding often suboptimal?
Exercise 2:
What does beam width control?
Quick Check
Q: Is Beam Search used during training?
Next, we conclude the Deep Learning module by exploring real-world Sequence Model Applications and how these models are deployed in practice.