Speech AI Course
Attention Models in Automatic Speech Recognition
In the previous lesson, you learned how CTC models solve the alignment problem without explicit labels.
However, CTC makes a strong assumption: each output depends only on the input, not on previous outputs.
This lesson introduces attention-based ASR models, which remove that limitation and allow models to focus on the most relevant parts of speech.
Why Attention Was Needed
Human listeners do not process speech frame by frame independently.
We naturally:
- Remember previous words
- Use context to resolve ambiguity
- Focus on relevant sounds
CTC models cannot fully capture this behavior.
Attention models were designed to bring context awareness into ASR.
What Is Attention?
Attention is a mechanism that allows a model to dynamically focus on specific parts of the input when generating each output token.
Instead of processing the entire input equally, the model asks:
“Which part of the audio matters most right now?”
Sequence-to-Sequence ASR
Attention models are typically built using sequence-to-sequence architectures.
These models consist of:
- An encoder
- An attention mechanism
- A decoder
Together, they map:
Audio → Text
The Encoder
The encoder processes the input audio features and converts them into a sequence of hidden representations.
It captures:
- Phonetic patterns
- Temporal information
- Acoustic context
Encoders are commonly built using:
- RNNs (LSTM / GRU)
- CNN + RNN stacks
- Transformers
The Attention Mechanism
The attention mechanism connects the encoder and decoder.
For each output step, it:
- Scores encoder states
- Computes attention weights
- Creates a context vector
This context vector represents the most relevant audio frames for the current output token.
The Decoder
The decoder generates text one token at a time.
Each output depends on:
- The previous output token
- The attention context vector
This allows the model to:
- Use language context naturally
- Resolve ambiguities
- Produce fluent sequences
Attention Alignment Intuition
Unlike CTC, attention models learn soft alignments between audio frames and text.
For example:
When generating the word "support", the model focuses on the portion of audio where that word is spoken.
This alignment is learned automatically.
Simple Attention Example (Conceptual)
# Pseudocode for attention weights
attention_scores = dot(decoder_state, encoder_states)
attention_weights = softmax(attention_scores)
context = sum(attention_weights * encoder_states)
Advantages of Attention-Based ASR
Attention models offer several benefits:
- Better language modeling
- Flexible alignment
- Improved accuracy for long utterances
They naturally integrate acoustic and linguistic information.
Limitations of Attention Models
Despite their strengths, attention models have challenges:
- High latency
- Difficulty with streaming ASR
- Heavy computational cost
Because attention needs the full input, it is less suitable for real-time scenarios.
CTC vs Attention (High-Level)
Key differences:
- CTC assumes output independence
- Attention models use output history
- CTC works well for streaming
- Attention excels in accuracy and fluency
Many modern systems combine both approaches.
Real-World Usage
Attention-based ASR is widely used in:
- Offline transcription
- Meeting summarization
- High-accuracy captioning systems
They are especially strong when latency is not critical.
Practice
What mechanism allows models to focus on relevant input frames?
Which component processes audio into hidden representations?
Which component generates text tokens sequentially?
Quick Quiz
What does the attention mechanism compute for the decoder?
Attention models naturally improve which aspect of ASR?
Why are attention models less suitable for streaming ASR?
Recap: Attention models allow ASR systems to dynamically focus on relevant audio and use context.
Next up: You’ll learn how Transformers unify attention and dominate modern ASR systems.