Speech AI Lesson 17 – Attention Models | Dataplexa

Attention Models in Automatic Speech Recognition

In the previous lesson, you learned how CTC models solve the alignment problem without explicit labels.

However, CTC makes a strong assumption: each output depends only on the input, not on previous outputs.

This lesson introduces attention-based ASR models, which remove that limitation and allow models to focus on the most relevant parts of speech.

Why Attention Was Needed

Human listeners do not process speech frame by frame independently.

We naturally:

  • Remember previous words
  • Use context to resolve ambiguity
  • Focus on relevant sounds

CTC models cannot fully capture this behavior.

Attention models were designed to bring context awareness into ASR.

What Is Attention?

Attention is a mechanism that allows a model to dynamically focus on specific parts of the input when generating each output token.

Instead of processing the entire input equally, the model asks:

“Which part of the audio matters most right now?”

Sequence-to-Sequence ASR

Attention models are typically built using sequence-to-sequence architectures.

These models consist of:

  • An encoder
  • An attention mechanism
  • A decoder

Together, they map:

Audio → Text

The Encoder

The encoder processes the input audio features and converts them into a sequence of hidden representations.

It captures:

  • Phonetic patterns
  • Temporal information
  • Acoustic context

Encoders are commonly built using:

  • RNNs (LSTM / GRU)
  • CNN + RNN stacks
  • Transformers

The Attention Mechanism

The attention mechanism connects the encoder and decoder.

For each output step, it:

  • Scores encoder states
  • Computes attention weights
  • Creates a context vector

This context vector represents the most relevant audio frames for the current output token.

The Decoder

The decoder generates text one token at a time.

Each output depends on:

  • The previous output token
  • The attention context vector

This allows the model to:

  • Use language context naturally
  • Resolve ambiguities
  • Produce fluent sequences

Attention Alignment Intuition

Unlike CTC, attention models learn soft alignments between audio frames and text.

For example:

When generating the word "support", the model focuses on the portion of audio where that word is spoken.

This alignment is learned automatically.

Simple Attention Example (Conceptual)


# Pseudocode for attention weights
attention_scores = dot(decoder_state, encoder_states)
attention_weights = softmax(attention_scores)
context = sum(attention_weights * encoder_states)
  
Context vector computed from relevant audio frames

Advantages of Attention-Based ASR

Attention models offer several benefits:

  • Better language modeling
  • Flexible alignment
  • Improved accuracy for long utterances

They naturally integrate acoustic and linguistic information.

Limitations of Attention Models

Despite their strengths, attention models have challenges:

  • High latency
  • Difficulty with streaming ASR
  • Heavy computational cost

Because attention needs the full input, it is less suitable for real-time scenarios.

CTC vs Attention (High-Level)

Key differences:

  • CTC assumes output independence
  • Attention models use output history
  • CTC works well for streaming
  • Attention excels in accuracy and fluency

Many modern systems combine both approaches.

Real-World Usage

Attention-based ASR is widely used in:

  • Offline transcription
  • Meeting summarization
  • High-accuracy captioning systems

They are especially strong when latency is not critical.

Practice

What mechanism allows models to focus on relevant input frames?



Which component processes audio into hidden representations?



Which component generates text tokens sequentially?



Quick Quiz

What does the attention mechanism compute for the decoder?





Attention models naturally improve which aspect of ASR?





Why are attention models less suitable for streaming ASR?





Recap: Attention models allow ASR systems to dynamically focus on relevant audio and use context.

Next up: You’ll learn how Transformers unify attention and dominate modern ASR systems.