Speech AI Course
Connectionist Temporal Classification (CTC Models)
In the previous lesson, you learned how deep learning enabled end-to-end ASR systems.
However, one major problem still remained: How do we align audio frames with text when we don’t know the alignment?
This lesson introduces Connectionist Temporal Classification (CTC), a breakthrough that made modern ASR practical.
The Alignment Problem in Speech Recognition
Speech audio is continuous, but text is discrete.
Example:
Audio duration: 3 seconds Text: "speech ai"
We do not know:
- Which audio frames correspond to which letters
- How long each sound lasts
- Where word boundaries occur
Manually labeling this alignment is impossible at scale.
Why Traditional Alignment Fails
Earlier ASR systems relied on:
- Phoneme-level labels
- Forced alignment
- Handcrafted rules
These methods were:
- Expensive
- Error-prone
- Hard to scale
CTC was designed to remove this dependency entirely.
What Is CTC?
Connectionist Temporal Classification (CTC) is a loss function and decoding framework used to train sequence-to-sequence models without explicit alignment.
CTC allows a model to learn:
- Which symbols appear
- In what order
- Without knowing exact timing
Key Idea Behind CTC
CTC introduces a special symbol called the blank.
The blank represents:
- No output at a given time step
- Silence or transition
This allows the model to stretch characters over multiple frames.
CTC Output Example
Target text:
AI
Possible CTC output sequence:
_ A _ A I _ _
After collapsing repeats and removing blanks:
AI
CTC Decoding Rules
CTC decoding follows two simple rules:
- Remove repeated characters
- Remove blank symbols
This transforms long frame-level outputs into final text.
Why CTC Works Well for Speech
CTC fits speech perfectly because:
- Speech length > text length
- Speech contains silence
- Speech timing varies
CTC naturally handles these properties.
CTC Model Architecture
A typical CTC-based ASR model contains:
- Encoder (CNN, RNN, or Transformer)
- Frame-level character probabilities
- CTC loss during training
No explicit decoder is required during training.
CTC Loss Intuition
CTC does not compare predictions to a single alignment.
Instead, it:
- Enumerates all valid alignments
- Sums their probabilities
- Maximizes total probability of the target text
This makes training alignment-free.
CTC Training Example (PyTorch)
import torch
import torch.nn as nn
ctc_loss = nn.CTCLoss(blank=0)
log_probs = torch.randn(100, 1, 30).log_softmax(2)
targets = torch.tensor([1, 2, 3, 4], dtype=torch.long)
input_lengths = torch.tensor([100])
target_lengths = torch.tensor([4])
loss = ctc_loss(log_probs, targets, input_lengths, target_lengths)
print(loss.item())
CTC Decoding Strategies
During inference, CTC models use:
- Greedy decoding (fast, less accurate)
- Beam search decoding (slower, more accurate)
Beam search often integrates a language model for better results.
Limitations of CTC
Despite its success, CTC has limitations:
- Weak language modeling
- Conditional independence assumption
- Difficulty modeling long context
These limitations motivated attention-based and transformer models.
Where CTC Is Used
CTC is widely used in:
- Streaming ASR
- Real-time transcription
- On-device speech recognition
Many production systems still rely on CTC.
Practice
What major problem does CTC solve in ASR?
What special symbol does CTC introduce?
Which loss function enables alignment-free training?
Quick Quiz
What does the blank symbol represent?
What is the first step in CTC decoding?
CTC is especially suitable for which ASR scenario?
Recap: CTC enables alignment-free ASR training using blank symbols and probabilistic decoding.
Next up: You’ll learn attention-based ASR models and how they differ fundamentally from CTC.