Speech AI Course
Traditional Automatic Speech Recognition
In the previous lesson, you learned what ASR is and how modern systems convert speech into text.
Before deep learning became dominant, speech recognition relied on traditional statistical models.
Understanding traditional ASR is important because:
- Many concepts still exist in modern systems
- It explains why ASR pipelines are structured the way they are
- Interviewers often ask about these foundations
What Is Traditional ASR?
Traditional ASR refers to speech recognition systems built using statistical modeling rather than end-to-end neural networks.
These systems were dominant for decades and powered early voice systems, call centers, and dictation software.
Unlike modern ASR, traditional systems were modular, with each component designed and optimized separately.
Classic ASR Pipeline
A traditional ASR system follows a strict pipeline:
Audio → Feature Extraction → Acoustic Model → Pronunciation Lexicon → Language Model → Decoder → Text
Each block solves a specific problem.
Feature Extraction in Traditional ASR
Feature extraction is one of the most important steps in traditional ASR.
Raw audio is converted into compact numerical representations that are easier to model statistically.
The most commonly used features were:
- MFCC (Mel-Frequency Cepstral Coefficients)
- Delta and Delta-Delta MFCCs
- PLP (Perceptual Linear Prediction)
These features were designed using human auditory perception principles.
import librosa
import numpy as np
audio, sr = librosa.load("speech.wav", sr=16000)
mfcc = librosa.feature.mfcc(
y=audio,
sr=sr,
n_mfcc=13
)
delta = librosa.feature.delta(mfcc)
delta2 = librosa.feature.delta(mfcc, order=2)
features = np.vstack([mfcc, delta, delta2])
print(features.shape)
This feature vector becomes the input to the acoustic model.
Acoustic Model (HMM + GMM)
Traditional ASR acoustic models were built using:
- Hidden Markov Models (HMMs)
- Gaussian Mixture Models (GMMs)
Each phoneme was modeled as a sequence of hidden states.
HMMs handled the time sequence, while GMMs modeled the probability distribution of acoustic features.
This separation made the system interpretable, but also complex.
Hidden Markov Models (HMM)
An HMM assumes:
- Speech is a sequence of hidden states
- Each state emits observable features
- State transitions follow probabilities
HMMs are well-suited for speech because speech is sequential and time-dependent.
Gaussian Mixture Models (GMM)
GMMs were used to model the probability of features given an HMM state.
Each state used multiple Gaussian distributions to capture feature variability.
However, GMMs struggled with:
- Complex feature distributions
- Large datasets
- Highly non-linear patterns
Pronunciation Lexicon
Traditional ASR systems required a pronunciation dictionary.
The lexicon maps words to phoneme sequences.
Example:
HELLO HH AH L OW
SPEECH S P IY CH
AI EY AY
Maintaining lexicons was time-consuming, especially for large vocabularies.
Language Model in Traditional ASR
Language models provided probabilities for word sequences.
Traditional systems used:
- N-gram language models
These models estimated the probability of a word given previous words.
Example:
P("support" | "technical") > P("report" | "technical")
This helped reduce recognition errors.
Decoding Process
The decoder combines:
- Acoustic model scores
- Pronunciation probabilities
- Language model probabilities
Its goal is to find the most probable word sequence for the given audio.
Decoding was computationally expensive and required careful tuning.
Limitations of Traditional ASR
Despite their success, traditional ASR systems had major limitations:
- Complex pipelines
- Heavy manual tuning
- Separate optimization of components
- Poor scalability with large data
These limitations motivated the shift toward deep learning.
Why Traditional ASR Still Matters
Even today:
- Many decoding concepts come from HMMs
- Feature extraction ideas still apply
- Hybrid systems are still used
Understanding traditional ASR makes modern systems easier to grasp.
Practice
Which model handles temporal sequences in traditional ASR?
Which model estimates feature probability distributions?
What maps words to phoneme sequences?
Quick Quiz
Traditional ASR systems are best described as?
Which language model was commonly used in traditional ASR?
Which step converts raw audio into numerical representations?
Recap: Traditional ASR systems used HMMs, GMMs, lexicons, and language models in a modular pipeline.
Next up: You’ll see how deep learning transformed ASR and replaced many traditional components.