Speech AI Course
Introduction to Automatic Speech Recognition (ASR)
So far, you have learned how speech is captured, processed, cleaned, represented, and evaluated.
Now we enter the heart of Speech AI: Automatic Speech Recognition (ASR).
ASR systems convert spoken language into written text. They power voice assistants, call centers, live captions, and transcription systems.
What Is Automatic Speech Recognition?
Automatic Speech Recognition (ASR) is the task of converting an audio signal containing speech into a sequence of words or characters.
In simple terms:
Speech → Text
Unlike humans, ASR systems must solve this problem using mathematics, probability, and machine learning.
Why ASR Is Difficult
At first glance, ASR seems simple — people speak, systems listen, text appears.
In reality, ASR is one of the hardest problems in AI.
Challenges include:
- Different accents and pronunciations
- Speaking speed variations
- Background noise and overlap
- Homophones (similar sounding words)
- Unclear word boundaries
Humans solve these problems subconsciously. Machines must learn them from data.
High-Level ASR Pipeline
A typical ASR system follows a structured pipeline.
At a high level:
Audio → Features → Acoustic Model → Language Model → Text
Each stage plays a distinct role and introduces its own challenges.
Audio Input
The ASR pipeline starts with raw audio input.
Audio is usually:
- Mono channel
- 16 kHz sampling rate
- Normalized amplitude
Poor input quality at this stage limits everything that follows.
Feature Extraction
Raw audio is not fed directly into models.
Instead, the signal is converted into compact representations called features.
Common ASR features include:
- MFCC (Mel-Frequency Cepstral Coefficients)
- Log-Mel Spectrograms
- Filter bank energies
These features capture phonetic information while reducing noise and redundancy.
import librosa
import numpy as np
audio, sr = librosa.load("speech.wav", sr=16000)
mfcc = librosa.feature.mfcc(
y=audio,
sr=sr,
n_mfcc=13
)
print(mfcc.shape)
Each column represents a short time frame, and each row represents a learned feature.
Acoustic Model
The acoustic model maps audio features to speech sound units.
Historically, acoustic models predicted:
- Phonemes
- Context-dependent states
Modern deep learning models learn this mapping end-to-end.
Common acoustic model architectures include:
- Deep Neural Networks (DNNs)
- Recurrent Neural Networks (RNNs)
- Convolutional Neural Networks (CNNs)
- Transformers
Language Model
The language model provides linguistic context.
It helps answer questions like:
- Which word sequence makes sense?
- Is “technical support” more likely than “technical report”?
Language models learn word patterns from large text corpora.
They significantly reduce ASR errors.
Decoding
Decoding combines:
- Acoustic model probabilities
- Language model probabilities
The decoder searches for the most likely word sequence given the audio.
This step is computationally expensive and critical for accuracy.
End-to-End ASR Systems
Modern ASR systems often use end-to-end architectures.
These systems directly map:
Audio → Text
Examples include:
- CTC-based models
- Attention-based encoder–decoder models
- Transformer ASR models
End-to-end systems reduce pipeline complexity but require large datasets.
ASR Output Example
audio_text = "please connect me to technical support"
print(audio_text)
Even small recognition errors can change meaning significantly.
Where ASR Is Used
ASR is used across many industries:
- Voice assistants
- Customer support automation
- Meeting transcription
- Accessibility tools
- Voice-controlled devices
Reliable ASR is foundational to most voice-driven applications.
Practice
What does ASR stand for?
What do we extract from audio before feeding it to ASR models?
Which ASR component helps choose meaningful word sequences?
Quick Quiz
What is the role of feature extraction in ASR?
Which component adds linguistic context to ASR?
What step combines probabilities to find the best text output?
Recap: ASR systems convert speech into text using features, acoustic models, language models, and decoding.
Next up: You’ll explore traditional ASR systems and how speech recognition worked before deep learning.