Speech AI Course
Building an End-to-End ASR Pipeline
Up to this point, you have learned ASR in pieces: models, decoding, multilingual support, APIs, and accuracy tuning.
In real-world systems, none of these live in isolation.
This lesson puts everything together and shows how engineers build a complete, production-ready ASR pipeline from audio input to final usable text.
What Is an ASR Pipeline?
An ASR pipeline is the full sequence of steps that converts raw audio into clean, usable text.
A typical pipeline contains:
- Audio capture
- Preprocessing
- Feature extraction
- Speech recognition
- Decoding and post-processing
- Delivery to downstream systems
Each step affects accuracy, latency, and reliability.
High-Level Pipeline Flow
At a high level, an ASR pipeline looks like this:
Microphone → Audio Processing → ASR Model → Text Output
Let’s now break this down step by step.
Step 1: Audio Capture
Everything starts with audio input.
Audio may come from:
- Microphones
- Uploaded audio files
- Phone calls
- Live streams
If audio quality is poor, no ASR model can fully recover.
Why This Code Exists
This code represents how audio is captured and normalized before further processing.
import soundfile as sf
audio, sr = sf.read("input.wav")
print("Sample rate:", sr)
What happens here:
- Raw waveform is loaded
- Sampling rate is verified
Why this matters:
Most ASR models expect a fixed sample rate. Mismatch causes accuracy loss.
Step 2: Audio Preprocessing
Raw audio often contains:
- Silence
- Background noise
- Volume variations
Preprocessing improves signal consistency.
Why This Code Exists
This code normalizes audio amplitude to make loudness consistent across samples.
import numpy as np
audio = audio / np.max(np.abs(audio))
What this does:
- Scales waveform values
- Prevents clipping
Why this improves ASR:
Models perform better when loudness is consistent.
Step 3: Feature Extraction
ASR models do not work directly on raw audio.
They rely on compact representations that preserve speech characteristics.
Why This Code Exists
This code converts audio into log-Mel spectrograms, the standard ASR feature representation.
import librosa
import numpy as np
mel = librosa.feature.melspectrogram(
y=audio,
sr=sr,
n_mels=80
)
features = np.log(mel + 1e-9)
What is happening internally:
- Audio is converted to frequency space
- Human hearing scale is applied
- Log scaling stabilizes training and inference
Why this is critical:
Good features make patterns easier for the model to learn.
Step 4: ASR Model Inference
This is where speech becomes text.
The ASR model:
- Processes acoustic features
- Predicts token probabilities
Why This Code Exists
This example shows how a trained ASR model is used to generate predictions.
logits = asr_model(features)
print(logits.shape)
What this produces:
- Frame-level probabilities
- Multiple candidate tokens
Why this matters:
Raw logits are not text yet. They must be decoded.
Step 5: Decoding
Decoding converts probabilities into readable text.
This step applies:
- CTC decoding or attention decoding
- Language model scoring
Why This Code Exists
This code applies beam search decoding to select the best transcription.
text = beam_search_decode(
logits,
beam_width=5
)
print(text)
What happens here:
- Multiple hypotheses are evaluated
- Language context is considered
- The most likely sentence is selected
Why decoding is crucial:
Most ASR errors happen here, not in the model itself.
Step 6: Post-Processing
Raw transcription often needs cleanup.
Post-processing may include:
- Punctuation restoration
- Capitalization
- Number formatting
Why This Code Exists
This code formats transcription for downstream consumption.
final_text = format_text(text)
print(final_text)
Why this matters:
Users care about readability, not raw output.
Putting It All Together
An end-to-end ASR pipeline is only as strong as its weakest step.
Production engineers focus on:
- Monitoring failures
- Logging errors
- Continuously improving data and decoding
Practice
What do we call the full process from audio to text?
What representation is used before ASR model inference?
Which step converts probabilities into text?
Quick Quiz
Which step prepares audio for the ASR model?
Which decoding method improves accuracy?
Which step improves text readability?
Recap: An end-to-end ASR pipeline connects audio capture, preprocessing, modeling, decoding, and post-processing into a reliable production system.
Next up: You’ll move from recognition to generation — Speech Synthesis and Text-to-Speech.