Speech AI Lesson 25 – Building an ASR Pipeline | Dataplexa

Building an End-to-End ASR Pipeline

Up to this point, you have learned ASR in pieces: models, decoding, multilingual support, APIs, and accuracy tuning.

In real-world systems, none of these live in isolation.

This lesson puts everything together and shows how engineers build a complete, production-ready ASR pipeline from audio input to final usable text.

What Is an ASR Pipeline?

An ASR pipeline is the full sequence of steps that converts raw audio into clean, usable text.

A typical pipeline contains:

Audio capture
Preprocessing
Feature extraction
Speech recognition
Decoding and post-processing
Delivery to downstream systems

Each step affects accuracy, latency, and reliability.

High-Level Pipeline Flow

At a high level, an ASR pipeline looks like this:

Microphone → Audio Processing → ASR Model → Text Output

Let’s now break this down step by step.

Step 1: Audio Capture

Everything starts with audio input.

Audio may come from:

Microphones
Uploaded audio files
Phone calls
Live streams

If audio quality is poor, no ASR model can fully recover.

Why This Code Exists

This code represents how audio is captured and normalized before further processing.


import soundfile as sf

audio, sr = sf.read("input.wav")
print("Sample rate:", sr)

What happens here:

Raw waveform is loaded
Sampling rate is verified

Sample rate: 16000

Why this matters:

Most ASR models expect a fixed sample rate. Mismatch causes accuracy loss.

Step 2: Audio Preprocessing

Raw audio often contains:

Silence
Background noise
Volume variations

Preprocessing improves signal consistency.

Why This Code Exists

This code normalizes audio amplitude to make loudness consistent across samples.


import numpy as np

audio = audio / np.max(np.abs(audio))

What this does:

Scales waveform values
Prevents clipping

Audio normalized

Why this improves ASR:

Models perform better when loudness is consistent.

Step 3: Feature Extraction

ASR models do not work directly on raw audio.

They rely on compact representations that preserve speech characteristics.

Why This Code Exists

This code converts audio into log-Mel spectrograms, the standard ASR feature representation.


import librosa
import numpy as np

mel = librosa.feature.melspectrogram(
    y=audio,
    sr=sr,
    n_mels=80
)

features = np.log(mel + 1e-9)

What is happening internally:

Audio is converted to frequency space
Human hearing scale is applied
Log scaling stabilizes training and inference

Feature matrix shape: (80, time_steps)

Why this is critical:

Good features make patterns easier for the model to learn.

Step 4: ASR Model Inference

This is where speech becomes text.

The ASR model:

Processes acoustic features
Predicts token probabilities

Why This Code Exists

This example shows how a trained ASR model is used to generate predictions.


logits = asr_model(features)
print(logits.shape)

What this produces:

Frame-level probabilities
Multiple candidate tokens

(120, 32)

Why this matters:

Raw logits are not text yet. They must be decoded.

Step 5: Decoding

Decoding converts probabilities into readable text.

This step applies:

CTC decoding or attention decoding
Language model scoring

Why This Code Exists

This code applies beam search decoding to select the best transcription.


text = beam_search_decode(
    logits,
    beam_width=5
)

print(text)

What happens here:

Multiple hypotheses are evaluated
Language context is considered
The most likely sentence is selected

Please schedule the meeting tomorrow.

Why decoding is crucial:

Most ASR errors happen here, not in the model itself.

Step 6: Post-Processing

Raw transcription often needs cleanup.

Post-processing may include:

Punctuation restoration
Capitalization
Number formatting

Why This Code Exists

This code formats transcription for downstream consumption.


final_text = format_text(text)
print(final_text)

Please schedule the meeting tomorrow.

Why this matters:

Users care about readability, not raw output.

Putting It All Together

An end-to-end ASR pipeline is only as strong as its weakest step.

Production engineers focus on:

Monitoring failures
Logging errors
Continuously improving data and decoding

Practice

What do we call the full process from audio to text?

What representation is used before ASR model inference?

Which step converts probabilities into text?

Quick Quiz

Which step prepares audio for the ASR model?

Decoding
Feature extraction
Post-processing

Which decoding method improves accuracy?

Greedy
Beam search
Random

Which step improves text readability?

Model training
Post-processing
Audio capture

Recap: An end-to-end ASR pipeline connects audio capture, preprocessing, modeling, decoding, and post-processing into a reliable production system.

Next up: You’ll move from recognition to generation — Speech Synthesis and Text-to-Speech.

← Previous Course Index Next →

Speech AI Course

Building an End-to-End ASR Pipeline

What Is an ASR Pipeline?

High-Level Pipeline Flow

Step 1: Audio Capture

Why This Code Exists

Step 2: Audio Preprocessing

Why This Code Exists

Step 3: Feature Extraction

Why This Code Exists

Step 4: ASR Model Inference

Why This Code Exists

Step 5: Decoding

Why This Code Exists

Step 6: Post-Processing

Why This Code Exists

Putting It All Together

Practice

Quick Quiz