Speech AI Lesson 25 – Building an ASR Pipeline | Dataplexa

Building an End-to-End ASR Pipeline

Up to this point, you have learned ASR in pieces: models, decoding, multilingual support, APIs, and accuracy tuning.

In real-world systems, none of these live in isolation.

This lesson puts everything together and shows how engineers build a complete, production-ready ASR pipeline from audio input to final usable text.

What Is an ASR Pipeline?

An ASR pipeline is the full sequence of steps that converts raw audio into clean, usable text.

A typical pipeline contains:

  • Audio capture
  • Preprocessing
  • Feature extraction
  • Speech recognition
  • Decoding and post-processing
  • Delivery to downstream systems

Each step affects accuracy, latency, and reliability.

High-Level Pipeline Flow

At a high level, an ASR pipeline looks like this:

Microphone → Audio Processing → ASR Model → Text Output

Let’s now break this down step by step.

Step 1: Audio Capture

Everything starts with audio input.

Audio may come from:

  • Microphones
  • Uploaded audio files
  • Phone calls
  • Live streams

If audio quality is poor, no ASR model can fully recover.

Why This Code Exists

This code represents how audio is captured and normalized before further processing.


import soundfile as sf

audio, sr = sf.read("input.wav")
print("Sample rate:", sr)
  

What happens here:

  • Raw waveform is loaded
  • Sampling rate is verified
Sample rate: 16000

Why this matters:

Most ASR models expect a fixed sample rate. Mismatch causes accuracy loss.

Step 2: Audio Preprocessing

Raw audio often contains:

  • Silence
  • Background noise
  • Volume variations

Preprocessing improves signal consistency.

Why This Code Exists

This code normalizes audio amplitude to make loudness consistent across samples.


import numpy as np

audio = audio / np.max(np.abs(audio))
  

What this does:

  • Scales waveform values
  • Prevents clipping
Audio normalized

Why this improves ASR:

Models perform better when loudness is consistent.

Step 3: Feature Extraction

ASR models do not work directly on raw audio.

They rely on compact representations that preserve speech characteristics.

Why This Code Exists

This code converts audio into log-Mel spectrograms, the standard ASR feature representation.


import librosa
import numpy as np

mel = librosa.feature.melspectrogram(
    y=audio,
    sr=sr,
    n_mels=80
)

features = np.log(mel + 1e-9)
  

What is happening internally:

  • Audio is converted to frequency space
  • Human hearing scale is applied
  • Log scaling stabilizes training and inference
Feature matrix shape: (80, time_steps)

Why this is critical:

Good features make patterns easier for the model to learn.

Step 4: ASR Model Inference

This is where speech becomes text.

The ASR model:

  • Processes acoustic features
  • Predicts token probabilities

Why This Code Exists

This example shows how a trained ASR model is used to generate predictions.


logits = asr_model(features)
print(logits.shape)
  

What this produces:

  • Frame-level probabilities
  • Multiple candidate tokens
(120, 32)

Why this matters:

Raw logits are not text yet. They must be decoded.

Step 5: Decoding

Decoding converts probabilities into readable text.

This step applies:

  • CTC decoding or attention decoding
  • Language model scoring

Why This Code Exists

This code applies beam search decoding to select the best transcription.


text = beam_search_decode(
    logits,
    beam_width=5
)

print(text)
  

What happens here:

  • Multiple hypotheses are evaluated
  • Language context is considered
  • The most likely sentence is selected
Please schedule the meeting tomorrow.

Why decoding is crucial:

Most ASR errors happen here, not in the model itself.

Step 6: Post-Processing

Raw transcription often needs cleanup.

Post-processing may include:

  • Punctuation restoration
  • Capitalization
  • Number formatting

Why This Code Exists

This code formats transcription for downstream consumption.


final_text = format_text(text)
print(final_text)
  
Please schedule the meeting tomorrow.

Why this matters:

Users care about readability, not raw output.

Putting It All Together

An end-to-end ASR pipeline is only as strong as its weakest step.

Production engineers focus on:

  • Monitoring failures
  • Logging errors
  • Continuously improving data and decoding

Practice

What do we call the full process from audio to text?



What representation is used before ASR model inference?



Which step converts probabilities into text?



Quick Quiz

Which step prepares audio for the ASR model?





Which decoding method improves accuracy?





Which step improves text readability?





Recap: An end-to-end ASR pipeline connects audio capture, preprocessing, modeling, decoding, and post-processing into a reliable production system.

Next up: You’ll move from recognition to generation — Speech Synthesis and Text-to-Speech.