Speech AI Lesson 13 – Introduction to ASR | Dataplexa

Introduction to Automatic Speech Recognition (ASR)

So far, you have learned how speech is captured, processed, cleaned, represented, and evaluated.

Now we enter the heart of Speech AI: Automatic Speech Recognition (ASR).

ASR systems convert spoken language into written text. They power voice assistants, call centers, live captions, and transcription systems.

What Is Automatic Speech Recognition?

Automatic Speech Recognition (ASR) is the task of converting an audio signal containing speech into a sequence of words or characters.

In simple terms:

Speech → Text

Unlike humans, ASR systems must solve this problem using mathematics, probability, and machine learning.

Why ASR Is Difficult

At first glance, ASR seems simple — people speak, systems listen, text appears.

In reality, ASR is one of the hardest problems in AI.

Challenges include:

  • Different accents and pronunciations
  • Speaking speed variations
  • Background noise and overlap
  • Homophones (similar sounding words)
  • Unclear word boundaries

Humans solve these problems subconsciously. Machines must learn them from data.

High-Level ASR Pipeline

A typical ASR system follows a structured pipeline.

At a high level:

Audio → Features → Acoustic Model → Language Model → Text

Each stage plays a distinct role and introduces its own challenges.

Audio Input

The ASR pipeline starts with raw audio input.

Audio is usually:

  • Mono channel
  • 16 kHz sampling rate
  • Normalized amplitude

Poor input quality at this stage limits everything that follows.

Feature Extraction

Raw audio is not fed directly into models.

Instead, the signal is converted into compact representations called features.

Common ASR features include:

  • MFCC (Mel-Frequency Cepstral Coefficients)
  • Log-Mel Spectrograms
  • Filter bank energies

These features capture phonetic information while reducing noise and redundancy.


import librosa
import numpy as np

audio, sr = librosa.load("speech.wav", sr=16000)

mfcc = librosa.feature.mfcc(
    y=audio,
    sr=sr,
    n_mfcc=13
)

print(mfcc.shape)
  
(13, 300)

Each column represents a short time frame, and each row represents a learned feature.

Acoustic Model

The acoustic model maps audio features to speech sound units.

Historically, acoustic models predicted:

  • Phonemes
  • Context-dependent states

Modern deep learning models learn this mapping end-to-end.

Common acoustic model architectures include:

  • Deep Neural Networks (DNNs)
  • Recurrent Neural Networks (RNNs)
  • Convolutional Neural Networks (CNNs)
  • Transformers

Language Model

The language model provides linguistic context.

It helps answer questions like:

  • Which word sequence makes sense?
  • Is “technical support” more likely than “technical report”?

Language models learn word patterns from large text corpora.

They significantly reduce ASR errors.

Decoding

Decoding combines:

  • Acoustic model probabilities
  • Language model probabilities

The decoder searches for the most likely word sequence given the audio.

This step is computationally expensive and critical for accuracy.

End-to-End ASR Systems

Modern ASR systems often use end-to-end architectures.

These systems directly map:

Audio → Text

Examples include:

  • CTC-based models
  • Attention-based encoder–decoder models
  • Transformer ASR models

End-to-end systems reduce pipeline complexity but require large datasets.

ASR Output Example


audio_text = "please connect me to technical support"
print(audio_text)
  
please connect me to technical support

Even small recognition errors can change meaning significantly.

Where ASR Is Used

ASR is used across many industries:

  • Voice assistants
  • Customer support automation
  • Meeting transcription
  • Accessibility tools
  • Voice-controlled devices

Reliable ASR is foundational to most voice-driven applications.

Practice

What does ASR stand for?



What do we extract from audio before feeding it to ASR models?



Which ASR component helps choose meaningful word sequences?



Quick Quiz

What is the role of feature extraction in ASR?





Which component adds linguistic context to ASR?





What step combines probabilities to find the best text output?





Recap: ASR systems convert speech into text using features, acoustic models, language models, and decoding.

Next up: You’ll explore traditional ASR systems and how speech recognition worked before deep learning.