Speech AI Lesson 13 – Introduction to ASR | Dataplexa

Introduction to Automatic Speech Recognition (ASR)

So far, you have learned how speech is captured, processed, cleaned, represented, and evaluated.

Now we enter the heart of Speech AI: Automatic Speech Recognition (ASR).

ASR systems convert spoken language into written text. They power voice assistants, call centers, live captions, and transcription systems.

What Is Automatic Speech Recognition?

Automatic Speech Recognition (ASR) is the task of converting an audio signal containing speech into a sequence of words or characters.

In simple terms:

Speech → Text

Unlike humans, ASR systems must solve this problem using mathematics, probability, and machine learning.

Why ASR Is Difficult

At first glance, ASR seems simple — people speak, systems listen, text appears.

In reality, ASR is one of the hardest problems in AI.

Challenges include:

Different accents and pronunciations
Speaking speed variations
Background noise and overlap
Homophones (similar sounding words)
Unclear word boundaries

Humans solve these problems subconsciously. Machines must learn them from data.

High-Level ASR Pipeline

A typical ASR system follows a structured pipeline.

At a high level:

Audio → Features → Acoustic Model → Language Model → Text

Each stage plays a distinct role and introduces its own challenges.

Audio Input

The ASR pipeline starts with raw audio input.

Audio is usually:

Mono channel
16 kHz sampling rate
Normalized amplitude

Poor input quality at this stage limits everything that follows.

Feature Extraction

Raw audio is not fed directly into models.

Instead, the signal is converted into compact representations called features.

Common ASR features include:

MFCC (Mel-Frequency Cepstral Coefficients)
Log-Mel Spectrograms
Filter bank energies

These features capture phonetic information while reducing noise and redundancy.


import librosa
import numpy as np

audio, sr = librosa.load("speech.wav", sr=16000)

mfcc = librosa.feature.mfcc(
    y=audio,
    sr=sr,
    n_mfcc=13
)

print(mfcc.shape)

(13, 300)

Each column represents a short time frame, and each row represents a learned feature.

Acoustic Model

The acoustic model maps audio features to speech sound units.

Historically, acoustic models predicted:

Phonemes
Context-dependent states

Modern deep learning models learn this mapping end-to-end.

Common acoustic model architectures include:

Deep Neural Networks (DNNs)
Recurrent Neural Networks (RNNs)
Convolutional Neural Networks (CNNs)
Transformers

Language Model

The language model provides linguistic context.

It helps answer questions like:

Which word sequence makes sense?
Is “technical support” more likely than “technical report”?

Language models learn word patterns from large text corpora.

They significantly reduce ASR errors.

Decoding

Decoding combines:

Acoustic model probabilities
Language model probabilities

The decoder searches for the most likely word sequence given the audio.

This step is computationally expensive and critical for accuracy.

End-to-End ASR Systems

Modern ASR systems often use end-to-end architectures.

These systems directly map:

Audio → Text

Examples include:

CTC-based models
Attention-based encoder–decoder models
Transformer ASR models

End-to-end systems reduce pipeline complexity but require large datasets.

ASR Output Example


audio_text = "please connect me to technical support"
print(audio_text)

please connect me to technical support

Even small recognition errors can change meaning significantly.

Where ASR Is Used

ASR is used across many industries:

Voice assistants
Customer support automation
Meeting transcription
Accessibility tools
Voice-controlled devices

Reliable ASR is foundational to most voice-driven applications.

Practice

What does ASR stand for?

What do we extract from audio before feeding it to ASR models?

Which ASR component helps choose meaningful word sequences?

Quick Quiz

What is the role of feature extraction in ASR?

Convert audio into compact representations
Remove all noise completely
Convert speech directly to text

Which component adds linguistic context to ASR?

Acoustic Model
Language Model
Decoder

What step combines probabilities to find the best text output?

Training
Decoding
Sampling

Recap: ASR systems convert speech into text using features, acoustic models, language models, and decoding.

Next up: You’ll explore traditional ASR systems and how speech recognition worked before deep learning.

← Previous Course Index Next →

Speech AI Course