Speech AI Lesson 6 – Feature Extraction | Dataplexa

Feature Extraction

In the previous lesson, you learned how audio is represented digitally using sampling rate and bit depth.

In this lesson, we move to one of the most important stages in the Speech AI pipeline: feature extraction.

This is where raw audio becomes useful information that machine learning models can actually learn from.

Why Feature Extraction Is Necessary

Raw audio waveforms contain too much unnecessary information.

If we feed raw waveforms directly into models, they become harder to train, slower, and less accurate.

Feature extraction solves this by:

Reducing data size
Highlighting speech-relevant patterns
Removing redundant information
Improving model performance

In real-world Speech AI systems, feature extraction often matters more than the model itself.

Time-Domain vs Frequency-Domain Features

Speech features can be extracted in two main ways:

Time-domain features
Frequency-domain features

Time-domain features work directly on the waveform, while frequency-domain features analyze how energy is distributed across different frequencies.

Time-Domain Features

Time-domain features are simpler and faster to compute.

Common time-domain features include:

Zero Crossing Rate (ZCR)
Energy
Amplitude statistics

Zero Crossing Rate (ZCR)

Zero Crossing Rate measures how often the signal crosses zero.

It helps differentiate between voiced and unvoiced sounds.


import librosa
import numpy as np

audio, sr = librosa.load("speech.wav", sr=16000)

zcr = librosa.feature.zero_crossing_rate(audio)
print(np.mean(zcr))

0.075

Frequency-Domain Features

Frequency-domain features are more powerful and are widely used in Speech AI.

They capture how speech energy is distributed across different frequencies over time.

To extract these features, audio is first converted into the frequency domain using the Fourier Transform.

Spectrogram

A spectrogram shows how frequency content changes over time.

It is one of the most fundamental representations used in speech and audio processing.


import librosa
import librosa.display
import matplotlib.pyplot as plt

audio, sr = librosa.load("speech.wav", sr=16000)

spectrogram = librosa.stft(audio)
spectrogram_db = librosa.amplitude_to_db(abs(spectrogram))

print(spectrogram_db.shape)

(1025, 94)

Mel Scale and Human Hearing

Humans do not perceive frequencies linearly.

We are more sensitive to lower frequencies and less sensitive to very high frequencies.

The Mel scale models this behavior and is widely used in Speech AI.

MFCC (Mel-Frequency Cepstral Coefficients)

MFCCs are the most widely used features in Speech AI.

They represent speech in a compact, human-perception-aligned form.

MFCC extraction involves:

Short-time Fourier Transform
Mel filter banks
Logarithmic compression
Discrete Cosine Transform


import librosa

audio, sr = librosa.load("speech.wav", sr=16000)

mfccs = librosa.feature.mfcc(
    y=audio,
    sr=sr,
    n_mfcc=13
)

print(mfccs.shape)

(13, 94)

Each column represents a short time frame, and each row represents a cepstral coefficient.

Why MFCCs Work So Well

MFCCs:

Reduce noise sensitivity
Capture speech-relevant information
Are compact and efficient
Work well with many models

This is why MFCCs are still widely used even with modern deep learning models.

Practice

Which stage converts raw audio into meaningful representations?

Which feature type is most commonly used in Speech AI?

Which scale models human perception of frequency?

Quick Quiz

Which feature representation is most widely used in Speech AI?

Zero Crossing Rate
MFCC
Raw waveform

MFCCs are extracted in which domain?

Time domain
Frequency domain
Spatial domain

Which concept aligns audio features with human hearing?

Fourier Transform
Mel scale
Sampling rate

Recap: Feature extraction transforms raw audio into compact, meaningful representations such as MFCCs that models can learn from effectively.

Next up: You’ll learn audio preprocessing techniques used before feature extraction to further improve accuracy.

← Previous Course Index Next →

Speech AI Course