Speech AI Course
Feature Extraction
In the previous lesson, you learned how audio is represented digitally using sampling rate and bit depth.
In this lesson, we move to one of the most important stages in the Speech AI pipeline: feature extraction.
This is where raw audio becomes useful information that machine learning models can actually learn from.
Why Feature Extraction Is Necessary
Raw audio waveforms contain too much unnecessary information.
If we feed raw waveforms directly into models, they become harder to train, slower, and less accurate.
Feature extraction solves this by:
- Reducing data size
- Highlighting speech-relevant patterns
- Removing redundant information
- Improving model performance
In real-world Speech AI systems, feature extraction often matters more than the model itself.
Time-Domain vs Frequency-Domain Features
Speech features can be extracted in two main ways:
- Time-domain features
- Frequency-domain features
Time-domain features work directly on the waveform, while frequency-domain features analyze how energy is distributed across different frequencies.
Time-Domain Features
Time-domain features are simpler and faster to compute.
Common time-domain features include:
- Zero Crossing Rate (ZCR)
- Energy
- Amplitude statistics
Zero Crossing Rate (ZCR)
Zero Crossing Rate measures how often the signal crosses zero.
It helps differentiate between voiced and unvoiced sounds.
import librosa
import numpy as np
audio, sr = librosa.load("speech.wav", sr=16000)
zcr = librosa.feature.zero_crossing_rate(audio)
print(np.mean(zcr))
Frequency-Domain Features
Frequency-domain features are more powerful and are widely used in Speech AI.
They capture how speech energy is distributed across different frequencies over time.
To extract these features, audio is first converted into the frequency domain using the Fourier Transform.
Spectrogram
A spectrogram shows how frequency content changes over time.
It is one of the most fundamental representations used in speech and audio processing.
import librosa
import librosa.display
import matplotlib.pyplot as plt
audio, sr = librosa.load("speech.wav", sr=16000)
spectrogram = librosa.stft(audio)
spectrogram_db = librosa.amplitude_to_db(abs(spectrogram))
print(spectrogram_db.shape)
Mel Scale and Human Hearing
Humans do not perceive frequencies linearly.
We are more sensitive to lower frequencies and less sensitive to very high frequencies.
The Mel scale models this behavior and is widely used in Speech AI.
MFCC (Mel-Frequency Cepstral Coefficients)
MFCCs are the most widely used features in Speech AI.
They represent speech in a compact, human-perception-aligned form.
MFCC extraction involves:
- Short-time Fourier Transform
- Mel filter banks
- Logarithmic compression
- Discrete Cosine Transform
import librosa
audio, sr = librosa.load("speech.wav", sr=16000)
mfccs = librosa.feature.mfcc(
y=audio,
sr=sr,
n_mfcc=13
)
print(mfccs.shape)
Each column represents a short time frame, and each row represents a cepstral coefficient.
Why MFCCs Work So Well
MFCCs:
- Reduce noise sensitivity
- Capture speech-relevant information
- Are compact and efficient
- Work well with many models
This is why MFCCs are still widely used even with modern deep learning models.
Practice
Which stage converts raw audio into meaningful representations?
Which feature type is most commonly used in Speech AI?
Which scale models human perception of frequency?
Quick Quiz
Which feature representation is most widely used in Speech AI?
MFCCs are extracted in which domain?
Which concept aligns audio features with human hearing?
Recap: Feature extraction transforms raw audio into compact, meaningful representations such as MFCCs that models can learn from effectively.
Next up: You’ll learn audio preprocessing techniques used before feature extraction to further improve accuracy.