Speech AI Course
Audio Preprocessing
In the previous lesson, you learned how feature extraction works and why representations like MFCCs are critical for Speech AI.
In this lesson, we focus on audio preprocessing, a stage that directly determines whether your Speech AI system succeeds or fails.
In real-world projects, poor preprocessing is the most common reason for low accuracy, unstable models, and unreliable predictions.
What Is Audio Preprocessing?
Audio preprocessing refers to the steps applied to raw audio before feature extraction.
The goal is to:
- Remove unwanted noise
- Normalize audio levels
- Standardize input format
- Improve signal clarity
Clean input leads to better features, and better features lead to better models.
Common Audio Preprocessing Steps
Most Speech AI pipelines use a combination of the following steps:
- Resampling
- Silence removal
- Normalization
- Noise reduction
- Pre-emphasis
Resampling
Resampling ensures all audio files use the same sampling rate.
Speech AI datasets often contain recordings with different sampling rates.
Standardizing them avoids inconsistencies during training.
import librosa
audio, sr = librosa.load("speech.wav", sr=None)
print("Original SR:", sr)
audio_16k = librosa.resample(audio, orig_sr=sr, target_sr=16000)
print("Resampled length:", len(audio_16k))
Silence Removal
Real speech contains pauses and silence that do not add useful information.
Removing silence reduces computation and improves model focus on speech segments.
import librosa
audio, sr = librosa.load("speech.wav", sr=16000)
trimmed_audio, _ = librosa.effects.trim(audio, top_db=20)
print("Original length:", len(audio))
print("Trimmed length:", len(trimmed_audio))
Normalization
Different recordings may have different loudness levels.
Normalization scales audio so that amplitude values fall within a consistent range.
import numpy as np
normalized_audio = audio / np.max(np.abs(audio))
print(np.max(normalized_audio))
Noise Reduction
Background noise can significantly degrade Speech AI performance.
Noise reduction techniques attempt to suppress non-speech components.
Simple noise reduction can be achieved using spectral gating methods.
import noisereduce as nr
reduced_noise = nr.reduce_noise(y=audio, sr=16000)
Pre-Emphasis Filter
Pre-emphasis boosts high-frequency components that are important for speech clarity.
It is commonly applied before feature extraction.
import numpy as np
pre_emphasis = 0.97
emphasized_audio = np.append(
audio[0],
audio[1:] - pre_emphasis * audio[:-1]
)
Putting It All Together
A real Speech AI preprocessing pipeline often includes multiple steps combined together.
import librosa
import numpy as np
import noisereduce as nr
audio, sr = librosa.load("speech.wav", sr=None)
audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
audio, _ = librosa.effects.trim(audio, top_db=20)
audio = audio / np.max(np.abs(audio))
audio = nr.reduce_noise(y=audio, sr=16000)
Practice
Which stage cleans and standardizes raw audio before feature extraction?
Which preprocessing step ensures a consistent sampling rate?
Which step removes background noise from audio?
Quick Quiz
Which preprocessing step scales audio to a fixed amplitude range?
Which technique boosts high-frequency speech components?
Which step removes non-speech segments from audio?
Recap: Audio preprocessing cleans, standardizes, and enhances raw audio to improve feature extraction and model accuracy.
Next up: You’ll learn about speech datasets, their structure, and how to prepare them for training.