Speech AI Lesson 7 – Audio Preprocessing | Dataplexa

Audio Preprocessing

In the previous lesson, you learned how feature extraction works and why representations like MFCCs are critical for Speech AI.

In this lesson, we focus on audio preprocessing, a stage that directly determines whether your Speech AI system succeeds or fails.

In real-world projects, poor preprocessing is the most common reason for low accuracy, unstable models, and unreliable predictions.

What Is Audio Preprocessing?

Audio preprocessing refers to the steps applied to raw audio before feature extraction.

The goal is to:

  • Remove unwanted noise
  • Normalize audio levels
  • Standardize input format
  • Improve signal clarity

Clean input leads to better features, and better features lead to better models.

Common Audio Preprocessing Steps

Most Speech AI pipelines use a combination of the following steps:

  • Resampling
  • Silence removal
  • Normalization
  • Noise reduction
  • Pre-emphasis

Resampling

Resampling ensures all audio files use the same sampling rate.

Speech AI datasets often contain recordings with different sampling rates.

Standardizing them avoids inconsistencies during training.


import librosa

audio, sr = librosa.load("speech.wav", sr=None)
print("Original SR:", sr)

audio_16k = librosa.resample(audio, orig_sr=sr, target_sr=16000)
print("Resampled length:", len(audio_16k))
  
Original SR: 44100 Resampled length: 48000

Silence Removal

Real speech contains pauses and silence that do not add useful information.

Removing silence reduces computation and improves model focus on speech segments.


import librosa

audio, sr = librosa.load("speech.wav", sr=16000)

trimmed_audio, _ = librosa.effects.trim(audio, top_db=20)
print("Original length:", len(audio))
print("Trimmed length:", len(trimmed_audio))
  
Original length: 52000 Trimmed length: 43000

Normalization

Different recordings may have different loudness levels.

Normalization scales audio so that amplitude values fall within a consistent range.


import numpy as np

normalized_audio = audio / np.max(np.abs(audio))
print(np.max(normalized_audio))
  
1.0

Noise Reduction

Background noise can significantly degrade Speech AI performance.

Noise reduction techniques attempt to suppress non-speech components.

Simple noise reduction can be achieved using spectral gating methods.


import noisereduce as nr

reduced_noise = nr.reduce_noise(y=audio, sr=16000)
  
Noise reduced audio generated

Pre-Emphasis Filter

Pre-emphasis boosts high-frequency components that are important for speech clarity.

It is commonly applied before feature extraction.


import numpy as np

pre_emphasis = 0.97
emphasized_audio = np.append(
    audio[0],
    audio[1:] - pre_emphasis * audio[:-1]
)
  
Pre-emphasis applied

Putting It All Together

A real Speech AI preprocessing pipeline often includes multiple steps combined together.


import librosa
import numpy as np
import noisereduce as nr

audio, sr = librosa.load("speech.wav", sr=None)

audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
audio, _ = librosa.effects.trim(audio, top_db=20)
audio = audio / np.max(np.abs(audio))
audio = nr.reduce_noise(y=audio, sr=16000)
  
Preprocessing pipeline completed

Practice

Which stage cleans and standardizes raw audio before feature extraction?



Which preprocessing step ensures a consistent sampling rate?



Which step removes background noise from audio?



Quick Quiz

Which preprocessing step scales audio to a fixed amplitude range?





Which technique boosts high-frequency speech components?





Which step removes non-speech segments from audio?





Recap: Audio preprocessing cleans, standardizes, and enhances raw audio to improve feature extraction and model accuracy.

Next up: You’ll learn about speech datasets, their structure, and how to prepare them for training.