Speech AI Course
Digital Audio Fundamentals
In the previous lesson, you learned the basics of audio such as waveforms, sampling rate, amplitude, and channels.
In this lesson, we go deeper into digital audio fundamentals, which are critical for building reliable Speech AI systems.
These concepts directly impact audio quality, model accuracy, storage requirements, and real-world performance.
What Makes Audio “Digital”?
Digital audio is created by converting continuous sound waves into discrete numerical values.
This conversion happens through three key processes:
- Sampling
- Quantization
- Encoding
Speech AI systems operate entirely on this digital representation.
Sampling Revisited (Time Resolution)
Sampling decides when we measure the signal.
A higher sampling rate captures more detail over time, but increases data size and computation.
For Speech AI, 16 kHz is commonly used because it captures human speech frequencies efficiently without unnecessary overhead.
import librosa
audio_16k, sr_16k = librosa.load("speech.wav", sr=16000)
audio_8k, sr_8k = librosa.load("speech.wav", sr=8000)
print(sr_16k, sr_8k)
print(len(audio_16k), len(audio_8k))
Notice how lowering the sampling rate reduces the number of samples.
Quantization (Amplitude Resolution)
Quantization decides how precisely amplitude values are stored.
It converts continuous amplitude values into fixed numeric levels.
The number of available levels depends on the bit depth.
- 8-bit → 256 levels
- 16-bit → 65,536 levels
- 24-bit → 16+ million levels
Higher bit depth means better audio quality, but also larger file size.
Bit Depth in Practice
Most Speech AI datasets use 16-bit audio, which provides a good balance between quality and efficiency.
import soundfile as sf
audio, sr = sf.read("speech.wav")
print(audio.dtype)
Even though files are stored as integers, libraries often convert them to floating-point values for processing.
Dynamic Range
Dynamic range refers to the difference between the quietest and loudest sound that can be represented.
Low bit depth causes quantization noise, which negatively impacts Speech AI models.
This is why clean, high-quality recordings are critical for training and inference.
Clipping and Distortion
When amplitude exceeds the maximum representable range, clipping occurs.
Clipping permanently distorts audio and cannot be fixed later.
import numpy as np
audio_clipped = np.clip(audio, -0.5, 0.5)
print(audio_clipped.max(), audio_clipped.min())
Why Digital Audio Fundamentals Matter
Understanding digital audio helps you:
- Select proper sampling rates and bit depth
- Avoid clipping and distortion
- Optimize storage and performance
- Improve Speech AI model accuracy
Most real-world Speech AI bugs originate from poor audio handling, not bad models.
Practice
Which process converts continuous amplitude values into discrete levels?
Which property determines how many amplitude levels are available?
What happens when audio exceeds the maximum representable range?
Quick Quiz
How many amplitude levels does 16-bit audio provide?
What type of noise is caused by low bit depth?
Which bit depth is most commonly used in Speech AI datasets?
Recap: Digital audio is defined by sampling rate and bit depth, which directly affect quality, storage, and Speech AI performance.
Next up: You’ll learn feature extraction techniques, starting with MFCCs and spectral features.