Speech AI Course
Audio Basics
In the previous lesson, you learned about different types of speech tasks. Before going deeper into models and algorithms, it is essential to understand the basics of audio.
Speech AI systems do not work directly with “sound” as humans hear it. They work with mathematical representations of audio signals.
What Is Audio?
Audio is a representation of sound waves created by vibrations in the air.
When a person speaks, air pressure changes over time. These pressure changes form a continuous signal called a sound wave.
Speech AI systems capture and analyze these sound waves digitally.
Analog vs Digital Audio
Sound in the real world is analog, meaning it is continuous.
Computers, however, work with digital data, which is discrete.
Converting analog sound into digital audio is the first step in any Speech AI system.
Waveform Representation
A waveform shows how the audio signal changes over time.
The horizontal axis represents time, and the vertical axis represents amplitude (signal strength).
Every spoken word, pause, and emotion creates a unique waveform pattern.
Sampling Rate
The sampling rate defines how many times per second the audio signal is measured.
It is measured in Hertz (Hz).
- 8000 Hz – telephone quality
- 16000 Hz – common for Speech AI
- 44100 Hz – CD quality audio
Higher sampling rates capture more detail but require more storage and computation.
Amplitude and Loudness
Amplitude represents the strength of the audio signal.
Louder sounds have higher amplitudes, while softer sounds have lower amplitudes.
In Speech AI, controlling amplitude is important to avoid distorted or inconsistent inputs.
Audio Channels
Audio can be recorded using one or more channels.
- Mono – single channel (commonly used in Speech AI)
- Stereo – two channels (left and right)
Most Speech AI models expect mono audio to keep processing simple and efficient.
Working with Audio in Python
Let’s look at a simple example of loading an audio file and checking its basic properties.
import librosa
audio, sample_rate = librosa.load("speech.wav", sr=None)
print("Sample Rate:", sample_rate)
print("Audio Length:", len(audio))
Why Audio Basics Matter
Understanding audio fundamentals helps you:
- Choose the correct sampling rate
- Avoid noisy or distorted inputs
- Design better preprocessing pipelines
- Debug Speech AI issues effectively
Strong audio knowledge separates reliable systems from unstable ones.
Practice
What physical phenomenon creates audio signals?
What determines how many times per second audio is measured?
Which audio channel type is most commonly used in Speech AI?
Quick Quiz
Which sampling rate is most commonly used in Speech AI?
Computers process audio in which form?
Which property represents loudness in an audio signal?
Recap: Audio is a digital representation of sound waves, and understanding sampling, amplitude, and channels is essential for Speech AI.
Next up: You’ll learn about digital audio fundamentals, including quantization and bit depth.