Speech AI Lesson 4 – Audio Basics | Dataplexa

Audio Basics

In the previous lesson, you learned about different types of speech tasks. Before going deeper into models and algorithms, it is essential to understand the basics of audio.

Speech AI systems do not work directly with “sound” as humans hear it. They work with mathematical representations of audio signals.

What Is Audio?

Audio is a representation of sound waves created by vibrations in the air.

When a person speaks, air pressure changes over time. These pressure changes form a continuous signal called a sound wave.

Speech AI systems capture and analyze these sound waves digitally.

Analog vs Digital Audio

Sound in the real world is analog, meaning it is continuous.

Computers, however, work with digital data, which is discrete.

Converting analog sound into digital audio is the first step in any Speech AI system.

Waveform Representation

A waveform shows how the audio signal changes over time.

The horizontal axis represents time, and the vertical axis represents amplitude (signal strength).

Every spoken word, pause, and emotion creates a unique waveform pattern.

Sampling Rate

The sampling rate defines how many times per second the audio signal is measured.

It is measured in Hertz (Hz).

8000 Hz – telephone quality
16000 Hz – common for Speech AI
44100 Hz – CD quality audio

Higher sampling rates capture more detail but require more storage and computation.

Amplitude and Loudness

Amplitude represents the strength of the audio signal.

Louder sounds have higher amplitudes, while softer sounds have lower amplitudes.

In Speech AI, controlling amplitude is important to avoid distorted or inconsistent inputs.

Audio Channels

Audio can be recorded using one or more channels.

Mono – single channel (commonly used in Speech AI)
Stereo – two channels (left and right)

Most Speech AI models expect mono audio to keep processing simple and efficient.

Working with Audio in Python

Let’s look at a simple example of loading an audio file and checking its basic properties.


import librosa

audio, sample_rate = librosa.load("speech.wav", sr=None)

print("Sample Rate:", sample_rate)
print("Audio Length:", len(audio))

Sample Rate: 16000 Audio Length: 48000

Why Audio Basics Matter

Understanding audio fundamentals helps you:

Choose the correct sampling rate
Avoid noisy or distorted inputs
Design better preprocessing pipelines
Debug Speech AI issues effectively

Strong audio knowledge separates reliable systems from unstable ones.

Practice

What physical phenomenon creates audio signals?

What determines how many times per second audio is measured?

Which audio channel type is most commonly used in Speech AI?

Quick Quiz

Which sampling rate is most commonly used in Speech AI?

8000 Hz
16000 Hz
44100 Hz

Computers process audio in which form?

Analog
Digital
Mechanical

Which property represents loudness in an audio signal?

Frequency
Amplitude
Duration

Recap: Audio is a digital representation of sound waves, and understanding sampling, amplitude, and channels is essential for Speech AI.

Next up: You’ll learn about digital audio fundamentals, including quantization and bit depth.

← Previous Course Index Next →

Speech AI Course