Speech AI Course
Text-to-Speech (TTS) Fundamentals
In the previous lesson, you learned what speech synthesis is and how a complete Text-to-Speech (TTS) pipeline works at a high level.
In this lesson, we go deeper into the **fundamental building blocks** that every TTS system—classical or modern—must handle correctly.
Understanding these fundamentals is critical if you want to:
- Build your own TTS systems
- Debug unnatural speech output
- Improve voice quality and realism
What Makes Speech Sound Natural?
Humans instantly recognize unnatural speech.
This happens because natural speech contains subtle variations in:
- Pitch
- Timing
- Stress
- Intonation
A TTS system must learn and reproduce all of these correctly.
Core Components of TTS Fundamentals
At a fundamental level, every TTS system must solve four problems:
- What to say (text content)
- How to pronounce it
- How long each sound lasts
- How it should sound emotionally
Text Normalization (Revisited)
Raw text often contains elements that are hard to speak directly:
- Numbers
- Dates
- Abbreviations
Why This Code Exists
This code converts written text into a spoken-friendly form.
def normalize_text(text):
text = text.replace("2024", "twenty twenty four")
text = text.replace("Dr.", "doctor")
return text
print(normalize_text("Dr. Smith joined in 2024."))
What happens inside:
- Numbers are expanded
- Abbreviations become words
Why this matters:
If normalization fails, even the best model produces wrong speech.
Phonemes and Pronunciation
Letters are not speech sounds.
TTS systems rely on **phonemes**—the smallest units of sound.
Why This Code Exists
This example shows how words map to phonemes.
pronunciation = {
"data": ["D", "EY", "T", "AH"],
"science": ["S", "AY", "AH", "N", "S"]
}
print(pronunciation["data"])
What happens here:
- Text is converted into sound units
- Ambiguous spelling is resolved
Why phonemes are essential:
Without phonemes, pronunciation becomes inconsistent and wrong.
Duration Modeling
Speech is not only about what is spoken, but **how long** each sound lasts.
Incorrect timing makes speech robotic.
Why This Code Exists
This code demonstrates assigning duration to phonemes.
durations = {
"D": 0.08,
"EY": 0.12,
"T": 0.06,
"AH": 0.10
}
print(durations)
What happens internally:
- Each phoneme receives a time length
- Speech rhythm is controlled
Why duration matters:
Natural speech relies heavily on timing variation.
Pitch and Intonation
Pitch controls how speech rises and falls.
Intonation conveys:
- Questions
- Emphasis
- Emotion
Why This Code Exists
This example shows a simple pitch contour.
pitch_contour = [110, 115, 120, 118, 112]
print(pitch_contour)
What this represents:
- Pitch variation across time
- Natural prosody shaping
Why pitch is critical:
Flat pitch instantly sounds synthetic.
Putting Fundamentals Together
A real TTS system simultaneously models:
- Phonemes
- Durations
- Pitch
- Spectral features
Modern neural models learn all of this automatically, but the fundamentals still apply.
Practice
What unit represents pronunciation in TTS systems?
Which factor controls how long sounds are spoken?
What feature controls rise and fall of voice?
Quick Quiz
Which representation is closest to actual speech sounds?
What controls speech rhythm?
Which feature affects intonation?
Recap: TTS fundamentals include phonemes, duration, pitch, and prosody—together they make speech sound human.
Next up: You’ll dive into modern neural TTS models starting with Tacotron-based architectures.