Speech AI Course
Prosody Control
In the previous lesson, you learned how emotion influences speech through pitch, energy, and expressive cues.
Prosody goes one step deeper.
It defines how speech flows over time — its rhythm, stress, emphasis, and pauses.
Even with correct pronunciation and emotion, poor prosody makes speech sound robotic.
What Is Prosody?
Prosody refers to the musical and temporal aspects of speech.
It includes:
- Intonation (pitch movement)
- Stress (emphasis on words)
- Timing and rhythm
- Pauses and phrasing
Humans rely heavily on prosody to understand meaning.
Why Prosody Control Is Critical in TTS
Consider the sentence:
"You finished the project."
Depending on prosody, it can sound:
- Neutral
- Surprised
- Disappointed
- Questioning
Words stay the same. Prosody changes everything.
Key Prosodic Features
Prosody is controlled through a combination of measurable features.
- Pitch contour (F0 curve)
- Duration of phonemes
- Energy levels
- Pause placement
Pitch Contours and Intonation
Pitch contours define how the voice rises and falls.
Questions often end with rising pitch, while statements usually fall.
Why This Code Exists
This code illustrates a simple pitch contour used to shape intonation.
import numpy as np
pitch_contour = np.array([120, 125, 130, 128, 122])
print("Pitch contour:", pitch_contour)
What happens inside:
- Pitch values change over time
- Creates natural rise and fall
Why this matters:
Flat pitch produces lifeless speech.
Duration and Speech Rhythm
Prosody also depends on how long sounds last.
Important words are often stretched, while function words are shortened.
Why This Code Exists
This example assigns variable durations to phonemes in a sentence.
durations = {
"important": 0.18,
"word": 0.15,
"the": 0.05
}
print(durations)
What happens here:
- Key words receive longer durations
- Speech rhythm becomes expressive
Why duration control is important:
Uniform timing sounds mechanical.
Energy and Emphasis
Energy controls loudness.
Speakers naturally increase energy to emphasize certain words.
Why This Code Exists
This code demonstrates energy variation across a phrase.
energy = np.array([0.6, 0.9, 1.2, 0.8])
print("Energy levels:", energy)
What happens internally:
- Higher energy highlights emphasis
- Lower energy softens speech
Why this matters:
Energy variation improves clarity and engagement.
Pause Modeling
Silence is part of speech.
Strategic pauses help listeners process meaning.
Why This Code Exists
This example represents pause placement between phrases.
pauses = [0.2, 0.05, 0.3]
print("Pause durations:", pauses)
What happens here:
- Longer pauses separate ideas
- Short pauses maintain flow
Why pauses matter:
No pauses make speech exhausting to listen to.
Prosody Control in Neural TTS
Modern neural TTS systems control prosody using:
- Prosody embeddings
- Reference encoders
- Style tokens
These mechanisms allow fine-grained control without manual tuning.
Challenges in Prosody Modeling
Prosody is difficult because:
- It is subjective
- It varies across languages
- It depends on context
This makes evaluation challenging.
Practice
What term describes rhythm, stress, and intonation?
Which feature controls rising and falling tone?
What helps separate phrases in speech?
Quick Quiz
What makes speech sound expressive?
Which factor controls speech rhythm?
Which feature controls emphasis?
Recap: Prosody controls rhythm, stress, pitch, and pauses, making speech sound natural and expressive.
Next up: You’ll learn about Multilingual Text-to-Speech and how systems handle multiple languages.