Speech AI Lesson 32 – Prosody Control | Dataplexa

Prosody Control

In the previous lesson, you learned how emotion influences speech through pitch, energy, and expressive cues.

Prosody goes one step deeper.

It defines how speech flows over time — its rhythm, stress, emphasis, and pauses.

Even with correct pronunciation and emotion, poor prosody makes speech sound robotic.

What Is Prosody?

Prosody refers to the musical and temporal aspects of speech.

It includes:

Intonation (pitch movement)
Stress (emphasis on words)
Timing and rhythm
Pauses and phrasing

Humans rely heavily on prosody to understand meaning.

Why Prosody Control Is Critical in TTS

Consider the sentence:

"You finished the project."

Depending on prosody, it can sound:

Neutral
Surprised
Disappointed
Questioning

Words stay the same. Prosody changes everything.

Key Prosodic Features

Prosody is controlled through a combination of measurable features.

Pitch contour (F0 curve)
Duration of phonemes
Energy levels
Pause placement

Pitch Contours and Intonation

Pitch contours define how the voice rises and falls.

Questions often end with rising pitch, while statements usually fall.

Why This Code Exists

This code illustrates a simple pitch contour used to shape intonation.


import numpy as np

pitch_contour = np.array([120, 125, 130, 128, 122])
print("Pitch contour:", pitch_contour)

What happens inside:

Pitch values change over time
Creates natural rise and fall

Pitch contour: [120 125 130 128 122]

Why this matters:

Flat pitch produces lifeless speech.

Duration and Speech Rhythm

Prosody also depends on how long sounds last.

Important words are often stretched, while function words are shortened.

Why This Code Exists

This example assigns variable durations to phonemes in a sentence.


durations = {
  "important": 0.18,
  "word": 0.15,
  "the": 0.05
}

print(durations)

What happens here:

Key words receive longer durations
Speech rhythm becomes expressive

{'important': 0.18, 'word': 0.15, 'the': 0.05}

Why duration control is important:

Uniform timing sounds mechanical.

Energy and Emphasis

Energy controls loudness.

Speakers naturally increase energy to emphasize certain words.

Why This Code Exists

This code demonstrates energy variation across a phrase.


energy = np.array([0.6, 0.9, 1.2, 0.8])
print("Energy levels:", energy)

What happens internally:

Higher energy highlights emphasis
Lower energy softens speech

Energy levels: [0.6 0.9 1.2 0.8]

Why this matters:

Energy variation improves clarity and engagement.

Pause Modeling

Silence is part of speech.

Strategic pauses help listeners process meaning.

Why This Code Exists

This example represents pause placement between phrases.


pauses = [0.2, 0.05, 0.3]
print("Pause durations:", pauses)

What happens here:

Longer pauses separate ideas
Short pauses maintain flow

Pause durations: [0.2, 0.05, 0.3]

Why pauses matter:

No pauses make speech exhausting to listen to.

Prosody Control in Neural TTS

Modern neural TTS systems control prosody using:

Prosody embeddings
Reference encoders
Style tokens

These mechanisms allow fine-grained control without manual tuning.

Challenges in Prosody Modeling

Prosody is difficult because:

It is subjective
It varies across languages
It depends on context

This makes evaluation challenging.

Practice

What term describes rhythm, stress, and intonation?

Which feature controls rising and falling tone?

What helps separate phrases in speech?

Quick Quiz

What makes speech sound expressive?

Prosody
Sampling rate
Compression

Which factor controls speech rhythm?

Duration
Noise
Codec

Which feature controls emphasis?

Energy
Text
Sampling

Recap: Prosody controls rhythm, stress, pitch, and pauses, making speech sound natural and expressive.

Next up: You’ll learn about Multilingual Text-to-Speech and how systems handle multiple languages.

← Previous Course Index Next →

Speech AI Course

Prosody Control

What Is Prosody?

Why Prosody Control Is Critical in TTS

Key Prosodic Features

Pitch Contours and Intonation

Why This Code Exists

Duration and Speech Rhythm

Why This Code Exists

Energy and Emphasis

Why This Code Exists

Pause Modeling

Why This Code Exists

Prosody Control in Neural TTS

Challenges in Prosody Modeling

Practice

Quick Quiz