Speech AI Lesson 32 – Prosody Control | Dataplexa

Prosody Control

In the previous lesson, you learned how emotion influences speech through pitch, energy, and expressive cues.

Prosody goes one step deeper.

It defines how speech flows over time — its rhythm, stress, emphasis, and pauses.

Even with correct pronunciation and emotion, poor prosody makes speech sound robotic.

What Is Prosody?

Prosody refers to the musical and temporal aspects of speech.

It includes:

  • Intonation (pitch movement)
  • Stress (emphasis on words)
  • Timing and rhythm
  • Pauses and phrasing

Humans rely heavily on prosody to understand meaning.

Why Prosody Control Is Critical in TTS

Consider the sentence:

"You finished the project."

Depending on prosody, it can sound:

  • Neutral
  • Surprised
  • Disappointed
  • Questioning

Words stay the same. Prosody changes everything.

Key Prosodic Features

Prosody is controlled through a combination of measurable features.

  • Pitch contour (F0 curve)
  • Duration of phonemes
  • Energy levels
  • Pause placement

Pitch Contours and Intonation

Pitch contours define how the voice rises and falls.

Questions often end with rising pitch, while statements usually fall.

Why This Code Exists

This code illustrates a simple pitch contour used to shape intonation.


import numpy as np

pitch_contour = np.array([120, 125, 130, 128, 122])
print("Pitch contour:", pitch_contour)
  

What happens inside:

  • Pitch values change over time
  • Creates natural rise and fall
Pitch contour: [120 125 130 128 122]

Why this matters:

Flat pitch produces lifeless speech.

Duration and Speech Rhythm

Prosody also depends on how long sounds last.

Important words are often stretched, while function words are shortened.

Why This Code Exists

This example assigns variable durations to phonemes in a sentence.


durations = {
  "important": 0.18,
  "word": 0.15,
  "the": 0.05
}

print(durations)
  

What happens here:

  • Key words receive longer durations
  • Speech rhythm becomes expressive
{'important': 0.18, 'word': 0.15, 'the': 0.05}

Why duration control is important:

Uniform timing sounds mechanical.

Energy and Emphasis

Energy controls loudness.

Speakers naturally increase energy to emphasize certain words.

Why This Code Exists

This code demonstrates energy variation across a phrase.


energy = np.array([0.6, 0.9, 1.2, 0.8])
print("Energy levels:", energy)
  

What happens internally:

  • Higher energy highlights emphasis
  • Lower energy softens speech
Energy levels: [0.6 0.9 1.2 0.8]

Why this matters:

Energy variation improves clarity and engagement.

Pause Modeling

Silence is part of speech.

Strategic pauses help listeners process meaning.

Why This Code Exists

This example represents pause placement between phrases.


pauses = [0.2, 0.05, 0.3]
print("Pause durations:", pauses)
  

What happens here:

  • Longer pauses separate ideas
  • Short pauses maintain flow
Pause durations: [0.2, 0.05, 0.3]

Why pauses matter:

No pauses make speech exhausting to listen to.

Prosody Control in Neural TTS

Modern neural TTS systems control prosody using:

  • Prosody embeddings
  • Reference encoders
  • Style tokens

These mechanisms allow fine-grained control without manual tuning.

Challenges in Prosody Modeling

Prosody is difficult because:

  • It is subjective
  • It varies across languages
  • It depends on context

This makes evaluation challenging.

Practice

What term describes rhythm, stress, and intonation?



Which feature controls rising and falling tone?



What helps separate phrases in speech?



Quick Quiz

What makes speech sound expressive?





Which factor controls speech rhythm?





Which feature controls emphasis?





Recap: Prosody controls rhythm, stress, pitch, and pauses, making speech sound natural and expressive.

Next up: You’ll learn about Multilingual Text-to-Speech and how systems handle multiple languages.