Speech AI Lesson 27 – TTS Fundamentals | Dataplexa

Text-to-Speech (TTS) Fundamentals

In the previous lesson, you learned what speech synthesis is and how a complete Text-to-Speech (TTS) pipeline works at a high level.

In this lesson, we go deeper into the **fundamental building blocks** that every TTS system—classical or modern—must handle correctly.

Understanding these fundamentals is critical if you want to:

  • Build your own TTS systems
  • Debug unnatural speech output
  • Improve voice quality and realism

What Makes Speech Sound Natural?

Humans instantly recognize unnatural speech.

This happens because natural speech contains subtle variations in:

  • Pitch
  • Timing
  • Stress
  • Intonation

A TTS system must learn and reproduce all of these correctly.

Core Components of TTS Fundamentals

At a fundamental level, every TTS system must solve four problems:

  • What to say (text content)
  • How to pronounce it
  • How long each sound lasts
  • How it should sound emotionally

Text Normalization (Revisited)

Raw text often contains elements that are hard to speak directly:

  • Numbers
  • Dates
  • Abbreviations

Why This Code Exists

This code converts written text into a spoken-friendly form.


def normalize_text(text):
    text = text.replace("2024", "twenty twenty four")
    text = text.replace("Dr.", "doctor")
    return text

print(normalize_text("Dr. Smith joined in 2024."))
  

What happens inside:

  • Numbers are expanded
  • Abbreviations become words
doctor Smith joined in twenty twenty four.

Why this matters:

If normalization fails, even the best model produces wrong speech.

Phonemes and Pronunciation

Letters are not speech sounds.

TTS systems rely on **phonemes**—the smallest units of sound.

Why This Code Exists

This example shows how words map to phonemes.


pronunciation = {
  "data": ["D", "EY", "T", "AH"],
  "science": ["S", "AY", "AH", "N", "S"]
}

print(pronunciation["data"])
  

What happens here:

  • Text is converted into sound units
  • Ambiguous spelling is resolved
['D', 'EY', 'T', 'AH']

Why phonemes are essential:

Without phonemes, pronunciation becomes inconsistent and wrong.

Duration Modeling

Speech is not only about what is spoken, but **how long** each sound lasts.

Incorrect timing makes speech robotic.

Why This Code Exists

This code demonstrates assigning duration to phonemes.


durations = {
  "D": 0.08,
  "EY": 0.12,
  "T": 0.06,
  "AH": 0.10
}

print(durations)
  

What happens internally:

  • Each phoneme receives a time length
  • Speech rhythm is controlled
{'D': 0.08, 'EY': 0.12, 'T': 0.06, 'AH': 0.10}

Why duration matters:

Natural speech relies heavily on timing variation.

Pitch and Intonation

Pitch controls how speech rises and falls.

Intonation conveys:

  • Questions
  • Emphasis
  • Emotion

Why This Code Exists

This example shows a simple pitch contour.


pitch_contour = [110, 115, 120, 118, 112]
print(pitch_contour)
  

What this represents:

  • Pitch variation across time
  • Natural prosody shaping
[110, 115, 120, 118, 112]

Why pitch is critical:

Flat pitch instantly sounds synthetic.

Putting Fundamentals Together

A real TTS system simultaneously models:

  • Phonemes
  • Durations
  • Pitch
  • Spectral features

Modern neural models learn all of this automatically, but the fundamentals still apply.

Practice

What unit represents pronunciation in TTS systems?



Which factor controls how long sounds are spoken?



What feature controls rise and fall of voice?



Quick Quiz

Which representation is closest to actual speech sounds?





What controls speech rhythm?





Which feature affects intonation?





Recap: TTS fundamentals include phonemes, duration, pitch, and prosody—together they make speech sound human.

Next up: You’ll dive into modern neural TTS models starting with Tacotron-based architectures.