Speech AI Lesson 27 – TTS Fundamentals | Dataplexa

Text-to-Speech (TTS) Fundamentals

In the previous lesson, you learned what speech synthesis is and how a complete Text-to-Speech (TTS) pipeline works at a high level.

In this lesson, we go deeper into the **fundamental building blocks** that every TTS system—classical or modern—must handle correctly.

Understanding these fundamentals is critical if you want to:

Build your own TTS systems
Debug unnatural speech output
Improve voice quality and realism

What Makes Speech Sound Natural?

Humans instantly recognize unnatural speech.

This happens because natural speech contains subtle variations in:

Pitch
Timing
Stress
Intonation

A TTS system must learn and reproduce all of these correctly.

Core Components of TTS Fundamentals

At a fundamental level, every TTS system must solve four problems:

What to say (text content)
How to pronounce it
How long each sound lasts
How it should sound emotionally

Text Normalization (Revisited)

Raw text often contains elements that are hard to speak directly:

Numbers
Dates
Abbreviations

Why This Code Exists

This code converts written text into a spoken-friendly form.


def normalize_text(text):
    text = text.replace("2024", "twenty twenty four")
    text = text.replace("Dr.", "doctor")
    return text

print(normalize_text("Dr. Smith joined in 2024."))

What happens inside:

Numbers are expanded
Abbreviations become words

doctor Smith joined in twenty twenty four.

Why this matters:

If normalization fails, even the best model produces wrong speech.

Phonemes and Pronunciation

Letters are not speech sounds.

TTS systems rely on **phonemes**—the smallest units of sound.

Why This Code Exists

This example shows how words map to phonemes.


pronunciation = {
  "data": ["D", "EY", "T", "AH"],
  "science": ["S", "AY", "AH", "N", "S"]
}

print(pronunciation["data"])

What happens here:

Text is converted into sound units
Ambiguous spelling is resolved

['D', 'EY', 'T', 'AH']

Why phonemes are essential:

Without phonemes, pronunciation becomes inconsistent and wrong.

Duration Modeling

Speech is not only about what is spoken, but **how long** each sound lasts.

Incorrect timing makes speech robotic.

Why This Code Exists

This code demonstrates assigning duration to phonemes.


durations = {
  "D": 0.08,
  "EY": 0.12,
  "T": 0.06,
  "AH": 0.10
}

print(durations)

What happens internally:

Each phoneme receives a time length
Speech rhythm is controlled

{'D': 0.08, 'EY': 0.12, 'T': 0.06, 'AH': 0.10}

Why duration matters:

Natural speech relies heavily on timing variation.

Pitch and Intonation

Pitch controls how speech rises and falls.

Intonation conveys:

Questions
Emphasis
Emotion

Why This Code Exists

This example shows a simple pitch contour.


pitch_contour = [110, 115, 120, 118, 112]
print(pitch_contour)

What this represents:

Pitch variation across time
Natural prosody shaping

[110, 115, 120, 118, 112]

Why pitch is critical:

Flat pitch instantly sounds synthetic.

Putting Fundamentals Together

A real TTS system simultaneously models:

Phonemes
Durations
Pitch
Spectral features

Modern neural models learn all of this automatically, but the fundamentals still apply.

Practice

What unit represents pronunciation in TTS systems?

Which factor controls how long sounds are spoken?

What feature controls rise and fall of voice?

Quick Quiz

Which representation is closest to actual speech sounds?

Letters
Phonemes
Words

What controls speech rhythm?

Volume
Duration
Noise

Which feature affects intonation?

Pitch
Text
Sampling

Recap: TTS fundamentals include phonemes, duration, pitch, and prosody—together they make speech sound human.

Next up: You’ll dive into modern neural TTS models starting with Tacotron-based architectures.

← Previous Course Index Next →

Speech AI Course

Text-to-Speech (TTS) Fundamentals

What Makes Speech Sound Natural?

Core Components of TTS Fundamentals

Text Normalization (Revisited)

Why This Code Exists

Phonemes and Pronunciation

Why This Code Exists

Duration Modeling

Why This Code Exists

Pitch and Intonation

Why This Code Exists

Putting Fundamentals Together

Practice

Quick Quiz