Speech AI Course
Tacotron Models
In the previous lesson, you learned the fundamental building blocks of Text-to-Speech (TTS): phonemes, duration, pitch, and prosody.
In this lesson, we move into the **neural era of speech synthesis** by studying one of the most important architectures ever created for TTS: Tacotron.
Tacotron marked the shift from rule-based and statistical TTS systems to fully neural, end-to-end speech generation.
What Is Tacotron?
Tacotron is a sequence-to-sequence neural network that converts text (or phonemes) directly into Mel spectrograms.
Instead of manually modeling pitch, duration, and prosody, Tacotron learns them automatically from data.
This was a major breakthrough in speech synthesis.
Why Tacotron Was a Breakthrough
Before Tacotron, TTS systems required:
- Separate duration models
- Hand-designed linguistic rules
- Complex pipelines
Tacotron simplified everything by learning text → speech representation in one model.
High-Level Tacotron Pipeline
Tacotron follows this flow:
Text / Phonemes → Encoder → Attention → Decoder → Mel Spectrogram
A vocoder then converts the Mel spectrogram into audio.
Input Representation
Tacotron does not work directly on raw characters efficiently.
Instead, input text is converted into:
- Characters
- Phonemes
- Embedding vectors
Why This Code Exists
This code shows how text tokens are converted into embeddings before entering the Tacotron encoder.
import torch
import torch.nn as nn
embedding = nn.Embedding(num_embeddings=50, embedding_dim=256)
text_ids = torch.tensor([3, 14, 7, 22])
embedded = embedding(text_ids)
print(embedded.shape)
What happens inside:
- Each text token becomes a dense vector
- Semantic and pronunciation patterns are encoded
Why this matters:
Neural networks cannot work on raw text directly; embeddings give them meaning.
Encoder: Understanding the Text
The encoder processes the embedded text and extracts meaningful representations.
It learns:
- Word relationships
- Pronunciation context
- Sequential dependencies
Why This Code Exists
This example represents an encoder block using a recurrent layer.
encoder = nn.LSTM(
input_size=256,
hidden_size=256,
batch_first=True
)
encoder_outputs, _ = encoder(embedded.unsqueeze(0))
print(encoder_outputs.shape)
What happens here:
- Sequence context is learned
- Temporal dependencies are captured
Why the encoder is important:
It allows Tacotron to understand text as a sequence, not isolated symbols.
Attention Mechanism
The attention mechanism is the heart of Tacotron.
It decides:
- Which text part to focus on
- When to move forward in the sentence
Without attention, alignment between text and speech fails.
Why This Code Exists
This simplified example shows attention weight computation.
attention_weights = torch.softmax(
torch.rand(4),
dim=0
)
print(attention_weights)
What happens internally:
- Weights sum to 1
- Model learns alignment between text and audio frames
Why attention matters:
It prevents skipping words or repeating phrases.
Decoder: Generating Speech Features
The decoder generates Mel spectrogram frames one step at a time.
It uses:
- Encoder context
- Attention output
- Previous audio frames
Why This Code Exists
This code demonstrates a decoder step producing Mel features.
decoder = nn.Linear(256, 80)
mel_frame = decoder(encoder_outputs[:, 0, :])
print(mel_frame.shape)
What happens here:
- Latent representation becomes acoustic features
- Speech timing and tone are encoded
Why this step is critical:
This is where text turns into sound representation.
From Tacotron to Audio
Tacotron does not generate raw audio.
Its output is a Mel spectrogram, which must be converted into waveform using a vocoder.
Later lessons will focus entirely on vocoders.
Limitations of Tacotron
Despite its success, Tacotron has limitations:
- Slow autoregressive decoding
- Attention failures on long text
- Training instability
These limitations led to improved architectures.
Practice
What does Tacotron generate as output?
Which component aligns text with speech frames?
Which part processes the input text sequence?
Quick Quiz
Which model first enabled end-to-end neural TTS?
Which mechanism handles text-to-audio alignment?
What representation is passed to the vocoder?
Recap: Tacotron is an encoder-attention-decoder model that converts text into Mel spectrograms for neural TTS.
Next up: You’ll learn about WaveNet and neural vocoders that turn spectrograms into high-quality audio.