Speech AI Course
WaveNet and Neural Vocoders
In the previous lesson, you learned how Tacotron converts text into Mel spectrograms.
However, Mel spectrograms are not sound.
To hear speech, we need a component that converts acoustic features into raw audio waveforms.
This component is called a vocoder.
What Is a Vocoder?
A vocoder is a model that generates audio samples from acoustic representations such as:
- Mel spectrograms
- Linear spectrograms
- Other learned features
Modern vocoders are fully neural and produce highly realistic speech.
Why Traditional Vocoders Failed
Older vocoders relied on signal-processing assumptions:
- Fixed source-filter models
- Simplified excitation signals
These assumptions limited speech quality and produced robotic voices.
WaveNet: A Major Breakthrough
WaveNet was the first neural vocoder to generate raw audio sample-by-sample.
Instead of predicting frames, it predicts one audio sample at a time.
How WaveNet Works (Conceptually)
WaveNet models the probability of the next audio sample given all previous samples.
Mathematically:
P(x) = Π P(xt | x1, …, xt−1)
This allows extremely realistic waveform generation.
Dilated Causal Convolutions
WaveNet uses dilated causal convolutions to efficiently model long audio contexts.
Causality ensures the model never sees the future.
Why This Code Exists
This code demonstrates a simple dilated convolution layer similar to what WaveNet uses internally.
import torch
import torch.nn as nn
conv = nn.Conv1d(
in_channels=1,
out_channels=16,
kernel_size=2,
dilation=4
)
x = torch.randn(1, 1, 100)
y = conv(x)
print(y.shape)
What happens inside:
- Dilation expands receptive field
- Model captures long-term dependencies
Why this matters:
Speech depends on long-range patterns, not just local samples.
Conditioning WaveNet on Spectrograms
WaveNet alone generates random audio.
To generate speech, it must be conditioned on acoustic features.
Why This Code Exists
This example shows how Mel spectrograms condition waveform generation.
mel_features = torch.randn(1, 80, 25)
audio_context = torch.randn(1, 1, 100)
conditioned = audio_context + mel_features.mean(dim=1, keepdim=True)
print(conditioned.shape)
What happens here:
- Mel features guide waveform shape
- Speech matches linguistic content
Why conditioning is essential:
Without it, audio would be meaningless noise.
Why WaveNet Is Slow
WaveNet generates audio one sample at a time.
For 16 kHz audio, that means:
- 16,000 predictions per second
This makes naive WaveNet too slow for real-time applications.
Faster Neural Vocoders
To overcome WaveNet’s speed limitations, new vocoders were developed:
- WaveRNN
- Parallel WaveNet
- HiFi-GAN
These models trade slight complexity for massive speed improvements.
WaveRNN (Concept Overview)
WaveRNN uses recurrent networks to generate audio more efficiently.
It balances quality and speed for deployment scenarios.
GAN-Based Vocoders
Modern systems like HiFi-GAN use adversarial training.
They generate audio in parallel, making them suitable for real-time use.
Why This Code Exists
This pseudocode illustrates generator-discriminator logic.
fake_audio = generator(mel_features)
real_score = discriminator(real_audio)
fake_score = discriminator(fake_audio)
What happens here:
- Generator learns realistic audio
- Discriminator enforces quality
Why GAN vocoders dominate today:
They provide the best balance of quality and speed.
Comparing Vocoders
- WaveNet: Highest quality, very slow
- WaveRNN: Faster, slightly lower quality
- HiFi-GAN: Fast and high quality
Practice
What component converts spectrograms into audio?
Which vocoder generates audio sample-by-sample?
What guides a vocoder to produce meaningful speech?
Quick Quiz
What convolution type expands receptive field?
Why are Mel spectrograms used in vocoders?
Which vocoder is widely used for real-time TTS?
Recap: Vocoders convert acoustic features into audio, with WaveNet pioneering neural waveform generation.
Next up: You’ll learn about Voice Cloning and how models learn speaker identity.