Speech AI Course
Evaluation Metrics
Until now, we focused on building Speech AI systems: audio processing, features, datasets, and noise handling.
But one critical question remains: How do we know if a Speech AI system is actually good?
This lesson answers that question by introducing evaluation metrics — the backbone of professional Speech AI development.
Why Evaluation Metrics Matter
Speech AI systems are never perfect.
Metrics help us:
- Measure model performance objectively
- Compare different models
- Track improvements over time
- Detect regressions in production
Without proper evaluation, model improvements become guesswork.
Task-Specific Evaluation
Speech AI is not a single task.
Different tasks require different metrics:
- Speech Recognition → transcription accuracy
- Speech Synthesis → naturalness and intelligibility
- Speaker Recognition → identification accuracy
Choosing the wrong metric can lead to misleading results.
Word Error Rate (WER)
Word Error Rate (WER) is the most common metric for speech recognition.
It measures how different the predicted transcript is from the reference transcript.
WER is calculated using:
WER = (Substitutions + Insertions + Deletions) / Total Words
Lower WER means better performance.
WER Example
Reference: "speech ai is powerful"
Prediction: "speech ai powerful"
One word deletion occurred → higher WER.
import jiwer
reference = "speech ai is powerful"
prediction = "speech ai powerful"
wer = jiwer.wer(reference, prediction)
print("WER:", wer)
Character Error Rate (CER)
Character Error Rate (CER) is similar to WER, but it operates at the character level.
CER is useful for:
- Languages without clear word boundaries
- Short utterances
- Fine-grained error analysis
cer = jiwer.cer(reference, prediction)
print("CER:", cer)
Accuracy, Precision, Recall
For tasks like keyword spotting or speaker recognition, classification metrics are used.
- Accuracy – overall correctness
- Precision – correctness of positive predictions
- Recall – ability to find all positives
from sklearn.metrics import accuracy_score, precision_score, recall_score
y_true = [1, 0, 1, 1, 0]
y_pred = [1, 0, 1, 0, 0]
print("Accuracy:", accuracy_score(y_true, y_pred))
print("Precision:", precision_score(y_true, y_pred))
print("Recall:", recall_score(y_true, y_pred))
Mean Opinion Score (MOS)
Speech synthesis quality is hard to measure automatically.
Mean Opinion Score (MOS) is a human-based evaluation metric.
Listeners rate audio quality on a scale (usually 1–5).
Higher MOS indicates more natural-sounding speech.
Objective vs Subjective Metrics
Speech AI uses both:
- Objective metrics – numerical, automated
- Subjective metrics – human judgments
Production systems often combine both for reliable evaluation.
Evaluation in Real-World Pipelines
In real systems:
- Offline evaluation validates model training
- Online metrics monitor live performance
- User feedback improves future models
Evaluation never stops after deployment.
Practice
Which metric measures word-level transcription errors?
Which metric evaluates errors at the character level?
Which metric relies on human listeners for evaluation?
Quick Quiz
For WER, which value indicates better performance?
Mean Opinion Score (MOS) is which type of metric?
Accuracy, precision, and recall are mainly used for which tasks?
Recap: Evaluation metrics quantify Speech AI performance and guide model improvement.
Next up: You’ll learn the limitations and challenges of Speech AI systems in real-world environments.