Speech AI Lesson 11 – Evaluation Metrics | Dataplexa

Evaluation Metrics

Until now, we focused on building Speech AI systems: audio processing, features, datasets, and noise handling.

But one critical question remains: How do we know if a Speech AI system is actually good?

This lesson answers that question by introducing evaluation metrics — the backbone of professional Speech AI development.

Why Evaluation Metrics Matter

Speech AI systems are never perfect.

Metrics help us:

  • Measure model performance objectively
  • Compare different models
  • Track improvements over time
  • Detect regressions in production

Without proper evaluation, model improvements become guesswork.

Task-Specific Evaluation

Speech AI is not a single task.

Different tasks require different metrics:

  • Speech Recognition → transcription accuracy
  • Speech Synthesis → naturalness and intelligibility
  • Speaker Recognition → identification accuracy

Choosing the wrong metric can lead to misleading results.

Word Error Rate (WER)

Word Error Rate (WER) is the most common metric for speech recognition.

It measures how different the predicted transcript is from the reference transcript.

WER is calculated using:

WER = (Substitutions + Insertions + Deletions) / Total Words

Lower WER means better performance.

WER Example

Reference: "speech ai is powerful"
Prediction: "speech ai powerful"

One word deletion occurred → higher WER.


import jiwer

reference = "speech ai is powerful"
prediction = "speech ai powerful"

wer = jiwer.wer(reference, prediction)
print("WER:", wer)
  
WER: 0.25

Character Error Rate (CER)

Character Error Rate (CER) is similar to WER, but it operates at the character level.

CER is useful for:

  • Languages without clear word boundaries
  • Short utterances
  • Fine-grained error analysis

cer = jiwer.cer(reference, prediction)
print("CER:", cer)
  
CER: 0.18

Accuracy, Precision, Recall

For tasks like keyword spotting or speaker recognition, classification metrics are used.

  • Accuracy – overall correctness
  • Precision – correctness of positive predictions
  • Recall – ability to find all positives

from sklearn.metrics import accuracy_score, precision_score, recall_score

y_true = [1, 0, 1, 1, 0]
y_pred = [1, 0, 1, 0, 0]

print("Accuracy:", accuracy_score(y_true, y_pred))
print("Precision:", precision_score(y_true, y_pred))
print("Recall:", recall_score(y_true, y_pred))
  
Accuracy: 0.8 Precision: 1.0 Recall: 0.67

Mean Opinion Score (MOS)

Speech synthesis quality is hard to measure automatically.

Mean Opinion Score (MOS) is a human-based evaluation metric.

Listeners rate audio quality on a scale (usually 1–5).

Higher MOS indicates more natural-sounding speech.

Objective vs Subjective Metrics

Speech AI uses both:

  • Objective metrics – numerical, automated
  • Subjective metrics – human judgments

Production systems often combine both for reliable evaluation.

Evaluation in Real-World Pipelines

In real systems:

  • Offline evaluation validates model training
  • Online metrics monitor live performance
  • User feedback improves future models

Evaluation never stops after deployment.

Practice

Which metric measures word-level transcription errors?



Which metric evaluates errors at the character level?



Which metric relies on human listeners for evaluation?



Quick Quiz

For WER, which value indicates better performance?





Mean Opinion Score (MOS) is which type of metric?





Accuracy, precision, and recall are mainly used for which tasks?





Recap: Evaluation metrics quantify Speech AI performance and guide model improvement.

Next up: You’ll learn the limitations and challenges of Speech AI systems in real-world environments.