Speech AI Lesson 11 – Evaluation Metrics | Dataplexa

Evaluation Metrics

Until now, we focused on building Speech AI systems: audio processing, features, datasets, and noise handling.

But one critical question remains: How do we know if a Speech AI system is actually good?

This lesson answers that question by introducing evaluation metrics — the backbone of professional Speech AI development.

Why Evaluation Metrics Matter

Speech AI systems are never perfect.

Metrics help us:

Measure model performance objectively
Compare different models
Track improvements over time
Detect regressions in production

Without proper evaluation, model improvements become guesswork.

Task-Specific Evaluation

Speech AI is not a single task.

Different tasks require different metrics:

Speech Recognition → transcription accuracy
Speech Synthesis → naturalness and intelligibility
Speaker Recognition → identification accuracy

Choosing the wrong metric can lead to misleading results.

Word Error Rate (WER)

Word Error Rate (WER) is the most common metric for speech recognition.

It measures how different the predicted transcript is from the reference transcript.

WER is calculated using:

WER = (Substitutions + Insertions + Deletions) / Total Words

Lower WER means better performance.

WER Example

Reference: "speech ai is powerful"
Prediction: "speech ai powerful"

One word deletion occurred → higher WER.


import jiwer

reference = "speech ai is powerful"
prediction = "speech ai powerful"

wer = jiwer.wer(reference, prediction)
print("WER:", wer)

WER: 0.25

Character Error Rate (CER)

Character Error Rate (CER) is similar to WER, but it operates at the character level.

CER is useful for:

Languages without clear word boundaries
Short utterances
Fine-grained error analysis


cer = jiwer.cer(reference, prediction)
print("CER:", cer)

CER: 0.18

Accuracy, Precision, Recall

For tasks like keyword spotting or speaker recognition, classification metrics are used.

Accuracy – overall correctness
Precision – correctness of positive predictions
Recall – ability to find all positives


from sklearn.metrics import accuracy_score, precision_score, recall_score

y_true = [1, 0, 1, 1, 0]
y_pred = [1, 0, 1, 0, 0]

print("Accuracy:", accuracy_score(y_true, y_pred))
print("Precision:", precision_score(y_true, y_pred))
print("Recall:", recall_score(y_true, y_pred))

Accuracy: 0.8 Precision: 1.0 Recall: 0.67

Mean Opinion Score (MOS)

Speech synthesis quality is hard to measure automatically.

Mean Opinion Score (MOS) is a human-based evaluation metric.

Listeners rate audio quality on a scale (usually 1–5).

Higher MOS indicates more natural-sounding speech.

Objective vs Subjective Metrics

Speech AI uses both:

Objective metrics – numerical, automated
Subjective metrics – human judgments

Production systems often combine both for reliable evaluation.

Evaluation in Real-World Pipelines

In real systems:

Offline evaluation validates model training
Online metrics monitor live performance
User feedback improves future models

Evaluation never stops after deployment.

Practice

Which metric measures word-level transcription errors?

Which metric evaluates errors at the character level?

Which metric relies on human listeners for evaluation?

Quick Quiz

For WER, which value indicates better performance?

Higher
Lower
Equal

Mean Opinion Score (MOS) is which type of metric?

Objective
Subjective
Hybrid

Accuracy, precision, and recall are mainly used for which tasks?

Regression
Classification
Clustering

Recap: Evaluation metrics quantify Speech AI performance and guide model improvement.

Next up: You’ll learn the limitations and challenges of Speech AI systems in real-world environments.

← Previous Course Index Next →

Speech AI Course