Evaluation Metrics for Deep Learning Models
In the previous lesson, we understood why deep neural networks suffer from vanishing and exploding gradients and how modern techniques reduce these problems.
Now that we know how models learn, the next critical question is:
How do we know whether a deep learning model is actually good?
This is where evaluation metrics come in.
Why Accuracy Alone Is Not Enough
Many beginners judge a model only by accuracy.
However, accuracy can be misleading, especially when data is imbalanced.
For example, if 95% of emails are not spam, a model that always predicts "not spam" will have 95% accuracy — but it is useless.
Deep learning models require deeper evaluation.
Classification vs Regression Metrics
Evaluation metrics depend on the type of problem:
Classification models predict categories, while regression models predict continuous values.
Using the wrong metric can lead to incorrect conclusions about model performance.
Common Classification Metrics
For classification tasks, we evaluate how well predictions match true labels.
Important metrics include:
Accuracy measures overall correctness, but ignores the type of errors.
Precision focuses on how many predicted positives are actually correct.
Recall focuses on how many actual positives were successfully detected.
Precision and Recall (Intuition)
Precision answers the question:
“When the model says YES, how often is it right?”
Recall answers:
“Out of all actual YES cases, how many did we catch?”
Different applications prioritize different metrics.
Real-World Example
In medical diagnosis:
Low recall means missing sick patients, which can be dangerous.
In spam detection:
Low precision means blocking important emails.
This is why accuracy alone is insufficient.
F1 Score – Balancing Precision and Recall
The F1 score combines precision and recall into a single metric.
It is especially useful when class distribution is uneven.
Deep learning practitioners often track F1 score instead of accuracy.
Regression Metrics
For regression tasks, we measure how far predictions are from true values.
Common metrics include:
Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE).
Each metric penalizes errors differently.
Why Squared Errors Matter
MSE and RMSE penalize large errors more heavily.
This is important when large mistakes are unacceptable.
However, they are sensitive to outliers.
Evaluation During Training
Deep learning models are evaluated after every epoch during training.
This helps us:
• Detect overfitting • Compare training vs validation performance • Decide when to stop training
Metrics guide training decisions.
Example: Compiling a Model with Metrics
model.compile(
optimizer="adam",
loss="categorical_crossentropy",
metrics=["accuracy"]
)
Here, accuracy is tracked during training, but additional metrics may be needed during evaluation.
Choosing the Right Metric
There is no single best metric.
The right choice depends on:
• Problem type • Cost of errors • Data distribution
Professional models are evaluated using multiple metrics.
Mini Practice
Which metric would you prioritize for a fraud detection system — precision or recall? Why?
Exercises
Exercise 1:
Why can high accuracy be misleading?
Exercise 2:
When is F1 score preferred over accuracy?
Quick Quiz
Q1. Which metric penalizes large errors more?
Q2. Why are multiple metrics often used?
In the next lesson, we will explore the confusion matrix and visually understand how predictions are distributed across classes.