DL Lesson 23 – Evaluation Metrics | Dataplexa

Evaluation Metrics for Deep Learning Models

In the previous lesson, we understood why deep neural networks suffer from vanishing and exploding gradients and how modern techniques reduce these problems.

Now that we know how models learn, the next critical question is:

How do we know whether a deep learning model is actually good?

This is where evaluation metrics come in.


Why Accuracy Alone Is Not Enough

Many beginners judge a model only by accuracy.

However, accuracy can be misleading, especially when data is imbalanced.

For example, if 95% of emails are not spam, a model that always predicts "not spam" will have 95% accuracy — but it is useless.

Deep learning models require deeper evaluation.


Classification vs Regression Metrics

Evaluation metrics depend on the type of problem:

Classification models predict categories, while regression models predict continuous values.

Using the wrong metric can lead to incorrect conclusions about model performance.


Common Classification Metrics

For classification tasks, we evaluate how well predictions match true labels.

Important metrics include:

Accuracy measures overall correctness, but ignores the type of errors.

Precision focuses on how many predicted positives are actually correct.

Recall focuses on how many actual positives were successfully detected.


Precision and Recall (Intuition)

Precision answers the question:

“When the model says YES, how often is it right?”

Recall answers:

“Out of all actual YES cases, how many did we catch?”

Different applications prioritize different metrics.


Real-World Example

In medical diagnosis:

Low recall means missing sick patients, which can be dangerous.

In spam detection:

Low precision means blocking important emails.

This is why accuracy alone is insufficient.


F1 Score – Balancing Precision and Recall

The F1 score combines precision and recall into a single metric.

It is especially useful when class distribution is uneven.

Deep learning practitioners often track F1 score instead of accuracy.


Regression Metrics

For regression tasks, we measure how far predictions are from true values.

Common metrics include:

Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE).

Each metric penalizes errors differently.


Why Squared Errors Matter

MSE and RMSE penalize large errors more heavily.

This is important when large mistakes are unacceptable.

However, they are sensitive to outliers.


Evaluation During Training

Deep learning models are evaluated after every epoch during training.

This helps us:

• Detect overfitting • Compare training vs validation performance • Decide when to stop training

Metrics guide training decisions.


Example: Compiling a Model with Metrics

model.compile(
    optimizer="adam",
    loss="categorical_crossentropy",
    metrics=["accuracy"]
)

Here, accuracy is tracked during training, but additional metrics may be needed during evaluation.


Choosing the Right Metric

There is no single best metric.

The right choice depends on:

• Problem type • Cost of errors • Data distribution

Professional models are evaluated using multiple metrics.


Mini Practice

Which metric would you prioritize for a fraud detection system — precision or recall? Why?


Exercises

Exercise 1:
Why can high accuracy be misleading?

Because accuracy ignores class imbalance and error types.

Exercise 2:
When is F1 score preferred over accuracy?

When classes are imbalanced and both precision and recall matter.

Quick Quiz

Q1. Which metric penalizes large errors more?

Mean Squared Error (MSE).

Q2. Why are multiple metrics often used?

Each metric highlights different aspects of performance.

In the next lesson, we will explore the confusion matrix and visually understand how predictions are distributed across classes.