ML Lesson 15 – ML Evaluation Metrics | Dataplexa

Machine Learning Evaluation Metrics

In the previous lesson, we learned about the bias–variance tradeoff and why balancing model complexity is critical.

Now we answer a very practical question: how do we measure whether a machine learning model is actually good?

This is done using evaluation metrics.


Why Accuracy Alone Is Not Enough

Many beginners assume that accuracy is the only metric that matters.

Accuracy simply tells us the percentage of correct predictions.

But in real-world problems, accuracy alone can be misleading.

For example, if only 5% of customers purchase a product, a model that always predicts “no purchase” will be 95% accurate — but completely useless.


Our Dataset and Evaluation

We continue using the same dataset across the entire ML module:

Dataplexa ML Housing & Customer Dataset

The target variable purchase_decision represents whether a customer made a purchase or not.

This makes our problem a classification task.


Confusion Matrix (Foundation)

Most evaluation metrics are derived from the confusion matrix.

The confusion matrix compares:

Actual values Predicted values

It breaks predictions into four categories:

True Positives True Negatives False Positives False Negatives

Understanding this matrix helps us understand all other metrics.


Accuracy

Accuracy measures how many predictions were correct overall.

It works well when classes are balanced.

from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

However, accuracy does not tell us what kinds of mistakes the model is making.


Precision (How Reliable Are Positive Predictions?)

Precision answers the question:

“When the model predicts a purchase, how often is it correct?”

High precision means fewer false positives.

This is important when false alarms are costly.


Recall (How Many Positives Did We Catch?)

Recall answers the question:

“Out of all actual purchasers, how many did the model identify?”

High recall means fewer missed opportunities.

This is important when missing a positive case is costly.


F1 Score (Balance Between Precision and Recall)

The F1 score balances precision and recall.

It is especially useful when classes are imbalanced.

A good F1 score means the model is both accurate and reliable.


Computing Metrics in Python

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

This report gives accuracy, precision, recall, and F1 score together.


Real-World Interpretation

Think about a marketing campaign:

High precision → fewer wrong customers targeted High recall → more potential buyers reached

The best model depends on business goals.


Choosing the Right Metric

There is no single “best” metric.

The choice depends on:

Business impact Cost of errors Class imbalance

Good ML practitioners always look beyond accuracy.


Mini Practice

Think about our dataset.

Ask yourself:

Is it worse to miss a potential buyer, or to incorrectly target a non-buyer?


Exercises

Exercise 1:
Why can accuracy be misleading?

Because it does not reveal the types of errors a model makes, especially with imbalanced data.

Exercise 2:
What does precision measure?

It measures how often positive predictions are actually correct.

Exercise 3:
When is recall more important than precision?

When missing a positive case is more costly than a false alarm.

Quick Quiz

Q1. Is F1 score useful for imbalanced datasets?

Yes. It balances precision and recall.

Q2. Should one metric be used for all problems?

No. Metric choice depends on the problem and business goals.

This completes the Beginner Level of Machine Learning.

In the next lesson, we move into the Intermediate Level and start with Linear Regression, our first full machine learning algorithm.