Machine Learning Evaluation Metrics
In the previous lesson, we learned about the bias–variance tradeoff and why balancing model complexity is critical.
Now we answer a very practical question: how do we measure whether a machine learning model is actually good?
This is done using evaluation metrics.
Why Accuracy Alone Is Not Enough
Many beginners assume that accuracy is the only metric that matters.
Accuracy simply tells us the percentage of correct predictions.
But in real-world problems, accuracy alone can be misleading.
For example, if only 5% of customers purchase a product, a model that always predicts “no purchase” will be 95% accurate — but completely useless.
Our Dataset and Evaluation
We continue using the same dataset across the entire ML module:
Dataplexa ML Housing & Customer Dataset
The target variable purchase_decision
represents whether a customer made a purchase or not.
This makes our problem a classification task.
Confusion Matrix (Foundation)
Most evaluation metrics are derived from the confusion matrix.
The confusion matrix compares:
Actual values Predicted values
It breaks predictions into four categories:
True Positives True Negatives False Positives False Negatives
Understanding this matrix helps us understand all other metrics.
Accuracy
Accuracy measures how many predictions were correct overall.
It works well when classes are balanced.
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)
However, accuracy does not tell us what kinds of mistakes the model is making.
Precision (How Reliable Are Positive Predictions?)
Precision answers the question:
“When the model predicts a purchase, how often is it correct?”
High precision means fewer false positives.
This is important when false alarms are costly.
Recall (How Many Positives Did We Catch?)
Recall answers the question:
“Out of all actual purchasers, how many did the model identify?”
High recall means fewer missed opportunities.
This is important when missing a positive case is costly.
F1 Score (Balance Between Precision and Recall)
The F1 score balances precision and recall.
It is especially useful when classes are imbalanced.
A good F1 score means the model is both accurate and reliable.
Computing Metrics in Python
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
This report gives accuracy, precision, recall, and F1 score together.
Real-World Interpretation
Think about a marketing campaign:
High precision → fewer wrong customers targeted High recall → more potential buyers reached
The best model depends on business goals.
Choosing the Right Metric
There is no single “best” metric.
The choice depends on:
Business impact Cost of errors Class imbalance
Good ML practitioners always look beyond accuracy.
Mini Practice
Think about our dataset.
Ask yourself:
Is it worse to miss a potential buyer, or to incorrectly target a non-buyer?
Exercises
Exercise 1:
Why can accuracy be misleading?
Exercise 2:
What does precision measure?
Exercise 3:
When is recall more important than precision?
Quick Quiz
Q1. Is F1 score useful for imbalanced datasets?
Q2. Should one metric be used for all problems?
This completes the Beginner Level of Machine Learning.
In the next lesson, we move into the Intermediate Level and start with Linear Regression, our first full machine learning algorithm.