AI Course
Model Evaluation Metrics
Building a machine learning model is only half the job. The real question is: how good is the model? Model Evaluation Metrics help us measure how well a model performs, how reliable its predictions are, and whether it can be trusted in real-world scenarios.
Different problems require different evaluation metrics. A model that performs well in one metric may perform poorly in another. Understanding these metrics is essential to building successful AI systems.
Real-World Connection
Imagine a medical test that detects a disease. If it predicts “healthy” for everyone, accuracy may look high, but the test is useless. Evaluation metrics help us understand whether predictions are truly meaningful, not just mathematically correct.
Why Model Evaluation Is Important
- Measures model performance objectively
- Helps compare different models
- Identifies overfitting and underfitting
- Guides model improvement
Evaluation Metrics for Classification Models
Classification models predict categories such as spam/not spam or fraud/not fraud. The most common metrics are Accuracy, Precision, Recall, and F1-score.
Accuracy
Accuracy measures how many predictions were correct out of all predictions.
Accuracy is useful when classes are balanced, but it can be misleading when one class dominates the dataset.
Accuracy Example (Python)
from sklearn.metrics import accuracy_score
y_true = [1, 0, 1, 1, 0]
y_pred = [1, 0, 1, 0, 0]
accuracy = accuracy_score(y_true, y_pred)
print(accuracy)
Understanding the Output
The model correctly predicted 80% of the outcomes. However, accuracy alone does not tell us which type of mistakes were made.
Precision
Precision measures how many predicted positive cases were actually positive. It is important when false positives are costly.
For example, in spam detection, marking a legitimate email as spam is undesirable.
Recall
Recall measures how many actual positive cases were correctly identified. It is important when missing positive cases is dangerous.
For example, in disease detection, failing to detect a sick patient can be critical.
Precision and Recall Example
from sklearn.metrics import precision_score, recall_score
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
print("Precision:", precision)
print("Recall:", recall)
Understanding Precision and Recall
The model was very accurate when it predicted a positive outcome, but it failed to detect all actual positive cases.
F1 Score
The F1 score balances precision and recall into a single value. It is useful when both false positives and false negatives matter.
F1 Score Example
from sklearn.metrics import f1_score
f1 = f1_score(y_true, y_pred)
print(f1)
Confusion Matrix
A confusion matrix shows a detailed breakdown of correct and incorrect predictions. It helps visualize where the model is making mistakes.
Confusion Matrix Example
from sklearn.metrics import confusion_matrix
matrix = confusion_matrix(y_true, y_pred)
print(matrix)
Evaluation Metrics for Regression Models
Regression models predict continuous values such as prices or temperatures. Common metrics include Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared.
Mean Absolute Error (MAE)
MAE measures the average magnitude of errors without considering direction. It is easy to interpret.
Regression Metrics Example
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
y_true = [100, 200, 300]
y_pred = [110, 190, 310]
print("MAE:", mean_absolute_error(y_true, y_pred))
print("MSE:", mean_squared_error(y_true, y_pred))
print("R2:", r2_score(y_true, y_pred))
Practice Questions
Practice 1: Which metric measures overall correctness?
Practice 2: Which metric focuses on false positives?
Practice 3: Which regression metric measures average absolute error?
Quick Quiz
Quiz 1: Which metric balances precision and recall?
Quiz 2: Which tool shows detailed prediction results?
Quiz 3: MAE and MSE are used for which models?
Coming up next: Overfitting and Underfitting — understanding why models fail and how to fix them.