AI Course
Model Evaluation Metrics
Model Evaluation Metrics help us understand how well a machine learning model is performing. Training a model is only half the job. The real question is whether the model is making correct and reliable predictions.
Different problems need different metrics. A model that works well for predicting house prices may need a completely different evaluation approach than a model detecting fraud.
Why Model Evaluation Is Important
Accuracy alone does not always tell the full story. A model can appear accurate but still fail in real-world scenarios.
- Helps compare multiple models
- Detects overfitting and underfitting
- Ensures reliability before deployment
- Guides model improvement decisions
Real-World Connection
Imagine a medical diagnosis system that predicts whether a patient has a disease. Even a small number of wrong predictions can be dangerous. In such cases, evaluation metrics beyond accuracy become critical.
Classification Metrics
Classification models predict categories. Common examples include spam detection, fraud detection, and disease diagnosis.
Accuracy
Accuracy measures how many predictions the model got right overall.
from sklearn.metrics import accuracy_score
y_true = [1, 0, 1, 1, 0]
y_pred = [1, 0, 1, 0, 0]
print(accuracy_score(y_true, y_pred))
The model correctly predicted 80% of the cases.
Precision
Precision measures how many predicted positives were actually correct.
from sklearn.metrics import precision_score
print(precision_score(y_true, y_pred))
A precision of 1.0 means every positive prediction was correct.
Recall
Recall measures how many actual positives were correctly identified.
from sklearn.metrics import recall_score
print(recall_score(y_true, y_pred))
The model detected 67% of all actual positive cases.
F1-Score
The F1-Score balances precision and recall into a single value.
from sklearn.metrics import f1_score
print(f1_score(y_true, y_pred))
F1-Score is useful when false positives and false negatives both matter.
Regression Metrics
Regression models predict continuous values such as prices, temperatures, or sales.
Mean Absolute Error (MAE)
MAE measures the average absolute difference between predictions and actual values.
from sklearn.metrics import mean_absolute_error
y_true = [100, 150, 200]
y_pred = [110, 140, 190]
print(mean_absolute_error(y_true, y_pred))
On average, predictions are off by 10 units.
Mean Squared Error (MSE)
MSE penalizes larger errors more heavily by squaring them.
from sklearn.metrics import mean_squared_error
print(mean_squared_error(y_true, y_pred))
Larger mistakes have a stronger impact on this metric.
R-Squared (R²)
R² explains how much variance in the target variable is captured by the model.
from sklearn.metrics import r2_score
print(r2_score(y_true, y_pred))
An R² value close to 1 indicates a strong model fit.
Choosing the Right Metric
- Use accuracy for balanced classification problems
- Use precision and recall for imbalanced datasets
- Use MAE or MSE for regression tasks
- Use R² to understand explained variance
Practice Questions
Practice 1: Which metric measures overall correct predictions?
Practice 2: Which metric measures detected actual positives?
Practice 3: Which regression metric measures average absolute error?
Quick Quiz
Quiz 1: Which metric balances precision and recall?
Quiz 2: Which metric penalizes large errors more?
Quiz 3: Precision and recall are critical for which datasets?
Coming up next: Overfitting and Underfitting — understanding when models learn too much or too little.