Data Science
ML Metrics
Master classification and regression metrics to evaluate model performance with precision, recall, RMSE, and R² using real ecommerce data.
Why Metrics Matter
Your model predicts customer returns with 85% accuracy. Sounds great? Not when 95% of orders aren't returned anyway. A model predicting "no return" for everything achieves 95% accuracy while being completely useless.
Metrics reveal what accuracy hides. Precision tells you how many predicted returns were actual returns. Recall tells you how many actual returns you caught. Both matter — but which matters more depends on your business cost.
Missing a high-value return costs INR 50,000 in lost inventory. A false return alert costs INR 500 in wasted investigation time. Now recall becomes 100x more important than precision.
Classification Metrics Deep Dive
Classification metrics start with the confusion matrix. Think of it as a truth table showing where your model got confused.
The scenario: Flipkart's fraud detection team needs to evaluate their return prediction model on 10,000 orders from last month.# Import libraries for metrics calculation
import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
# Load ecommerce data for analysis
df = pd.read_csv('dataplexa_ecommerce.csv')Libraries imported successfully
What just happened?
We imported sklearn.metrics which contains all classification and regression metrics. The confusion_matrix function creates the foundation for all other metrics. Try this: Always import metrics at the start of model evaluation.
# Create realistic predictions vs actual returns
np.random.seed(42) # For reproducible results
actual_returns = df['returned'].values
# Simulate model predictions with realistic accuracy
predicted_returns = actual_returns.copy()
# Add some prediction errors to make it realistic
flip_indices = np.random.choice(len(actual_returns), size=800, replace=False)
predicted_returns[flip_indices] = ~predicted_returns[flip_indices]Created 800 prediction errors out of total predictions
What just happened?
We simulated a realistic ML model by starting with perfect predictions, then introducing 800 errors randomly. This mimics how models make mistakes in production. The ~ operator flips boolean values. Try this: Always test metrics on imperfect predictions to understand their behavior.
# Build the confusion matrix - foundation of all metrics
cm = confusion_matrix(actual_returns, predicted_returns)
print("Confusion Matrix:")
print(" Predicted")
print(" No Yes")
print(f"Actual No {cm[0,0]} {cm[0,1]}")
print(f"Actual Yes {cm[1,0]} {cm[1,1]}")
Confusion Matrix:
Predicted
No Yes
Actual No 4567 342
Actual Yes 458 4633What just happened?
The confusion matrix shows 4567 true negatives (correctly predicted no return), 4633 true positives (correctly predicted return), and our errors split into 342 false positives + 458 false negatives. Try this: Always label your confusion matrix clearly — it prevents metric calculation mistakes.
# Calculate the big 4 classification metrics manually
true_pos = cm[1,1] # Bottom right: predicted yes, actual yes
false_pos = cm[0,1] # Top right: predicted yes, actual no
false_neg = cm[1,0] # Bottom left: predicted no, actual yes
true_neg = cm[0,0] # Top left: predicted no, actual no
print(f"True Positives: {true_pos}")
print(f"False Positives: {false_pos}")
print(f"False Negatives: {false_neg}")
print(f"True Negatives: {true_neg}")
True Positives: 4633 False Positives: 342 False Negatives: 458 True Negatives: 4567
What just happened?
We extracted the four building blocks of classification metrics from our confusion matrix. 4633 true positives means we correctly identified 4633 returns. 458 false negatives means we missed 458 actual returns. Try this: Always understand these four numbers before calculating derived metrics.
Model correctly predicts 92% of cases but makes different types of errors
The doughnut shows our model's prediction accuracy breakdown. Green and blue segments represent correct predictions totaling 9,200 orders. Orange shows 342 false alarms where we predicted returns that didn't happen. Red shows 458 missed returns — actual returns we failed to catch. Business impact becomes clear when you cost out these errors. Missing 458 returns at INR 15,000 average order value costs INR 68.7 lakhs. False alarms cost investigation time but won't lose inventory. This tells Flipkart to tune their model for higher recall even if precision drops.# Calculate precision, recall, and F1-score step by step
precision = true_pos / (true_pos + false_pos)
recall = true_pos / (true_pos + false_neg)
accuracy = (true_pos + true_neg) / (true_pos + true_neg + false_pos + false_neg)
f1_score = 2 * (precision * recall) / (precision + recall)
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"Accuracy: {accuracy:.3f}")
print(f"F1-Score: {f1_score:.3f}")
Precision: 0.931 Recall: 0.910 Accuracy: 0.920 F1-Score: 0.920
What just happened?
Precision = 0.931 means 93% of predicted returns were actual returns. Recall = 0.910 means we caught 91% of all actual returns. F1 = 0.920 balances both metrics. Try this: Use F1-score when you need equal importance of precision and recall.
Regression Metrics Essentials
Regression metrics measure how close your predictions are to actual values. Unlike classification's binary right/wrong, regression deals with degrees of wrongness. Being off by INR 100 isn't the same as being off by INR 10,000.
The scenario: Zomato's pricing team built a model to predict order revenue and needs to evaluate prediction accuracy across different order sizes.# Import regression metrics from sklearn
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import matplotlib.pyplot as plt
# Get actual revenue values from our data
actual_revenue = df['revenue'].values
print(f"Revenue range: INR {actual_revenue.min():,.0f} to INR {actual_revenue.max():,.0f}")
print(f"Mean revenue: INR {actual_revenue.mean():,.0f}")
Revenue range: INR 500 to INR 200,000 Mean revenue: INR 21,847
What just happened?
Our ecommerce data shows revenue ranges from INR 500 to INR 200,000 with a mean of INR 21,847. This wide range means we need metrics that handle both small and large prediction errors appropriately. Try this: Always examine your target variable distribution before choosing regression metrics.
# Create realistic predictions with varying accuracy
np.random.seed(42)
# Add noise proportional to actual values (realistic model behavior)
prediction_noise = np.random.normal(0, 0.15 * actual_revenue)
predicted_revenue = actual_revenue + prediction_noise
# Ensure no negative predictions
predicted_revenue = np.maximum(predicted_revenue, 100)
print(f"Sample of actual vs predicted:")
for i in range(5):
print(f"Actual: INR {actual_revenue[i]:,.0f} | Predicted: INR {predicted_revenue[i]:,.0f}")
Sample of actual vs predicted: Actual: INR 15,750 | Predicted: INR 13,892 Actual: INR 8,500 | Predicted: INR 9,247 Actual: INR 45,600 | Predicted: INR 43,821 Actual: INR 2,100 | Predicted: INR 2,387 Actual: INR 89,200 | Predicted: INR 92,156
What just happened?
We simulated a realistic regression model by adding noise proportional to actual values (15% standard deviation). This mimics how real models have larger absolute errors on larger values but similar relative errors. The np.maximum prevents negative predictions. Try this: Always create realistic error patterns when testing metrics.
# Calculate the big 3 regression metrics
mae = mean_absolute_error(actual_revenue, predicted_revenue)
mse = mean_squared_error(actual_revenue, predicted_revenue)
rmse = np.sqrt(mse) # Root mean squared error
r2 = r2_score(actual_revenue, predicted_revenue)
print(f"Mean Absolute Error (MAE): INR {mae:,.0f}")
print(f"Root Mean Squared Error (RMSE): INR {rmse:,.0f}")
print(f"R² Score: {r2:.3f}")
print(f"Mean revenue: INR {actual_revenue.mean():,.0f}")
Mean Absolute Error (MAE): INR 2,687 Root Mean Squared Error (RMSE): INR 3,421 Mean revenue: INR 21,847
What just happened?
MAE = INR 2,687 means average prediction error is INR 2,687. RMSE = INR 3,421 penalizes larger errors more heavily. The difference suggests we have some big errors. R² = 0.978 means our model explains 97.8% of revenue variance. Try this: Compare RMSE to MAE — big difference indicates outlier predictions.
Points close to diagonal line indicate accurate predictions across all revenue ranges
The scatter plot reveals prediction patterns across revenue ranges. Points clustering near the red diagonal line show good predictions. Vertical distance from the line represents prediction error. Notice how larger orders show proportionally similar accuracy — our model maintains relative performance across order sizes. R² of 0.978 means strong predictive power, but business teams care more about absolute errors. MAE of INR 2,687 tells Zomato their pricing estimates are typically off by under INR 3,000 — acceptable for demand planning but potentially problematic for margin calculations on small orders.Different metric scales require separate evaluation frameworks for classification vs regression
📊 Data Insight
Our return prediction model achieves 93.1% precision and 91.0% recall, missing only 458 out of 5,091 actual returns. The revenue prediction model shows INR 2,687 average error on INR 21,847 mean revenue — a 12.3% relative error rate that's excellent for demand forecasting but needs improvement for precise margin calculations.
Choosing the Right Metric
Wrong metric choice kills projects. I've seen teams optimize for accuracy on imbalanced datasets and wonder why their model sucks in production. The metric you choose determines how your model learns and what problems it solves.
High Stakes Scenarios
Medical diagnosis, fraud detection, safety systems. Use Recall when missing positives costs lives or money. Better safe than sorry.
Resource Constrained
Marketing campaigns, inventory alerts, manual reviews. Use Precision when false positives waste limited resources.
Common Mistake Alert
Using accuracy on imbalanced datasets — your model gets 95% accuracy by predicting the majority class for everything. Always check class distribution first. For 95% negative class, use precision/recall or AUC-ROC instead of accuracy.
Business context trumps statistical perfection every time. A model with 85% recall and 60% precision might beat one with 95% accuracy if your business loses INR 50,000 per missed case but only INR 500 per false alarm.
| Metric | Best For | Avoid When |
|---|---|---|
| Precision | False positives expensive | Missing positives costly |
| Recall | Missing positives expensive | Limited investigation resources |
| F1-Score | Balanced cost scenarios | Clear cost imbalance exists |
| MAE | All errors equally bad | Large errors much worse |
| RMSE | Outliers very costly | Robust to outliers needed |
| R² | Model comparison | Absolute error matters |
Pro tip: Calculate multiple metrics but optimize for one primary metric aligned with business cost. Report others for context. Stakeholders love seeing precision AND recall even when you optimize for F1-score.
Quiz
1. Your ecommerce return prediction model has 93.1% precision. What does this tell your business stakeholders?
2. You're building a revenue prediction model for dynamic pricing. Your MAE is INR 2,687 and RMSE is INR 4,521. Which should you optimize for and why?
3. Your fraud detection model has 87% precision and 92% recall. Each missed fraud case costs INR 25,000 in losses, while each false alert costs INR 200 in investigation time. How should you tune the model?
Up Next
Train/Test/Validation
Learn how to properly split your data to get reliable metric scores that actually predict real-world model performance.