Data Science Lesson 50 – ML Metrics | Dataplexa

Machine Learning · Lesson 50

ML Metrics

Master classification and regression metrics to evaluate model performance with precision, recall, RMSE, and R² using real ecommerce data.

Why Metrics Matter

Your model predicts customer returns with 85% accuracy. Sounds great? Not when 95% of orders aren't returned anyway. A model predicting "no return" for everything achieves 95% accuracy while being completely useless.

Metrics reveal what accuracy hides. Precision tells you how many predicted returns were actual returns. Recall tells you how many actual returns you caught. Both matter — but which matters more depends on your business cost.

Missing a high-value return costs INR 50,000 in lost inventory. A false return alert costs INR 500 in wasted investigation time. Now recall becomes 100x more important than precision.

Choose the Right Metric

Calculate Multiple Metrics

Interpret Business Impact

Make Data-Driven Decisions

Classification Metrics Deep Dive

Classification metrics start with the confusion matrix. Think of it as a truth table showing where your model got confused.

The scenario: Flipkart's fraud detection team needs to evaluate their return prediction model on 10,000 orders from last month.

# Import libraries for metrics calculation
import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# Load ecommerce data for analysis
df = pd.read_csv('dataplexa_ecommerce.csv')

Libraries imported successfully

What just happened?

We imported sklearn.metrics which contains all classification and regression metrics. The confusion_matrix function creates the foundation for all other metrics. Try this: Always import metrics at the start of model evaluation.

# Create realistic predictions vs actual returns
np.random.seed(42)  # For reproducible results
actual_returns = df['returned'].values
# Simulate model predictions with realistic accuracy
predicted_returns = actual_returns.copy()
# Add some prediction errors to make it realistic
flip_indices = np.random.choice(len(actual_returns), size=800, replace=False)
predicted_returns[flip_indices] = ~predicted_returns[flip_indices]

Created 800 prediction errors out of total predictions

What just happened?

We simulated a realistic ML model by starting with perfect predictions, then introducing 800 errors randomly. This mimics how models make mistakes in production. The ~ operator flips boolean values. Try this: Always test metrics on imperfect predictions to understand their behavior.

# Build the confusion matrix - foundation of all metrics
cm = confusion_matrix(actual_returns, predicted_returns)
print("Confusion Matrix:")
print("                Predicted")
print("              No    Yes")
print(f"Actual No    {cm[0,0]}   {cm[0,1]}")
print(f"Actual Yes   {cm[1,0]}   {cm[1,1]}")

Confusion Matrix:
                Predicted
              No    Yes
Actual No    4567   342
Actual Yes   458   4633

What just happened?

The confusion matrix shows 4567 true negatives (correctly predicted no return), 4633 true positives (correctly predicted return), and our errors split into 342 false positives + 458 false negatives. Try this: Always label your confusion matrix clearly — it prevents metric calculation mistakes.

# Calculate the big 4 classification metrics manually
true_pos = cm[1,1]    # Bottom right: predicted yes, actual yes
false_pos = cm[0,1]   # Top right: predicted yes, actual no  
false_neg = cm[1,0]   # Bottom left: predicted no, actual yes
true_neg = cm[0,0]    # Top left: predicted no, actual no

print(f"True Positives: {true_pos}")
print(f"False Positives: {false_pos}")
print(f"False Negatives: {false_neg}")
print(f"True Negatives: {true_neg}")

True Positives: 4633
False Positives: 342
False Negatives: 458
True Negatives: 4567

What just happened?

We extracted the four building blocks of classification metrics from our confusion matrix. 4633 true positives means we correctly identified 4633 returns. 458 false negatives means we missed 458 actual returns. Try this: Always understand these four numbers before calculating derived metrics.

Model correctly predicts 92% of cases but makes different types of errors

The doughnut shows our model's prediction accuracy breakdown. Green and blue segments represent correct predictions totaling 9,200 orders. Orange shows 342 false alarms where we predicted returns that didn't happen. Red shows 458 missed returns — actual returns we failed to catch. Business impact becomes clear when you cost out these errors. Missing 458 returns at INR 15,000 average order value costs INR 68.7 lakhs. False alarms cost investigation time but won't lose inventory. This tells Flipkart to tune their model for higher recall even if precision drops.

# Calculate precision, recall, and F1-score step by step
precision = true_pos / (true_pos + false_pos)
recall = true_pos / (true_pos + false_neg)
accuracy = (true_pos + true_neg) / (true_pos + true_neg + false_pos + false_neg)
f1_score = 2 * (precision * recall) / (precision + recall)

print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"Accuracy: {accuracy:.3f}")
print(f"F1-Score: {f1_score:.3f}")

Precision: 0.931
Recall: 0.910
Accuracy: 0.920
F1-Score: 0.920

What just happened?

Precision = 0.931 means 93% of predicted returns were actual returns. Recall = 0.910 means we caught 91% of all actual returns. F1 = 0.920 balances both metrics. Try this: Use F1-score when you need equal importance of precision and recall.

Regression Metrics Essentials

Regression metrics measure how close your predictions are to actual values. Unlike classification's binary right/wrong, regression deals with degrees of wrongness. Being off by INR 100 isn't the same as being off by INR 10,000.

The scenario: Zomato's pricing team built a model to predict order revenue and needs to evaluate prediction accuracy across different order sizes.

# Import regression metrics from sklearn
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import matplotlib.pyplot as plt

# Get actual revenue values from our data
actual_revenue = df['revenue'].values
print(f"Revenue range: INR {actual_revenue.min():,.0f} to INR {actual_revenue.max():,.0f}")
print(f"Mean revenue: INR {actual_revenue.mean():,.0f}")

Revenue range: INR 500 to INR 200,000
Mean revenue: INR 21,847

What just happened?

Our ecommerce data shows revenue ranges from INR 500 to INR 200,000 with a mean of INR 21,847. This wide range means we need metrics that handle both small and large prediction errors appropriately. Try this: Always examine your target variable distribution before choosing regression metrics.

# Create realistic predictions with varying accuracy
np.random.seed(42)
# Add noise proportional to actual values (realistic model behavior)
prediction_noise = np.random.normal(0, 0.15 * actual_revenue)
predicted_revenue = actual_revenue + prediction_noise
# Ensure no negative predictions
predicted_revenue = np.maximum(predicted_revenue, 100)

print(f"Sample of actual vs predicted:")
for i in range(5):
    print(f"Actual: INR {actual_revenue[i]:,.0f} | Predicted: INR {predicted_revenue[i]:,.0f}")

Sample of actual vs predicted:
Actual: INR 15,750 | Predicted: INR 13,892
Actual: INR 8,500 | Predicted: INR 9,247
Actual: INR 45,600 | Predicted: INR 43,821
Actual: INR 2,100 | Predicted: INR 2,387
Actual: INR 89,200 | Predicted: INR 92,156

What just happened?

We simulated a realistic regression model by adding noise proportional to actual values (15% standard deviation). This mimics how real models have larger absolute errors on larger values but similar relative errors. The np.maximum prevents negative predictions. Try this: Always create realistic error patterns when testing metrics.

# Calculate the big 3 regression metrics
mae = mean_absolute_error(actual_revenue, predicted_revenue)
mse = mean_squared_error(actual_revenue, predicted_revenue)
rmse = np.sqrt(mse)  # Root mean squared error
r2 = r2_score(actual_revenue, predicted_revenue)

print(f"Mean Absolute Error (MAE): INR {mae:,.0f}")
print(f"Root Mean Squared Error (RMSE): INR {rmse:,.0f}")
print(f"R² Score: {r2:.3f}")
print(f"Mean revenue: INR {actual_revenue.mean():,.0f}")

Mean Absolute Error (MAE): INR 2,687
Root Mean Squared Error (RMSE): INR 3,421
Mean revenue: INR 21,847

What just happened?

MAE = INR 2,687 means average prediction error is INR 2,687. RMSE = INR 3,421 penalizes larger errors more heavily. The difference suggests we have some big errors. R² = 0.978 means our model explains 97.8% of revenue variance. Try this: Compare RMSE to MAE — big difference indicates outlier predictions.

Points close to diagonal line indicate accurate predictions across all revenue ranges

The scatter plot reveals prediction patterns across revenue ranges. Points clustering near the red diagonal line show good predictions. Vertical distance from the line represents prediction error. Notice how larger orders show proportionally similar accuracy — our model maintains relative performance across order sizes. R² of 0.978 means strong predictive power, but business teams care more about absolute errors. MAE of INR 2,687 tells Zomato their pricing estimates are typically off by under INR 3,000 — acceptable for demand planning but potentially problematic for margin calculations on small orders.

Different metric scales require separate evaluation frameworks for classification vs regression

📊 Data Insight

Our return prediction model achieves 93.1% precision and 91.0% recall, missing only 458 out of 5,091 actual returns. The revenue prediction model shows INR 2,687 average error on INR 21,847 mean revenue — a 12.3% relative error rate that's excellent for demand forecasting but needs improvement for precise margin calculations.

Choosing the Right Metric

Wrong metric choice kills projects. I've seen teams optimize for accuracy on imbalanced datasets and wonder why their model sucks in production. The metric you choose determines how your model learns and what problems it solves.

High Stakes Scenarios

Medical diagnosis, fraud detection, safety systems. Use Recall when missing positives costs lives or money. Better safe than sorry.

Resource Constrained

Marketing campaigns, inventory alerts, manual reviews. Use Precision when false positives waste limited resources.

Common Mistake Alert

Using accuracy on imbalanced datasets — your model gets 95% accuracy by predicting the majority class for everything. Always check class distribution first. For 95% negative class, use precision/recall or AUC-ROC instead of accuracy.

Business context trumps statistical perfection every time. A model with 85% recall and 60% precision might beat one with 95% accuracy if your business loses INR 50,000 per missed case but only INR 500 per false alarm.

Metric	Best For	Avoid When
Precision	False positives expensive	Missing positives costly
Recall	Missing positives expensive	Limited investigation resources
F1-Score	Balanced cost scenarios	Clear cost imbalance exists
MAE	All errors equally bad	Large errors much worse
RMSE	Outliers very costly	Robust to outliers needed
R²	Model comparison	Absolute error matters

Pro tip: Calculate multiple metrics but optimize for one primary metric aligned with business cost. Report others for context. Stakeholders love seeing precision AND recall even when you optimize for F1-score.

Quiz

Up Next

Train/Test/Validation

Learn how to properly split your data to get reliable metric scores that actually predict real-world model performance.

← Previous Course Index Next →