AI Lesson 45 – Loss Functions & Optimizers | Dataplexa

Cross-Validation

Cross-Validation is a technique used to evaluate how well a machine learning model will perform on unseen data. Instead of testing a model on just one train-test split, cross-validation tests it multiple times using different portions of the data.

This gives a more reliable estimate of a model’s real-world performance and helps detect overfitting early.

Why Cross-Validation Is Needed

A single train-test split can be misleading. A model might perform well just because it got an “easy” test set. Cross-validation reduces this risk by testing the model across multiple data splits.

  • More reliable performance measurement
  • Better model comparison
  • Early detection of overfitting
  • Better use of limited data

Real-World Connection

Think of hiring an employee based on multiple interviews instead of one. Each interview evaluates different skills. Cross-validation works the same way by testing the model multiple times.

How Cross-Validation Works

The dataset is divided into several equal parts called folds. The model is trained on some folds and tested on the remaining fold. This process repeats until every fold has been used as a test set.

K-Fold Cross-Validation

K-Fold Cross-Validation is the most common method. The value of K decides how many folds the data is split into.


from sklearn.model_selection import KFold
import numpy as np

data = np.array([1, 2, 3, 4, 5])
kf = KFold(n_splits=3)

for train_index, test_index in kf.split(data):
    print("Train:", train_index, "Test:", test_index)
  
Train: [2 3 4] Test: [0 1] Train: [0 1 4] Test: [2 3] Train: [0 1 2 3] Test: [4]

Each fold becomes a test set once, ensuring balanced evaluation.

Cross-Validation with a Model

Scikit-learn provides built-in functions to perform cross-validation easily.


from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

X, y = load_iris(return_X_y=True)
model = LogisticRegression(max_iter=200)

scores = cross_val_score(model, X, y, cv=5)
print(scores)
  
[0.96 0.93 0.97 0.95 0.96]

Each value represents accuracy from one fold. The average score gives a stable performance estimate.

Stratified Cross-Validation

For classification problems, it is important that each fold contains a similar class distribution. Stratified Cross-Validation ensures this balance.

  • Used mainly for classification
  • Preserves class ratios
  • More reliable for imbalanced data

When to Use Cross-Validation

  • When dataset size is small
  • When comparing multiple models
  • When tuning hyperparameters
  • Before deploying a model

Practice Questions

Practice 1: What technique evaluates a model using multiple data splits?



Practice 2: What are the data partitions in K-Fold called?



Practice 3: Cross-validation helps detect which problem?



Quick Quiz

Quiz 1: The most common cross-validation method is?





Quiz 2: If cv=5, how many folds are used?





Quiz 3: Which cross-validation preserves class distribution?





Coming up next: Hyperparameter Tuning — optimizing model performance.