AI Course
Cross-Validation
Cross-Validation is a technique used to evaluate how well a machine learning model will perform on unseen data. Instead of testing a model on just one train-test split, cross-validation tests it multiple times using different portions of the data.
This gives a more reliable estimate of a model’s real-world performance and helps detect overfitting early.
Why Cross-Validation Is Needed
A single train-test split can be misleading. A model might perform well just because it got an “easy” test set. Cross-validation reduces this risk by testing the model across multiple data splits.
- More reliable performance measurement
- Better model comparison
- Early detection of overfitting
- Better use of limited data
Real-World Connection
Think of hiring an employee based on multiple interviews instead of one. Each interview evaluates different skills. Cross-validation works the same way by testing the model multiple times.
How Cross-Validation Works
The dataset is divided into several equal parts called folds. The model is trained on some folds and tested on the remaining fold. This process repeats until every fold has been used as a test set.
K-Fold Cross-Validation
K-Fold Cross-Validation is the most common method. The value of K decides how many folds the data is split into.
from sklearn.model_selection import KFold
import numpy as np
data = np.array([1, 2, 3, 4, 5])
kf = KFold(n_splits=3)
for train_index, test_index in kf.split(data):
print("Train:", train_index, "Test:", test_index)
Each fold becomes a test set once, ensuring balanced evaluation.
Cross-Validation with a Model
Scikit-learn provides built-in functions to perform cross-validation easily.
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
X, y = load_iris(return_X_y=True)
model = LogisticRegression(max_iter=200)
scores = cross_val_score(model, X, y, cv=5)
print(scores)
Each value represents accuracy from one fold. The average score gives a stable performance estimate.
Stratified Cross-Validation
For classification problems, it is important that each fold contains a similar class distribution. Stratified Cross-Validation ensures this balance.
- Used mainly for classification
- Preserves class ratios
- More reliable for imbalanced data
When to Use Cross-Validation
- When dataset size is small
- When comparing multiple models
- When tuning hyperparameters
- Before deploying a model
Practice Questions
Practice 1: What technique evaluates a model using multiple data splits?
Practice 2: What are the data partitions in K-Fold called?
Practice 3: Cross-validation helps detect which problem?
Quick Quiz
Quiz 1: The most common cross-validation method is?
Quiz 2: If cv=5, how many folds are used?
Quiz 3: Which cross-validation preserves class distribution?
Coming up next: Hyperparameter Tuning — optimizing model performance.