AI Course
Cross-Validation Methods
When we train a machine learning model, evaluating it on the same data it was trained on gives a misleading sense of performance. Cross-validation is a technique used to test how well a model will perform on unseen data by repeatedly splitting the dataset in different ways.
This lesson explains why cross-validation is necessary, how it works, and the most commonly used cross-validation methods in real-world machine learning projects.
Real-World Connection
Think about judging a student’s knowledge using only one exam. That single test may be too easy or too hard. A better approach is to evaluate the student multiple times with different question sets. Cross-validation works the same way by testing the model multiple times on different data splits.
Why Cross-Validation Is Important
- Provides a more reliable performance estimate
- Reduces dependency on a single train-test split
- Helps detect overfitting
- Improves model selection
Basic Train-Test Split
The simplest validation approach is splitting data into training and testing sets. However, this method depends heavily on how the data is split and may not represent true performance.
K-Fold Cross-Validation
In K-Fold Cross-Validation, the dataset is divided into K equal parts called folds. The model is trained on K-1 folds and tested on the remaining fold. This process is repeated K times, and the results are averaged.
- Each data point is used for training and testing
- Provides stable performance estimates
- Common values of K are 5 or 10
K-Fold Example (Python)
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
X, y = load_iris(return_X_y=True)
model = LogisticRegression(max_iter=1000)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf)
print(scores)
Understanding the Output
Each value represents model accuracy for one fold. The variation across folds shows how stable the model is. The average of these scores gives a reliable performance estimate.
Stratified K-Fold
Stratified K-Fold ensures that each fold has the same class distribution as the original dataset. This is especially important for imbalanced classification problems.
Stratified K-Fold Example
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf)
print(scores)
Leave-One-Out Cross-Validation (LOOCV)
In Leave-One-Out Cross-Validation, each data point is used once as a test set while the rest are used for training. This provides a very accurate estimate but is computationally expensive.
When to Use Each Method
- Train-test split for quick experiments
- K-Fold for most real-world applications
- Stratified K-Fold for imbalanced datasets
- LOOCV for very small datasets
Common Mistakes to Avoid
- Data leakage between folds
- Using test data during training
- Ignoring class imbalance
Practice Questions
Practice 1: What technique evaluates models using multiple data splits?
Practice 2: Which method splits data into K parts?
Practice 3: Which method preserves class distribution?
Quick Quiz
Quiz 1: Cross-validation mainly provides what?
Quiz 2: Stratified K-Fold is best for which type of data?
Quiz 3: LOOCV is considered what?
Coming up next: Introduction to Deep Learning — moving from traditional ML to neural networks.