AI Course
Cross Validation
When a model performs well on training data but fails on new data, the problem is usually poor generalization. Cross Validation is a technique that helps us measure how a model will perform on unseen data before deploying it.
Instead of trusting a single train-test split, cross validation evaluates the model multiple times on different data portions and gives a more reliable performance estimate.
Why Cross Validation Is Needed
A single split can be misleading. If the test set is too easy or too hard, the model’s performance score may not reflect reality.
- Provides stable performance estimates
- Reduces overfitting risk
- Uses data efficiently
- Helps compare models fairly
Real-World Connection
Imagine evaluating a student using only one exam. The result may not represent true ability. Multiple exams across different topics give a fairer evaluation. Cross validation works the same way for models.
What Is K-Fold Cross Validation?
In K-Fold Cross Validation, the dataset is split into K equal parts. The model is trained on K-1 parts and tested on the remaining part. This process repeats K times.
- Each fold is used once as test data
- Final score is the average of all folds
Simple K-Fold Example
Let’s apply K-Fold Cross Validation using scikit-learn.
from sklearn.datasets import load_iris
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
data = load_iris()
X, y = data.data, data.target
model = LogisticRegression(max_iter=200)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf)
print(scores)
Each value represents accuracy from one fold. Notice how scores vary slightly across different splits.
Average Cross Validation Score
print("Average Accuracy:", scores.mean())
The average score gives a more trustworthy estimate of model performance.
Stratified K-Fold
For classification problems, Stratified K-Fold maintains the same class distribution in each fold. This is especially important for imbalanced datasets.
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf)
print(scores)
print("Average:", scores.mean())
Stratified K-Fold ensures fair evaluation when classes are uneven.
Common Cross Validation Types
- K-Fold: Standard method for general datasets
- Stratified K-Fold: Preserves class balance
- Leave-One-Out: Uses one sample as test each time
- Time Series Split: Used for sequential data
When Not to Use Cross Validation
- Very large datasets where computation is expensive
- Time-dependent data without proper splitting
Practice Questions
Practice 1: What technique evaluates models on multiple data splits?
Practice 2: Which method splits data into K equal parts?
Practice 3: Which cross validation preserves class distribution?
Quick Quiz
Quiz 1: In K-Fold CV, the model is trained on how many folds?
Quiz 2: Stratified K-Fold is mainly used for which datasets?
Quiz 3: Cross validation primarily improves?
Coming up next: Feature Engineering — turning raw data into powerful signals.