AI Lesson 40 – Cross-Validation Methods | Dataplexa

Cross Validation

When a model performs well on training data but fails on new data, the problem is usually poor generalization. Cross Validation is a technique that helps us measure how a model will perform on unseen data before deploying it.

Instead of trusting a single train-test split, cross validation evaluates the model multiple times on different data portions and gives a more reliable performance estimate.

Why Cross Validation Is Needed

A single split can be misleading. If the test set is too easy or too hard, the model’s performance score may not reflect reality.

  • Provides stable performance estimates
  • Reduces overfitting risk
  • Uses data efficiently
  • Helps compare models fairly

Real-World Connection

Imagine evaluating a student using only one exam. The result may not represent true ability. Multiple exams across different topics give a fairer evaluation. Cross validation works the same way for models.

What Is K-Fold Cross Validation?

In K-Fold Cross Validation, the dataset is split into K equal parts. The model is trained on K-1 parts and tested on the remaining part. This process repeats K times.

  • Each fold is used once as test data
  • Final score is the average of all folds

Simple K-Fold Example

Let’s apply K-Fold Cross Validation using scikit-learn.


from sklearn.datasets import load_iris
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression

data = load_iris()
X, y = data.data, data.target

model = LogisticRegression(max_iter=200)
kf = KFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(model, X, y, cv=kf)
print(scores)
  
[0.966 1.000 0.933 0.966 1.000]

Each value represents accuracy from one fold. Notice how scores vary slightly across different splits.

Average Cross Validation Score


print("Average Accuracy:", scores.mean())
  
Average Accuracy: 0.973

The average score gives a more trustworthy estimate of model performance.

Stratified K-Fold

For classification problems, Stratified K-Fold maintains the same class distribution in each fold. This is especially important for imbalanced datasets.


from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf)

print(scores)
print("Average:", scores.mean())
  
[0.966 0.966 1.000 0.933 1.000] Average: 0.973

Stratified K-Fold ensures fair evaluation when classes are uneven.

Common Cross Validation Types

  • K-Fold: Standard method for general datasets
  • Stratified K-Fold: Preserves class balance
  • Leave-One-Out: Uses one sample as test each time
  • Time Series Split: Used for sequential data

When Not to Use Cross Validation

  • Very large datasets where computation is expensive
  • Time-dependent data without proper splitting

Practice Questions

Practice 1: What technique evaluates models on multiple data splits?



Practice 2: Which method splits data into K equal parts?



Practice 3: Which cross validation preserves class distribution?



Quick Quiz

Quiz 1: In K-Fold CV, the model is trained on how many folds?





Quiz 2: Stratified K-Fold is mainly used for which datasets?





Quiz 3: Cross validation primarily improves?





Coming up next: Feature Engineering — turning raw data into powerful signals.