AI Lesson 40 – Cross-Validation Methods | Dataplexa

Cross Validation

When a model performs well on training data but fails on new data, the problem is usually poor generalization. Cross Validation is a technique that helps us measure how a model will perform on unseen data before deploying it.

Instead of trusting a single train-test split, cross validation evaluates the model multiple times on different data portions and gives a more reliable performance estimate.

Why Cross Validation Is Needed

A single split can be misleading. If the test set is too easy or too hard, the model’s performance score may not reflect reality.

Provides stable performance estimates
Reduces overfitting risk
Uses data efficiently
Helps compare models fairly

Real-World Connection

Imagine evaluating a student using only one exam. The result may not represent true ability. Multiple exams across different topics give a fairer evaluation. Cross validation works the same way for models.

What Is K-Fold Cross Validation?

In K-Fold Cross Validation, the dataset is split into K equal parts. The model is trained on K-1 parts and tested on the remaining part. This process repeats K times.

Each fold is used once as test data
Final score is the average of all folds

Simple K-Fold Example

Let’s apply K-Fold Cross Validation using scikit-learn.


from sklearn.datasets import load_iris
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression

data = load_iris()
X, y = data.data, data.target

model = LogisticRegression(max_iter=200)
kf = KFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(model, X, y, cv=kf)
print(scores)

[0.966 1.000 0.933 0.966 1.000]

Each value represents accuracy from one fold. Notice how scores vary slightly across different splits.

Average Cross Validation Score


print("Average Accuracy:", scores.mean())

Average Accuracy: 0.973

The average score gives a more trustworthy estimate of model performance.

Stratified K-Fold

For classification problems, Stratified K-Fold maintains the same class distribution in each fold. This is especially important for imbalanced datasets.


from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf)

print(scores)
print("Average:", scores.mean())

[0.966 0.966 1.000 0.933 1.000] Average: 0.973

Stratified K-Fold ensures fair evaluation when classes are uneven.

Common Cross Validation Types

K-Fold: Standard method for general datasets
Stratified K-Fold: Preserves class balance
Leave-One-Out: Uses one sample as test each time
Time Series Split: Used for sequential data

When Not to Use Cross Validation

Very large datasets where computation is expensive
Time-dependent data without proper splitting

Practice Questions

Practice 1: What technique evaluates models on multiple data splits?

Practice 2: Which method splits data into K equal parts?

Practice 3: Which cross validation preserves class distribution?

Quick Quiz

Quiz 1: In K-Fold CV, the model is trained on how many folds?

K
K-1
1

Quiz 2: Stratified K-Fold is mainly used for which datasets?

Large
Imbalanced
Time Series

Quiz 3: Cross validation primarily improves?

Speed
Generalization
Memory

Coming up next: Feature Engineering — turning raw data into powerful signals.

← Previous Course Index Next →

AI Course