Cross-Validation
In the previous lesson, we learned how train/test split helps us evaluate a model using unseen data.
Now we move one step further and answer an important question: what if our train/test split itself is biased?
To solve this, we use a powerful technique called Cross-Validation.
Why Train/Test Split Is Not Always Enough
When we split data once, the model’s performance depends heavily on how the data was split.
If the test data is unusually easy or unusually hard, our evaluation may be misleading.
This means a single split may not represent real-world performance accurately.
What Is Cross-Validation?
Cross-validation is a technique where we evaluate a model multiple times on different subsets of the data.
Instead of training once and testing once, we train and test the model several times.
The final performance is calculated as the average of all evaluations.
k-Fold Cross-Validation (Core Concept)
The most common form of cross-validation is k-fold cross-validation.
Here’s how it works conceptually:
The dataset is divided into k equal parts (called folds). One fold is used as test data. The remaining folds are used as training data. This process repeats until each fold has been used as a test set once.
This ensures every data point gets a chance to be tested.
Real-World Analogy
Imagine evaluating a student by giving them multiple small exams instead of one final exam.
Each exam tests a different portion of the syllabus.
The final grade is the average of all exams, making the evaluation fair and reliable.
Using Our Dataset
We continue using the same dataset:
Dataplexa ML Housing & Customer Dataset
Cross-validation ensures that our model performs well across different segments of customers, not just a specific group.
Implementing Cross-Validation in Python
Scikit-learn provides built-in support for cross-validation.
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
df = pd.read_csv("dataplexa_ml_housing_customer_dataset.csv")
X = df.drop("purchase_decision", axis=1)
y = df["purchase_decision"]
model = LogisticRegression(max_iter=1000)
scores = cross_val_score(model, X, y, cv=5)
scores, scores.mean()
Each value in scores represents performance on one fold.
The average score gives a more reliable estimate of model quality.
Why Cross-Validation Reduces Overfitting Risk
Because the model is evaluated on multiple unseen subsets, it is harder for it to accidentally perform well by chance.
If a model performs consistently across folds, it is more likely to generalize well.
Choosing the Right Value of k
Common choices include:
5-fold cross-validation 10-fold cross-validation
Smaller k means faster computation. Larger k means more reliable evaluation.
In practice, 5 or 10 folds work well for most problems.
Mini Practice
Think about our dataset.
Ask yourself:
Would one train/test split represent all customers fairly? How does cross-validation improve trust in evaluation?
Exercises
Exercise 1:
Why do we use cross-validation instead of a single split?
Exercise 2:
What does k represent in k-fold cross-validation?
Exercise 3:
Why is average score used in cross-validation?
Quick Quiz
Q1. Does cross-validation replace train/test split?
Q2. Is cross-validation computationally expensive?
In the next lesson, we will understand the Bias–Variance Tradeoff, which explains why balancing model complexity is critical.