ML Lesson 13 – Cross-Validation | Dataplexa

Cross-Validation

In the previous lesson, we learned how train/test split helps us evaluate a model using unseen data.

Now we move one step further and answer an important question: what if our train/test split itself is biased?

To solve this, we use a powerful technique called Cross-Validation.


Why Train/Test Split Is Not Always Enough

When we split data once, the model’s performance depends heavily on how the data was split.

If the test data is unusually easy or unusually hard, our evaluation may be misleading.

This means a single split may not represent real-world performance accurately.


What Is Cross-Validation?

Cross-validation is a technique where we evaluate a model multiple times on different subsets of the data.

Instead of training once and testing once, we train and test the model several times.

The final performance is calculated as the average of all evaluations.


k-Fold Cross-Validation (Core Concept)

The most common form of cross-validation is k-fold cross-validation.

Here’s how it works conceptually:

The dataset is divided into k equal parts (called folds). One fold is used as test data. The remaining folds are used as training data. This process repeats until each fold has been used as a test set once.

This ensures every data point gets a chance to be tested.


Real-World Analogy

Imagine evaluating a student by giving them multiple small exams instead of one final exam.

Each exam tests a different portion of the syllabus.

The final grade is the average of all exams, making the evaluation fair and reliable.


Using Our Dataset

We continue using the same dataset:

Dataplexa ML Housing & Customer Dataset

Cross-validation ensures that our model performs well across different segments of customers, not just a specific group.


Implementing Cross-Validation in Python

Scikit-learn provides built-in support for cross-validation.

import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

df = pd.read_csv("dataplexa_ml_housing_customer_dataset.csv")

X = df.drop("purchase_decision", axis=1)
y = df["purchase_decision"]

model = LogisticRegression(max_iter=1000)

scores = cross_val_score(model, X, y, cv=5)
scores, scores.mean()

Each value in scores represents performance on one fold.

The average score gives a more reliable estimate of model quality.


Why Cross-Validation Reduces Overfitting Risk

Because the model is evaluated on multiple unseen subsets, it is harder for it to accidentally perform well by chance.

If a model performs consistently across folds, it is more likely to generalize well.


Choosing the Right Value of k

Common choices include:

5-fold cross-validation 10-fold cross-validation

Smaller k means faster computation. Larger k means more reliable evaluation.

In practice, 5 or 10 folds work well for most problems.


Mini Practice

Think about our dataset.

Ask yourself:

Would one train/test split represent all customers fairly? How does cross-validation improve trust in evaluation?


Exercises

Exercise 1:
Why do we use cross-validation instead of a single split?

Because a single split may be biased, while cross-validation evaluates performance across multiple data subsets.

Exercise 2:
What does k represent in k-fold cross-validation?

It represents the number of folds the dataset is divided into.

Exercise 3:
Why is average score used in cross-validation?

It provides a more stable and reliable estimate of model performance.

Quick Quiz

Q1. Does cross-validation replace train/test split?

No. Train/test split is still used; cross-validation improves evaluation during training.

Q2. Is cross-validation computationally expensive?

Yes, because the model is trained multiple times.

In the next lesson, we will understand the Bias–Variance Tradeoff, which explains why balancing model complexity is critical.