ML Lesson 12 – Train/Test Split | Dataplexa

Train / Test Split

In the previous lesson, we understood why models fail due to overfitting and underfitting. Now we learn the first practical technique used in Machine Learning to detect and control this problem.

That technique is called Train / Test Split.


Why Do We Need Train / Test Split?

When a machine learning model is trained, it sees data and learns patterns from it. But if we test the model on the same data it has already seen, we will get a misleadingly high performance.

That would be like asking a student questions they have already memorized.

To truly evaluate learning, we must test the model on unseen data.


Training Data vs Testing Data

Training data is the portion of the dataset used to teach the model.

Testing data is the portion of the dataset kept aside and used only for evaluation.

The model never sees the testing data during training.


Real-World Analogy

Imagine preparing for an exam.

You study from textbooks and practice questions. That is your training phase.

The final exam contains new questions. That is the testing phase.

If you perform well in the exam, it means you truly learned, not memorized.


Using Our Dataset

We continue using the same dataset introduced earlier:

Dataplexa ML Housing & Customer Dataset

This dataset will now be divided into two parts:

One part for training the model One part for testing the model


Common Split Ratios

There is no single perfect split, but some ratios are commonly used in practice.

80% training / 20% testing 70% training / 30% testing

The idea is to give the model enough data to learn, while keeping sufficient unseen data for evaluation.


Performing Train / Test Split in Python

We now split our dataset using scikit-learn.

import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("dataplexa_ml_housing_customer_dataset.csv")

X = df.drop("purchase_decision", axis=1)
y = df["purchase_decision"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

X_train.shape, X_test.shape

Here:

80% of the data is used for training 20% is reserved for testing


Why Random State Matters

The random_state ensures reproducibility.

Without it, the split changes every time the code runs.

With it, results remain consistent, which is important for debugging and comparison.


How Train / Test Split Prevents Overfitting

If a model performs very well on training data but poorly on test data, we know it has overfitted.

If it performs poorly on both, it is underfitting.

Train / test split gives us this insight early.


Mini Practice

Think about our dataset.

Ask yourself:

What happens if we test on data the model has already seen? What happens if the test set is too small?


Exercises

Exercise 1:
Why can’t we evaluate a model using training data?

Because the model has already seen the training data, leading to misleading performance.

Exercise 2:
What is the purpose of test data?

To evaluate how well the model performs on unseen data.

Exercise 3:
Why is random_state used?

To ensure reproducible and consistent data splits.

Quick Quiz

Q1. Does the model see test data during training?

No. Test data is kept completely separate.

Q2. Is train/test split the final solution to overfitting?

No. It is the first step; advanced techniques like cross-validation are also used.

In the next lesson, we will move one step deeper and learn Cross-Validation, which improves model evaluation even further.