Train / Test Split
In the previous lesson, we understood why models fail due to overfitting and underfitting. Now we learn the first practical technique used in Machine Learning to detect and control this problem.
That technique is called Train / Test Split.
Why Do We Need Train / Test Split?
When a machine learning model is trained, it sees data and learns patterns from it. But if we test the model on the same data it has already seen, we will get a misleadingly high performance.
That would be like asking a student questions they have already memorized.
To truly evaluate learning, we must test the model on unseen data.
Training Data vs Testing Data
Training data is the portion of the dataset used to teach the model.
Testing data is the portion of the dataset kept aside and used only for evaluation.
The model never sees the testing data during training.
Real-World Analogy
Imagine preparing for an exam.
You study from textbooks and practice questions. That is your training phase.
The final exam contains new questions. That is the testing phase.
If you perform well in the exam, it means you truly learned, not memorized.
Using Our Dataset
We continue using the same dataset introduced earlier:
Dataplexa ML Housing & Customer Dataset
This dataset will now be divided into two parts:
One part for training the model One part for testing the model
Common Split Ratios
There is no single perfect split, but some ratios are commonly used in practice.
80% training / 20% testing 70% training / 30% testing
The idea is to give the model enough data to learn, while keeping sufficient unseen data for evaluation.
Performing Train / Test Split in Python
We now split our dataset using scikit-learn.
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv("dataplexa_ml_housing_customer_dataset.csv")
X = df.drop("purchase_decision", axis=1)
y = df["purchase_decision"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
X_train.shape, X_test.shape
Here:
80% of the data is used for training 20% is reserved for testing
Why Random State Matters
The random_state ensures reproducibility.
Without it, the split changes every time the code runs.
With it, results remain consistent, which is important for debugging and comparison.
How Train / Test Split Prevents Overfitting
If a model performs very well on training data but poorly on test data, we know it has overfitted.
If it performs poorly on both, it is underfitting.
Train / test split gives us this insight early.
Mini Practice
Think about our dataset.
Ask yourself:
What happens if we test on data the model has already seen? What happens if the test set is too small?
Exercises
Exercise 1:
Why can’t we evaluate a model using training data?
Exercise 2:
What is the purpose of test data?
Exercise 3:
Why is random_state used?
Quick Quiz
Q1. Does the model see test data during training?
Q2. Is train/test split the final solution to overfitting?
In the next lesson, we will move one step deeper and learn Cross-Validation, which improves model evaluation even further.