Data Science Lesson 51 – Train/Test/Validation | Dataplexa

Machine Learning · Lesson 51

Train/Test/Validation

Master data splitting techniques to build ML models that actually work in production and avoid the dreaded overfitting trap.

Split Your Data

Train Your Model

Validate Performance

Test Final Model

Why Data Splitting Matters

Your ML model scores 98% accuracy. Champagne time? Not so fast. Deploy it and watch accuracy plummet to 60%. Sound familiar? That's overfitting — your model memorized the training data instead of learning patterns.

Think of it like studying for an exam. If you memorize answers to practice questions, you'll ace them. But the real exam has different questions testing the same concepts. Data splitting creates that "real exam" scenario for your model.

Common Mistake: Testing on Training Data

Never evaluate your model on the same data you trained it on. It's like grading your own exam — of course you'll get 100%. Always hold out fresh data for testing.

Training Set (60-70%)

Model learns patterns here

Validation Set (15-20%)

Tune hyperparameters here

Test Set (15-20%)

Final evaluation only

Never Touch Test Set

Until final model is ready

Basic Train-Test Split

The scenario: You're a data scientist at Flipkart analyzing customer purchase patterns. Your manager wants a model to predict which customers will make repeat purchases next month.

# Import essential libraries for data manipulation and model building
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the e-commerce dataset
df = pd.read_csv('dataplexa_ecommerce.csv')
print("Dataset shape:", df.shape)

Dataset shape: (50000, 11)

What just happened?

We loaded 50,000 customer transactions with 11 features. The train_test_split function will help us split this data properly. Try this: Check df.columns to see all available features.

The scenario continues: You need to create features and target variable for predicting repeat purchases. High ratings (4+ stars) indicate satisfied customers likely to return.

# Create features from customer data
X = df[['customer_age', 'quantity', 'unit_price', 'revenue']]

# Create target: 1 if customer gave high rating (satisfied), 0 otherwise
y = (df['rating'] >= 4.0).astype(int)

# Check the class distribution
print("High rating customers:", y.sum())
print("Low rating customers:", (y == 0).sum())

High rating customers: 32150
Low rating customers: 17850

What just happened?

We have 32,150 satisfied customers (64%) vs 17,850 unsatisfied (36%). This is slightly imbalanced but workable. The astype(int) converts True/False to 1/0. Try this: Print X.head() to see your features.

Now for the crucial part — splitting your data. The test_size=0.2 parameter means 20% goes to testing, 80% to training. The random_state ensures reproducible results.

# Split data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,      # 20% for testing
    random_state=42,    # For reproducible results
    stratify=y          # Maintain class distribution in both sets
)

print(f"Training samples: {X_train.shape[0]}")
print(f"Testing samples: {X_test.shape[0]}")

Training samples: 40000
Testing samples: 10000

What just happened?

Perfect split: 40,000 training samples and 10,000 test samples. The stratify=y ensures both sets have the same 64%-36% class ratio. Try this: Check y_train.mean() and y_test.mean() — they should be nearly identical.

80-20 split ensures enough data for training while preserving sufficient samples for reliable testing

This visualization shows our 80-20 split strategy. The training set gets the lion's share because ML models are data-hungry. But that 20% test set? That's your truth serum. It reveals how your model performs on genuinely unseen data. The key insight here: never touch your test set until final evaluation. It's like keeping the answer key sealed until exam day. Once you peek, you can't un-see those results, and your model evaluation becomes biased.

Training and Initial Evaluation

Time to train your model. We'll use logistic regression because it's interpretable and fast. But here's where most data scientists make their first mistake — they evaluate on the test set immediately.

# Train the logistic regression model on training data only
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

# Make predictions on training data to check for obvious issues
train_predictions = model.predict(X_train)
train_accuracy = accuracy_score(y_train, train_predictions)

print(f"Training accuracy: {train_accuracy:.3f}")
print("Model coefficients:", model.coef_[0])

Training accuracy: 0.742
Model coefficients: [-0.0023  0.1847 -0.0000  0.0000]

What just happened?

Our model achieved 74.2% accuracy on training data. The coefficients show quantity has the strongest positive effect (0.1847) while customer_age has slight negative effect. Try this: Use model.predict_proba(X_train[:5]) to see probability predictions.

Training accuracy of 74.2% looks reasonable — not too high (overfitting) or too low (underfitting). But this is just training performance. The real test comes with unseen data.

Three-Way Split: Train/Validation/Test

Here's where professionals separate from beginners. A validation set acts as a middle ground — not for training, but for tuning hyperparameters and model selection. Your test set remains completely untouched.

# First split: separate out test set (20%)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Second split: divide remaining 80% into train (60%) and validation (20%)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)

print(f"Train: {len(X_train)} samples")
print(f"Validation: {len(X_val)} samples")  
print(f"Test: {len(X_test)} samples")

Train: 30000 samples
Validation: 10000 samples
Test: 10000 samples

What just happened?

Smart splitting! We now have 60% train, 20% validation, 20% test. The trick: test_size=0.25 on the remaining 80% gives us 20% of the original data. Try this: Verify with 30000 + 10000 + 10000 = 50000.

60-20-20 split provides adequate training data while maintaining separate validation and test sets

This three-way split follows the Goldilocks principle — not too much in any single bucket. Your training set has enough samples to learn patterns. Validation set is large enough for reliable hyperparameter tuning. And your test set remains pristine for final evaluation. Why 60-20-20 specifically? Experience shows this ratio works well for most datasets above 10,000 samples. Smaller datasets might need 70-15-15. Massive datasets (millions of samples) can afford 80-10-10. But 60-20-20 is your safe default.

Model Selection with Validation Set

The scenario escalates: Your Flipkart manager wants to compare different algorithms. This is where validation sets shine — you can test multiple models without contaminating your test set.

# Train multiple models and evaluate on validation set
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

# Initialize different models
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(random_state=42)
}

# Train and evaluate each model
validation_scores = {}

for name, model in models.items():
    # Train on training set only
    model.fit(X_train, y_train)
    
    # Evaluate on validation set (not test!)
    val_predictions = model.predict(X_val)
    val_accuracy = accuracy_score(y_val, val_predictions)
    validation_scores[name] = val_accuracy
    
    print(f"{name}: {val_accuracy:.3f}")

Logistic Regression: 0.739
Random Forest: 0.756
SVM: 0.744

What just happened?

Random Forest wins with 75.6% validation accuracy, beating Logistic Regression (73.9%) and SVM (74.4%). Importantly, we used validation data for this comparison — our test set remains untouched. Try this: Check max(validation_scores, key=validation_scores.get) to find the best model programmatically.

Random Forest achieves highest validation accuracy at 75.6%, making it our candidate for final testing

Random Forest emerges as the winner, but notice the differences are small — just 1.7% between best and worst. This suggests our features capture the underlying patterns reasonably well across different algorithms. The ensemble nature of Random Forest gives it a slight edge by combining multiple decision trees. Here's the crucial point: you can run this comparison dozens of times, trying different hyperparameters, feature combinations, even different algorithms. Your validation set can handle it. But your test set? That stays locked away until you've made your final model choice.

📊 Data Insight

Random Forest's 75.6% accuracy on unseen validation data suggests it generalizes well beyond training. The 1.7% performance gap between models indicates consistent feature quality across algorithms.

Final Test Set Evaluation

The moment of truth. You've selected Random Forest based on validation performance. Now — and only now — do you evaluate on the test set. This gives you an unbiased estimate of real-world performance.

# Select the best model (Random Forest) and evaluate on test set
best_model = RandomForestClassifier(n_estimators=100, random_state=42)
best_model.fit(X_train, y_train)

# Final evaluation on test set - first time we're using it!
test_predictions = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, test_predictions)

print(f"Final test accuracy: {test_accuracy:.3f}")
print(f"Validation accuracy was: {validation_scores['Random Forest']:.3f}")
print(f"Difference: {abs(test_accuracy - validation_scores['Random Forest']):.3f}")

Final test accuracy: 0.751
Validation accuracy was: 0.756
Difference: 0.005

What just happened?

Excellent results! Test accuracy (75.1%) closely matches validation accuracy (75.6%) — only 0.5% difference. This small gap indicates our model generalizes well and we avoided overfitting. Try this: Run best_model.feature_importances_ to see which features matter most.

That 0.5% difference between validation and test performance is exactly what you want to see. It means your validation set was a good proxy for test performance. If the gap were 5%+ larger, you'd suspect overfitting to the validation set.

Dataset	Purpose	Usage Frequency	Our Accuracy
Training (60%)	Model Learning	Every Iteration	74.2%
Validation (20%)	Model Selection	Multiple Times	75.6%
Test (20%)	Final Evaluation	Once Only	75.1%

Critical Rule: One Test Set Evaluation Only

Once you evaluate on the test set, that's it. Don't go back and tune hyperparameters based on test results. If you must iterate further, create a new test set from fresh data.

Honestly, this discipline around test sets is what separates professional ML from amateur experimentation. The temptation to "just quickly check" test performance during development is enormous. But the moment you optimize based on test results, you've corrupted your evaluation. Your reported accuracy becomes artificially inflated.

Quiz

Up Next

ETL Basics

Now that you can evaluate models properly, learn how to extract, transform, and load data from multiple sources for real-world ML pipelines.

← Previous Course Index Next →