Data Science
Train/Test/Validation
Master data splitting techniques to build ML models that actually work in production and avoid the dreaded overfitting trap.
Why Data Splitting Matters
Your ML model scores 98% accuracy. Champagne time? Not so fast. Deploy it and watch accuracy plummet to 60%. Sound familiar? That's overfitting — your model memorized the training data instead of learning patterns.
Think of it like studying for an exam. If you memorize answers to practice questions, you'll ace them. But the real exam has different questions testing the same concepts. Data splitting creates that "real exam" scenario for your model.
Common Mistake: Testing on Training Data
Never evaluate your model on the same data you trained it on. It's like grading your own exam — of course you'll get 100%. Always hold out fresh data for testing.
Training Set (60-70%)
Model learns patterns here
Validation Set (15-20%)
Tune hyperparameters here
Test Set (15-20%)
Final evaluation only
Never Touch Test Set
Until final model is ready
Basic Train-Test Split
The scenario: You're a data scientist at Flipkart analyzing customer purchase patterns. Your manager wants a model to predict which customers will make repeat purchases next month.
# Import essential libraries for data manipulation and model building
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Load the e-commerce dataset
df = pd.read_csv('dataplexa_ecommerce.csv')
print("Dataset shape:", df.shape)Dataset shape: (50000, 11)
What just happened?
We loaded 50,000 customer transactions with 11 features. The train_test_split function will help us split this data properly. Try this: Check df.columns to see all available features.
The scenario continues: You need to create features and target variable for predicting repeat purchases. High ratings (4+ stars) indicate satisfied customers likely to return.
# Create features from customer data
X = df[['customer_age', 'quantity', 'unit_price', 'revenue']]
# Create target: 1 if customer gave high rating (satisfied), 0 otherwise
y = (df['rating'] >= 4.0).astype(int)
# Check the class distribution
print("High rating customers:", y.sum())
print("Low rating customers:", (y == 0).sum())High rating customers: 32150 Low rating customers: 17850
What just happened?
We have 32,150 satisfied customers (64%) vs 17,850 unsatisfied (36%). This is slightly imbalanced but workable. The astype(int) converts True/False to 1/0. Try this: Print X.head() to see your features.
Now for the crucial part — splitting your data. The test_size=0.2 parameter means 20% goes to testing, 80% to training. The random_state ensures reproducible results.
# Split data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% for testing
random_state=42, # For reproducible results
stratify=y # Maintain class distribution in both sets
)
print(f"Training samples: {X_train.shape[0]}")
print(f"Testing samples: {X_test.shape[0]}")Training samples: 40000 Testing samples: 10000
What just happened?
Perfect split: 40,000 training samples and 10,000 test samples. The stratify=y ensures both sets have the same 64%-36% class ratio. Try this: Check y_train.mean() and y_test.mean() — they should be nearly identical.
80-20 split ensures enough data for training while preserving sufficient samples for reliable testing
This visualization shows our 80-20 split strategy. The training set gets the lion's share because ML models are data-hungry. But that 20% test set? That's your truth serum. It reveals how your model performs on genuinely unseen data. The key insight here: never touch your test set until final evaluation. It's like keeping the answer key sealed until exam day. Once you peek, you can't un-see those results, and your model evaluation becomes biased.Training and Initial Evaluation
Time to train your model. We'll use logistic regression because it's interpretable and fast. But here's where most data scientists make their first mistake — they evaluate on the test set immediately.
# Train the logistic regression model on training data only
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
# Make predictions on training data to check for obvious issues
train_predictions = model.predict(X_train)
train_accuracy = accuracy_score(y_train, train_predictions)
print(f"Training accuracy: {train_accuracy:.3f}")
print("Model coefficients:", model.coef_[0])Training accuracy: 0.742 Model coefficients: [-0.0023 0.1847 -0.0000 0.0000]
What just happened?
Our model achieved 74.2% accuracy on training data. The coefficients show quantity has the strongest positive effect (0.1847) while customer_age has slight negative effect. Try this: Use model.predict_proba(X_train[:5]) to see probability predictions.
Training accuracy of 74.2% looks reasonable — not too high (overfitting) or too low (underfitting). But this is just training performance. The real test comes with unseen data.
Three-Way Split: Train/Validation/Test
Here's where professionals separate from beginners. A validation set acts as a middle ground — not for training, but for tuning hyperparameters and model selection. Your test set remains completely untouched.
# First split: separate out test set (20%)
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Second split: divide remaining 80% into train (60%) and validation (20%)
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)
print(f"Train: {len(X_train)} samples")
print(f"Validation: {len(X_val)} samples")
print(f"Test: {len(X_test)} samples")Train: 30000 samples Validation: 10000 samples Test: 10000 samples
What just happened?
Smart splitting! We now have 60% train, 20% validation, 20% test. The trick: test_size=0.25 on the remaining 80% gives us 20% of the original data. Try this: Verify with 30000 + 10000 + 10000 = 50000.
60-20-20 split provides adequate training data while maintaining separate validation and test sets
This three-way split follows the Goldilocks principle — not too much in any single bucket. Your training set has enough samples to learn patterns. Validation set is large enough for reliable hyperparameter tuning. And your test set remains pristine for final evaluation. Why 60-20-20 specifically? Experience shows this ratio works well for most datasets above 10,000 samples. Smaller datasets might need 70-15-15. Massive datasets (millions of samples) can afford 80-10-10. But 60-20-20 is your safe default.Model Selection with Validation Set
The scenario escalates: Your Flipkart manager wants to compare different algorithms. This is where validation sets shine — you can test multiple models without contaminating your test set.
# Train multiple models and evaluate on validation set
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
# Initialize different models
models = {
'Logistic Regression': LogisticRegression(random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'SVM': SVC(random_state=42)
}# Train and evaluate each model
validation_scores = {}
for name, model in models.items():
# Train on training set only
model.fit(X_train, y_train)
# Evaluate on validation set (not test!)
val_predictions = model.predict(X_val)
val_accuracy = accuracy_score(y_val, val_predictions)
validation_scores[name] = val_accuracy
print(f"{name}: {val_accuracy:.3f}")Logistic Regression: 0.739 Random Forest: 0.756 SVM: 0.744
What just happened?
Random Forest wins with 75.6% validation accuracy, beating Logistic Regression (73.9%) and SVM (74.4%). Importantly, we used validation data for this comparison — our test set remains untouched. Try this: Check max(validation_scores, key=validation_scores.get) to find the best model programmatically.
Random Forest achieves highest validation accuracy at 75.6%, making it our candidate for final testing
Random Forest emerges as the winner, but notice the differences are small — just 1.7% between best and worst. This suggests our features capture the underlying patterns reasonably well across different algorithms. The ensemble nature of Random Forest gives it a slight edge by combining multiple decision trees. Here's the crucial point: you can run this comparison dozens of times, trying different hyperparameters, feature combinations, even different algorithms. Your validation set can handle it. But your test set? That stays locked away until you've made your final model choice.📊 Data Insight
Random Forest's 75.6% accuracy on unseen validation data suggests it generalizes well beyond training. The 1.7% performance gap between models indicates consistent feature quality across algorithms.
Final Test Set Evaluation
The moment of truth. You've selected Random Forest based on validation performance. Now — and only now — do you evaluate on the test set. This gives you an unbiased estimate of real-world performance.
# Select the best model (Random Forest) and evaluate on test set
best_model = RandomForestClassifier(n_estimators=100, random_state=42)
best_model.fit(X_train, y_train)
# Final evaluation on test set - first time we're using it!
test_predictions = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, test_predictions)
print(f"Final test accuracy: {test_accuracy:.3f}")
print(f"Validation accuracy was: {validation_scores['Random Forest']:.3f}")
print(f"Difference: {abs(test_accuracy - validation_scores['Random Forest']):.3f}")Final test accuracy: 0.751 Validation accuracy was: 0.756 Difference: 0.005
What just happened?
Excellent results! Test accuracy (75.1%) closely matches validation accuracy (75.6%) — only 0.5% difference. This small gap indicates our model generalizes well and we avoided overfitting. Try this: Run best_model.feature_importances_ to see which features matter most.
That 0.5% difference between validation and test performance is exactly what you want to see. It means your validation set was a good proxy for test performance. If the gap were 5%+ larger, you'd suspect overfitting to the validation set.
| Dataset | Purpose | Usage Frequency | Our Accuracy |
|---|---|---|---|
| Training (60%) | Model Learning | Every Iteration | 74.2% |
| Validation (20%) | Model Selection | Multiple Times | 75.6% |
| Test (20%) | Final Evaluation | Once Only | 75.1% |
Critical Rule: One Test Set Evaluation Only
Once you evaluate on the test set, that's it. Don't go back and tune hyperparameters based on test results. If you must iterate further, create a new test set from fresh data.
Quiz
1. You've trained a model for Swiggy's delivery time prediction. After getting 68% accuracy on validation, you test and get 71% on test set. What should you do next?
2. Your team at Paytm has 100,000 transaction records for fraud detection. What's the recommended data split ratio?
3. Your Zomato dataset has 80% 5-star ratings and 20% low ratings. During train_test_split, how do you ensure both sets have the same imbalance?
Up Next
ETL Basics
Now that you can evaluate models properly, learn how to extract, transform, and load data from multiple sources for real-world ML pipelines.