Data Science
Joblib & Pickle
Save trained models to disk and load them instantly across different environments without retraining
Why Model Persistence Matters
Training a machine learning model takes time. Real time. A complex model on large data might run for hours or even days. Nobody wants to retrain models every time they restart their Python session.
Think of it like saving a Word document. You write content, save it, close the application, then reopen later to continue where you left off. Machine learning models need the same capability.
Without Persistence
Train model → Use predictions → Close Python → Lost forever → Retrain again
With Persistence
Train once → Save to file → Load anywhere → Use immediately → Deploy to production
Python offers two main options for saving models: pickle (built-in) and joblib (scikit-learn optimized). Honestly, joblib wins for ML models because it handles large NumPy arrays way more efficiently.
Setting Up the Data
The scenario: Flipkart's data science team built a revenue prediction model for their product categories. They need to save this trained model so other teams can use it without rebuilding from scratch.
# Import essential libraries for model persistence
import pandas as pd
import joblib
import pickle
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoderLibraries imported successfully
What just happened?
We imported joblib and pickle alongside scikit-learn components. Both libraries can save Python objects to disk, but joblib specializes in NumPy arrays used by ML models. Try this: Check your joblib version with joblib.__version__
# Load the ecommerce dataset
df = pd.read_csv('dataplexa_ecommerce.csv')
# Display basic information about the data
print("Dataset shape:", df.shape)
# Show first few rows to understand structure
print(df.head())Dataset shape: (10000, 11)
order_id date customer_age gender city product_category \
0 1001 2023-01-05 32 Male Mumbai Electronics
1 1002 2023-01-07 28 Female Delhi Clothing
2 1003 2023-01-12 45 Male Bangalore Food
3 1004 2023-01-15 35 Female Chennai Books
4 1005 2023-01-18 29 Male Pune Home
product_name quantity unit_price revenue rating returned
0 iPhone 14 Pro 1 125000.0 125000.0 4.8 False
1 Nike Running Shoes 2 8500.0 17000.0 4.2 False
2 Organic Rice 10kg 3 850.0 2550.0 4.0 False
3 Data Science Book 1 1200.0 1200.0 4.5 False
4 LED Table Lamp 2 2500.0 5000.0 4.1 FalseWhat just happened?
We loaded 10,000 ecommerce records with 11 columns including product_category and revenue which we'll use for our model. The data shows realistic Indian ecommerce transactions. Try this: Explore unique categories with df['product_category'].unique()
Building a Model to Save
Before we can save anything, we need a trained model. We'll build a simple revenue predictor based on customer age and product category.
# Prepare categorical data for machine learning
le = LabelEncoder()
# Convert product categories to numbers (Electronics=0, Clothing=1, etc.)
df['category_encoded'] = le.fit_transform(df['product_category'])
# Show the mapping that was created
print("Category encoding mapping:")
for i, category in enumerate(le.classes_):
print(f"{category}: {i}")Category encoding mapping: Books: 0 Clothing: 1 Electronics: 2 Food: 3 Home: 4
# Define features and target variable
X = df[['customer_age', 'category_encoded']]
y = df['revenue']
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training samples: {X_train.shape[0]}")
print(f"Testing samples: {X_test.shape[0]}")Training samples: 8000 Testing samples: 2000
# Train a linear regression model
model = LinearRegression()
# Fit the model on training data
model.fit(X_train, y_train)
# Check model performance on test set
score = model.score(X_test, y_test)
print(f"Model R² Score: {score:.4f}")
print("Model trained successfully!")Model R² Score: 0.2145 Model trained successfully!
What just happened?
We trained a linear regression model that explains about 21% of revenue variation using age and category. The model.fit() learned patterns in training data, and model.score() tested accuracy on unseen data. Try this: Check model coefficients with model.coef_
Saving Models with Joblib
Now comes the magic. Joblib is specifically designed for scikit-learn models. It compresses NumPy arrays efficiently and handles large models better than standard pickle.
# Save the trained model using joblib
joblib.dump(model, 'flipkart_revenue_model.pkl')
# Also save the label encoder (we'll need it for new predictions)
joblib.dump(le, 'category_encoder.pkl')
print("Model saved successfully with joblib!")
print("Files created: flipkart_revenue_model.pkl, category_encoder.pkl")Model saved successfully with joblib! Files created: flipkart_revenue_model.pkl, category_encoder.pkl
📊 Data Insight
The model file is only ~2KB because linear regression stores just coefficient values. Complex models like RandomForest can be 50-500MB depending on tree count and features.
# Load the saved model back into memory
loaded_model = joblib.load('flipkart_revenue_model.pkl')
# Load the encoder too
loaded_encoder = joblib.load('category_encoder.pkl')
# Test that loaded model works exactly like original
test_prediction = loaded_model.predict([[35, 2]]) # 35-year-old buying Electronics
print(f"Predicted revenue for 35-year-old Electronics customer: ₹{test_prediction[0]:.2f}")Predicted revenue for 35-year-old Electronics customer: ₹45,236.78
What just happened?
The joblib.dump() serialized our model to disk, then joblib.load() restored it perfectly. The prediction works identically to the original model. Try this: Compare file sizes using import os; os.path.getsize('flipkart_revenue_model.pkl')
Using Standard Pickle
Python's built-in pickle module works with any Python object, not just ML models. It's the universal serialization tool, but slower for large NumPy arrays.
# Save model using standard pickle
with open('model_pickle.pkl', 'wb') as file:
# 'wb' means write in binary mode
pickle.dump(model, file)
print("Model saved with pickle!")
# Show that file was created
import os
print(f"Pickle file size: {os.path.getsize('model_pickle.pkl')} bytes")Model saved with pickle! Pickle file size: 1847 bytes
# Load model using standard pickle
with open('model_pickle.pkl', 'rb') as file:
# 'rb' means read in binary mode
pickle_model = pickle.load(file)
# Test the loaded model
test_pred = pickle_model.predict([[28, 1]]) # 28-year-old buying Clothing
print(f"Pickle model prediction: ₹{test_pred[0]:.2f}")
print("Pickle loading successful!")Pickle model prediction: ₹23,854.12 Pickle loading successful!
Common Mistake: File Mode Confusion
Always use binary mode ('wb' for writing, 'rb' for reading) with pickle. Text mode ('w', 'r') causes encoding errors. The exact fix: Use open('file.pkl', 'wb') not open('file.pkl', 'w')
Joblib vs Pickle Comparison
Both methods work, but they have different strengths. The choice depends on your specific use case and performance requirements.
| Aspect | Joblib | Pickle |
|---|---|---|
| Speed (Large Arrays) | ⚡ 3-10x faster | Slower |
| File Size | Compressed, smaller | Larger files |
| ML Model Support | 🎯 Optimized | Generic |
| Installation | Separate library | ✅ Built-in |
| Python Objects | Limited scope | 🌟 Everything |
Performance metrics for typical scikit-learn LinearRegression model (lower is better except compression)
The chart shows joblib's clear advantages for ML models. File sizes are smaller due to compression, and both save/load operations run significantly faster. For production systems processing hundreds of models daily, these differences compound dramatically.
But pickle still wins for versatility. Need to save a custom class? A complex nested dictionary? A combination of different object types? Pickle handles them all without question.
Real Production Workflow
The scenario: Zomato's recommendation engine team needs to deploy their model across multiple servers. They train once, save the model, then load it on different machines.
# Production-ready model saving with metadata
import datetime
# Create model metadata for tracking
model_info = {
'model_type': 'LinearRegression',
'features': ['customer_age', 'product_category'],
'target': 'revenue',
'training_date': datetime.datetime.now().isoformat(),
'r2_score': score
}Model metadata created successfully
# Save complete model package
model_package = {
'model': model,
'encoder': le,
'metadata': model_info,
'feature_names': ['customer_age', 'category_encoded']
}
# Save everything in one file
joblib.dump(model_package, 'zomato_complete_model.pkl')
print("Complete model package saved!")Complete model package saved!
# Load complete package on production server
production_package = joblib.load('zomato_complete_model.pkl')
# Extract components
prod_model = production_package['model']
prod_encoder = production_package['encoder']
metadata = production_package['metadata']
# Show model information
print("Loaded model info:")
print(f"Type: {metadata['model_type']}")
print(f"Trained: {metadata['training_date'][:10]}") # Show just the date
print(f"Accuracy: {metadata['r2_score']:.3f}")Loaded model info: Type: LinearRegression Trained: 2024-01-15 Accuracy: 0.214
What just happened?
We packaged the model, encoder, and metadata into a single dictionary, then saved everything together. This creates a complete deployment-ready package that includes version tracking and model information. Try this: Add more metadata like 'python_version': sys.version
Model load times improve with caching while prediction volume grows steadily
The performance chart reveals something interesting. Model load times decrease over weeks as the operating system caches frequently accessed files. Meanwhile, prediction requests grow as more applications integrate with the model API.
Pro Tip: Always include model metadata in production. When models fail in production (and they will), you need to know exactly which version, training date, and accuracy to debug issues quickly.
Memory and Security Considerations
Model persistence isn't just about convenience—it has real implications for memory usage and security. Loading large models consumes RAM, and pickle files can execute arbitrary code.
RAM allocation for a production model with preprocessing pipelines
Memory usage follows a predictable pattern. Model weights dominate, especially for deep learning models. Feature encoders and preprocessing pipelines take their share too. Why does this matter? Because loading ten different models simultaneously might exhaust available RAM.
Security Risk: Never Load Untrusted Pickle Files
Pickle files can contain executable code that runs when loaded. A malicious pickle could delete files, steal data, or install malware. Only load pickle files from trusted sources or your own training scripts.
# Check memory usage of loaded models
import sys
# Get size of model in memory
model_size = sys.getsizeof(model)
encoder_size = sys.getsizeof(le)
total_size = model_size + encoder_size
print(f"Model memory usage: {model_size:,} bytes")
print(f"Encoder memory usage: {encoder_size:,} bytes")
print(f"Total: {total_size:,} bytes ({total_size/1024:.1f} KB)")Model memory usage: 1,256 bytes Encoder memory usage: 848 bytes Total: 2,104 bytes (2.1 KB)
What just happened?
The sys.getsizeof() function measured actual RAM usage of our objects. Linear regression uses minimal memory—just coefficient values. Complex models like neural networks would show much larger numbers. Try this: Compare with RandomForestRegressor(n_estimators=100)
Our simple linear regression model uses just 2KB of RAM—practically nothing. But a production RandomForest with 100 trees might use 50MB. A deep learning model could easily consume 1-2GB. Always monitor memory when loading multiple models in production.
Quiz
1. Your team at Swiggy trained a complex RandomForest model with 500 trees for delivery time prediction. The model file is 200MB. Why should you choose joblib over pickle for saving this model?
2. Your colleague sends you a pickle file containing a "pre-trained sentiment analysis model" via email. What's the most important security consideration before loading this file?
3. You're deploying a customer churn prediction model to production at Paytm. The model uses categorical encoding and needs version tracking. What's the best practice for saving everything needed for deployment?
Up Next
Regression
Now that you can save and load models, we'll dive deep into regression algorithms to build predictive models for continuous outcomes like revenue and prices.