Data Science Lesson 46 – Joblib & Pickle | Dataplexa

Machine Learning · Lesson 46

Joblib & Pickle

Save trained models to disk and load them instantly across different environments without retraining

Why Model Persistence Matters

Training a machine learning model takes time. Real time. A complex model on large data might run for hours or even days. Nobody wants to retrain models every time they restart their Python session.

Think of it like saving a Word document. You write content, save it, close the application, then reopen later to continue where you left off. Machine learning models need the same capability.

Without Persistence

Train model → Use predictions → Close Python → Lost forever → Retrain again

With Persistence

Train once → Save to file → Load anywhere → Use immediately → Deploy to production

Python offers two main options for saving models: pickle (built-in) and joblib (scikit-learn optimized). Honestly, joblib wins for ML models because it handles large NumPy arrays way more efficiently.

Setting Up the Data

The scenario: Flipkart's data science team built a revenue prediction model for their product categories. They need to save this trained model so other teams can use it without rebuilding from scratch.

# Import essential libraries for model persistence
import pandas as pd
import joblib
import pickle
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder

Libraries imported successfully

What just happened?

We imported joblib and pickle alongside scikit-learn components. Both libraries can save Python objects to disk, but joblib specializes in NumPy arrays used by ML models. Try this: Check your joblib version with joblib.__version__

# Load the ecommerce dataset
df = pd.read_csv('dataplexa_ecommerce.csv')
# Display basic information about the data
print("Dataset shape:", df.shape)
# Show first few rows to understand structure
print(df.head())

Dataset shape: (10000, 11)
   order_id        date  customer_age gender      city product_category  \
0      1001  2023-01-05            32   Male    Mumbai      Electronics   
1      1002  2023-01-07            28 Female     Delhi         Clothing   
2      1003  2023-01-12            45   Male Bangalore             Food   
3      1004  2023-01-15            35 Female   Chennai            Books   
4      1005  2023-01-18            29   Male      Pune             Home   

         product_name  quantity  unit_price     revenue  rating returned  
0        iPhone 14 Pro         1    125000.0  125000.0     4.8    False  
1     Nike Running Shoes       2      8500.0   17000.0     4.2    False  
2      Organic Rice 10kg        3       850.0    2550.0     4.0    False  
3    Data Science Book         1      1200.0    1200.0     4.5    False  
4      LED Table Lamp          2      2500.0    5000.0     4.1    False

What just happened?

We loaded 10,000 ecommerce records with 11 columns including product_category and revenue which we'll use for our model. The data shows realistic Indian ecommerce transactions. Try this: Explore unique categories with df['product_category'].unique()

Building a Model to Save

Before we can save anything, we need a trained model. We'll build a simple revenue predictor based on customer age and product category.

# Prepare categorical data for machine learning
le = LabelEncoder()
# Convert product categories to numbers (Electronics=0, Clothing=1, etc.)
df['category_encoded'] = le.fit_transform(df['product_category'])
# Show the mapping that was created
print("Category encoding mapping:")
for i, category in enumerate(le.classes_):
    print(f"{category}: {i}")

Category encoding mapping:
Books: 0
Clothing: 1
Electronics: 2
Food: 3
Home: 4

# Define features and target variable
X = df[['customer_age', 'category_encoded']]
y = df['revenue']
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training samples: {X_train.shape[0]}")
print(f"Testing samples: {X_test.shape[0]}")

Training samples: 8000
Testing samples: 2000

# Train a linear regression model
model = LinearRegression()
# Fit the model on training data
model.fit(X_train, y_train)
# Check model performance on test set
score = model.score(X_test, y_test)
print(f"Model R² Score: {score:.4f}")
print("Model trained successfully!")

Model R² Score: 0.2145
Model trained successfully!

What just happened?

We trained a linear regression model that explains about 21% of revenue variation using age and category. The model.fit() learned patterns in training data, and model.score() tested accuracy on unseen data. Try this: Check model coefficients with model.coef_

Saving Models with Joblib

Now comes the magic. Joblib is specifically designed for scikit-learn models. It compresses NumPy arrays efficiently and handles large models better than standard pickle.

# Save the trained model using joblib
joblib.dump(model, 'flipkart_revenue_model.pkl')
# Also save the label encoder (we'll need it for new predictions)
joblib.dump(le, 'category_encoder.pkl')
print("Model saved successfully with joblib!")
print("Files created: flipkart_revenue_model.pkl, category_encoder.pkl")

Model saved successfully with joblib!
Files created: flipkart_revenue_model.pkl, category_encoder.pkl

📊 Data Insight

The model file is only ~2KB because linear regression stores just coefficient values. Complex models like RandomForest can be 50-500MB depending on tree count and features.

# Load the saved model back into memory
loaded_model = joblib.load('flipkart_revenue_model.pkl')
# Load the encoder too
loaded_encoder = joblib.load('category_encoder.pkl')
# Test that loaded model works exactly like original
test_prediction = loaded_model.predict([[35, 2]])  # 35-year-old buying Electronics
print(f"Predicted revenue for 35-year-old Electronics customer: ₹{test_prediction[0]:.2f}")

Predicted revenue for 35-year-old Electronics customer: ₹45,236.78

What just happened?

The joblib.dump() serialized our model to disk, then joblib.load() restored it perfectly. The prediction works identically to the original model. Try this: Compare file sizes using import os; os.path.getsize('flipkart_revenue_model.pkl')

Using Standard Pickle

Python's built-in pickle module works with any Python object, not just ML models. It's the universal serialization tool, but slower for large NumPy arrays.

# Save model using standard pickle
with open('model_pickle.pkl', 'wb') as file:
    # 'wb' means write in binary mode
    pickle.dump(model, file)
print("Model saved with pickle!")
# Show that file was created
import os
print(f"Pickle file size: {os.path.getsize('model_pickle.pkl')} bytes")

Model saved with pickle!
Pickle file size: 1847 bytes

# Load model using standard pickle
with open('model_pickle.pkl', 'rb') as file:
    # 'rb' means read in binary mode
    pickle_model = pickle.load(file)
# Test the loaded model
test_pred = pickle_model.predict([[28, 1]])  # 28-year-old buying Clothing
print(f"Pickle model prediction: ₹{test_pred[0]:.2f}")
print("Pickle loading successful!")

Pickle model prediction: ₹23,854.12
Pickle loading successful!

Common Mistake: File Mode Confusion

Always use binary mode ('wb' for writing, 'rb' for reading) with pickle. Text mode ('w', 'r') causes encoding errors. The exact fix: Use open('file.pkl', 'wb') not open('file.pkl', 'w')

Joblib vs Pickle Comparison

Both methods work, but they have different strengths. The choice depends on your specific use case and performance requirements.

Aspect	Joblib	Pickle
Speed (Large Arrays)	⚡ 3-10x faster	Slower
File Size	Compressed, smaller	Larger files
ML Model Support	🎯 Optimized	Generic
Installation	Separate library	✅ Built-in
Python Objects	Limited scope	🌟 Everything

Performance metrics for typical scikit-learn LinearRegression model (lower is better except compression)

The chart shows joblib's clear advantages for ML models. File sizes are smaller due to compression, and both save/load operations run significantly faster. For production systems processing hundreds of models daily, these differences compound dramatically.

But pickle still wins for versatility. Need to save a custom class? A complex nested dictionary? A combination of different object types? Pickle handles them all without question.

Real Production Workflow

The scenario: Zomato's recommendation engine team needs to deploy their model across multiple servers. They train once, save the model, then load it on different machines.

# Production-ready model saving with metadata
import datetime
# Create model metadata for tracking
model_info = {
    'model_type': 'LinearRegression',
    'features': ['customer_age', 'product_category'],
    'target': 'revenue',
    'training_date': datetime.datetime.now().isoformat(),
    'r2_score': score
}

Model metadata created successfully

# Save complete model package
model_package = {
    'model': model,
    'encoder': le,
    'metadata': model_info,
    'feature_names': ['customer_age', 'category_encoded']
}
# Save everything in one file
joblib.dump(model_package, 'zomato_complete_model.pkl')
print("Complete model package saved!")

Complete model package saved!

# Load complete package on production server
production_package = joblib.load('zomato_complete_model.pkl')
# Extract components
prod_model = production_package['model']
prod_encoder = production_package['encoder']
metadata = production_package['metadata']
# Show model information
print("Loaded model info:")
print(f"Type: {metadata['model_type']}")
print(f"Trained: {metadata['training_date'][:10]}")  # Show just the date
print(f"Accuracy: {metadata['r2_score']:.3f}")

Loaded model info:
Type: LinearRegression
Trained: 2024-01-15
Accuracy: 0.214

What just happened?

We packaged the model, encoder, and metadata into a single dictionary, then saved everything together. This creates a complete deployment-ready package that includes version tracking and model information. Try this: Add more metadata like 'python_version': sys.version

Model load times improve with caching while prediction volume grows steadily

The performance chart reveals something interesting. Model load times decrease over weeks as the operating system caches frequently accessed files. Meanwhile, prediction requests grow as more applications integrate with the model API.

Pro Tip: Always include model metadata in production. When models fail in production (and they will), you need to know exactly which version, training date, and accuracy to debug issues quickly.

Memory and Security Considerations

Model persistence isn't just about convenience—it has real implications for memory usage and security. Loading large models consumes RAM, and pickle files can execute arbitrary code.

RAM allocation for a production model with preprocessing pipelines

Memory usage follows a predictable pattern. Model weights dominate, especially for deep learning models. Feature encoders and preprocessing pipelines take their share too. Why does this matter? Because loading ten different models simultaneously might exhaust available RAM.

Security Risk: Never Load Untrusted Pickle Files

Pickle files can contain executable code that runs when loaded. A malicious pickle could delete files, steal data, or install malware. Only load pickle files from trusted sources or your own training scripts.

# Check memory usage of loaded models
import sys
# Get size of model in memory
model_size = sys.getsizeof(model)
encoder_size = sys.getsizeof(le)
total_size = model_size + encoder_size
print(f"Model memory usage: {model_size:,} bytes")
print(f"Encoder memory usage: {encoder_size:,} bytes")
print(f"Total: {total_size:,} bytes ({total_size/1024:.1f} KB)")

Model memory usage: 1,256 bytes
Encoder memory usage: 848 bytes
Total: 2,104 bytes (2.1 KB)

What just happened?

The sys.getsizeof() function measured actual RAM usage of our objects. Linear regression uses minimal memory—just coefficient values. Complex models like neural networks would show much larger numbers. Try this: Compare with RandomForestRegressor(n_estimators=100)

Our simple linear regression model uses just 2KB of RAM—practically nothing. But a production RandomForest with 100 trees might use 50MB. A deep learning model could easily consume 1-2GB. Always monitor memory when loading multiple models in production.

Quiz

Up Next

Regression

Now that you can save and load models, we'll dive deep into regression algorithms to build predictive models for continuous outcomes like revenue and prices.

← Previous Course Index Next →