AI Course
Model Serving
Building a machine learning model is only half the job. A trained model is useful only when it can be used by real users or real systems. Model serving is the process of making a trained machine learning model available for predictions in production environments.
In this lesson, you will learn what model serving means, how models are deployed, how predictions are generated in real time, and how serving fits into real-world AI systems.
Real-World Connection
When you search on Google, request a ride on Uber, or receive a movie recommendation on Netflix, a trained model is making predictions instantly. These predictions are not happening in notebooks — they are coming from deployed models running on servers. This is model serving in action.
What Is Model Serving?
Model serving is the process of exposing a trained machine learning model so that it can receive input data and return predictions. This is usually done through APIs, web services, or background systems.
- Model is trained offline
- Model is saved to disk
- Model is loaded into a server
- Predictions are returned on request
Common Ways to Serve Models
- REST APIs
- Batch prediction jobs
- Streaming prediction systems
- Embedded models in applications
Simple Model Serving Flow
A basic model serving pipeline follows these steps:
- Client sends input data
- Server receives the request
- Model processes the input
- Prediction is returned
Saving a Trained Model
Before serving a model, it must be saved after training. Let’s see how to save a simple machine learning model using Python.
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
import joblib
X, y = load_iris(return_X_y=True)
model = LogisticRegression(max_iter=1000)
model.fit(X, y)
joblib.dump(model, "iris_model.pkl")
What This Code Does
The model is trained on the Iris dataset and then saved to a file named iris_model.pkl. This file can later be loaded by a server to generate predictions without retraining the model.
Loading a Model for Serving
When a request comes in, the saved model is loaded and used to make predictions.
import joblib
model = joblib.load("iris_model.pkl")
prediction = model.predict([[5.1, 3.5, 1.4, 0.2]])
print(prediction)
Understanding the Output
The output [0] represents the predicted class label for the given input features. This prediction is generated instantly, which is exactly how model serving works in production.
Serving Models Using APIs
Most real-world systems use APIs to serve models. A client sends data using an HTTP request, and the server responds with a prediction. Frameworks like Flask and FastAPI are commonly used for this purpose.
Simple API-Based Model Serving Example
from flask import Flask, request, jsonify
import joblib
app = Flask(__name__)
model = joblib.load("iris_model.pkl")
@app.route("/predict", methods=["POST"])
def predict():
data = request.json["features"]
prediction = model.predict([data])
return jsonify({"prediction": int(prediction[0])})
app.run()
What This API Code Means
This code creates a web server with a prediction endpoint. When a request is sent with feature values, the model processes the input and returns the prediction as a JSON response.
Challenges in Model Serving
- Handling large traffic
- Maintaining low latency
- Model version control
- Monitoring performance
- Handling data drift
Best Practices
- Separate training and serving environments
- Use versioned models
- Monitor prediction accuracy
- Log inputs and outputs
Practice Questions
Practice 1: What is the process of making a trained model available for predictions?
Practice 2: What is the most common method used to serve models to applications?
Practice 3: Which library is commonly used to save and load models in Python?
Quick Quiz
Quiz 1: What is the main purpose of model serving?
Quiz 2: Which framework is commonly used to create model APIs?
Quiz 3: Which factor is critical for real-time model serving?
Coming up next: Introduction to Natural Language Processing — how machines understand human language.