Serverless vs Containerized Deployments: Which One is Better for Real-Time ML Scoring?
Machine learning (ML) models are becoming more and more popular. However, deploying ML models to production is not a trivial task. It requires careful consideration of the trade-offs between performance, scalability, reliability, and cost.
One of the main challenges of ML deployment is how to handle real-time scoring requests from users or applications.
Real-time scoring means that the ML model has to provide predictions or recommendations within a short time frame, usually in milliseconds or seconds. This requires the ML model to be available and responsive at all times, and to handle variable and unpredictable workloads.
There are two common approaches for deploying ML models for real-time scoring: serverless and containerized.
In this article, I will compare these two approaches and discuss their pros and cons. I will also provide some examples of how to use them in practice.
What is Serverless Deployment?
Serverless deployment is a cloud computing paradigm that allows you to run code without having to manage servers or infrastructure. The code is executed by a cloud provider on-demand, in response to events or triggers. The cloud provider handles the scaling, load balancing, security, and fault tolerance of the code execution. You only pay for the resources consumed by the code execution, such as CPU time, memory, and network bandwidth.
Serverless deployment is ideal for applications that have short-lived, stateless, and event-driven functions. For example, a serverless function can be triggered by an HTTP request, a message queue, a database change, or a scheduled timer. The function can perform some logic, such as validating input, processing data, calling an API, or sending an email. The function can also invoke other functions or services as part of its logic.
One of the benefits of serverless deployment is that it simplifies the development and deployment process. You do not have to worry about provisioning, configuring, or maintaining servers or infrastructure. You can focus on writing the business logic and testing the functionality. The cloud provider takes care of the rest.
Another benefit of serverless deployment is that it optimizes the resource utilization and cost efficiency. The serverless function only runs when it is needed, and only consumes the resources that it requires. The cloud provider automatically scales the function up or down according to the demand. You only pay for what you use, and do not have to pay for idle or underutilized resources.
What is Containerized Deployment?
Containerized deployment is another cloud computing paradigm that allows you to run code in isolated and portable units called containers. A container will usually contain everything that the code needs to run, such as the executable, libraries, dependencies, configuration files, and environment variables. A container can run on any machine that has a container runtime installed, such as Docker and Kubernetes.
Containerized deployment is ideal for applications that have long-running, stateful, and complex services. For example, a container can run a web server, a database server, an ML model server, or a custom application server. The container can expose ports and endpoints for communication with other containers or external clients. It can also store data and state in persistent volumes or external storage services.
One of the benefits of containerized deployment is that it enhances the portability and compatibility of the code. You can run the same container on different machines or platforms without having to worry about the underlying hardware or software differences. You can also reuse existing containers or images from public or private repositories without having to rebuild them from scratch.
Improved performance and reliability of the code is another benefit of containerized deployment. The container runs in isolation from other containers or processes on the same machine, which reduces the interference and contention for resources. The container also has a consistent and predictable behavior across different environments, which reduces the risk of errors or failures.
How to Deploy ML Models for Real-Time Scoring Using Serverless
One of the popular ways to deploy ML models for real-time scoring using serverless is to use cloud functions. Cloud functions are serverless functions that can be triggered by various events or sources. For example, AWS Lambda, Azure Functions, Google Cloud Functions, and IBM Cloud Functions are some of the cloud providers that offer cloud functions as a service. For the most of the following steps, I will be using AWS Lambda.
To deploy an ML model using cloud functions, you have to follow these steps:
- Create a cloud function that accepts an input (such as an HTTP request) and returns an output (such as an HTTP response).
Using AWS CLI, you can use the aws lambda create-function
command to create a cloud function. For example:
aws lambda create-function \
--function-name my-function \
--runtime python3.10 \
--role arn:aws:iam::123456789012:role/lambda-role \
--handler handler \
--zip-file fileb://my-function.zip \
--timeout 15 \
--memory-size 256
This command creates a cloud function named my-function
with Python 3.10 runtime environment, lambda-role
IAM role, handler
method as entry point, my-function.zip
file as source code package, 15 seconds timeout limit, and 256 MB memory size.
- Write the code for the cloud function that loads the ML model (from your local file or remote storage service), preprocesses the input data, invokes the ML model to make predictions (using scikit-learn), and postprocesses the output data (formatting JSON or adding confidence scores).
For example, here is a Python code for a cloud function that uses scikit-learn to load a linear regression model and make predictions based on the input data:
# Import libraries
import json
import numpy as np
import sklearn.linear_model
# Load the model from a local file
model = sklearn.linear_model.LinearRegression()
model.load("model.pkl")
# Define the cloud function handler
def handler(event, context):
# Parse the input data from the HTTP request
input_data = json.loads(event["body"])
# Preprocess the input data (e.g., encode categorical variables)
input_data = preprocess(input_data)
# Convert the input data to a numpy array
input_data = np.array(input_data)
# Reshape the input data to match the model input shape
input_data = input_data.reshape(1, -1)
# Invoke the model to make predictions
output_data = model.predict(input_data)
# Postprocess the output data (e.g., format JSON or add confidence scores)
output_data = postprocess(output_data)
# Convert the output data to a JSON string
output_data = json.dumps(output_data)
# Return the output data as an HTTP response
return {
"statusCode": 200,
"headers": {
"Content-Type": "application/json"
},
"body": output_data
}
- Upload the code and the ML model (or a reference to the ML model) to the cloud provider, and configure the cloud function settings (such as memory, timeout, and trigger).
aws lambda update-function-code \
--function-name my-function \
--zip-file fileb://my-function.zip
aws lambda update-function-configuration \
--function-name my-function \
--memory-size 512 \
--timeout 30 \
--triggers '{"APIGatewayProxyRequest":{"path":"/predict","httpMethod":"POST"}}'
These commands update the code of my-function
with a new ZIP file named my-function.zip
, and update its configuration with 512 MB memory size, 30 seconds timeout limit, and an API Gateway trigger that invokes it when a POST request is sent to /predict
path.
- Test the cloud function by sending sample requests and verifying the responses.
You can use the aws lambda invoke
command to invoke a cloud function synchronously or asynchronously with an optional payload. The command returns information such as status code, execution time, log stream name, etc. The command also saves the output data to a file specified by the --output-file
option. For example, if you want to save the output data to a file named output.json
, you can use the following command:
aws lambda invoke \
--function-name my-function \
--payload '{"data": [1.0, 2.0, 3.0]}' \
--output-file output.json
This command invokes the cloud function named my-function
with some sample payload data and saves the output data to a file named output.json
. You can then open the file and view the output data.
- Monitor and troubleshoot the cloud function by using the cloud provider's dashboard, logs, metrics, and alerts.
How to Deploy ML Models for Real-Time Scoring Using Containerized
One of the popular ways to deploy ML models for real-time scoring using containerized is to use ML model servers. ML model servers are containerized services that can expose RESTful APIs or gRPC endpoints for serving ML models. For example, TensorFlow Serving, Seldon Core, TorchServe, and MLflow Model Server are some of the open-source frameworks that offer ML model servers as a service.
Follow these steps to deploy an ML model using ML model servers:
- Create a Dockerfile that defines the base image, dependencies, environment variables, and commands for building and running the ML model server.
# Use TensorFlow Serving as base image
FROM tensorflow/serving
# Copy the SavedModel from local directory to container directory
COPY ./model /models/model
# Set environment variables for TensorFlow Serving
ENV MODEL_NAME=model
ENV PORT=8501
# Expose port for HTTP requests
EXPOSE 8501
# Run TensorFlow Serving as entrypoint
ENTRYPOINT ["/usr/bin/tf_serving_entrypoint.sh"]
This Dockerfile uses tensorflow/serving
as the base image, copies the SavedModel from a local directory named model
to a container directory named /models/model
, sets environment variables for TensorFlow Serving such as MODEL_NAME
and PORT
, exposes port 8501 for HTTP requests, and runs TensorFlow Serving as the entrypoint.
- Just like the previous example, write the code for the ML model server that loads the ML model, preprocesses the input data, invokes the ML model to make predictions, and postprocesses the output data.
Here is a Python code for an ML model server that uses Seldon Core to serve a scikit-learn model:
# Import libraries
import json
import numpy as np
import sklearn.linear_model
# Load the model from a local file
model = sklearn.linear_model.LinearRegression()
model.load("model.pkl")
# Define the ML model server class
class ModelServer(object):
# Define the initialization method
def __init__(self):
# Load the model
self.model = model
# Define the prediction method
def predict(self, input_data, features_names):
# Preprocess the input data (e.g., encode categorical variables)
input_data = preprocess(input_data)
# Convert the input data to a numpy array
input_data = np.array(input_data)
# Reshape the input data to match the model input shape
input_data = input_data.reshape(1, -1)
# Invoke the model to make predictions
output_data = self.model.predict(input_data)
# Postprocess the output data (e.g., format JSON or add confidence scores)
output_data = postprocess(output_data)
# Return the output data
return output_data
- Build the Docker image from the Dockerfile and the code, and push it to a Docker registry.
You can use the docker build
command to build a Docker image from a Dockerfile and a code directory. You can also use the docker push
command to push the Docker image to a Docker registry.
For example:
docker build -t my-model-server ./code
docker push my-model-server
- Deploy the Docker image to a container orchestration platform (such as Kubernetes or Amazon ECS), and configure the service settings (such as replicas, ports, and health checks).
Using Kubernetes, you can use kubectl to deploy the Docker image to a Kubernetes cluster. You can also use YAML files or charts to define the service settings for the deployment.
For example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-model-server
spec:
replicas: 3
selector:
matchLabels:
app: my-model-server
template:
metadata:
labels:
app: my-model-server
spec:
containers:
- name: my-model-server
image: my-model-server
ports:
- containerPort: 8501
---
apiVersion: v1
kind: Service
metadata:
name: my-model-server
spec:
selector:
app: my-model-server
ports:
- protocol: TCP
port: 80
targetPort: 8501
- Test the service by sending sample requests and verifying the responses.
Still using Kubernetes, you can use kubectl or curl to send HTTP requests to the service endpoint and check the HTTP responses.
kubectl get svc my-model-server
curl http://my-model-server/predict -d '{"data": [1.0, 2.0, 3.0]}'
These commands get the service endpoint of my-model-server
and send an HTTP POST request with some sample data to its /predict
path. The commands also print out the HTTP response with the prediction result.
- Monitor and troubleshoot the service by using the platform's dashboard, logs, metrics, and alerts.
For this step, you can use kubectl or Grafana to view the dashboard, logs, metrics, and alerts for the service. Use tools such as Prometheus or Jaeger to collect and analyze performance and tracing data for the service.
So, Which One is Better for Real-Time ML Scoring?
You cannnot not really say which one is better. I think the answer is definitive. It depends on various factors, such as:
- The complexity and size of the ML model. Serverless deployment may have limitations on the memory, disk space, and CPU time available for loading and running large or complex models. Containerized deployment may have more flexibility and control over these resources.
- The frequency and variability of the scoring requests. Serverless deployment may have advantages on handling sporadic or unpredictable workloads, as it can scale up or down automatically and only charge for what is used. Containerized deployment may have more stability and consistency on handling steady or predictable workloads, as it can maintain a fixed number of replicas and avoid cold starts or latency spikes.
- The latency and throughput requirements of the scoring requests. Serverless deployment may have disadvantages on meeting the latency and throughput requirements of the scoring requests, as it may incur delays or timeouts when initializing or scaling up the function. It may also throttle or queue requests when there is high demand or insufficient resources. Containerized deployment may have advantages on delivering the latency and throughput requirements of the scoring requests, as it can avoid delays or timeouts when initializing or scaling up the service. It can also handle high demand or insufficient resources better.
You have seen that there is no one-size-fits-all solution for deploying ML models for real-time scoring. The choice of the best approach depends on the factors mentioned above coupled with your development and deployment process.
Therefore, I usually recommend that you experiment with different approaches and compare their results and trade-offs before choosing the one to go with.