Deploying Convolutional and Transformer-based Generative Models as Microservices on Kubernetes

Interested in AI? I am adding the AI touch to LHB with this tutorial on deploying convolutional and transformer-based generative models as microservices on Kubernetes, with containerized model serving and periodic retraining.

You'll learn the following in this tutorial:

  • Containerize PyTorch and TensorFlow models for GPU-accelerated inference
  • Implement canary deployments for gradual generative model updates
  • Automated continuous retraining of generative models with new data
  • Integrate generative microservices into disaster recovery workflows

By the end of this article, you will be able to create and deploy your generative microservices that can produce high-quality images, text, audio, or video content on demand and keep them up to date with the latest data and feedback.

What are generative models?

Generative models are a type of machine learning model that can learn to generate new data that resembles the data they are trained on. For example, a generative model can learn to create realistic images of faces, animals, landscapes, or artworks based on a large dataset of existing images. Generative models can also learn to generate text, audio, or video content, such as captions, stories, songs, or animations.

There are different types of generative models, such as variational autoencoders (VAEs), generative adversarial networks (GANs), and transformers.

In this article, I will focus on convolutional and transformer-based generative models.

Convolutional generative models use convolutional layers to process spatial information in images or videos.

Transformer-based generative models use transformer layers to process sequential information in text or audio.

Why deploy generative models as microservices?

Microservices are small, independent, and loosely coupled services that communicate with each other through well-defined APIs. Microservices offer several benefits for deploying machine learning models in production, such as:

  • Scalability: You can scale each microservice independently according to the demand and resource availability.
  • Availability: You can ensure high availability and fault tolerance by replicating and load balancing each microservice across multiple nodes or regions.
  • Modularity: You can develop, test, deploy, and update each microservice separately without affecting the rest of the system.
  • Diversity: You can use different technologies, frameworks, languages, and tools for each microservice according to the specific requirements and preferences.

Containerize PyTorch and TensorFlow models for GPU-accelerated inference

One of the challenges of deploying machine learning models as microservices is to ensure that they run consistently and efficiently across different environments.

A common solution is to use containers, which are isolated and portable units of software that contain everything needed to run an application: code, libraries, dependencies, configuration, etc.

To containerize your PyTorch or TensorFlow model for GPU-accelerated inference, you need to follow these steps:

  • Choose a base image that contains the operating system and the framework version you need. For example, you can use the official PyTorch or TensorFlow images from Docker Hub or NVIDIA NGC.
  • Install any additional dependencies or packages you need for your model or application. You may need to install torchvision or tensorflow_datasets for data loading and processing.
  • Copy your model file (e.g., .pt or .h5) and any other files (e.g., .json or .txt) you need for your application into the container image.
  • Define an entry point script that loads your model from the file and runs your inference logic. Preprocess the input data, feed it into your model, postprocess the output data, and return it as a response.
  • Expose a port for your application to communicate with other services. You can use Flask or FastAPI to create a RESTful API for your model inference.
  • Build your container image using docker build or Podman build commands. For example:
# Use PyTorch base image from NVIDIA NGC
FROM nvcr.io/nvidia/pytorch:21.03-py3

# Install torchvision for data loading
RUN pip install torchvision

# Copy model file and entrypoint script
COPY model.pt /app/model.pt
COPY app.py /app/app.py

# Expose port 5000 for Flask API
EXPOSE 5000

# Run entrypoint script
CMD ["python", "/app/app.py"]
docker build -t generative-model:v2 .

You should see something like this:

Push your container image to a registry using docker push commands. For example:

docker push myregistry/generative-model:v2

Implement canary deployments for gradual generative model updates

One of the advantages of microservices is that you can update each service independently without affecting the rest of the system.

However, updating a machine learning model can be risky, as it may introduce bugs, errors, or performance degradation.

Therefore, it is advisable to use a canary deployment strategy, which allows you to test a new version of your model on a subset of users or traffic before rolling it out fully.

A canary deployment strategy involves the following steps:

  • Deploy a new version of your model as a separate microservice, alongside the existing version. For example, you can use different tags or labels for your container images and deployments to distinguish between the versions.
  • Split the traffic between the two versions using a load balancer or a service mesh. For example, you can use Istio or Linkerd to create routing rules and weight-based distribution for your services.
  • Monitor and compare the performance and behavior of the two versions using metrics, logs, and alerts. For example, you can use Prometheus or Grafana to collect and visualize metrics such as latency, throughput, error rate, accuracy, etc.
  • Gradually increase the traffic to the new version if it meets your expectations and criteria. For example, you can start with 10% of the traffic and increase it by 10% every hour until it reaches 100%.
  • Roll back to the old version if you detect any issues or anomalies with the new version. For example, you can use Kubernetes commands or Helm charts to revert your deployments and services.

Here is an example of a Helm chart for deploying a TensorFlow transformer-based generative model as a canary:

# Define values for deployment name, image tag, and traffic weight
name: generative-model
imageTag: v2
weight: 10

# Define deployment template
deployment:
  apiVersion: apps/v1
  kind: Deployment
  metadata:
    name: {{ .Values.name }}-{{ .Values.imageTag }}
    labels:
      app: {{ .Values.name }}
      version: {{ .Values.imageTag }}
  spec:
    replicas: 1
    selector:
      matchLabels:
        app: {{ .Values.name }}
        version: {{ .Values.imageTag }}
    template:
      metadata:
        labels:
          app: {{ .Values.name }}
          version: {{ .Values.imageTag }}
      spec:
        containers:
        - name: {{ .Values.name }}
          image: myregistry/{{ .Values.name }}:{{ .Values.imageTag }}
          ports:
          - containerPort: 5000

# Define service template
service:
  apiVersion: v1
  kind: Service
  metadata:
    name: {{ .Values.name }}
    labels:
      app: {{ .Values.name }}
  spec:
    selector:
      app: {{ .Values.name }}
    ports:
    - port: 80
      targetPort: 5000

# Define virtual service template for Istio routing
virtualService:
  apiVersion: networking.istio.io/v1alpha3
  kind: VirtualService
  metadata:
    name: {{ .Values.name }}
  spec:
    hosts:
    - {{ .Values.name }}
    http:
    - route:
      - destination:
          host: {{ .Values.name }}
          subset: v1
        weight: {{ sub 100 .Values.weight }}
      - destination:
          host: {{ .Values.name }}
          subset: v2
        weight: {{ .Values.weight }}

# Define destination rule template for Istio subsets
destinationRule:
  apiVersion: networking.istio.io/v1alpha3
  kind: DestinationRule
  metadata:
    name: {{ .Values.name }}
  spec:
    host: {{ .Values.name }}
    subsets:
    - name: v1
      labels:
        version: v1
    - name: v2
      labels:
        version: v2

To deploy the generative model as a canary using this Helm chart, you can run this command:

helm upgrade generative-model generative-model-chart --set imageTag=v2 --set weight=10

You should see something like this:

Automate continuous retraining of generative models with new data

Another challenge of deploying machine learning models as microservices is to keep them up to date with the latest data and feedback. A static model that is trained once and never updated may become obsolete or inaccurate over time due to changes in data distribution, user behavior, or business requirements. Therefore, it is advisable to automate the continuous retraining of generative models with new data.

Automating continuous retraining of generative models involves the following steps:

  • Collect and store new data from various sources, such as user feedback, logs, sensors, etc. For example, you can use Kafka or Azure Event Hubs to stream data into a data lake or a data warehouse.
  • Preprocess and transform the new data into a suitable format for training. For example, you can use Spark or Databricks to perform data cleaning, feature engineering, and data augmentation.
  • Train a new version of your generative model using the new data and the existing model as a starting point. For example, you can use PyTorch Lightning or TensorFlow Extended to create a training pipeline that runs on a distributed cluster or a cloud platform.
  • Evaluate and validate the new version of your generative model using various metrics and tests. For example, you can use TensorBoard or MLflow to track and compare the performance of different versions of your model on different datasets and tasks.
  • Deploy the new version of your generative model as a microservice using a canary deployment strategy. For example, you can use Helm or Kustomize to create and update your Kubernetes deployments and services.

To automate these steps, you can use tools such as Airflow or Kubeflow to create and orchestrate workflows that run periodically or triggered by events. You can also use tools such as Argo CD or Flux to implement GitOps, which is a practice of using Git as a single source of truth for your code and configuration.

Here is an example of an Airflow DAG for automating continuous retraining of a PyTorch convolutional generative model:

# Import modules
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago

# Define DAG parameters
dag = DAG(
    dag_id='generative_model_retraining',
    schedule_interval='@daily',
    start_date=days_ago(1),
    catchup=False
)

# Define tasks
collect_data = BashOperator(
    task_id='collect_data',
    bash_command='python collect_data.py',
    dag=dag
)

preprocess_data = BashOperator(
    task_id='preprocess_data',
    bash_command='python preprocess_data.py',
    dag=dag
)

train_model = BashOperator(
    task_id='train_model',
    bash_command='python train_model.py',
    dag=dag
)

evaluate_model = BashOperator(
    task_id='evaluate_model',
    bash_command='python evaluate_model.py',
    dag=dag
)

deploy_model = BashOperator(
    task_id='deploy_model',
    bash_command='helm upgrade generative-model generative-model-chart --set imageTag=v2 --set weight=10',
    dag=dag
)

# Define dependencies
collect_data >> preprocess_data >> train_model >> evaluate_model >> deploy_model

Integrate generative microservices into disaster recovery workflows

The last step of deploying generative models as microservices is to ensure that they are resilient and reliable in case of disasters or failures.

To integrate generative microservices into disaster recovery workflows, you need to follow these steps:

  • Backup your data and models regularly and store them in a secure and accessible location. For example, you can use Azure Blob Storage or Amazon S3 to store your data and models in the cloud.
  • Implement backup and restore procedures for your data and models using tools such as Velero or KubeDB. For example, you can use Velero to backup and restore your Kubernetes cluster state and resources, including your deployments, services, volumes, etc.
  • Implement health checks and liveness probes for your microservices using tools such as Kubernetes or Istio. For example, you can use Kubernetes liveness probes to check if your microservice is running and restart it if it fails.
  • Implement readiness probes and circuit breakers for your microservices using tools such as Kubernetes or Istio. For example, you can use Kubernetes readiness probes to check if your microservice is ready to receive traffic and remove it from the load balancer if it is not.
  • Implement retry and timeout policies for your microservices using tools such as Istio or Resilience4j. For example, you can use Istio retry policy to retry failed requests to your microservice up to a certain number of times or until a certain timeout is reached.
  • Implement fallback and recovery strategies for your microservices using tools such as Istio or Hystrix. For example, you can use Istio fallback policy to route requests to an alternative service if your primary service is unavailable.

Here is an example of an Istio destination rule for implementing circuit breaker and fallback policies for a TensorFlow transformer-based generative model:

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: generative-model
spec:
  host: generative-model
  subsets:
  - name: v1
    labels:
      version: v1
  - name: v2
    labels:
      version: v2
  trafficPolicy:
    connectionPool:
      http:
        http1MaxPendingRequests: 100
        maxRequestsPerConnection: 10
    outlierDetection:
      consecutiveErrors: 5
      interval: 10s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
    loadBalancer:
      simple: LEAST_CONN
    mirror:
      host: generative-model-fallback
      subset: v1
    mirrorPercent: 10

Conclusion

In this tutorial, you learned how to deploy convolutional and transformer-based generative models as microservices on Kubernetes, with containerized model serving and periodic retraining.

You also learned how to implement canary deployments, backup and restore procedures, health checks, liveness and readiness probes, retry and timeout policies, circuit breakers, and fallback and recovery strategies for your generative microservices. You also saw some examples of code and configuration files for each step.

I hope you learned something new and useful. If you have any questions or feedback, please feel free to leave a comment below.

✍🏻
Barry Ugochukwu is a Data scientist who likes sharing his knowledge on data, AI and ML.