This project shows how to serve an ONNX-optimized image classification model as a web service with FastAPI, Docker, and Kubernetes.

Last update: Dec 23, 2022

Overview

Deploying ML models with FastAPI, Docker, and Kubernetes

By: Sayak Paul and Chansung Park

This project shows how to serve an ONNX-optimized image classification model as a RESTful web service with FastAPI, Docker, and Kubernetes (k8s). The idea is to first Dockerize the API and then deploy it on a k8s cluster running on Google Kubernetes Engine (GKE). We do this integration using GitHub Actions.

👋 Note: Even though this project uses an image classification its structure and techniques can be used to serve other models as well.

Deploying the model as a service with k8s

We decouple the model optimization part from our API code. The optimization part is available within the notebooks/TF_to_ONNX.ipynb notebook.
Then we locally test the API. You can find the instructions within the api directory.
To deploy the API, we define our deployment.yaml workflow file inside .github/workflows. It does the following tasks:
- Looks for any changes in the specified directory. If there are any changes:
- Builds and pushes the latest Docker image to Google Container Register (GCR).
- Deploys the Docker container on the k8s cluster running on GKE.

Configurations needed beforehand

Create a k8s cluster on GKE. Here's a relevant resource.
Create a service account key (JSON) file. It's a good practice to only grant it the roles required for the project. For example, for this project, we created a fresh service account and granted it permissions for the following: Storage Admin, GKE Developer, and GCR Developer.
Crete a secret named GCP_CREDENTIALS on your GitHub repository and copy paste the contents of the service account key file into the secret.

Configure bucket storage related permissions for the service account:

$ export PROJECT_ID=<PROJECT_ID>
$ export ACCOUNT=<ACCOUNT>

$ gcloud -q projects add-iam-policy-binding ${PROJECT_ID} \
    --member=serviceAccount:${ACCOUNT}@${PROJECT_ID}.iam.gserviceaccount.com \
    --role roles/storage.admin

$ gcloud -q projects add-iam-policy-binding ${PROJECT_ID} \
    --member=serviceAccount:${ACCOUNT}@${PROJECT_ID}.iam.gserviceaccount.com \
    --role roles/storage.objectAdmin

gcloud -q projects add-iam-policy-binding ${PROJECT_ID} \
    --member=serviceAccount:${ACCOUNT}@${PROJECT_ID}.iam.gserviceaccount.com \
    --role roles/storage.objectCreator

If you're on the main branch already then upon a new push, the worflow defined in .github/workflows/deployment.yaml should automatically run. Here's how the final outputs should look like so (run link):

Notes

Since we use CPU-based pods within the k8s cluster, we use ONNX optimizations since they are known to provide performance speed-ups for CPU-based environments. If you are using GPU-based pods then look into TensorRT.
We use Kustomize to manage the deployment on k8s.

Querying the API endpoint

From workflow outputs, you should see something like so:

NAME             TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)        AGE
fastapi-server   LoadBalancer   xxxxxxxxxx   xxxxxxxxxx        80:30768/TCP   23m
kubernetes       ClusterIP      xxxxxxxxxx     <none>          443/TCP        160m

Note the EXTERNAL-IP corresponding to fastapi-server (iff you have named your service like so). Then cURL it:

curl -X POST -F [email protected] -F with_resize=True -F with_post_process=True http://{EXTERNAL-IP}:80/predict/image

You should get the following output (if you're using the cat.jpg image present in the api directory):

"{\"Label\": \"tabby\", \"Score\": \"0.538\"}"

The request assumes that you have a file called cat.jpg present in your working directory.

TODO (s)

Set up logging for the k8s pods.
Find a better way to report the latest API endpoint.

Acknowledgements

ML-GDE program for providing GCP credit support.

Comments

Feat/locust grpc

@deep-diver currently, the load test runs into:

I have ensured https://github.com/sayakpaul/ml-deployment-k8s-fastapi/blob/feat/locust-grpc/locust/grpc/locustfile.py#L49 returns the correct output. But after a few requests, I run into the above problem.

Also, I should mention that the gRPC client currently does not take care of image resizing which makes it a bit less comparable to the REST client which handles preprocessing as well postprocessing.

opened by sayakpaul 18
Setup TF Serving based deployment
In this new feature, the following works are expected

~~Update the notebook~~ Create a new notebook with the TF Serving prototype based on both gRPC(Ref) and RestAPI(Ref).

~~Update the notebook~~ Update the newly created notebook to check the %%timeit on the TF Serving server locally.

Build/Commit docker image based on TF Serving base image using this method.

Deploy the built docker image on GKE cluster

Check the deployed model's performance with a various scenarios (maybe the same ones applied to ONNX+FastAPI scenarios)

new feature
opened by deep-diver 11
Perform load testing with Locust
Resources:

https://towardsdatascience.com/performance-testing-an-ml-serving-api-with-locust-ecd98ab9b7f7

https://microsoft.github.io/PartsUnlimitedMRP/pandp/200.1x-PandP-LocustTest.html

https://github.com/https-deeplearning-ai/machine-learning-engineering-for-production-public/tree/main/course4/week2-ungraded-labs/C4_W2_Lab_3_Latency_Test_Compose
opened by sayakpaul 10
4 dockerize
fix

move api/utils/requirements.txt to /api

add missing dependency python-multipart to the requirements.txt

add

Dockerfile

Closes https://github.com/sayakpaul/ml-deployment-k8s-fastapi/issues/4
opened by deep-diver 4
Deployment on GKE with GitHub Actions

Closes https://github.com/sayakpaul/ml-deployment-k8s-fastapi/issues/5, https://github.com/sayakpaul/ml-deployment-k8s-fastapi/issues/7, and https://github.com/sayakpaul/ml-deployment-k8s-fastapi/issues/6.

opened by sayakpaul 2
chore: refactored the colab notebook.

Just added a text cell explaining why it's better to include the preprocessing function in the final exported model. Also, added a cell to show if the TF and ONNX outputs match with np.testing.assert_allclose().

opened by sayakpaul 2

Releases(v1.0.0)

v1.0.0(Feb 21, 2022)

Source code(tar.gz)
Source code(zip)
resnet50_w_preprocessing.onnx(97.42 MB)
resnet50_w_preprocessing_tf.tar.gz(101.89 MB)

Owner

Sayak Paul

ML Engineer at @carted | One PR at a time

GitHub Repository

This project shows how to serve an ONNX-optimized image classification model as a web service with FastAPI, Docker, and Kubernetes.

Related tags

Overview

Deploying ML models with FastAPI, Docker, and Kubernetes

Deploying the model as a service with k8s

Configurations needed beforehand

Notes

Querying the API endpoint

TODO (s)

Acknowledgements

Comments

Feat/locust grpc

Setup TF Serving based deployment

Perform load testing with Locust

4 dockerize

Deployment on GKE with GitHub Actions

chore: refactored the colab notebook.

Releases(v1.0.0)

v1.0.0(Feb 21, 2022)

Owner

Sayak Paul

API using python and Fastapi framework

flask extension for integration with the awesome pydantic package

A utility that allows you to use DI in fastapi without Depends()

implementation of deta base for FastAPIUsers

High-performance Async REST API, in Python. FastAPI + GINO + Arq + Uvicorn (w/ Redis and PostgreSQL).

:rocket: CLI tool for FastAPI. Generating new FastAPI projects & boilerplates made easy.

Feature rich robust FastAPI template.

Opinionated set of utilities on top of FastAPI

Formatting of dates and times in Flask templates using moment.js.

A dynamic FastAPI router that automatically creates CRUD routes for your models

Beyonic API Python official client library simplified examples using Flask, Django and Fast API.

Learn to deploy a FastAPI application into production DigitalOcean App Platform

FastAPI backend for Repost

Prometheus exporter for several chia node statistics

Twitter API monitor with fastAPI + MongoDB

Easy and secure implementation of Azure AD for your FastAPI APIs 🔒

FastAPI with Docker and Traefik

Github timeline htmx based web app rewritten from Common Lisp to Python FastAPI

A Flask extension that enables or disables features based on configuration.

Slack webhooks API served by FastAPI