Convert BART models to ONNX with quantization. 3X reduction in size, and upto 3X boost in inference speed

Last update: Dec 09, 2022

Related tags

Deep Learning fast-Bart

Overview

fast-Bart

Reduction of BART model size by 3X, and boost in inference speed up to 3X

BART implementation of the fastT5 library (https://github.com/Ki6an/fastT5)

Pytorch model -> ONNX model -> Quantized ONNX model

Install

Install using requirements.txt file

git clone https://github.com/siddharth-sharma7/fast-Bart
cd fast-Bart
pip install -r requirements.txt

Usage

The export_and_get_onnx_model() method exports the given pretrained Bart model to onnx, quantizes it and runs it on the onnxruntime with default settings. The returned model from this method supports the generate() method of huggingface.

If you don't wish to quantize the model then use quantized=False in the method.

from fastBart import export_and_get_onnx_model
from transformers import AutoTokenizer

model_name = 'facebook/bart-base'
model = export_and_get_onnx_model(model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name)
input = "This is a very long sentence and needs to be summarized."
token = tokenizer(input, return_tensors='pt')

tokens = model.generate(input_ids=token['input_ids'],
               attention_mask=token['attention_mask'],
               num_beams=3)

output = tokenizer.decode(tokens.squeeze(), skip_special_tokens=True)
print(output)

to run the already exported model use get_onnx_model()

you can customize the whole pipeline as shown in the below code example:

from fastBart import (OnnxBart, get_onnx_runtime_sessions,
                    generate_onnx_representation, quantize)
from transformers import AutoTokenizer

model_or_model_path = 'facebook/bart-base'

# Step 1. convert huggingfaces bart model to onnx
onnx_model_paths = generate_onnx_representation(model_or_model_path)

# Step 2. (recommended) quantize the converted model for fast inference and to reduce model size.
# The process is slow for the decoder and init-decoder onnx files (can take up to 15 mins)
quant_model_paths = quantize(onnx_model_paths)

# step 3. setup onnx runtime
model_sessions = get_onnx_runtime_sessions(quant_model_paths)

# step 4. get the onnx model
model = OnnxBart(model_or_model_path, model_sessions)

                      ...

custom output paths

By default, fastBart creates a models-bart folder in the current directory and stores all the models. You can provide a custom path for a folder to store the exported models. And to run already exported models that are stored in a custom folder path: use get_onnx_model(onnx_models_path="/path/to/custom/folder/")

from fastBart import export_and_get_onnx_model, get_onnx_model

model_name = "facebook/bart-base"
custom_output_path = "/path/to/custom/folder/"

# 1. stores models to custom_output_path
model = export_and_get_onnx_model(model_name, custom_output_path)

# 2. run already exported models that are stored in custom path
# model = get_onnx_model(model_name, custom_output_path)

Functionalities

Export any pretrained Bart model to ONNX easily.
The exported model supports beam search and greedy search and more via generate() method.
Reduce the model size by 3X using quantization.
Up to 3X speedup compared to PyTorch execution for greedy search and 2-3X for beam search.

Convert BART models to ONNX with quantization. 3X reduction in size, and upto 3X boost in inference speed

Related tags

Overview

fast-Bart

Reduction of BART model size by 3X, and boost in inference speed up to 3X

Install

Usage

custom output paths

Functionalities

Owner

Siddharth Sharma

A commany has recently introduced a new type of bidding, the average bidding, as an alternative to the bid given to the current maximum bidding

Towards uncontrained hand-object reconstruction from RGB videos

A repository for generating stylized talking 3D and 3D face

Python Implementation of the CoronaWarnApp (CWA) Event Registration

[ACL 20] Probing Linguistic Features of Sentence-level Representations in Neural Relation Extraction

[NeurIPS 2019] Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss

Unofficial Tensorflow-Keras implementation of Fastformer based on paper [Fastformer: Additive Attention Can Be All You Need](https://arxiv.org/abs/2108.09084).

Multimodal Co-Attention Transformer (MCAT) for Survival Prediction in Gigapixel Whole Slide Images

An implementation of an abstract algebra for music tones (pitches).

Auto-Encoding Score Distribution Regression for Action Quality Assessment

Shitty gaze mouse controller

Amazon Forest Computer Vision: Satellite Image tagging code using PyTorch / Keras with lots of PyTorch tricks

A super lightweight Lagrangian model for calculating millions of trajectories using ERA5 data

A script that trains a model to recognize handwritten digits using the MNIST data set.

Machine Learning Platform for Kubernetes

NeuralCompression is a Python repository dedicated to research of neural networks that compress data

Code for the paper "Location-aware Single Image Reflection Removal"

On Generating Extended Summaries of Long Documents

Point cloud processing tool library.

PyTorch wrapper for Taichi data-oriented class