🏎️ Accelerate training and inference of 🤗 Transformers with easy to use hardware optimization tools

Last update: Dec 30, 2022

Overview

Hugging Face Optimum

🤗 Optimum is an extension of 🤗 Transformers, providing a set of performance optimization tools enabling maximum efficiency to train and run models on targeted hardware.

The AI ecosystem evolves quickly and more and more specialized hardware along with their own optimizations are emerging every day. As such, Optimum enables users to efficiently use any of these platforms with the same ease inherent to transformers.

Integration with Hardware Partners

🤗 Optimum aims at providing more diversity towards the kind of hardware users can target to train and finetune their models.

To achieve this, we are collaborating with the following hardware manufacturers in order to provide the best transformers integration:

Graphcore IPUs - IPUs are a completely new kind of massively parallel processor to accelerate machine intelligence. More information here.
Habana Gaudi Processor (HPU) - HPUs are designed to maximize training throughput and efficiency. More information here.
More to come soon! ⭐

Optimizing models towards inference

Along with supporting dedicated AI hardware for training, Optimum also provides inference optimizations towards various frameworks and platforms.

We currently support ONNX runtime along with Intel Neural Compressor (INC).

Features	ONNX Runtime	Intel Neural Compressor
Post-training Dynamic Quantization	✔️	✔️
Post-training Static Quantization	✔️	✔️
Quantization Aware Training (QAT)	Stay tuned! ⭐	✔️
Pruning	N/A	✔️

Installation

🤗 Optimum can be installed using pip as follows:

python -m pip install optimum

If you'd like to use the accelerator-specific features of 🤗 Optimum, you can install the required dependencies according to the table below:

Accelerator	Installation
ONNX runtime	`python -m pip install optimum[onnxruntime]`
Intel Neural Compressor (INC)	`python -m pip install optimum[intel]`
Graphcore IPU	`python -m pip install optimum[graphcore]`
Habana Gaudi Processor (HPU)	`python -m pip install optimum[habana]`

If you'd like to play with the examples or need the bleeding edge of the code and can't wait for a new release, you can install the base library from source as follows:

python -m pip install git+https://github.com/huggingface/optimum.git

For the accelerator-specific features, you can install them by appending #egg=optimum[accelerator_type] to the pip command, e.g.

python -m pip install git+https://github.com/huggingface/optimum.git#egg=optimum[onnxruntime]

Quickstart

At its core, 🤗 Optimum uses configuration objects to define parameters for optimization on different accelerators. These objects are then used to instantiate dedicated optimizers, quantizers, and pruners.

Quantization

For example, here's how you can apply dynamic quantization with ONNX Runtime:

from optimum.onnxruntime.configuration import AutoQuantizationConfig
from optimum.onnxruntime import ORTQuantizer

# The model we wish to quantize
model_checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
# The type of quantization to apply
qconfig = AutoQuantizationConfig.arm64(is_static=False, per_channel=False)
quantizer = ORTQuantizer.from_pretrained(model_checkpoint, feature="sequence-classification")

# Quantize the model!
quantizer.export(
    onnx_model_path="model.onnx",
    onnx_quantized_model_output_path="model-quantized.onnx",
    quantization_config=qconfig,
)

In this example, we've quantized a model from the Hugging Face Hub, but it could also be a path to a local model directory. The feature argument in the from_pretrained() method corresponds to the type of task that we wish to quantize the model for. The result from applying the export() method is a model-quantized.onnx file that can be used to run inference. Here's an example of how to load an ONNX Runtime model and generate predictions with it:

from functools import partial
from datasets import Dataset
from optimum.onnxruntime.model import ORTModel

# Load quantized model
ort_model = ORTModel("model-quantized.onnx", quantizer._onnx_config)
# Create a dataset or load one from the Hub
ds = Dataset.from_dict({"sentence": ["I love burritos!"]})
# Tokenize the inputs
def preprocess_fn(ex, tokenizer):
    return tokenizer(ex["sentence"])

tokenized_ds = ds.map(partial(preprocess_fn, tokenizer=quantizer.tokenizer))
ort_outputs = ort_model.evaluation_loop(tokenized_ds)
# Extract logits!
ort_outputs.predictions

Similarly, you can apply static quantization by simply setting is_static to True when instantiating the QuantizationConfig object:

qconfig = AutoQuantizationConfig.arm64(is_static=True, per_channel=False)

Static quantization relies on feeding batches of data through the model to estimate the activation quantization parameters ahead of inference time. To support this, 🤗 Optimum allows you to provide a calibration dataset. The calibration dataset can be a simple Dataset object from the 🤗 Datasets library, or any dataset that's hosted on the Hugging Face Hub. For this example, we'll pick the sst2 dataset that the model was originally trained on:

from optimum.onnxruntime.configuration import AutoCalibrationConfig

# Create the calibration dataset
calibration_dataset = quantizer.get_calibration_dataset(
    "glue",
    dataset_config_name="sst2",
    preprocess_function=partial(preprocess_fn, tokenizer=quantizer.tokenizer),
    num_samples=50,
    dataset_split="train",
)
# Create the calibration configuration containing the parameters related to calibration.
calibration_config = AutoCalibrationConfig.minmax(calibration_dataset)
# Perform the calibration step: computes the activations quantization ranges
ranges = quantizer.fit(
    dataset=calibration_dataset,
    calibration_config=calibration_config,
    onnx_model_path="model.onnx",
    operators_to_quantize=qconfig.operators_to_quantize,
)
# Quantize the same way we did for dynamic quantization!
quantizer.export(
    onnx_model_path="model.onnx",
    onnx_quantized_model_output_path="model-quantized.onnx",
    calibration_tensors_range=ranges,
    quantization_config=qconfig,
)

Graph optimization

Then let's take a look at applying graph optimizations techniques such as operator fusion and constant folding. As before, we load a configuration object, but this time by setting the optimization level instead of the quantization approach:

from optimum.onnxruntime.configuration import OptimizationConfig

# optimization_config=99 enables all available graph optimisations
optimization_config = OptimizationConfig(optimization_level=99)

Next, we load an optimizer to apply these optimisations to our model:

from optimum.onnxruntime import ORTOptimizer

optimizer = ORTOptimizer.from_pretrained(
    model_checkpoint,
    feature="sequence-classification",
)

# Export the optimized model
optimizer.export(
    onnx_model_path="model.onnx",
    onnx_optimized_model_output_path="model-optimized.onnx",
    optimization_config=optimization_config,
)

And that's it - the model is now optimized and ready for inference!

As you can see, the process is similar in each case:

Define the optimization / quantization strategies via an OptimizationConfig / QuantizationConfig object
Instantiate an ORTQuantizer or ORTOptimizer class
Apply the export() method
Run inference

Training

Besides supporting ONNX Runtime inference, 🤗 Optimum also supports ONNX Runtime training, reducing the memory and computations needed during training. This can be achieved by using the class ORTTrainer, which possess a similar behavior than the Trainer of 🤗 Transformers:

-from transformers import Trainer
+from optimum.onnxruntime import ORTTrainer

# Step 1: Create your ONNX Runtime Trainer
-trainer = Trainer(
+trainer = ORTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
    feature="sequence-classification",
)

# Step 2: Use ONNX Runtime for training and evalution!🤗
train_result = trainer.train()
eval_metrics = trainer.evaluate()

By replacing Trainer by ORTTrainer, you will be able to leverage ONNX Runtime for fine-tuning tasks.

Check out the examples directory for more sophisticated usage.

Happy optimizing 🤗 !

Comments

Handling ONNX models with external data
This PR aims to handle loading and exporting ONNX models with external data, locally and from the hub. We can also now use FORCE_ONNX_EXTERNAL_DATA=1 to force using external data format even for small models

[X] Saving/loading a model with external data locally

[X] Saving external data in a single file (ends with .onnx_data for easy loading from hub)

[X] Saving/loading a model with external data from the hub

[X] Writing tests

[X] Apply the same changes for other models besides seq2seq

cc @fxmarty @mht-sharma @michaelbenayoun

Fixes https://github.com/huggingface/optimum/issues/254 and https://github.com/huggingface/optimum/issues/377
opened by NouamaneTazi 32
add mt5 to ORTConfigManager conf list

What does this PR do?

Add MT5 to ORTConfigManager.

Fixes #321

I re-arranged in alphabetical order all available models. I can put it back like it was if needed. 🤗

@JingyaHuang

Aside from this PR, I was wondering if opening an issue like https://github.com/huggingface/transformers/issues/16308 for implementing all available onnx models in the ORTConfigManager could be nice?

opened by ChainYo 24
[BT] Add `Bettertransformer` support for FSMT
What does this PR do?

Fixes # (issue)

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).

[ ] Did you make sure to update the documentation with your changes?

[ ] Did you write any new necessary tests?
opened by Sumanth077 18
Add the ORTModelForSemanticSegmentation class
What does this PR do?

This PR aims to implement the ORTModelForImageSegmentation class to provide support for image segmentation .onnx models, and full integration of such models through transformers pipelines for CPU or GPU onnxruntime inference (see Issue #382)

Implementation details

The ORTModelForImageSegmentation was based on the already implemented ORTModelForImageClassification in optimum/onnxruntime/modeling_ort.py with several modifications:

For CPU and GPU inference :

class was added to optimum/onnxruntime/__init__.py

self.forward method returns a SemanticSegmenterOutput instead of ImageClassifierOutput

correct auto_model_classand export_feature referenced

Copied all tests from the ORTModelForImageClassificationIntegrationTest in tests/onnxruntime/test_modeling.py

For GPU inference

logits_shape was changed ORTModelForImageSegmentation.prepare_logits_buffer to return a 4 dimensional tensor shape 2D of shape (input_batch_size, self.config.num_labels, output_height, output_width). The issue is that I did not find a way to get model output size, which is different from input size from config.json, or any other attribute of ORTModelForImageSegmentation or ORTModelForImageSegmentation.model.

CPU inference works as following:

from optimum.onnxruntime.modeling_ort import ORTModelForImageSegmentation session = ORTModelForImageSegmentation.load_model(onnx_path) onnx_model = ORTModelForImageSegmentation(session) inputs = feature_extractor(pil_image, return_tensors="pt") outputs = onnx_model(**inputs)

I could not test GPU inference because I could not manage to make onnxruntime-gpu work:

onnx_model.to('cuda:0') >>> File "C:\Users\theol\Documents\GitHub\Repositories\optimum\optimum\onnxruntime\modeling_ort.py", line 202, in to validate_provider_availability(provider) # raise error if the provider is not available >>> File "C:\Users\theol\Documents\GitHub\Repositories\optimum\optimum\onnxruntime\utils.py", line 227, in validate_provider_availability raise ImportError( >>>ImportError: Asked to use CUDAExecutionProvider, but `onnxruntime-gpu` package was not found. Make sure to install `onnxruntime-gpu` package instead of `onnxruntime`.

Might be because of local venv setup issues on my side. My CUDA installation is working for transformers with torch models. Still, it probably would not work properly yet because of the wrong output size in prepare_logits_buffer

Remaining tasks

Fixing proper output size for io binding

Uploading a .onnx segmentation model to https://huggingface.co/hf-internal-testing and modify IMAGE_SEGMENTATION_EXAMPLE checkpoint name and image url to appropriate example. (See two comments at optimum/onnxruntime/modeling_ort.py lines 1463 and 1533)

Modify test class model to a SemanticSegmentation model in order to get working tests

@michaelbenayoun @JingyaHuang your help would be appreciated 👍
opened by TheoMrc 17
Saving external data for large ONNX models

What does this PR do?

Fixes #254 and https://github.com/huggingface/optimum/issues/377

We can now load and save ORT models that have external data 🚀

opened by NouamaneTazi 16
onnx speed is even slower
System Info

win10 python 3.8.4 pytorch 12.1 cpu transformers4.22.2 optimum 1.4.0 onnxruntime 1.12.1

Who can help?

@Narsil @patil-suraj

Information

[X] The official example scripts

[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)

[ ] My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer, pipeline from optimum.onnxruntime import ORTModelForSeq2SeqLM import warnings text="Vehicle detection technology is of great significance for realizing automatic monitoring and AI-assisted driving systems. The state-of-the-art object detection method, namely, a class of YOLOv5, has often been used to detect vehicles." warnings.filterwarnings("ignore") import time textlists=[text,text,text,text,text] model_checkpoint = "Helsinki-NLP/opus-mt-en-zh" model = ORTModelForSeq2SeqLM.from_pretrained(model_checkpoint, from_transformers=True) tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

model.save_pretrained("onnx")

tokenizer.save_pretrained("onnx")

onnx_translation = pipeline("translation_en_to_zh", model=model, tokenizer=tokenizer) t1=time.time() result = onnx_translation(textlists) print(result ,time.time()-t1)

from transformers import ( MarianTokenizer, MarianMTModel, ) modchoice = "Helsinki-NLP/opus-mt-en-zh" tokenizer = MarianTokenizer.from_pretrained(modchoice)

model = MarianMTModel.from_pretrained(modchoice)

t1 = time.time() encoded=tokenizer.prepare_seq2seq_batch( textlists, truncation=True, padding="longest", return_tensors="pt" )

encoded.to(device)

translated = model.generate( **encoded )

tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated] print(tgt_text,time.time() - t1)

Batch processing is much slower, and single processing is only a little faster

Expected behavior

Faster batch processing
inference onnxruntime
opened by chaodreaming 14

Issue to use GPT2 ONNX export with past key values

System Info

python: 3.10.6
platform: Ubuntu 22.10
optimum version: 1.5.1
onnxruntime: 1.13.1

Who can help?

@JingyaHuang @ec

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Command line to export a GPT2 model:

python -m optimum.exporters.onnx --model gpt2 --task causal-lm-with-past output/

Gives the following output logs:

Framework not specified. Using pt to export to ONNX.
Using framework PyTorch: 1.13.0+cu117
Overriding 2 configuration item(s)
	- use_cache -> True
	- pad_token_id -> 0
/home/jplu/anaconda3/envs/transformers/lib/python3.10/site-packages/transformers/models/gpt2/modeling_gpt2.py:796: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if batch_size <= 0:
/home/jplu/anaconda3/envs/transformers/lib/python3.10/site-packages/transformers/models/gpt2/modeling_gpt2.py:185: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  attn_weights = attn_weights / torch.tensor(
/home/jplu/anaconda3/envs/transformers/lib/python3.10/site-packages/transformers/models/gpt2/modeling_gpt2.py:185: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  attn_weights = attn_weights / torch.tensor(
/home/jplu/anaconda3/envs/transformers/lib/python3.10/site-packages/transformers/models/gpt2/modeling_gpt2.py:200: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  mask_value = torch.tensor(mask_value, dtype=attn_weights.dtype).to(attn_weights.device)
Validating ONNX model...
	-[✓] ONNX model output names match reference model (present.1.value, present.0.key, present.6.key, present.6.value, present.5.value, present.8.key, present.0.value, present.2.key, present.5.key, present.10.key, present.9.value, present.10.value, logits, present.4.value, present.7.key, present.11.value, present.3.value, present.3.key, present.4.key, present.2.value, present.1.key, present.9.key, present.11.key, present.8.value, present.7.value)
	- Validating ONNX Model output "logits":
		-[✓] (2, 16, 50257) matches (2, 16, 50257)
		-[x] values not close enough, max diff: 0.0013427734375 (atol: 1e-05)
	- Validating ONNX Model output "present.0.key":
		-[✓] (2, 12, 32, 64) matches (2, 12, 32, 64)
		-[✓] all values close (atol: 1e-05)
	- Validating ONNX Model output "present.0.value":
		-[✓] (2, 12, 32, 64) matches (2, 12, 32, 64)
		-[✓] all values close (atol: 1e-05)
	- Validating ONNX Model output "present.1.key":
		-[✓] (2, 12, 32, 64) matches (2, 12, 32, 64)
		-[✓] all values close (atol: 1e-05)
	- Validating ONNX Model output "present.1.value":
		-[✓] (2, 12, 32, 64) matches (2, 12, 32, 64)
		-[✓] all values close (atol: 1e-05)
	- Validating ONNX Model output "present.2.key":
		-[✓] (2, 12, 32, 64) matches (2, 12, 32, 64)
		-[✓] all values close (atol: 1e-05)
	- Validating ONNX Model output "present.2.value":
		-[✓] (2, 12, 32, 64) matches (2, 12, 32, 64)
		-[✓] all values close (atol: 1e-05)
	- Validating ONNX Model output "present.3.key":
		-[✓] (2, 12, 32, 64) matches (2, 12, 32, 64)
		-[✓] all values close (atol: 1e-05)
	- Validating ONNX Model output "present.3.value":
		-[✓] (2, 12, 32, 64) matches (2, 12, 32, 64)
		-[✓] all values close (atol: 1e-05)
	- Validating ONNX Model output "present.4.key":
		-[✓] (2, 12, 32, 64) matches (2, 12, 32, 64)
		-[✓] all values close (atol: 1e-05)
	- Validating ONNX Model output "present.4.value":
		-[✓] (2, 12, 32, 64) matches (2, 12, 32, 64)
		-[✓] all values close (atol: 1e-05)
	- Validating ONNX Model output "present.5.key":
		-[✓] (2, 12, 32, 64) matches (2, 12, 32, 64)
		-[✓] all values close (atol: 1e-05)
	- Validating ONNX Model output "present.5.value":
		-[✓] (2, 12, 32, 64) matches (2, 12, 32, 64)
		-[✓] all values close (atol: 1e-05)
	- Validating ONNX Model output "present.6.key":
		-[✓] (2, 12, 32, 64) matches (2, 12, 32, 64)
		-[✓] all values close (atol: 1e-05)
	- Validating ONNX Model output "present.6.value":
		-[✓] (2, 12, 32, 64) matches (2, 12, 32, 64)
		-[✓] all values close (atol: 1e-05)
	- Validating ONNX Model output "present.7.key":
		-[✓] (2, 12, 32, 64) matches (2, 12, 32, 64)
		-[✓] all values close (atol: 1e-05)
	- Validating ONNX Model output "present.7.value":
		-[✓] (2, 12, 32, 64) matches (2, 12, 32, 64)
		-[✓] all values close (atol: 1e-05)
	- Validating ONNX Model output "present.8.key":
		-[✓] (2, 12, 32, 64) matches (2, 12, 32, 64)
		-[✓] all values close (atol: 1e-05)
	- Validating ONNX Model output "present.8.value":
		-[✓] (2, 12, 32, 64) matches (2, 12, 32, 64)
		-[✓] all values close (atol: 1e-05)
	- Validating ONNX Model output "present.9.key":
		-[✓] (2, 12, 32, 64) matches (2, 12, 32, 64)
		-[✓] all values close (atol: 1e-05)
	- Validating ONNX Model output "present.9.value":
		-[✓] (2, 12, 32, 64) matches (2, 12, 32, 64)
		-[✓] all values close (atol: 1e-05)
	- Validating ONNX Model output "present.10.key":
		-[✓] (2, 12, 32, 64) matches (2, 12, 32, 64)
		-[✓] all values close (atol: 1e-05)
	- Validating ONNX Model output "present.10.value":
		-[✓] (2, 12, 32, 64) matches (2, 12, 32, 64)
		-[✓] all values close (atol: 1e-05)
	- Validating ONNX Model output "present.11.key":
		-[✓] (2, 12, 32, 64) matches (2, 12, 32, 64)
		-[✓] all values close (atol: 1e-05)
	- Validating ONNX Model output "present.11.value":
		-[✓] (2, 12, 32, 64) matches (2, 12, 32, 64)
		-[✓] all values close (atol: 1e-05)
An error occured, but the model was saved at: model_repository/gpt2/1/model.onnx

Eventhough there is an error in the close values validation, that's ok. Now I would like to run the model with the following Python:

from optimum.onnxruntime import ORTModelForCausalLM
from transformers import GPT2Tokenizer

model = ORTModelForCausalLM.from_pretrained("output/", from_transformers=False, use_cache=True)
tokenizer = GPT2Tokenizer.from_pretrained("output/")
tokens = tokenizer("My name is Julien and I like", return_tensors="pt")
outputs_model = model.generate(**tokens)

And I get the following error:

/home/jplu/anaconda3/envs/transformers/lib/python3.10/site-packages/transformers/generation_utils.py:1359: UserWarning: Neither `max_length` nor `max_new_tokens` has been set, `max_length` will default to 20 (`self.config.max_length`). Controlling `max_length` via the config is deprecated and `max_length` will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jplu/anaconda3/envs/transformers/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/jplu/anaconda3/envs/transformers/lib/python3.10/site-packages/transformers/generation_utils.py", line 1490, in generate
    return self.greedy_search(
  File "/home/jplu/anaconda3/envs/transformers/lib/python3.10/site-packages/transformers/generation_utils.py", line 2233, in greedy_search
    outputs = self(
  File "/home/jplu/anaconda3/envs/transformers/lib/python3.10/site-packages/optimum/modeling_base.py", line 60, in __call__
    return self.forward(*args, **kwargs)
  File "/home/jplu/anaconda3/envs/transformers/lib/python3.10/site-packages/optimum/onnxruntime/modeling_ort.py", line 1454, in forward
    outputs = self.model.run(None, onnx_inputs)
  File "/home/jplu/anaconda3/envs/transformers/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 196, in run
    raise ValueError("Model requires {} inputs. Input Feed contains {}".format(num_required_inputs, num_inputs))
ValueError: Model requires 26 inputs. Input Feed contains 2

Do I have to randomly feed myself the past_key_values.X.value and past_key_values.X.keys?

When I try to do this directly with onnxruntime, I also get an error. Here what I do:

import onnxruntime as ort
from transformers import GPT2Tokenizer
import numpy as np

sess = ort.InferenceSession('output/model.onnx', providers=["CPUExecutionProvider"])
tokenizer = GPT2Tokenizer.from_pretrained("output/")
tokens = dict(tokenizer("My name is Julien and I like", return_tensors="np"))
shape = (1, 12, len(tokens["input_ids"][0]), 64)

for i in range(12):
    tokens[f"past_key_values.{i}.key"] = np.random.uniform(0, 1, shape).astype(np.float32)
    tokens[f"past_key_values.{i}.value"] = np.random.uniform(0, 1, shape).astype(np.float32)

sess.run(None, tokens)

And I get the following error:

2022-12-06 16:42:17.603173515 [E:onnxruntime:, sequential_executor.cc:369 Execute] Non-zero status code returned while running Add node. Name:'/transformer/h.0/attn/Add' Status Message: /onnxruntime_src/onnxruntime/core/providers/cpu/math/element_wise_ops.h:503 void onnxruntime::BroadcastIterator::Init(ptrdiff_t, ptrdiff_t) axis == 1 || axis == largest was false. Attempting to broadcast an axis by a dimension other than 1. 8 by 16

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jplu/anaconda3/envs/transformers/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 200, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Add node. Name:'/transformer/h.0/attn/Add' Status Message: /onnxruntime_src/onnxruntime/core/providers/cpu/math/element_wise_ops.h:503 void onnxruntime::BroadcastIterator::Init(ptrdiff_t, ptrdiff_t) axis == 1 || axis == largest was false. Attempting to broadcast an axis by a dimension other than 1. 8 by 16

Expected behavior

I expect to have a proper generation and usage with onnxruntime. The final goal is to use it through a Triton server.

I certainly miss something, but the documentation is not clear on how to properly use seq2seq and causal-lm with past-key-values either directly with onnxruntime or with optimum.

Thanks a lot in advance for all the advices you could provide :)

bug

opened by jplu 13

Added support for Tapas Model
What does this PR do?

Fixes # 20372

Before submitting

This PR adds new support for BetterTransformer integration for the Tapas model.

This PR adds documentation that indicates BetterTransofrmer integration for Tapas is added.

Questions

Can I ask you how I can test to add Bettertransformer feature for Tapas Model?

To: @younesbelkada, @sgugger
opened by JuheonChu 13

Inference worse with onnxruntime-gpu than native pytorch for seq2seq model

System Info

Optimum: 1.4.1.dev0
torch: 1.12.1+cu116
onnx: 1.12.0
onnxruntime-gpu: 1.12.1
python: 3.8.13
CUDA: 11.6
cudnn: 8.4.1
RTX 3090

Who can help?

@JingyaHuang @echarlaix

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I compared inference on GPU of a native torch Helsinki-NLP/opus-mt-fr-en model with respect to the optimized onnx model thanks to Optimum library. So, I have defined a fastAPI microservice based on two classes below for GPU both torch and optimized ONNX, repsectively:

class Seq2SeqModel:
    tokenizer: Optional[MarianTokenizer]
    model: Optional[MarianMTModel]

    def load_model(self):
        """Loads the model"""
        # model_id="Helsinki-NLP/opus-mt-fr-en"
        model_path = Path("./app/artifacts/HF")
        tokenizer = AutoTokenizer.from_pretrained(model_path)
        model = AutoModelForSeq2SeqLM.from_pretrained(model_path).to("cuda")
        self.tokenizer = tokenizer
        self.model = model

    async def predict(self, input: PredictionInput) -> PredictionOutput:
        """Runs a prediction"""
        if not self.tokenizer or not self.model:
            raise RuntimeError("Model is not loaded")
        tokens = self.tokenizer(input.text, return_tensors="pt").to("cuda")
        translated = self.model.generate(**tokens, num_beams=beam_size)
        return PredictionOutput(translated_text=self.tokenizer.decode(translated[0], skip_special_tokens=True))

class OnnxOptimizedSeq2SeqModel:
    tokenizer: Optional[MarianTokenizer]
    model: Optional[ORTModelForSeq2SeqLM]

    def load_model(self):
        """Loads the model"""
        # model_id="Helsinki-NLP/opus-mt-fr-en"
        onnx_path = Path("./app/artifacts/OL_1")
        tokenizer = AutoTokenizer.from_pretrained(onnx_path)
        optimized_model = ORTModelForSeq2SeqLM.from_pretrained(
            onnx_path,
            encoder_file_name="encoder_model_optimized.onnx",
            decoder_file_name="decoder_model_optimized.onnx",
            decoder_file_with_past_name="decoder_with_past_model_optimized.onnx",
            provider="CUDAExecutionProvider"
        )
        self.tokenizer = tokenizer
        self.model = optimized_model

app = FastAPI()
seq2seq_model = Seq2SeqModel()
onnx_optimized_seq2seq_model = OnnxOptimizedSeq2SeqModel()
beam_size = 3

@app.on_event("startup")
async def startup():
    seq2seq_model.load_model()
    onnx_optimized_seq2seq_model.load_model()

@app.post("/prediction")
async def prediction(
    output: PredictionOutput = Depends(seq2seq_model.predict),
) -> PredictionOutput:
    return output

@app.post("/prediction_onnx_optimized")
async def prediction(
    output: PredictionOutput = Depends(onnx_optimized_seq2seq_model.predict),
) -> PredictionOutput:
    return output

Expected behavior

When load testing the model on my local computer, I was surprised by two things:

The performance on GPU of the optimized ONNX model is worse than the native torch (maybe linked to #365 and #396?) :

GPU_optimized_onnxruntime GPU_torch

When running this fastAPI service into a docker image I got the following warning:

2022-09-28 08:20:21.214094612 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:566 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Please reference https://onnxruntime.ai/docs/reference/execution-providers/CUDA-ExecutionProvider.html#requirements to ensure all dependencies are met.

Does this mean the CUDAExecutionProvider is not working even if I set it in?:

        optimized_model = ORTModelForSeq2SeqLM.from_pretrained(
            onnx_path,
            encoder_file_name="encoder_model_optimized.onnx",
            decoder_file_name="decoder_model_optimized.onnx",
            decoder_file_with_past_name="decoder_with_past_model_optimized.onnx",
            provider="CUDAExecutionProvider"
        )

What could be caused that? I saw in https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html that CUDA 11.6 is not mentionned, could it be this?

bug inference onnxruntime

opened by Matthieu-Tinycoaching 12

[ORT] Filter out invalid inputs in ORTModelForXXX forward pass
Context

TL;DR

Transformers #17617

Long story to tell... For DeBERTa model, the tokenizer gives out token_type_ids by default. However, the exported IR might not contain token_type_ids(eg. the case when config.type_vocab_size=0 if exported by transformers.onnx.export). In this situation:

The forward pass will fail if the user takes directly the output as input(as our snippet does).

Otherwise, they need to add another line to filter out invalid input themselves which needs a deeper understanding of the model and its tokenizer.

Considering the user experience, I think that we shall add this filter directly in the ORTModelForXXX.

What does this PR do?

Filter out invalid inputs in ORTModelForXXX.

Fixes #207
opened by JingyaHuang 12
Unable to use GPU accelerated Optimum Onnx transformer model for inference
System Info

Optimum Version: 1.5.0 Ubuntu 20.04 Linux Python version 3.8

Who can help?

@JingyaHuang @echarlaix When following the documentation on https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/gpu for 1.5.0 version optimum. We get the following error:

RuntimeError Traceback (most recent call last) in 19 "education", 20 "music"] ---> 21 pred = onnx_z0(sequence_to_classify, candidate_labels, multi_class=False)

8 frames /usr/local/lib/python3.8/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py in bind_input(self, name, device_type, device_id, element_type, shape, buffer_ptr) 454 :param buffer_ptr: memory pointer to input data 455 """ --> 456 self._iobinding.bind_input( 457 name, 458 C.OrtDevice(

RuntimeError: Error when binding input: There's no data transfer registered for copying tensors from Device:[DeviceType:1 MemoryType:0 DeviceId:0] to Device:[DeviceType:0 MemoryType:0 DeviceId:0]

This is reproducible on google colab gpu instance as well. This is observed from 1.5.0 version only and 1.4.1 works as expected.

Information

[X] The official example scripts

[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)

[ ] My own task or dataset (give details below)

Reproduction

!pip install optimum[onnxruntime-gpu]==1.5.1 !pip install transformers onnx

from optimum.onnxruntime import ORTModelForSequenceClassification

ort_model = ORTModelForSequenceClassification.from_pretrained( "philschmid/tiny-bert-sst2-distilled", from_transformers=True, provider="CUDAExecutionProvider", )

from optimum.pipelines import pipeline from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("philschmid/tiny-bert-sst2-distilled")

pipe = pipeline(task="text-classification", model=ort_model, tokenizer=tokenizer) result = pipe("Both the music and visual were astounding, not to mention the actors performance.") print(result)

Expected behavior

Inference fails due to device error, which is not expected.
bug
opened by smiraldr 11

ONNX transformation to cast int64 constants to int32 when possible

As per title.

Partially fixes #627 , we need to integrate this in this CLI and document + test.

Try with:

import onnx
from pathlib import Path
from optimum.onnx import model_to_int32

path = "/path/to/decoder_model.onnx"
model = onnx.load(path)

model = model_to_int32(model)

onnx.save(
    model,
    path,
    save_as_external_data=True,
    all_tensors_to_one_file=True,
    location=Path(path).name + "_data",
)

onnx.checker.check_model(path)

Inspect the original and transformed models "Slice" nodes.

opened by fxmarty 2

Fix provider options when several providers are passed
When several providers are passed to the InferenceSession, which is the case when TensorrtExecutionProvider is chosen, the provider_options argument needs to be of the same length than providers, otherwise raising:

EP Error using ['TensorrtExecutionProvider', 'CUDAExecutionProvider'] Falling back to ['CUDAExecutionProvider', 'CPUExecutionProvider'] and retrying.

Reference: https://onnxruntime.ai/docs/api/python/api_summary.html#inferencesession

This was untested up to now. Still need to add a test for this PR.

In a next PR: remove all the code duplication for load_model() in modeling_ort.py, modeling_decoder.py, modeling_seq2seq.py. But I won't do it in this PR.

This should fix https://github.com/huggingface/optimum/issues/606 https://github.com/huggingface/optimum/issues/605
opened by fxmarty 1
Support generation config in ORTModel

This PR adds support for generation config in ORTModel, following https://github.com/huggingface/transformers/pull/20388

Note: we should really add nightly tests tracking on transformers/diffusers main.

opened by fxmarty 3

Fix uninformative message when passing `use_cache=True` to ORTModel and no ONNX with cache is available

As per title,

from optimum.onnxruntime import ORTModelForCausalLM

ort_model = ORTModelForCausalLM.from_pretrained("/path/to/gpt2_onnx", use_cache=True)

raises

  File "/home/fxmarty/hf_internship/optimum/optimum/onnxruntime/modeling_decoder.py", line 536, in _from_pretrained
    decoder_with_past_path = ORTModelDecoder.infer_onnx_filename(
  File "/home/fxmarty/hf_internship/optimum/optimum/onnxruntime/modeling_ort.py", line 351, in infer_onnx_filename
    raise FileNotFoundError(f"Could not find any ONNX model file in {path}")
FileNotFoundError: Could not find any ONNX model file in /home/fxmarty/hf_internship/optimum/gpt2_onnx

which is not informative. With this PR, we get:

FileNotFoundError: The parameter `use_cache=True` was passed to ORTModelDecoder.from_pretrained() but no ONNX file using past key values could be found in /home/fxmarty/hf_internship/optimum/gpt2_onnx, with the error:
    Could not find any ONNX model file for the regex (.*)?decoder(.*)?with_past(.*)?\.onnx in /home/fxmarty/hf_internship/optimum/gpt2_onnx.

opened by fxmarty 1

Added mapping for prophetnet
What does this PR do?

Opening up draft PR to start discussion on how to add Better Transformer support for ProphetNet

Fixes #488

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).

[ ] Did you make sure to update the documentation with your changes?

[ ] Did you write any new necessary tests?
opened by adit299 2
Enable merged decoder in ORTModel
What does this PR do?

Enable the use of merged decoders in ORT modeling.

[x] Check if it works for large proto, and add a saving option.

[ ] Adapt ORTModels to be able to use merged model (New input use_cache, dummy inputs for past_key_values...)

[ ] Check if merged ONNX model works for IOBinding (introduces new input use_cache, but dlpack doesn't support dtype=bool )

To discuss:

Where should the merging be applied?

Shall it be automatically applied?
opened by JingyaHuang 1

Releases(v1.6.1)

v1.6.1(Dec 23, 2022)
Hotfixes

Revert breaking removal of EncoderOnnxConfig, DecoderOnnxConfig, _DecoderWithLMhead by @fxmarty in https://github.com/huggingface/optimum/pull/643

Fix item access of some _TASKS_TO_AUTOMODELS by @fxmarty in https://github.com/huggingface/optimum/pull/642

Full Changelog: https://github.com/huggingface/optimum/compare/v1.6.0...v1.6.1
Source code(tar.gz)
Source code(zip)
v1.6.0(Dec 23, 2022)
Optimum CLI

The Optimum command line interface is introduced, and is now the official entrypoint for the ONNX export. Example commands:

optimum-cli --help optimum-cli export onnx --help optimum-cli export onnx --model bert-base-uncased --task sequence-classification bert_onnx/

Add Optimum CLI backbone by @fxmarty in https://github.com/huggingface/optimum/pull/593

Stable Diffusion ONNX export

Optimum now supports the ONNX export of stable diffusion models from the diffusers library:

optimum-cli export onnx --model runwayml/stable-diffusion-v1-5 sd_v15_onnx/

Add Stable Diffusion ONNX export by @echarlaix in https://github.com/huggingface/optimum/pull/570

BetterTransformer support for more architectures

BetterTransformer integration includes new models in this release: CLIP, RemBERT, mBART, ViLT, FSMT

The complete list of supported models is available in the documentation.

[BT] Add Bettertransformer support for FSMT by @Sumanth077 in https://github.com/huggingface/optimum/pull/494

[BT] add BetterTransformer support for ViLT architecture by @ka00ri in https://github.com/huggingface/optimum/pull/508

Add MBart support for BetterTransformer by @ravenouse in https://github.com/huggingface/optimum/pull/516

Add CLIP BetterTransformer by @fxmarty in https://github.com/huggingface/optimum/pull/534

Add BetterTransformer support for RemBERT by @hchings in https://github.com/huggingface/optimum/pull/545

ONNX export for more architectures

The ONNX export now supports Swin, MobileNet-v1, MobileNet-v2.

Add Swin support in exporters.onnx by @fxmarty in https://github.com/huggingface/optimum/pull/528

[ONNX] add mobilenet support by @younesbelkada in https://github.com/huggingface/optimum/pull/633

Extended ONNX export for encoder-decoder and decoder models

Encoder-decoder or decoder-only models normally making use of the generate() method in transformers can now be exported in several files using the --for-ort argument:

optimum-cli export onnx --model t5-small --task seq2seq-lm-with-past --for-ort t5_small_onnx

yielding:

. └── t5_small_onnx ├── config.json ├── decoder_model.onnx ├── decoder_with_past_model.onnx ├── encoder_model.onnx ├── special_tokens_map.json ├── spiece.model ├── tokenizer_config.json └── tokenizer.json

Passing --for-ort, exported models are expected to be loadable directly into ORTModel.

Add ort export in exporters for encoder-decoder models by @mht-sharma in https://github.com/huggingface/optimum/pull/497

Support decoder generated with --for-ort from optimum.exporters.onnx in ORTDecoder by @fxmarty in https://github.com/huggingface/optimum/pull/554

Support for ONNX models with external data at export, optimization, quantization

The ONNX export from PyTorch normally creates external data in case the exported model is larger than 2 GB. This release introduces a better support for the export and use of large models, writting all external data into a .onnx_data file if necessary.

Handling ONNX models with external data by @NouamaneTazi in https://github.com/huggingface/optimum/pull/586

Improve the compatibility dealing with large ONNX proto in ORTOptimizer and ORTQuantizer by @JingyaHuang in https://github.com/huggingface/optimum/pull/332

ONNX Runtime API improvement

Various improvements to allow for a better user experience in the ONNX Runtime integration:

ORTModel, ORTModelDecoder and ORTModelForConditionalGeneration can now load any ONNX model files regardless of their names, allowing to load optimized and quantized models without having to specify a file name argument.

ORTModel.from_pretrained() with from_transformers=True now downloads and loads the model in a temporary directory instead of the cache, which was not a right place to store it.

ORTQuantizer.save_pretrained() now saves the model configuration and the preprocessor, making the exported directory usable end-to-end.

ORTOptimizer.save_pretrained() now saves the preprocessor, making the exported directory usable end-to-end.

ONNX Runtime integration API improvement by @michaelbenayoun in https://github.com/huggingface/optimum/pull/515

Custom shapes support at ONNX export

The shape of the example input to provide for the export to ONNX can be overridden in case the validity of the ONNX model is sensitive to the shape used during the export.

Read more: optimum-cli export onnx --help

Support custom shapes for dummy inputs by @fxmarty in https://github.com/huggingface/optimum/pull/522

Support for custom input shapes in exporters onnx by @fxmarty in https://github.com/huggingface/optimum/pull/575

Enable use_cache=True for ORTModelForCausalLM

Reusing past key values for models using ORTModelForCausalLM (e.g. gpt2) is now possible using use_cache=True, avoiding to recompute them at each iteration of the decoding:

from transformers import AutoTokenizer from optimum.onnxruntime import ORTModelForCausalLM import torch tokenizer = AutoTokenizer.from_pretrained("gpt2") model = ORTModelForCausalLM.from_pretrained("gpt2", from_transformers=True, use_cache=True) inputs = tokenizer("My name is Arthur and I live in", return_tensors="pt") gen_tokens = model.generate(**inputs) tokenizer.batch_decode(gen_tokens)

Enable past_key_values for ORTModelForCausalLM by @echarlaix in https://github.com/huggingface/optimum/pull/326

IO binding support for ORTModelForCustomTasks

ORTModelForCustomTasks now supports IO Binding when using CUDAExecutionProvider.

Add IO binding support for custom ORTModel by @JingyaHuang in https://github.com/huggingface/optimum/pull/447

Experimental support to merge ONNX decoder with/without past key values

Along with --for-ort, when passing --task causal-lm-with-past, --task seq2seq-with-past or --task speech2seq-lm-with-past during the ONNX export exports two models: one not using the previously computed keys/values, and one using them.

An experimental support is introduced to merge the two models in one. Example:

optimum-cli export onnx --model t5-small --task seq2seq-lm-with-past --for-ort t5_onnx/

import onnx from optimum.onnx import merge_decoders decoder = onnx.load("t5_onnx/decoder_model.onnx") decoder_with_past = onnx.load("t5_onnx/decoder_with_past_model.onnx") merged_model = merge_decoders(decoder, decoder_with_past) onnx.save(merged_model, "t5_onnx/decoder_merged_model.onnx")

Merge ONNX decoder models by @JingyaHuang in https://github.com/huggingface/optimum/pull/587

Major bugs fixed

Fix BetterTransformer with padding="max_length" by @fxmarty in https://github.com/huggingface/optimum/pull/543

Fix non-nesting bug in BetterTransformer integration by @younesbelkada in https://github.com/huggingface/optimum/pull/637

Other changes, bugfixes and improvements

Fix doc-builder premission error by @mishig25 in https://github.com/huggingface/optimum/pull/482

Fix doc build pr premissions by @mishig25 in https://github.com/huggingface/optimum/pull/484

Re-order the task manager doc by @michaelbenayoun in https://github.com/huggingface/optimum/pull/483

Fix whisper device for gpu test by @fxmarty in https://github.com/huggingface/optimum/pull/486

Fix tensorflow CI by @fxmarty in https://github.com/huggingface/optimum/pull/489

Fix PR doc generation by @regisss in https://github.com/huggingface/optimum/pull/495

Fix broken links in the doc by @fxmarty in https://github.com/huggingface/optimum/pull/499

Update iobinding ORT encoder whisper by @mht-sharma in https://github.com/huggingface/optimum/pull/498

fix NormalizedConfig init error message by @PaulQbFeng in https://github.com/huggingface/optimum/pull/500

Change import structure for ORTModel by @fxmarty in https://github.com/huggingface/optimum/pull/456

[BT] Fix failing CI tests by @younesbelkada in https://github.com/huggingface/optimum/pull/501

Remove redundant condition statement in ORTDecoder(Seq2seq) by @JingyaHuang in https://github.com/huggingface/optimum/pull/504

[BT] put decorator on the correct place by @younesbelkada in https://github.com/huggingface/optimum/pull/509

[BT] clearer error message for norm_first by @younesbelkada in https://github.com/huggingface/optimum/pull/510

Deprecate PyTorch 1.12. for BetterTransformer by @fxmarty in https://github.com/huggingface/optimum/pull/513

Fix ORTModelForSeq2SeqLM test by @fxmarty in https://github.com/huggingface/optimum/pull/455

Clearer error messages when initilizing the requested ONNX Runtime execution provider fails by @fxmarty in https://github.com/huggingface/optimum/pull/514

[BT] Fix doc bugs by @younesbelkada in https://github.com/huggingface/optimum/pull/517

Replace sklearn by scikit-learn by @lesteve in https://github.com/huggingface/optimum/pull/502

ORTModel uses optimum.exporters.onnx by @michaelbenayoun in https://github.com/huggingface/optimum/pull/490

Cleanup deprecated ONNX Runtime training docker files by @JingyaHuang in https://github.com/huggingface/optimum/pull/523

Added support for Tapas Model by @JuheonChu in https://github.com/huggingface/optimum/pull/520

Add benchmark results to gpu doc by @JingyaHuang in https://github.com/huggingface/optimum/pull/525

ORTModelForConditionalGeneration uses optimum.exporters.onnx by @mht-sharma in https://github.com/huggingface/optimum/pull/529

Better error message when wrong task is given to exporters by @fxmarty in https://github.com/huggingface/optimum/pull/531

Add OrtModelForSpeechSeq2Seq to doc by @fxmarty in https://github.com/huggingface/optimum/pull/533

Fold sections by default in the documentation's side-bar by @regisss in https://github.com/huggingface/optimum/pull/535

Import GenerationMixin from transformers.generation if transformers >= 4.25.0 by @regisss in https://github.com/huggingface/optimum/pull/536

Add check_if_transformers_greater to manage different versions of transformers by @regisss in https://github.com/huggingface/optimum/pull/537

Enable to push some sections to the end of the TOC in the doc by @regisss in https://github.com/huggingface/optimum/pull/532

Fix import in ONNX export CLI by @fxmarty in https://github.com/huggingface/optimum/pull/553

Update readme by @echarlaix in https://github.com/huggingface/optimum/pull/550

Refactor of 2 functions used in ORTModel by @michaelbenayoun in https://github.com/huggingface/optimum/pull/551

Update readme by @echarlaix in https://github.com/huggingface/optimum/pull/556

Fix ORTTrainer wrapper duplication / PyTorch evaluate / update with transformers 4.25.1 by @JingyaHuang in https://github.com/huggingface/optimum/pull/561

Fix flaky BetterTransformer test by @fxmarty in https://github.com/huggingface/optimum/pull/564

enable FP16Optimizer for fp16 deepspeed training. by @AdamLouly in https://github.com/huggingface/optimum/pull/547

Update documentation quick tour section by @echarlaix in https://github.com/huggingface/optimum/pull/574

Move custom IOBinding to IOBindingHelper by @JingyaHuang in https://github.com/huggingface/optimum/pull/571

Add test for exporters.onnx CLI by @fxmarty in https://github.com/huggingface/optimum/pull/573

Documentation on quantization by @michaelbenayoun in https://github.com/huggingface/optimum/pull/565

More robust tests for ORTModel using decoders and use_cache=True by @fxmarty in https://github.com/huggingface/optimum/pull/576

Fix errors in onnxruntime modeling tests by @fxmarty in https://github.com/huggingface/optimum/pull/585

[BT] fix flaky test by @younesbelkada in https://github.com/huggingface/optimum/pull/591

Fix exporters onnx shapes by @fxmarty in https://github.com/huggingface/optimum/pull/581

Fix exporters.onnx tests by @fxmarty in https://github.com/huggingface/optimum/pull/584

Update on the ONNX Runtime documentation by @michaelbenayoun in https://github.com/huggingface/optimum/pull/567

Add the ORTModelForSemanticSegmentation class by @TheoMrc in https://github.com/huggingface/optimum/pull/539

Refactor BetterTransformer to be able to raise more informative error messages by @fxmarty in https://github.com/huggingface/optimum/pull/594

Constraint temprarily NumPy version to save CIs by @JingyaHuang in https://github.com/huggingface/optimum/pull/614

Add encoder_last_hidden_state as an output for encoder-decoder models by @fxmarty in https://github.com/huggingface/optimum/pull/601

Update dev version by @fxmarty in https://github.com/huggingface/optimum/pull/617

Fix documentation example by @echarlaix in https://github.com/huggingface/optimum/pull/603

Documentation improvements by @fxmarty in https://github.com/huggingface/optimum/pull/598

More informative message at ONNX export by @fxmarty in https://github.com/huggingface/optimum/pull/609

Use optimum exporter for current weight sharing test by @JingyaHuang in https://github.com/huggingface/optimum/pull/616

OnnxConfig now handle the export to encoder / decoder / decoder_with_past themselves by @michaelbenayoun in https://github.com/huggingface/optimum/pull/590

Set explictly the device index by @JingyaHuang in https://github.com/huggingface/optimum/pull/613

Fix ORT GPU test by @JingyaHuang in https://github.com/huggingface/optimum/pull/624

Add GPT-J normalized config by @fxmarty in https://github.com/huggingface/optimum/pull/623

Remove diffusers dependency in onnxruntime code by @fxmarty in https://github.com/huggingface/optimum/pull/619

Use exporters in ORTTrainer by @mht-sharma in https://github.com/huggingface/optimum/pull/546

Improve use_io_binding default value for different execution providers by @JingyaHuang in https://github.com/huggingface/optimum/pull/604

fixed FuseBiasInLinear by specifying device by @IlyasMoutawwakil in https://github.com/huggingface/optimum/pull/630

Fixed GPU documentation for HF pipelines by @smiraldr in https://github.com/huggingface/optimum/pull/602

Add argument in the CLI to specify device to do the ONNX export on by @fxmarty in https://github.com/huggingface/optimum/pull/634

Allow kwargs in all generate_dummy_inputs() methods by @fxmarty in https://github.com/huggingface/optimum/pull/638

Full Changelog: https://github.com/huggingface/optimum/compare/v1.5.2...v1.6.0

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@TheoMrc

Add ORTModelForSemanticSegmentation https://github.com/huggingface/optimum/pull/539

@ravenouse

Add MBart support for BetterTransformer https://github.com/huggingface/optimum/pull/516

@ka00ri

Add BetterTransformer support for ViLT architecture https://github.com/huggingface/optimum/pull/508

@Sumanth077

Add Bettertransformer support for FSMT https://github.com/huggingface/optimum/pull/494

Source code(tar.gz)
Source code(zip)
v1.5.2(Dec 19, 2022)

Constraint temporarily numpy<1.24.0 (#614)
Source code(tar.gz)
Source code(zip)
v1.5.1(Nov 24, 2022)

Deprecate PyTorch 1.12. for BetterTransformer with better error message (#513)
Source code(tar.gz)
Source code(zip)
v1.5.0(Nov 17, 2022)
BetterTransformer

Convert your model into its PyTorch BetterTransformer format using a one liner with the new BetterTransformer integration for faster inference on CPU and GPU!

from optimum.bettertransformer import BetterTransformer model = BetterTransformer.transform(model)

Check the full list of supported models in the documentaiton, and check out the Google Colab demo.

Contributions

BetterTransformer integration (#423)

ViT and Wav2Vec2 support (#470)

ONNX Runtime IOBinding support

ORT models (except for ORTModelForCustomTasks) now support IOBinding to avoid data copying overheads between the host and device. Significant inference speedup during the decoding process on GPU.

By default, use_io_binding is set to True when using CUDA. You can turn off the IOBinding in case of any memory issue:

from optimum.onnxruntime import ORTModelForSeq2SeqLM model = ORTModelForSeq2SeqLM.from_pretrained("optimum/t5-small", use_io_binding=False)

Contributions

Add IOBinding support to ONNX Runtime module (#421)

Optimum Exporters

optimum.exporters is a new module that handles the export of PyTorch and TensorFlow models to several backends. Only ONNX is supported for now, and more than 50 architectures can already be exported, among which BERT, GPT-Neo, Bloom, T5, ViT, Whisper, CLIP.

The export can be done via the CLI:

python -m optimum.exporters.onnx --model openai/whisper-tiny.en whisper_onnx/

For more information, check the documentation.

Contributions

optimum.exporters creation (#403)

Automatic task detection (#445)

Whisper

Whisper can be exported to ONNX using optimum.exporters.

Whisper can also be exported and ran using optimum.onnxruntime, IO binding is also supported.

Note: For the now the export from optimum.exporters will not be usable by ORTModelForSpeechSeq2Seq. To be able to run inference, export Whisper directly using ORTModelForSpeechSeq2Seq. This will be solved in the next release.

Contributions

Whisper support with optimum.onnxruntime and optimum.exporters (#420)

Other contributions

ONNX Runtime training now supports ORT 1.13.1 and transformers 4.23.1 (#434)

ORTModel can load models from subfolders in a similar fashion as in transformers (#443)

ORTOptimizer has been refactored, and a factory class has been added to create common OptimizationConfigs (#457)

Fixes and updates in the documentation (#411, #432, #437, #441)

Fixes IOBinding (#454, #461)

Source code(tar.gz)
Source code(zip)
v1.4.1(Oct 26, 2022)
Add inference with ORTModel to ORTTrainer and ORTSeq2SeqTrainer #189

Add InferenceSession options and provider to ORTModel #271

Add mT5 (#341) and Marian (#393) support to ORTOptimizer

Add batchnorm folding torch.fx transformations #348

The torch.fx transformations now use the marking methods mark_as_transformed, mark_as_restored, get_transformed_nodes #385

Update BaseConfig for transformers 4.22.0 release #386

Update ORTTrainer for transformers 4.22.1 release #388

Add extra ONNX Runtime quantization options #398

Add possibility to pass provider_options to ORTModel #401

Add support to pass a specific device for ORTModel, as transformers does for pipelines #427

Fixes to support onnxruntime 1.13.1 #430

Source code(tar.gz)
Source code(zip)
v1.4.0(Sep 8, 2022)
ONNX Runtime

Refactorization of ORTQuantizer (#270) and ORTOptimizer (#294)

Add ONNX Runtime fused Adam Optimizer (#295)

Add ORTModelForCustomTasks allowing ONNX Runtime inference support for custom tasks (#303)

Add ORTModelForMultipleChoice allowing ONNX Runtime inference for models with multiple choice classification head (#358)

Torch FX

Add FuseBiasInLinear a transformation that fuses the weight and the bias of linear modules (#253)

Improvements and bugfixes

Enable the possibility to disregard the precomputed past_key_values during ONNX Runtime inference of Seq2Seq models (#241)

Enable node exclusion from quantization for benchmark suite (#284)

Enable possibility to use a token authentication when loading a calibration dataset (#289)

Fix optimum pipeline when no model is given (#301)

Source code(tar.gz)
Source code(zip)
v1.3.0(Jul 12, 2022)
Torch FX

The optimum.fx.optimization module (#232) provides a set of torch.fx graph transformations, along with classes and functions to write your own transformations and compose them.

The Transformation and ReversibleTransformation represent non-reversible and reversible transformations, and it is possible to write such transformations by inheriting from those classes

The compose utility function enables transformation composition

Two reversible transformations were added:

MergeLinears: merges linear layers that have the same input

ChangeTrueDivToMulByInverse: changes a division by a static value to a multiplication of its inverse

ORTModelForSeq2SeqLM

ORTModelForSeq2SeqLM (#199) allows ONNX export and ONNX Runtime inference for Seq2Seq models.

When exported, Seq2Seq models are decomposed into three parts : the encoder, the decoder (actually consisting of the decoder with the language modeling head), and the decoder with pre-computed key/values as additional inputs.

This specific export comes from the fact that during the first pass, the decoder has no pre-computed key/values hidden-states, while during the rest of the generation past key/values will be used to speed up sequential decoding.

Below is an example that downloads a T5 model from the Hugging Face Hub, exports it through the ONNX format and saves it :

from optimum.onnxruntime import ORTModelForSeq2SeqLM # Load model from hub and export it through the ONNX format model = ORTModelForSeq2SeqLM.from_pretrained("t5-small", from_transformers=True) # Save the exported model in the given directory model.save_pretrained(output_dir)

ORTModelForImageClassification

ORTModelForImageClassification (#226) allows ONNX Runtime inference for models with an image classification head.

Below is an example that downloads a ViT model from the Hugging Face Hub, exports it through the ONNX format and saves it :

from optimum.onnxruntime import ORTModelForImageClassification # Load model from hub and export it through the ONNX format model = ORTModelForImageClassification.from_pretrained("google/vit-base-patch16-224", from_transformers=True) # Save the exported model in the given directory model.save_pretrained(output_dir)

ORTOptimizer

Adds support for converting model weights from fp32 to fp16 by adding a new optimization parameter (fp16) to OptimizationConfig (#273).

Pipelines

Additional pipelines tasks are now supported, here is a list of the supported tasks along with the default model for each:

Image Classification (ViT)

Text-to-Text Generation (T5 small)

Summarization (T5 base)

Translation (T5 base)

Below is an example that downloads a T5 small model from the Hub and loads it with transformers pipeline for translation :

from transformers import AutoTokenizer, pipeline from optimum.onnxruntime import ORTModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("optimum/t5-small") model = ORTModelForSeq2SeqLM.from_pretrained("optimum/t5-small") onnx_translation = pipeline("translation_en_to_fr", model=model, tokenizer=tokenizer) text = "What a beautiful day !" pred = onnx_translation(text) # [{'translation_text': "C'est une belle journée !"}]

Breaking change

The ORTModelForXXX execution provider default value is now set to CPUExecutionProvider (#203). Before, if no execution provider was provided, it was set to CUDAExecutionProvider if a gpu was detected, or to CPUExecutionProvider otherwise.
Source code(tar.gz)
Source code(zip)
v1.2.3(Jun 15, 2022)
Remove intel sub-package, migrating to optimum-intel (#212)

Fix the loading and saving of ORTModel optimized and quantized models (#214)

Source code(tar.gz)
Source code(zip)
v1.2.2(Jun 2, 2022)
Extend QuantizationPreprocessor to dynamic quantization (https://github.com/huggingface/optimum/pull/196)

Introduce unified approach to create transformers vs optimized models benchmark (https://github.com/huggingface/optimum/pull/194)

Bump huggingface_hub version and protobuf fix (https://github.com/huggingface/optimum/pull/205)

Source code(tar.gz)
Source code(zip)
v1.2.1(May 13, 2022)

Add support to Python version 3.7 (https://github.com/huggingface/optimum/pull/176)
Source code(tar.gz)
Source code(zip)
v1.2.0(May 10, 2022)
ORTModel

ORTModelForXXX classes such as ORTModelForSequenceClassification were integrated with the Hugging Face Hub in order to easily export models through the ONNX format, load ONNX models, as well as easily save the resulting model and push it to the 🤗 Hub by using respectively the save_pretrained and push_to_hub methods. An already optimized and / or quantized ONNX model can also be loaded using the ORTModelForXXX classes using the from_pretrained method.

Below is an example that downloads a DistilBERT model from the Hub, exports it through the ONNX format and saves it :

from optimum.onnxruntime import ORTModelForSequenceClassification # Load model from hub and export it through the ONNX format model = ORTModelForSequenceClassification.from_pretrained( "distilbert-base-uncased-finetuned-sst-2-english", from_transformers=True ) # Save the exported model model.save_pretrained("a_local_path_for_convert_onnx_model")

Pipelines

Built-in support for transformers pipelines was added. This allows us to leverage the same API used from Transformers, with the power of accelerated runtimes such as ONNX Runtime.

The currently supported tasks with the default model for each are the following :

Text Classification (DistilBERT model fine-tuned on SST-2)

Question Answering (DistilBERT model fine-tuned on SQuAD v1.1)

Token Classification(BERT large fine-tuned on CoNLL2003)

Feature Extraction (DistilBERT)

Zero Shot Classification (BART model fine-tuned on MNLI)

Text Generation (DistilGPT2)

Below is an example that downloads a RoBERTa model from the Hub, exports it through the ONNX format and loads it with transformers pipeline for question-answering.

from transformers import AutoTokenizer, pipeline from optimum.onnxruntime import ORTModelForQuestionAnswering # load vanilla transformers and convert to onnx model = ORTModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2",from_transformers=True) tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2") # test the model with using transformers pipeline, with handle_impossible_answer for squad_v2 optimum_qa = pipeline(task, model=model, tokenizer=tokenizer, handle_impossible_answer=True) prediction = optimum_qa( question="What's my name?", context="My name is Philipp and I live in Nuremberg." ) print(prediction) # {'score': 0.9041663408279419, 'start': 11, 'end': 18, 'answer': 'Philipp'}

Improvements

Add loss when performing the evalutation step using an instance of ORTTrainer, previously not enabled when inference was performed with ONNX Runtime in #152

Source code(tar.gz)
Source code(zip)
v1.1.1(Apr 26, 2022)
Habana

Installation details added for Optimum-Habana which provides optimized transformers integration for Intel's Habana Gaudi Processor (HPU).

ONNX Runtime

Add the possibility to specify the execution provider in ORTModel.

Add IncludeFullyConnectedNodes class to find the nodes composing the fully connected layers in order to (only) target the latter for quantization to limit the accuracy drop.

Update QuantizationPreprocessor so that the intersection of the two sets representing the nodes to quantize and the nodes to exclude from quantization to be an empty set.

Rename Seq2SeqORTTrainer to ORTSeq2SeqTrainer for clarity and to keep consistency.

Add ORTOptimizer support for ELECTRA models.

Fix the loading of pretrained ORTConfig which contains optimization and quantization config.

Source code(tar.gz)
Source code(zip)
v1.1.0(Apr 1, 2022)
ORTTrainer and Seq2SeqORTTrainer

The ORTTrainer and Seq2SeqORTTrainer are two newly experimental classes.

Both ORTTrainer and Seq2SeqORTTrainer were created to have a similar user-facing API as the Trainer and Seq2SeqTrainer of the Transformers library.

ORTTrainer allows the usage of the ONNX Runtime backend to train a given PyTorch model in order to accelerate training. ONNX Runtime will run the forward and backward passes using an optimized automatically-exported ONNX computation graph, while the rest of the training loop is executed by native PyTorch.

ORTTrainer allows the usage of ONNX Runtime inferencing during both the evaluation and the prediction step.

For Seq2SeqORTTrainer, ONNX Runtime inferencing is incompatible with --predict_with_generate, as the generate method is not supported yet.

ONNX Runtime optimization and quantization APIs improvements

The ORTQuantizer and ORTOptimizer classes underwent a massive refactoring that should allow a simpler and more flexible user-facing API.

Addition of the possibility to iteratively compute the quantization activation ranges when applying static quantization by using the ORTQuantizer method partial_fit. This is especially useful when using memory-hungry calibration methods such as Entropy and Percentile methods.

When using the MinMax calibration method, it is now possible to compute the moving average of the minimum and maximum values representing the activations quantization ranges instead of the global minimum and maximum (feature available with onnxruntime v1.11.0 or higher).

The classes OptimizationConfig, QuantizationConfig and CalibrationConfig were added in order to better segment the different ONNX Runtime related parameters instead of having one unique configuration ORTConfig.

The QuantizationPreprocessor class was added in order to find the nodes to include and / or exclude from quantization, by finding the nodes following a given pattern (such as the nodes forming LayerNorm for example). This is particularly useful in the context of static quantization, where the quantization of modules such as LayerNorm or GELU are responsible of important drop in accuracy.

Source code(tar.gz)
Source code(zip)
v1.0.0(Feb 24, 2022)
ONNX Runtime support

An ORTConfig class was introduced, allowing the user to define the desired export, optimization and quantization strategies.

The ORTOptimizer class takes care of the model's ONNX export as well as the graph optimization provided by ONNX Runtime. In order to create an instance of ORTOptimizer, the user needs to provide an ORTConfig object, defining the export and graph-level transformations informations. Then optimization can be perfomed by calling the ORTOptimizer.fit method.

ONNX Runtime static and dynamic quantization can also be applied on a model by using the newly added ORTQuantizer class. In order to create an instance of ORTQuantizer, the user needs to provide an ORTConfig object, defining the export and quantization informations, such as the quantization approach to use or the activations and weights data types. Then quantization can be applied by calling the ORTQuantizer.fit method.

Additionnal features for Intel Neural Compressor

We have also added a new class called IncOptimizer which will take care of combining the pruning and the quantization processes.
Source code(tar.gz)
Source code(zip)
v0.1.2(Feb 2, 2022)

With this release, we enable Intel Neural Compressor v1.8 magnitude pruning for a variety of NLP tasks with the introduction of IncTrainer which handles the pruning process.
Source code(tar.gz)
Source code(zip)
v0.1.1(Nov 10, 2021)

With this release, we enable Intel Neural Compressor v1.7 PyTorch dynamic, post-training and aware-training quantization for a variety of NLP tasks. This support includes the overall process, from quantization application to the loading of the resulting quantized model. The latter being enabled by the introduction of the IncQuantizedModel class.
Source code(tar.gz)
Source code(zip)
v0.0.1(Sep 14, 2021)

Initial release for early access to Optimum library featuring Intel's LPOT quantization and pruning support.
Source code(tar.gz)
Source code(zip)

🏎️ Accelerate training and inference of 🤗 Transformers with easy to use hardware optimization tools

Related tags

Overview

Hugging Face Optimum

Integration with Hardware Partners

Optimizing models towards inference

Installation

Quickstart

Quantization

Graph optimization

Training

Comments

What does this PR do?

What does this PR do?

Before submitting

What does this PR do?

Implementation details

Remaining tasks

What does this PR do?

System Info

Who can help?

Information

Tasks

Reproduction

model.save_pretrained("onnx")

tokenizer.save_pretrained("onnx")

encoded.to(device)

Expected behavior

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

What does this PR do?

Before submitting

Questions

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Context

What does this PR do?

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

What does this PR do?

Before submitting

What does this PR do?

Releases(v1.6.1)

v1.6.1(Dec 23, 2022)

Hotfixes

v1.6.0(Dec 23, 2022)

Optimum CLI

Stable Diffusion ONNX export

BetterTransformer support for more architectures

ONNX export for more architectures

Extended ONNX export for encoder-decoder and decoder models

Support for ONNX models with external data at export, optimization, quantization

ONNX Runtime API improvement

Custom shapes support at ONNX export

Enable use_cache=True for ORTModelForCausalLM

IO binding support for ORTModelForCustomTasks

Experimental support to merge ONNX decoder with/without past key values

Major bugs fixed

Other changes, bugfixes and improvements

Significant community contributions

v1.5.2(Dec 19, 2022)

v1.5.1(Nov 24, 2022)

v1.5.0(Nov 17, 2022)

BetterTransformer

Contributions

ONNX Runtime IOBinding support

Contributions

Optimum Exporters

Enable `use_cache=True` for ORTModelForCausalLM