ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

Last update: Jan 04, 2023

Overview

ONNX Runtime is a cross-platform inference and training machine-learning accelerator.

ONNX Runtime inference can enable faster customer experiences and lower costs, supporting models from deep learning frameworks such as PyTorch and TensorFlow/Keras as well as classical machine learning libraries such as scikit-learn, LightGBM, XGBoost, etc. ONNX Runtime is compatible with different hardware, drivers, and operating systems, and provides optimal performance by leveraging hardware accelerators where applicable alongside graph optimizations and transforms. Learn more →

ONNX Runtime training can accelerate the model training time on multi-node NVIDIA GPUs for transformer models with a one-line addition for existing PyTorch training scripts. Learn more →

Get Started

General Information: onnxruntime.ai

Usage documention and tutorials: onnxruntime.ai/docs

Companion sample repositories:

ONNX Runtime Inferencing: microsoft/onnxruntime-inference-examples
ONNX Runtime Training: microsoft/onnxruntime-training-examples

Build Pipeline Status

System	CPU	GPU	EPs
Windows
Linux
Mac
Android
iOS
WebAssembly

Data/Telemetry

Windows distributions of this project may collect usage data and send it to Microsoft to help improve our products and services. See the privacy statement for more details.

Contributions and Feedback

We welcome contributions! Please see the contribution guidelines.

For feature requests or bug reports, please file a GitHub Issue.

For general discussion or questions, please use GitHub Discussions.

Code of Conduct

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

License

This project is licensed under the MIT License.

Comments

Openvino ep 2021.4 v3.3
Changes enabled in OpenVINO EP for IO Buffer Optimization Enable Auto Plugin Feature

Motivation and Context

Change was required to enable IO Buffer Optimization

Change was required to enable AutoPlugin, fix Multi, Hetero Flow

Change is ONNXRuntime API to get the Device Location for For ORT Value Tensor

If it fixes an open issue, please link to the issue here.
opened by sfatimar 79
Java API for onnxruntime

Description: This pull request provides a Java 8 API using JNI. It has unit tests ported from the v0.5.0 release of the C# API, I'll work on porting the new tests from the master branch over the next few weeks. I assume there will be some design & naming discussion on this PR so we can have that while I work on the unit tests.

Currently it builds using a separate gradle project which I've tested on Mac & Linux. The build process involves running gradle clean build -x test; gradle build as the combination of a JNI and Java project in Gradle 5 isn't properly supported. I could do with some help integrating it into the CMake build system, but I've not used CMake much before. Integrating it into CMake will make it simpler to put in the appropriate provider compilation flags and fix the oddities in the build (as CMake has all the information necessary).

opened by Craigacp 75
Support CUDA Graph
Description

This PR wants to support the feature of CUDA Graph. This feature can significantly reduce the CPU overhead of calling CUDA APIs by submitting the entire graph to the GPU with a single call to cudaGraphLaunch.

Motivation and Context

Why is this change required? What problem does it solve? This feature is pretty helpful to reduce the model latency, especially for the online inference, when the above CPU overhead is a bottleneck. For example, it can reduce the 95% latency of the transformer-based online inference model (with 148 millions of parameters) from 4.3ms to 2.1ms.
opened by feihugis 72
Resolve Optim Params Issues
Includes a test of Optimizer Parameter Groups for the ONNX BERT Model (3 variations)

Resolves the issue of not passing default hyperparameters for parameters not in a group

Resolves the issue of sending 'lambda_coef' instead of 'lambda' to the backend

Resolves the issue of sending lr to the backend as a hyperparameter
opened by rayankrish 68
Upgrade GIST memory compression nodes, kernels, optimizer rule, and cli
Description: Extend Gist memory compression to support additional compression formats, support of new priority execution order, and other upgrades:

New Feature: GistPack1 compression. It compresses from float32/bool to 1 bit. It is used for lossless compression for dropout and relu nodes.

New Feature: GistPack8 compression. It compresses from 32 bits/16 bits to 8 bits. It is used for lossy compression for any operator.

New Feature: GistPackMsfp15 compression. It compresses 8 (or tile size) values each 32 bits wide to 8 (or tile size) values each 7 bits wide (sign and mantissa) and a single 8 bits shared exponent. It is used for lossy compression for any operator.

New Feature: GistPack16 compression. It compresses from 32 bits to 16 bits. It is used for lossy compression for any operator.

We also upgraded Gist rule to support different operators. We created a generic Gist rule as long as we provide a Pattern map. The pattern map has key as the target operator and value as the destination operator (e.g. PATTER_MAP[Sofmax] = {“SoftmaxGrad”}. Our rule is operator-agnostic, and makes Gist robust to support new operators in the future.

New test for Priority execution order for nested compression.

Gist upgrade to support priority execution order to trigger encoder (compression) and decoder (decompression) accordingly.

Gist CLI: --use_gist, --op <which operator is being targeted, e.g. Softmax is op 1> --gist_compr <GistPack1|GistPack8|GistPack16|GistPackMsfp15>

Motivation and Context

Why is this change required? What problem does it solve? It fixes and improves Gist optimizer rule by changing Gist operators to handle 1 input and 1 output without the need of early encoder input or late decoder output. It also adds new compression format (Pack1, Pack8).

training
opened by fninaparavecino 61
Multi-stream executor
Description: This PR including following works:

provide stream and related synchronization abstractions in onnxruntime.

enhance onnxruntime's execution planner / executor / memory arena to support execute multiple streams in parallel.

deprecate the parallel executor for cpu.

deprecate the Fence mechanism.

update the cuda / tensorrt EP to support the stream mechanism, support running different request in different cuda stream.

Motivation and Context

Why is this change required? currently, the execution plan is just a linear list of those primitives, ort will execute them step by step. For any given graph, ORT will serialize it to a fixed execution order. This sequential execution design simplifies most scenarios, but it has the following limitations:

it is difficult to enable inter-node parallelization, we have a half-baked parallel executor but it is very difficult to make it work with GPU.

The fence mechanism can work with single gpu stream + cpu thread case, but when extend to multiple stream, it is difficult to manage the cross GPU stream synchronizations.

our cuda EP rely on the BFCArena to make the memory management work with the GPU async kernels, but current BFCArena is not aware of the streams, so it doesn't behavior correctly when run with multiple streams.

This PR enhance our existing execution plan and executor to support multiple stream execution. we use an unified algorithm to mange both single stream and multiple stream scenarios. This PR mainly focus on the infrastructure support for multiple stream execution, that is said, given a valid stream assignment, onnxruntime can execute it correctly. How to generate a good stream assignment for a given model will be in the future PR.
opened by souptc 60
Amdmigraphx fix build error
Description: Describe your changes. For build error related to EP API changes

Motivation and Context

ORT EP is changed to use shared lib, and APIs for EP is changed, AMD migraphx needs corresponding changes to work as an EP.

Added a few operators that AMDMIGraphX implemented recently.

Why is this change required? What problem does it solve? See above explanation

If it fixes an open issue, please link to the issue here. No
opened by scxiao 60

Python MacOS arm64 release binaries

Describe the bug

ONNX Runtime does not install using pip on M1.

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): macOS 11.2.1
ONNX Runtime installed from (source or binary): pip
Python version: 3.9.1

To Reproduce

~: uname -v
Darwin Kernel Version 20.3.0: Thu Jan 21 00:06:51 PST 2021; root:xnu-7195.81.3~1/RELEASE_ARM64_T8101
~: which python3
/opt/homebrew/bin/python3
~: which pip
/opt/homebrew/bin/pip
~: python3 --version
Python 3.9.1
~: pip install onnxruntime
ERROR: Could not find a version that satisfies the requirement onnxruntime
ERROR: No matching distribution found for onnxruntime

feature request

opened by lutzroeder 59

Bump numpy from 1.21.0 to 1.22.0 in /tools/ci_build/github/linux/docker/scripts/training/ortmodule/stage1/requirements_torch1.11.0_rocm4.3.1
Bumps numpy from 1.21.0 to 1.22.0.

Release notes

Sourced from numpy's releases.

v1.22.0

NumPy 1.22.0 Release Notes

NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.

A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.

NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.

New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.

A new configurable allocator for use by downstream projects.

These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

Expired deprecations

Deprecated numeric style dtype strings have been removed

Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

(gh-19539)

Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

(gh-19615)

... (truncated)

Commits

4adc87d Merge pull request #20685 from charris/prepare-for-1.22.0-release

fd66547 REL: Prepare for the NumPy 1.22.0 release.

125304b wip

c283859 Merge pull request #20682 from charris/backport-20416

5399c03 Merge pull request #20681 from charris/backport-20954

f9c45f8 Merge pull request #20680 from charris/backport-20663

794b36f Update armccompiler.py

d93b14e Update test_public_api.py

7662c07 Update init.py

311ab52 Update armccompiler.py

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

api
opened by dependabot[bot] 55
[Java] Adds support for DNNL, OpenVINO, TensorRT shared providers and refactors the CUDA shared provider loader
Description:

Refactors the native library loading in Java to allow CUDA to be loaded on demand, fixing #7044. Then expands the shared provider library loading to DNNL, OpenVINO, TensorRT, fixing #6553.

Added a flag to the native library loading to allow users to supply a directory which contains all the native libraries, fixing #8003. This is also the only way to make the shared library providers load from a different place than the jar, as the individual library path specification conflicts with the way that the ONNX Runtime native code loads the shared library providers.

I also slightly refactored the Java cmake bits, and added the --console=plain flag to the gradle executions to stop gradle writing over cmake's output.

Motivation and Context

Why is this change required? What problem does it solve? Re-enables DNNL, OpenVINO and TensorRT in Java by allowing them to be packaged in the jar and dynamically loaded in the same way CUDA is.

If it fixes an open issue, please link to the issue here. Fixes #6553. Fixes #7044. Fixes #8003.
opened by Craigacp 54

Jetson Xavier - building from source

I tried the solution proposed here: `../build.sh --config Release --update --build --build_wheel --use_tensorrt --cuda_home /usr/local/cuda --cudnn_home /usr/lib/aarch64-linux-gnu --tensorrt_home /usr/lib/aarch64-linux-gnu 2020-02-14 14:34:50,960 Build [INFO] - Build started 2020-02-14 14:34:50,960 Build [DEBUG] - Running subprocess in '/code/onnxruntime' ['git', 'submodule', 'sync', '--recursive'] Synchronizing submodule url for 'cmake/external/DNNLibrary' Synchronizing submodule url for 'cmake/external/DNNLibrary/third_party/flatbuffers' Synchronizing submodule url for 'cmake/external/DNNLibrary/third_party/glog' Synchronizing submodule url for 'cmake/external/DNNLibrary/third_party/onnx' Synchronizing submodule url for 'cmake/external/DNNLibrary/third_party/onnx/third_party/benchmark' Synchronizing submodule url for 'cmake/external/DNNLibrary/third_party/onnx/third_party/pybind11' Synchronizing submodule url for 'cmake/external/DNNLibrary/third_party/onnx/third_party/pybind11/tools/clang' Synchronizing submodule url for 'cmake/external/DNNLibrary/third_party/protobuf' Synchronizing submodule url for 'cmake/external/DNNLibrary/third_party/protobuf/third_party/benchmark' Synchronizing submodule url for 'cmake/external/DNNLibrary/third_party/protobuf/third_party/googletest' Synchronizing submodule url for 'cmake/external/DNNLibrary/third_party/pybind11' Synchronizing submodule url for 'cmake/external/DNNLibrary/third_party/pybind11/tools/clang' Synchronizing submodule url for 'cmake/external/cub' Synchronizing submodule url for 'cmake/external/date' Synchronizing submodule url for 'cmake/external/eigen' Synchronizing submodule url for 'cmake/external/gemmlowp' Synchronizing submodule url for 'cmake/external/googletest' Synchronizing submodule url for 'cmake/external/grpc' Synchronizing submodule url for 'cmake/external/grpc/third_party/abseil-cpp' Synchronizing submodule url for 'cmake/external/grpc/third_party/benchmark' Synchronizing submodule url for 'cmake/external/grpc/third_party/bloaty' Synchronizing submodule url for 'cmake/external/grpc/third_party/bloaty/third_party/googletest' Synchronizing submodule url for 'cmake/external/grpc/third_party/bloaty/third_party/libFuzzer' Synchronizing submodule url for 'cmake/external/grpc/third_party/bloaty/third_party/re2' Synchronizing submodule url for 'cmake/external/grpc/third_party/boringssl' Synchronizing submodule url for 'cmake/external/grpc/third_party/boringssl-with-bazel' Synchronizing submodule url for 'cmake/external/grpc/third_party/cares/cares' Synchronizing submodule url for 'cmake/external/grpc/third_party/data-plane-api' Synchronizing submodule url for 'cmake/external/grpc/third_party/gflags' Synchronizing submodule url for 'cmake/external/grpc/third_party/gflags/doc' Synchronizing submodule url for 'cmake/external/grpc/third_party/googleapis' Synchronizing submodule url for 'cmake/external/grpc/third_party/googletest' Synchronizing submodule url for 'cmake/external/grpc/third_party/libcxx' Synchronizing submodule url for 'cmake/external/grpc/third_party/libcxxabi' Synchronizing submodule url for 'cmake/external/grpc/third_party/protobuf' Synchronizing submodule url for 'cmake/external/grpc/third_party/protobuf/third_party/benchmark' Synchronizing submodule url for 'cmake/external/grpc/third_party/protobuf/third_party/googletest' Synchronizing submodule url for 'cmake/external/grpc/third_party/protoc-gen-validate' Synchronizing submodule url for 'cmake/external/grpc/third_party/upb' Synchronizing submodule url for 'cmake/external/grpc/third_party/upb/third_party/protobuf' Synchronizing submodule url for 'cmake/external/grpc/third_party/upb/third_party/protobuf/third_party/benchmark' Synchronizing submodule url for 'cmake/external/grpc/third_party/upb/third_party/protobuf/third_party/googletest' Synchronizing submodule url for 'cmake/external/grpc/third_party/zlib' Synchronizing submodule url for 'cmake/external/mimalloc' Synchronizing submodule url for 'cmake/external/nsync' Synchronizing submodule url for 'cmake/external/onnx' Synchronizing submodule url for 'cmake/external/onnx/third_party/benchmark' Synchronizing submodule url for 'cmake/external/onnx/third_party/pybind11' Synchronizing submodule url for 'cmake/external/onnx/third_party/pybind11/tools/clang' Synchronizing submodule url for 'cmake/external/onnx-tensorrt' Synchronizing submodule url for 'cmake/external/onnx-tensorrt/third_party/onnx' Synchronizing submodule url for 'cmake/external/onnx-tensorrt/third_party/onnx/third_party/benchmark' Synchronizing submodule url for 'cmake/external/onnx-tensorrt/third_party/onnx/third_party/pybind11' Synchronizing submodule url for 'cmake/external/onnx-tensorrt/third_party/onnx/third_party/pybind11/tools/clang' Synchronizing submodule url for 'cmake/external/protobuf' Synchronizing submodule url for 'cmake/external/protobuf/third_party/benchmark' Synchronizing submodule url for 'cmake/external/protobuf/third_party/googletest' Synchronizing submodule url for 'cmake/external/re2' Synchronizing submodule url for 'cmake/external/spdlog' Synchronizing submodule url for 'cmake/external/tvm' Synchronizing submodule url for 'cmake/external/tvm/3rdparty/HalideIR' Synchronizing submodule url for 'cmake/external/tvm/3rdparty/dlpack' Synchronizing submodule url for 'cmake/external/tvm/3rdparty/dmlc-core' Synchronizing submodule url for 'cmake/external/tvm/3rdparty/rang' Synchronizing submodule url for 'cmake/external/wil' 2020-02-14 14:34:52,305 Build [DEBUG] - Running subprocess in '/code/onnxruntime' ['git', 'submodule', 'update', '--init', '--recursive'] 2020-02-14 14:34:54,502 Build [INFO] - Generating CMake build tree 2020-02-14 14:34:54,504 Build [DEBUG] - Running subprocess in '/code/onnxruntime/build/Linux/Release' ['/usr/local/bin/cmake', '/code/onnxruntime/cmake', '-Donnxruntime_RUN_ONNX_TESTS=OFF', '-Donnxruntime_GENERATE_TEST_REPORTS=ON', '-Donnxruntime_DEV_MODE=OFF', '-DPYTHON_EXECUTABLE=/usr/bin/python3', '-Donnxruntime_USE_CUDA=ON', '-Donnxruntime_USE_NSYNC=OFF', '-Donnxruntime_CUDNN_HOME=/usr/lib/aarch64-linux-gnu', '-Donnxruntime_USE_AUTOML=OFF', '-Donnxruntime_CUDA_HOME=/usr/local/cuda', '-Donnxruntime_USE_JEMALLOC=OFF', '-Donnxruntime_USE_MIMALLOC=OFF', '-Donnxruntime_ENABLE_PYTHON=ON', '-Donnxruntime_BUILD_CSHARP=OFF', '-Donnxruntime_BUILD_SHARED_LIB=OFF', '-Donnxruntime_USE_EIGEN_FOR_BLAS=ON', '-Donnxruntime_USE_OPENBLAS=OFF', '-Donnxruntime_USE_MKLDNN=OFF', '-Donnxruntime_USE_MKLML=OFF', '-Donnxruntime_USE_GEMMLOWP=OFF', '-Donnxruntime_USE_NGRAPH=OFF', '-Donnxruntime_USE_OPENVINO=OFF', '-Donnxruntime_USE_OPENVINO_BINARY=OFF', '-Donnxruntime_USE_OPENVINO_SOURCE=OFF', '-Donnxruntime_USE_OPENVINO_MYRIAD=OFF', '-Donnxruntime_USE_OPENVINO_GPU_FP32=OFF', '-Donnxruntime_USE_OPENVINO_GPU_FP16=OFF', '-Donnxruntime_USE_OPENVINO_CPU_FP32=OFF', '-Donnxruntime_USE_OPENVINO_VAD_M=OFF', '-Donnxruntime_USE_OPENVINO_VAD_F=OFF', '-Donnxruntime_USE_NNAPI=OFF', '-Donnxruntime_USE_OPENMP=ON', '-Donnxruntime_USE_TVM=OFF', '-Donnxruntime_USE_LLVM=OFF', '-Donnxruntime_ENABLE_MICROSOFT_INTERNAL=OFF', '-Donnxruntime_USE_BRAINSLICE=OFF', '-Donnxruntime_USE_NUPHAR=OFF', '-Donnxruntime_USE_EIGEN_THREADPOOL=OFF', '-Donnxruntime_USE_TENSORRT=ON', '-Donnxruntime_TENSORRT_HOME=/usr/lib/aarch64-linux-gnu', '-Donnxruntime_CROSS_COMPILING=OFF', '-Donnxruntime_BUILD_SERVER=OFF', '-Donnxruntime_BUILD_x86=OFF', '-Donnxruntime_USE_FULL_PROTOBUF=ON', '-Donnxruntime_DISABLE_CONTRIB_OPS=OFF', '-Donnxruntime_MSVC_STATIC_RUNTIME=OFF', '-Donnxruntime_ENABLE_LANGUAGE_INTEROP_OPS=OFF', '-Donnxruntime_USE_DML=OFF', '-DCUDA_CUDA_LIBRARY=/usr/local/cuda/lib64/stubs', '-Donnxruntime_PYBIND_EXPORT_OPSCHEMA=OFF', '-DCMAKE_BUILD_TYPE=Release'] Use gtest from submodule -- Found PythonInterp: /usr/bin/python3 (found version "3.6.9") -- Found PythonInterp: /usr/bin/python3 (found suitable version "3.6.9", minimum required is "3.5") Use protobuf from submodule -- The CUDA compiler identification is NVIDIA 10.0.326 -- Check for working CUDA compiler: /usr/local/cuda-10.0/bin/nvcc -- Check for working CUDA compiler: /usr/local/cuda-10.0/bin/nvcc - broken CMake Error at /usr/local/share/cmake-3.17/Modules/CMakeTestCUDACompiler.cmake:46 (message): The CUDA compiler

"/usr/local/cuda-10.0/bin/nvcc"

is not able to compile a simple test program.

It fails with the following output:

Change Dir: /code/onnxruntime/build/Linux/Release/CMakeFiles/CMakeTmp

Run Build Command(s):/usr/bin/make cmTC_bb43d/fast && /usr/bin/make -f CMakeFiles/cmTC_bb43d.dir/build.make CMakeFiles/cmTC_bb43d.dir/build
make[1]: Entering directory '/code/onnxruntime/build/Linux/Release/CMakeFiles/CMakeTmp'
Building CUDA object CMakeFiles/cmTC_bb43d.dir/main.cu.o
/usr/local/cuda-10.0/bin/nvcc    -cudart shared  -Xcompiler=-fPIE   -x cu -c /code/onnxruntime/build/Linux/Release/CMakeFiles/CMakeTmp/main.cu -o CMakeFiles/cmTC_bb43d.dir/main.cu.o
Linking CUDA executable cmTC_bb43d
/usr/local/bin/cmake -E cmake_link_script CMakeFiles/cmTC_bb43d.dir/link.txt --verbose=1
/usr/bin/g++   CMakeFiles/cmTC_bb43d.dir/main.cu.o -o cmTC_bb43d  -lcudadevrt -lcudart_static  -L"/usr/local/cuda-10.0/targets/aarch64-linux/lib/stubs" -L"/usr/local/cuda-10.0/targets/aarch64-linux/lib" -lcudadevrt -lcudart
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::globalState::initializeDriverEntrypoints()':
:(.text+0x23488): undefined reference to `dlsym'
:(.text+0x234b0): undefined reference to `dlsym'
:(.text+0x234d4): undefined reference to `dlsym'
:(.text+0x234f8): undefined reference to `dlsym'
:(.text+0x2351c): undefined reference to `dlsym'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o)::(.text+0x23540): more undefined references to `dlsym' follow
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::globalState::loadDriverInternal()':
:(.text+0x288cc): undefined reference to `dlopen'
:(.text+0x28904): undefined reference to `dlclose'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::__loadDriverInternalUtil()':
:(.text+0x289e0): undefined reference to `dlopen'
:(.text+0x28a14): undefined reference to `dlclose'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::globalState::initializeDriverInternal()':
:(.text+0x2b664): undefined reference to `dlclose'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosInit()':
:(.text+0x5c7bc): undefined reference to `dlerror'
:(.text+0x5c7c8): undefined reference to `dlopen'
:(.text+0x5c7dc): undefined reference to `dlsym'
:(.text+0x5c7e4): undefined reference to `dlerror'
:(.text+0x5c7f4): undefined reference to `dlclose'
:(.text+0x5c838): undefined reference to `dlerror'
:(.text+0x5c844): undefined reference to `dlopen'
:(.text+0x5c858): undefined reference to `dlsym'
:(.text+0x5c860): undefined reference to `dlerror'
:(.text+0x5c870): undefined reference to `dlclose'
:(.text+0x5c8b4): undefined reference to `dlerror'
:(.text+0x5c8c0): undefined reference to `dlopen'
:(.text+0x5c8d4): undefined reference to `dlsym'
:(.text+0x5c8dc): undefined reference to `dlerror'
:(.text+0x5c8ec): undefined reference to `dlclose'
:(.text+0x5c930): undefined reference to `dlerror'
:(.text+0x5c93c): undefined reference to `dlopen'
:(.text+0x5c950): undefined reference to `dlsym'
:(.text+0x5c958): undefined reference to `dlerror'
:(.text+0x5c968): undefined reference to `dlclose'
:(.text+0x5c9a0): undefined reference to `dlerror'
:(.text+0x5c9ac): undefined reference to `dlopen'
:(.text+0x5c9c0): undefined reference to `dlsym'
:(.text+0x5c9c8): undefined reference to `dlerror'
:(.text+0x5c9d8): undefined reference to `dlclose'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosSemaphoreCreate(sem_t*, int)':
:(.text+0x5d910): undefined reference to `sem_init'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosSemaphoreDestroy(sem_t*)':
:(.text+0x5d92c): undefined reference to `sem_destroy'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosSemaphoreWait(sem_t*, unsigned int)':
:(.text+0x5da10): undefined reference to `sem_timedwait'
:(.text+0x5da48): undefined reference to `sem_wait'
:(.text+0x5da60): undefined reference to `sem_trywait'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosSemaphoreSignal(sem_t*)':
:(.text+0x5dab0): undefined reference to `sem_post'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosVirtualReserveInRangeBug1778973WARInit()':
:(.text+0x5f448): undefined reference to `pthread_mutexattr_init'
:(.text+0x5f464): undefined reference to `pthread_mutexattr_settype'
:(.text+0x5f474): undefined reference to `pthread_mutexattr_setpshared'
:(.text+0x5f484): undefined reference to `pthread_mutexattr_setprotocol'
:(.text+0x5f4a4): undefined reference to `pthread_mutexattr_destroy'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosPosixInit()':
:(.text+0x5f4f0): undefined reference to `dlerror'
:(.text+0x5f4fc): undefined reference to `dlopen'
:(.text+0x5f510): undefined reference to `dlsym'
:(.text+0x5f518): undefined reference to `dlerror'
:(.text+0x5f528): undefined reference to `dlclose'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosVirtualReserveInRange(unsigned long, void*, void*, unsigned long)':
:(.text+0x5f768): undefined reference to `pthread_once'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosLoadLibrary(char const*)':
:(.text+0x5fc8c): undefined reference to `dlerror'
:(.text+0x5fca0): undefined reference to `dlopen'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosLoadLibraryUnsafe(char const*)':
:(.text+0x5fcb4): undefined reference to `dlerror'
:(.text+0x5fcc8): undefined reference to `dlopen'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosFreeLibrary(void*)':
:(.text+0x5fcd4): undefined reference to `dlclose'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosGetProcAddress(void*, char const*)':
:(.text+0x5fce8): undefined reference to `dlsym'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosTlsAlloc(void (*)(void*))':
:(.text+0x5fdec): undefined reference to `pthread_key_create'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosTlsFree(unsigned int)':
:(.text+0x5fe10): undefined reference to `pthread_key_delete'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosTlsGetValue(unsigned int)':
:(.text+0x5fe18): undefined reference to `pthread_getspecific'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosTlsSetValue(unsigned int, void*)':
:(.text+0x5fe28): undefined reference to `pthread_setspecific'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosInitializeCriticalSectionWithSharedFlag(pthread_mutex_t*, int)':
:(.text+0x5fef4): undefined reference to `pthread_mutexattr_init'
:(.text+0x5ff14): undefined reference to `pthread_mutexattr_settype'
:(.text+0x5ff24): undefined reference to `pthread_mutexattr_setpshared'
:(.text+0x5ff34): undefined reference to `pthread_mutexattr_setprotocol'
:(.text+0x5ff50): undefined reference to `pthread_mutexattr_destroy'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosInitializeCriticalSection(pthread_mutex_t*)':
:(.text+0x5ff70): undefined reference to `pthread_mutexattr_init'
:(.text+0x5ff8c): undefined reference to `pthread_mutexattr_settype'
:(.text+0x5ff9c): undefined reference to `pthread_mutexattr_setpshared'
:(.text+0x5ffac): undefined reference to `pthread_mutexattr_setprotocol'
:(.text+0x5ffc8): undefined reference to `pthread_mutexattr_destroy'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosInitializeCriticalSectionShared(pthread_mutex_t*)':
:(.text+0x5ffe8): undefined reference to `pthread_mutexattr_init'
:(.text+0x60004): undefined reference to `pthread_mutexattr_settype'
:(.text+0x60014): undefined reference to `pthread_mutexattr_setpshared'
:(.text+0x60024): undefined reference to `pthread_mutexattr_setprotocol'
:(.text+0x60040): undefined reference to `pthread_mutexattr_destroy'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosTryEnterCriticalSection(pthread_mutex_t*)':
:(.text+0x60058): undefined reference to `pthread_mutex_trylock'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosInitRWLockEx(void**, void*, unsigned long)':
:(.text+0x600b4): undefined reference to `pthread_rwlockattr_init'
:(.text+0x600c4): undefined reference to `pthread_rwlockattr_setpshared'
:(.text+0x600d4): undefined reference to `pthread_rwlock_init'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosInitRWLock(void**)':
:(.text+0x60114): undefined reference to `pthread_rwlockattr_init'
:(.text+0x60144): undefined reference to `pthread_rwlockattr_setpshared'
:(.text+0x60154): undefined reference to `pthread_rwlock_init'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosAcquireReaderLock(void**)':
:(.text+0x60164): undefined reference to `pthread_rwlock_rdlock'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosAcquireWriterLock(void**)':
:(.text+0x6016c): undefined reference to `pthread_rwlock_wrlock'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosTryAcquireReaderLock(void**)':
:(.text+0x6017c): undefined reference to `pthread_rwlock_tryrdlock'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosTryAcquireWriterLock(void**)':
:(.text+0x601a4): undefined reference to `pthread_rwlock_trywrlock'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosReleaseReaderLock(void**)':
:(.text+0x601c4): undefined reference to `pthread_rwlock_unlock'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosReleaseWriterLock(void**)':
:(.text+0x601cc): undefined reference to `pthread_rwlock_unlock'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosDestroyRWLockEx(void**)':
:(.text+0x601d4): undefined reference to `pthread_rwlock_destroy'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosDestroyRWLock(void**)':
:(.text+0x601ec): undefined reference to `pthread_rwlock_destroy'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosOnce(int*, void (*)())':
:(.text+0x60210): undefined reference to `pthread_once'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosCondCreateWithSharedFlag(pthread_cond_t*, int)':
:(.text+0x60250): undefined reference to `pthread_condattr_setpshared'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosCondCreate(pthread_cond_t*)':
:(.text+0x602b0): undefined reference to `pthread_condattr_setpshared'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosCondCreateShared(pthread_cond_t*)':
:(.text+0x60310): undefined reference to `pthread_condattr_setpshared'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosThreadCreateWithName(cudart::CUOSthread_st**, int (*)(void*), void*, char const*)':
:(.text+0x60564): undefined reference to `pthread_create'
:(.text+0x60578): undefined reference to `pthread_setname_np'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosThreadCreate(cudart::CUOSthread_st**, int (*)(void*), void*)':
:(.text+0x60640): undefined reference to `pthread_create'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosThreadJoin(cudart::CUOSthread_st*, int*)':
:(.text+0x606a8): undefined reference to `pthread_join'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosThreadDetach(cudart::CUOSthread_st*)':
:(.text+0x60708): undefined reference to `pthread_detach'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosHasThreadExited(cudart::CUOSthread_st*)':
:(.text+0x60758): undefined reference to `pthread_kill'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosShmCreateNamedEx(void*, char const*, unsigned long, cudart::cuosShmInfoEx_st**)':
:(.text+0x60ee0): undefined reference to `shm_unlink'
:(.text+0x60ef8): undefined reference to `shm_open'
:(.text+0x60f98): undefined reference to `shm_unlink'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosShmOpenNamedEx(void*, char const*, unsigned long, cudart::cuosShmInfoEx_st**)':
:(.text+0x61124): undefined reference to `shm_open'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosShmCloseEx(cudart::cuosShmInfoEx_st*, unsigned int, unsigned int)':
:(.text+0x61370): undefined reference to `shm_unlink'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `cudart::cuosSetThreadName(cudart::CUOSthread_st*, char const*)':
:(.text+0x62294): undefined reference to `pthread_setname_np'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `CUOSdlsymLoader<int (*)(int, sockaddr*, unsigned int*, int)>::~CUOSdlsymLoader()':
:(.text._ZN15CUOSdlsymLoaderIPFiiP8sockaddrPjiEED2Ev[_ZN15CUOSdlsymLoaderIPFiiP8sockaddrPjiEED5Ev]+0x18): undefined reference to `dlclose'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `CUOSdlsymLoader<int (*)(int*, int)>::~CUOSdlsymLoader()':
:(.text._ZN15CUOSdlsymLoaderIPFiPiiEED2Ev[_ZN15CUOSdlsymLoaderIPFiPiiEED5Ev]+0x18): undefined reference to `dlclose'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `CUOSdlsymLoader<int (*)(unsigned long, unsigned long, unsigned long const*)>::~CUOSdlsymLoader()':
:(.text._ZN15CUOSdlsymLoaderIPFimmPKmEED2Ev[_ZN15CUOSdlsymLoaderIPFimmPKmEED5Ev]+0x18): undefined reference to `dlclose'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `CUOSdlsymLoader<int (*)(unsigned long, unsigned long, unsigned long*)>::~CUOSdlsymLoader()':
:(.text._ZN15CUOSdlsymLoaderIPFimmPmEED2Ev[_ZN15CUOSdlsymLoaderIPFimmPmEED5Ev]+0x18): undefined reference to `dlclose'
/usr/local/cuda-10.0/targets/aarch64-linux/lib/libcudart_static.a(libcudart_static.a.o): In function `CUOSdlsymLoader<int (*)()>::~CUOSdlsymLoader()':
:(.text._ZN15CUOSdlsymLoaderIPFivEED2Ev[_ZN15CUOSdlsymLoaderIPFivEED5Ev]+0x18): undefined reference to `dlclose'
collect2: error: ld returned 1 exit status
CMakeFiles/cmTC_bb43d.dir/build.make:103: recipe for target 'cmTC_bb43d' failed
make[1]: *** [cmTC_bb43d] Error 1
make[1]: Leaving directory '/code/onnxruntime/build/Linux/Release/CMakeFiles/CMakeTmp'
Makefile:138: recipe for target 'cmTC_bb43d/fast' failed
make: *** [cmTC_bb43d/fast] Error 2

CMake will not be able to correctly generate this project. Call Stack (most recent call first): CMakeLists.txt:715 (enable_language)

-- Configuring incomplete, errors occurred! See also "/code/onnxruntime/build/Linux/Release/CMakeFiles/CMakeOutput.log". See also "/code/onnxruntime/build/Linux/Release/CMakeFiles/CMakeError.log". Traceback (most recent call last): File "/code/onnxruntime/tools/ci_build/build.py", line 1043, in sys.exit(main()) File "/code/onnxruntime/tools/ci_build/build.py", line 972, in main args, cmake_extra_args) File "/code/onnxruntime/tools/ci_build/build.py", line 422, in generate_build_tree run_subprocess(cmake_args + ["-DCMAKE_BUILD_TYPE={}".format(config)], cwd=config_build_dir) File "/code/onnxruntime/tools/ci_build/build.py", line 196, in run_subprocess return subprocess.run(args, cwd=cwd, check=True, stdout=stdout, stderr=stderr, env=my_env, shell=shell) File "/usr/lib/python3.6/subprocess.py", line 438, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '['/usr/local/bin/cmake', '/code/onnxruntime/cmake', '-Donnxruntime_RUN_ONNX_TESTS=OFF', '-Donnxruntime_GENERATE_TEST_REPORTS=ON', '-Donnxruntime_DEV_MODE=OFF', '-DPYTHON_EXECUTABLE=/usr/bin/python3', '-Donnxruntime_USE_CUDA=ON', '-Donnxruntime_USE_NSYNC=OFF', '-Donnxruntime_CUDNN_HOME=/usr/lib/aarch64-linux-gnu', '-Donnxruntime_USE_AUTOML=OFF', '-Donnxruntime_CUDA_HOME=/usr/local/cuda', '-Donnxruntime_USE_JEMALLOC=OFF', '-Donnxruntime_USE_MIMALLOC=OFF', '-Donnxruntime_ENABLE_PYTHON=ON', '-Donnxruntime_BUILD_CSHARP=OFF', '-Donnxruntime_BUILD_SHARED_LIB=OFF', '-Donnxruntime_USE_EIGEN_FOR_BLAS=ON', '-Donnxruntime_USE_OPENBLAS=OFF', '-Donnxruntime_USE_MKLDNN=OFF', '-Donnxruntime_USE_MKLML=OFF', '-Donnxruntime_USE_GEMMLOWP=OFF', '-Donnxruntime_USE_NGRAPH=OFF', '-Donnxruntime_USE_OPENVINO=OFF', '-Donnxruntime_USE_OPENVINO_BINARY=OFF', '-Donnxruntime_USE_OPENVINO_SOURCE=OFF', '-Donnxruntime_USE_OPENVINO_MYRIAD=OFF', '-Donnxruntime_USE_OPENVINO_GPU_FP32=OFF', '-Donnxruntime_USE_OPENVINO_GPU_FP16=OFF', '-Donnxruntime_USE_OPENVINO_CPU_FP32=OFF', '-Donnxruntime_USE_OPENVINO_VAD_M=OFF', '-Donnxruntime_USE_OPENVINO_VAD_F=OFF', '-Donnxruntime_USE_NNAPI=OFF', '-Donnxruntime_USE_OPENMP=ON', '-Donnxruntime_USE_TVM=OFF', '-Donnxruntime_USE_LLVM=OFF', '-Donnxruntime_ENABLE_MICROSOFT_INTERNAL=OFF', '-Donnxruntime_USE_BRAINSLICE=OFF', '-Donnxruntime_USE_NUPHAR=OFF', '-Donnxruntime_USE_EIGEN_THREADPOOL=OFF', '-Donnxruntime_USE_TENSORRT=ON', '-Donnxruntime_TENSORRT_HOME=/usr/lib/aarch64-linux-gnu', '-Donnxruntime_CROSS_COMPILING=OFF', '-Donnxruntime_BUILD_SERVER=OFF', '-Donnxruntime_BUILD_x86=OFF', '-Donnxruntime_USE_FULL_PROTOBUF=ON', '-Donnxruntime_DISABLE_CONTRIB_OPS=OFF', '-Donnxruntime_MSVC_STATIC_RUNTIME=OFF', '-Donnxruntime_ENABLE_LANGUAGE_INTEROP_OPS=OFF', '-Donnxruntime_USE_DML=OFF', '-DCUDA_CUDA_LIBRARY=/usr/local/cuda/lib64/stubs', '-Donnxruntime_PYBIND_EXPORT_OPSCHEMA=OFF', '-DCMAKE_BUILD_TYPE=Release']' returned non-zero exit status 1. `

opened by AndreV84 52

make WITHCACHE as an option in MacOS workflow
Description

Set the WithCache default value as false in Mac OS CI workflow too.

Add date of today in cache key to avoid cache size keep increasing too.

WithCache, the pipeline duration reduced from 70 more minutes to 10 more minutes
opened by mszhanyi 0
please reopen the issue

Describe the issue

Could you please reopen this issue? We get the same problem in opset_version=16. issue: https://github.com/microsoft/onnxruntime/issues/2756#issue-543199292.

Urgency

No response

Target platform

Windows

Build script

.

Error / output

onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running BatchNormalization node. Name:'BatchNormalization_123' Status Message: D:\a_work\1\s\onnxruntime\core\framework\op_kernel.cc:81 onnxruntime::OpKernelContext::OutputMLValue status.IsOK() was false. Shape mismatch attempting to re-use buffer. {1,3,256,192} != {1,6,256,192}. Validate usage of dim_value (values should be > 0) and dim_param (all values with the same string should equate to the same size) in shapes in the model.

Visual Studio Version

No response

GCC / Compiler Version

No response
build platform:windows

opened by shu0o0yX 0

CUDNN error executing cudnnConvolutionForward

Describe the issue

Hi, I'm running the same ONNX model on many different machines in azure (all of the same type, same configuration, docker, etc...) and on some of them I get the following error on the first batch which is being executed:

<class 'onnxruntime.capi.onnxruntime_pybind11_state.Fail'>

[ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Conv node. Name:'efficientnetb4/stem_conv/Conv2D' Status Message: CUDNN error executing cudnnConvolutionForward(s_.handle, &alpha, s_.x_tensor, s_.x_data, s_.w_desc, s_.w_data, s_.conv_desc, s_.algo, workspace.get(), s_.workspace_bytes, &beta, s_.y_tensor, s_.y_data)

It happens only on some of the machines, and only on the first message.

To reproduce

onnxruntime-gpu==1.10.0

 ONNX_PROVIDERS = [
     ('CUDAExecutionProvider', {
         'device_id': 0,
         'cudnn_conv_algo_search': 'DEFAULT', 
     }),
 ]
ONNX_SESSION_OPTIONS = onnxruntime.SessionOptions()
ONNX_SESSION_OPTIONS.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
efficientnet = onnxruntime.InferenceSession(str(fe_net_weights),
                                                        sess_options=ONNX_SESSION_OPTIONS,
                                                        providers=ONNX_PROVIDERS)

feature_extractor.run([output_layer], {"input": input})

Urgency

No response

Platform

Linux

OS Version

Ubuntu 20.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

onnxruntime-gpu==1.10.0

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

cuda 11.3.0, cudnn8

ep:CUDA

opened by kfirgoldwsc 0

How to save inference onnx model？

Describe the issue

Now I can build my own training session from torch net, but when I save onnx model after training, BatchNormalization is in training mode and can not fuse to conv. What should I do to save inference model ? current format:

expect format:

To reproduce

Urgency

No response

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.8.1

PyTorch Version

3.7

Execution Provider

CUDA

Execution Provider Library Version

No response
training ep:CUDA

opened by ArtyZe 0
[MIGraphX] update the MIGraphX version used in ORT to rocm-5.4.0

Description

Update the MIGraphX version used in ORT to rocm-5.4.0

Motivation and Context

The previous branch migraphx_for_ort has stopped updating, it is too far away from the MIgraphX latest release branch. More discussion here: https://github.com/microsoft/onnxruntime/issues/14126#issuecomment-1373201049

opened by PeixuanZuo 0
Update HistogramCalibrater.collect_data method to reduce memory consumption

Description

Updated HistogramCalibrater.collect_data method.

Inference results are no longer appended to self.intermediate_outputs list. Instead, self.collector.collect method is called inside a while loop.

Motivation and Context

When CalibrationMethod.Entropy or CalibrationMethod.Percentile is specified, HistogramCalibrater class is used.

In the HistogramCalibrater.collect_data method, all the intermediate outputs are taken in prior to collect histograms using HistogramCollector class. But this two-pass scheme consumes a lot of memory when a network has many intermediate output nodes and there're a lot of data that CalibrationDataReader provides.

Please be noted that quantized models aren't identical after the changes. I suppose it won't cause harmful results though.

opened by beru 0

Releases(v1.13.1)

v1.13.1(Oct 24, 2022)
Announcements

Security issues addressed by this release

A protobuf security issue CVE-2022-1941 that impact users who load ONNX models from untrusted sources, for example, a deep learning inference service which allows users to upload their models then runs the inferences in a shared environment.

An ONNX security vulnerability that allows reading of tensor_data outside the model directory, which allows attackers to read or write arbitrary files on an affected system that loads ONNX models from untrusted sources. (#12915)

Deprecations

CUDA 10.x support at source code level

Windows 8.x support in Nuget/C API prebuilt binaries. Support for Windows 7+ Desktop versions (including Windows servers) will be retained by building ONNX Runtime from source.

NUPHAR EP code is removed

Dependency versioning updates

C++ 17 compiler is now required to build ORT from source. On Linux, GCC version >=7.0 is required.

Minimal numpy version bumped to 1.21.6 (from 1.21.0) for ONNX Runtime Python packages

Official ONNX Runtime GPU packages now require CUDA version >=11.6 instead of 11.4.

General

Expose all arena configs in Python API in an extensible way

Fix ARM64 NuGet packaging

Fix EP allocator setup issue affecting TVM EP

Performance

Transformers CUDA improvements

Quantization on GPU for BERT - notebook, documentation on QAT, transformer optimization toolchain and quantized kernels.

Add fused attention CUDA kernels for BERT.

Fuse Add (bias) and Transpose of Q/K/V into one kernel for Attention and LongformerAttention.

Reduce GEMM computation in LongformerAttention with a new weight format.

General quantization (tool and kernel)

Quantization debugging tool - identify sensitive node/layer from accuracy drop discrepancies

New quantize API based on QuantConfig

New quantized operators: SoftMax, Split, Where

Execution Providers

CUDA EP

Official ONNX Runtime GPU packages are now built with CUDA version 11.6 instead of 11.4, but should still be backwards compatible with 11.4

TensorRT EP

Build option to link against pre-built onnx-tensorrt parser; this enables potential "no-code" TensorRT minor version upgrades and can be used to build against TensorRT 8.5 EA

Improved nested control flow support

Improve HashId generation used for uniquely identifying TRT engines. Addresses issues such as TRT Engine Cache Regeneration Issue

TensorRT uint8 support

OpenVINO EP

OpenVINO version upgraded to 2022.2.0

Support for INT8 QDQ models from NNCF

Support for Intel 13th Gen Core Process (Raptor Lake)

Preview support for Intel discrete graphics cards Intel Data Center GPU Flex Series and Intel Arc GPU

Increased test coverage for GPU Plugin

SNPE EP

Add support for Windows Dev Kit 2023

Nuget Package is now available

DirectML EP

Update to DML 1.9.1

New ops: LayerNormalization, Gelu, MatMulScale, DFT, FusedMatMul (contrib)

Bug fixes: DML EP Fix InstanceNormalization with 3D tensors (#12693), DML EP squeeze all axes when empty (#12649), DirectML GEMM broken in opset 11 and 13 when optional tensor C not provided (#12568)

[new] CANN EP - Initial integration of CANN EP contributed by Huawei to support Ascend 310 (#11477)

Mobile

EP infrastructure

Implemented support for additional EPs that use static kernels

Required for EPs like XNNPACK to be supported in minimal build

Removes need for kernel hashes to reduce maintenance overhead for developers

NOTE: ORT format models will need to be regenerated as the format change is NOT backwards compatible. We're replacing hashes for the CPU EP kernels with operator constraint information for operators used by the model so that we can match any static kernels available at runtime.

XNNPack

Added more kernels including QDQ format model support

AveragePool, Softmax,

QLinearConv, QLinearAveragePool, QLinearSoftmax

Added support for XNNPACK using threadpool

See documentation for recommendations on how to configure the XNNPACK threadpool

ORT format model peak memory usage

Added ability to use ORT format model directly for initializers to reduce peak memory usage

Enabled via SessionOptions config

https://onnxruntime.ai/docs/reference/ort-format-models.html#load-ort-format-model-from-an-in-memory-byte-array

Set "session.use_ort_model_bytes_directly" and "session.use_ort_model_bytes_for_initializers" to "1"

Web

Support for 4GB memory in webassembly

Upgraded emscripten to 3.1.19

Build from source support for onnxruntime-extensions and sentencepiece

Initial support for XNNPACK for optimizations for Wasm

Training

Training packages updated to CUDA version 11.6 and removed CUDA 10.2 and 11.3

Performance improvements via op fusions like BiasSoftmax and Dropout fusion, Gather to Split fusion etc targeting SOTA models

Added Aten support for GroupNorm, InstanceNormalization, Upsample nearest

Bug fix for SimplifiedLayerNorm, seg fault for alltoall

Contributions

Contributors to ONNX Runtime include members across teams at Microsoft, along with our community members: snnn, baijumeswani#2baijumeswani, edgchen1, iK1D, skottmckay, cloudhan, tianleiwu, fs-eire, mszhanyi, WilBrady, hariharans29, chenfucn, fdwr, yuslepukhin, wejoncy, PeixuanZuo, pengwa, yufenglee, jchen351, justinchuby, dependabot[bot], RandySheriffH, sumitsays, wschin, wangyems, YUNQIUGUO, ytaous, pranavsharma, vvchernov, natke, Craigacp, RandyShuai, smk2007, zhangyaobit, jcwchen, yihonglyu, georgen117, chilo-ms, ashbhandare, faxu, jstoecker, gramalingam, garymm, jeffbloo, xadupre, jywu-msft, askhade, RyanUnderhill, thiagocrepaldi, mindest, jingyanwangms, wenbingl, ashari4, sfatimar, MaajidKhan, souptc, HectorSVC, weixingzhang, zhanghuanrong
Source code(tar.gz)
Source code(zip)
Microsoft.AI.MachineLearning.1.13.1.symbols.zip(198.27 MB)
Microsoft.AI.MachineLearning.1.13.1.zip(31.33 MB)
Microsoft.ML.OnnxRuntime.DirectML.1.13.1.zip(7.89 MB)
onnxruntime-linux-aarch64-1.13.1.tgz(5.16 MB)
onnxruntime-linux-x64-1.13.1.tgz(5.52 MB)
onnxruntime-linux-x64-cuda-1.13.1.tgz(109.05 MB)
onnxruntime-linux-x64-gpu-1.13.1.tgz(110.66 MB)
onnxruntime-linux-x64-tensorrt-1.13.1.tgz(110.65 MB)
onnxruntime-osx-arm64-1.13.1.tgz(5.83 MB)
onnxruntime-osx-universal2-1.13.1.tgz(12.31 MB)
onnxruntime-osx-x86_64-1.13.1.tgz(6.56 MB)
onnxruntime-win-arm-1.13.1.zip(34.71 MB)
onnxruntime-win-arm64-1.13.1.zip(36.93 MB)
onnxruntime-win-x64-1.13.1.zip(37.25 MB)
onnxruntime-win-x64-cuda-1.13.1.zip(138.30 MB)
onnxruntime-win-x64-gpu-1.13.1.zip(149.82 MB)
onnxruntime-win-x64-tensorrt-1.13.1.zip(149.83 MB)
onnxruntime-win-x86-1.13.1.zip(36.38 MB)
v1.12.1(Aug 4, 2022)
This patch addresses packaging issues and bug fixes on top of v1.12.0.

Java package: MacOS M1 support folder structure fix

Android package: enable optimizations

GPU (TensorRT provider): bug fixes

DirectML: package fix

WinML: bug fixes

See #12418 for full list of specific fixes included
Source code(tar.gz)
Source code(zip)
Microsoft.AI.MachineLearning.1.12.1.symbols.zip(195.86 MB)
Microsoft.AI.MachineLearning.1.12.1.zip(30.84 MB)
Microsoft.ML.OnnxRuntime.DirectML.1.12.1.zip(13.60 MB)
onnxruntime-linux-aarch64-1.12.1.tgz(5.16 MB)
onnxruntime-linux-x64-1.12.1.tgz(5.50 MB)
onnxruntime-linux-x64-gpu-1.12.1.tgz(105.06 MB)
onnxruntime-osx-arm64-1.12.1.tgz(5.77 MB)
onnxruntime-osx-universal2-1.12.1.tgz(12.17 MB)
onnxruntime-osx-x86_64-1.12.1.tgz(6.48 MB)
onnxruntime-win-arm-1.12.1.zip(34.48 MB)
onnxruntime-win-arm64-1.12.1.zip(35.88 MB)
onnxruntime-win-x64-1.12.1.zip(36.65 MB)
onnxruntime-win-x64-gpu-1.12.1.zip(143.31 MB)
onnxruntime-win-x86-1.12.1.zip(35.94 MB)
v1.12.0(Jul 22, 2022)
Announcements

For Execution Provider maintainers/owners: the lightweight compile API is now the default compiler API for all Execution Providers (this was previously only available for the mobile build). If you have an EP using the legacy compiler API, please migrate to the lightweight compile API as soon as possible. The legacy API will be deprecated in next release (ORT 1.13).

netstandard1.1 support is being deprecated in this release and will be removed in the next ORT 1.13 release

Key Updates

General

ONNX spec support

onnx opset 17

onnx-ml opset 3 (TreeEnsemble update)

BeamSearch operator for encoder-decoder transformers models

Support for invoking individual ops without the need to create a separate graph

For use with custom op development to reuse ORT code

Support for feeding external initializers (for large models) as byte arrays for model inferencing

Build switch to disable usage of abseil library to remove dependency

Packages

Python 3.10 support

Mac M1 support in Python and Java packages

.NET 6/MAUI support in Nuget C# package

Additional target frameworks: net6.0, net6.0-android, net6.0-ios, net6.0-macos

NOTE: netstandard1.1 support is being deprecated in this release and will be removed in the 1.13 release

onnxruntime-openvino package available on Pypi (from Intel)

Performance and Quantization

Improved C++ APIs that now utilize RAII for better memory management

Operator performance optimizations, including GatherElements

Memory optimizations to support compute-intensive real-time inferencing scenarios (e.g. audio inferencing scenarios)

CPU usage savings for infrequent inference requests by reducing thread spinning

Memory usage reduction through use of containers from the abseil library, especially inlined vectors used to store tensor shapes and inlined hash maps

New quantized kernels for weight symmetry to improve performance on ARM64 little core (GEMM and Conv)

Specialized kernel to improve performance of quantized Resize by up to 2x speedup

Improved the thread job partition for QLinearConv, demonstrating up to ~20% perf gain for certain models

Quantization tool: improved ONNX shape inference for large models

Execution Providers

TensorRT EP

TensorRT 8.4 support

Provide option to share execution context memory between TensorRT subgraphs

Workaround long CI test time caused by frequent initialization/de-initialization of TensorRT builder

Improve subgraph partitioning and consolidate TensorRT subgraphs when possible

Refactor engine cache serialization/deserialization logic

Miscellaneous bug fixes and performance improvements

OpenVINO EP

Pre-Built ONNXRuntime binaries with OpenVINO now available on pypi: onnxruntime-openvino

Performance optimizations of existing supported models

New runtime configuration option ‘enable_dynamic_shapes’ added to enable dynamic shapes for each iteration

ORTModule included as part of OVEP Python Package to enable Torch ORT Inference

DirectML EP

Updated to DirectML 1.9

Opset 13-15 support: #11827, #11814, #11782, #11772

Bug fixes: Xbox command list reuse, descriptor heap reset, command allocator memory growth, negative pad counts, node suffix removal

TVM EP - details

Updated to add model .dll ingestion and execution on Windows

Updated documentation and CI tests

[New] SNPE EP - details

[Preview] XNNPACK EP - initial infrastructure with limited operator support, for use with ORT Mobile and ORT Web

Currently supports Conv and MaxPool, with work in progress to add more kernels

Mobile

Binary size reductions in Android minimal build - 12% reduction in size of base build with no operator kernels

Added new operator support to NNAPI and CoreML EPs to improve ability to run super resolution and BERT models using NPU

NNAPI: DepthToSpace, PRelu, Gather, Unsqueeze, Pad

CoreML: DepthToSpace, PRelu

Added Docker file to simplify running a custom minimal build to create an ORT Android package

Initial XNNPACK EP compatibility

Web

Memory usage optimizations

Initial XNNPACK EP compatibility

ORT Training

[New] ORT Training acceleration is also natively available through HuggingFace Optimum

[New] FusedAdam Optimizer now available through the torch-ort package for easier training integration

FP16_Optimizer Support for more DeepSpeed Versions

Bfloat16 support for AtenOp

Added gradient ops for ReduceMax and ReduceMin

Updates to Min and Max grad ops to use distributed logic

Optimizations

Optimized perf for Gelu and GeluGrad kernels for mixed precision models

Enabled fusions for SimplifiedLayerNorm

Added bitmask versions of Dropout, BiasDropout and DropoutGrad which brings ~8x space savings for the mast output.

Known issues

The Microsoft.ML.OnnxRuntime.DirectML package on Nuget has an issue and will be fixed in a patch. Fix: #12368

The Maven package has a packaging issue for Mac M1 builds and will be fixed in a patch. Fix: #12335 / Workaround discussion

Windows builds are not compatible with Windows 8.x in this release. Please use v1.11 for now.

Contributions

Contributors to ONNX Runtime include members across teams at Microsoft, along with our community members: snnn, edgchen1, fdwr, skottmckay, iK1D, fs-eire, mszhanyi, WilBrady, justinchuby, tianleiwu, PeixuanZuo, garymm, yufenglee, adrianlizarraga, yuslepukhin, dependabot[bot], chilo-ms, vvchernov, oliviajain, ytaous, hariharans29, sumitsays, wangyems, pengwa, baijumeswani, smk2007, RandySheriffH, gramalingam, xadupre, yihonglyu, zhangyaobit, YUNQIUGUO, jcwchen, chenfucn, souptc, chandru-r, jstoecker, hanbitmyths, RyanUnderhill, georgen117, jywu-msft, mindest, sfatimar, HectorSVC, Craigacp, jeffdaily, zhijxu-MS, natke, stevenlix, jeffbloo, guoyu-wang, daquexian, faxu, jingyanwangms, adtsai, wschin, weixingzhang, wenbingl, MaajidKhan, ashbhandare, ajindal1, zhanghuanrong, tiagoshibata, askhade, liqunfu
Source code(tar.gz)
Source code(zip)
Microsoft.AI.MachineLearning.1.12.0.symbols.zip(196.28 MB)
Microsoft.AI.MachineLearning.1.12.0.zip(30.84 MB)
Microsoft.ML.OnnxRuntime.DirectML.1.12.0.zip(13.60 MB)
onnxruntime-linux-aarch64-1.12.0.tgz(5.16 MB)
onnxruntime-linux-x64-1.12.0.tgz(5.50 MB)
onnxruntime-linux-x64-gpu-1.12.0.tgz(105.05 MB)
onnxruntime-osx-arm64-1.12.0.tgz(5.77 MB)
onnxruntime-osx-universal2-1.12.0.tgz(12.17 MB)
onnxruntime-osx-x86_64-1.12.0.tgz(6.48 MB)
onnxruntime-win-arm-1.12.0.zip(34.25 MB)
onnxruntime-win-arm64-1.12.0.zip(36.29 MB)
onnxruntime-win-x64-1.12.0.zip(36.78 MB)
onnxruntime-win-x64-gpu-1.12.0.zip(143.79 MB)
onnxruntime-win-x86-1.12.0.zip(35.57 MB)
v1.11.1(Apr 27, 2022)
This is a patch release on 1.11.0 with the following fixes:

Symbolic shape infer error (https://github.com/microsoft/onnxruntime/pull/10674)

Quantization tool bug (https://github.com/microsoft/onnxruntime/pull/10940)

Adds missing numpy type when looking for the ort correspondance (https://github.com/microsoft/onnxruntime/pull/10943)

Profiling tool JSON format bug (https://github.com/microsoft/onnxruntime/pull/11046)

Function bug fix (https://github.com/microsoft/onnxruntime/pull/11148)

Add mobile helpers to Python build (https://github.com/microsoft/onnxruntime/pull/11196)

Scoped GIL release in run_with_iobinding (https://github.com/microsoft/onnxruntime/pull/11248)

Fix output type mapping for JS (https://github.com/microsoft/onnxruntime/pull/11049)

All official packages are attached, and Python packages are additionally published to PyPi.
Source code(tar.gz)
Source code(zip)
Microsoft.AI.MachineLearning.1.11.1.symbols.zip(191.59 MB)
Microsoft.AI.MachineLearning.1.11.1.zip(29.93 MB)
Microsoft.ML.OnnxRuntime.DirectML.1.11.1.zip(13.03 MB)
onnxruntime-linux-aarch64-1.11.1.tgz(4.93 MB)
onnxruntime-linux-x64-1.11.1.tgz(5.42 MB)
onnxruntime-linux-x64-gpu-1.11.1.tgz(103.34 MB)
onnxruntime-osx-arm64-1.11.1.tgz(5.48 MB)
onnxruntime-osx-universal2-1.11.1.tgz(11.57 MB)
onnxruntime-osx-x86_64-1.11.1.tgz(6.18 MB)
onnxruntime-win-arm-1.11.1.zip(33.35 MB)
onnxruntime-win-arm64-1.11.1.zip(35.14 MB)
onnxruntime-win-x64-1.11.1.zip(35.59 MB)
onnxruntime-win-x64-gpu-1.11.1.zip(141.09 MB)
onnxruntime-win-x86-1.11.1.zip(34.84 MB)
v1.11.0(Mar 26, 2022)
Key Updates

General

Support for ONNX 1.11 with opset 16

Updated protobuf version to 3.18.x

Enable usage of Mimalloc (details)

Transformer model helper scripts

T5 conversion script

GPT2 conversion script

On Windows, error strings in OrtStatus are now encoded in UTF-8. When you need to print it out to screen, first convert it to a wide char string by using the MultiByteToWideChar Windows API.

Performance

Memory utilization related performance improvements (e.g. elimination of vectors for small dims)

Performance variance stability improvement through dynamic cost model session option (details)

New quantization data format support: S8S8 in QDQ format

Added s8s8 kernels for ARM64

Support to convert s8s8 to u8s8 automatically for x64

Improved performance on ARM64 for quantized CNN model through:

New kernels for quantized depthwise Conv

Improved symmetrically quantized Conv by leveraging indirect buffer

New Gemm kernels for symmetric quantized Conv and MatMul

General quantization improvements, including new quantized operators (Resize, ArgMax) and quantization tool updates

API

Java: Only a single OrtEnv can be created in any given execution of the JVM. Previously, the environment could be closed completely and a fresh one could be created with different parameters (e.g. global thread pool, or logging level) (details)

Packages

Nuget packages

C# packages now tested with .NET 5. .NET Core 2.1 support is deprecated as it has reached end of life support on August 21, 2021. We will closely follow .NET's support policy

Removed PDB files. These are attached as release artifacts below.

Pypi packages

Python 3.6 is deprecated as it has reached EOL December 2021. Supported Python versions: 3.7-3.9

Note: Mac M1 builds are not yet available in pypi but can be built from source

OnnxRuntime with OpenVINO support available at https://pypi.org/project/onnxruntime-openvino/1.11.0/

Execution Providers

CUDA

Enable CUDA provider option configuration for C# to support workspace size configuration from and fix binary compatibility of CUDAProviderOptions C API

Preview support for CUDA Graphs (details)

TensorRT

TRT 8.2.3 support

Memory footprint optimizations

Support protobuf >= 3.11

Updated flatbuffers version to 2.0

Misc Bug Fixes

DirectML

Updated more operators to opset 13 (QuantizeLinear, DequantizeLinear, ReduceSum, Split, Squeeze, Unsqueeze, ReduceSum).

OpenVINO

OpenVINO™ version upgraded to 2022.1.0 - biggest OpenVINO™ upgrade in 3.5 years. This provides functional bug fixes, API Change 2.0 and capability changes from the previous 2021.4.2 LTS release.

Performance Optimizations of existing supported models.

Pre-Built OnnxRuntime Binaries with OpenVINO enabled can be downloaded from https://github.com/intel/onnxruntime/releases/tag/v4.0 https://pypi.org/project/onnxruntime-openvino/1.11.0/

OpenCL (in preview)

Introduced the EP for OpenCL to use with Mobile GPUs

Available in experimental/opencl branch for users to try. Provide feedback through Issues and Discussions in the repo.

README is available here.

Mobile

Added general support for converting a model to NHWC layout at runtime

Execution provider sets preferred layout and shared infrastructure in ORT will ensure the nodes the execution provider is assigned will be in that layout

Added support for runtime optimization with minimal binary size impact

Relevant optimizations are saved in the ORT format model for replay at runtime if applicable

Added support for QDQ format models to the NNAPI EP

Will fall back to CPU EP’s QDQ handling if NNAPI is not available using runtime optimizations

Includes updates to the ORT QDQ optimizers so they work better with mobile scenarios

Added helpers to:

Analyze if a model can be used with the pre-built ORT Mobile package

Update ONNX opset so model can be used with the pre-built package

Convert dynamic inputs into fixed size inputs so that the model can be used with NNAPI/CoreML

Optimize a QDQ format model for use with ORT

Added Android and iOS packages with full ORT builds

These packages have additional support for the full set of opsets and ops for ONNX models at the cost of a larger binary size.

Web

Build option to create ONNX Runtime WebAssembly static library

Support for concurrent creation of multiple inference sessions

Upgraded emsdk version to 3.1.3 for more stable multi-threads and enables LTO with multi-threads build on WebAssembly.

Known issues

When using tensor sequences/sparse tensors, the generated profile is not valid JSON. (Fixed in https://github.com/microsoft/onnxruntime/pull/10974)

There is a bug in the quantization tool for calibration when choosing percentile algorithm (Fixed in https://github.com/microsoft/onnxruntime/pull/10940). To fix this, please apply the typo fix in the python file.

Mac M

Contributions

Contributors to ONNX Runtime include members across teams at Microsoft, along with our community members: snnn, edgchen1, skottmckay, yufenglee, wangyems, yuslepukhin, gwang-msft, iK1D, chilo-ms, fdwr, ytaous, RandySheriffH, hanbitmyths, chenfucn, yihonglyu, ajindal1, fs-eire, souptc, tianleiwu, YUNQIUGUO, hariharans29, oliviajain, xadupre, ashari4, RyanUnderhill, jywu-msft, weixingzhang, baijumeswani, georgen117, natke, Craigacp, jeffdaily, JingqiaoFu, zhanghuanrong, satyajandhyala, smk2007, ryanlai2, askhade, thiagocrepaldi, jingyanwangms, pengwa, scxiao, ashbhandare, BowenBao, SherlockNoMad, sumitsays, sfatimar, mosdav, harshithapv, liqunfu, tiagoshibata, gineshidalgo99, pranavsharma, jcwchen, nkreeger, xkszltl, faxu, suffiank, stevenlix, jeffbloo, feihugis
Source code(tar.gz)
Source code(zip)
Microsoft.AI.MachineLearning.1.11.0.symbols.zip(191.34 MB)
Microsoft.AI.MachineLearning.1.11.0.zip(29.93 MB)
Microsoft.ML.OnnxRuntime.DirectML.1.11.0.zip(13.03 MB)
onnxruntime-linux-aarch64-1.11.0.tgz(4.93 MB)
onnxruntime-linux-x64-1.11.0.tgz(5.41 MB)
onnxruntime-linux-x64-gpu-1.11.0.tgz(103.33 MB)
onnxruntime-osx-arm64-1.11.0.tgz(5.48 MB)
onnxruntime-osx-universal2-1.11.0.tgz(11.57 MB)
onnxruntime-osx-x86_64-1.11.0.tgz(6.18 MB)
onnxruntime-win-arm-1.11.0.zip(33.35 MB)
onnxruntime-win-arm64-1.11.0.zip(35.13 MB)
onnxruntime-win-x64-1.11.0.zip(35.66 MB)
onnxruntime-win-x64-gpu-1.11.0.zip(140.98 MB)
onnxruntime-win-x86-1.11.0.zip(34.63 MB)
v1.10.0(Dec 8, 2021)
Announcements

As noted in the deprecation notice in ORT 1.9, InferenceSession now requires the providers parameters to be set when enabling Execution Providers other than default CPUExecutionProvider. e.g. InferenceSession('model.onnx', providers=['CUDAExecutionProvider'])

Python 3.6 support removed for Mac builds. Since 3.6 is end-of-life in December 2021, it will no longer be supported from next release (ORT 1.11) onwards

Removed dependency on optional-lite

Removed experimental Featurizers code

General

Support for plug-in custom thread creation and join functions to enable usage of external threads

Optional type support from op set 15

Performance

Introduced indirect Convolution method for QLinearConv which has symmetrically quantized filter, i.e., filter type is int8 and zero point of filter is 0. The method leverages in-direct buffer instead of memcpy'ing the original data and doesn’t need to compute the sum of each pixel of output image for quantized Conv.

X64: new kernels - including avx2, avxvnni, avx512 and avx 512 vnni, for general and depthwise quantized Conv.

ARM64: new kernels for depthwise quantized Conv.

Tensor shape optimization to avoid allocating heap memory in most cases - #9542

Added transpose optimizer to push and cancel transpose ops, significantly improving perf for models requiring layout transformation

API

Python

Following through on the deprecation notice in ORT 1.9, InferenceSession now requires the providers parameters to be set when enabling Execution Providers other than default CPUExecutionProvider. e.g. InferenceSession('model.onnx', providers=['CUDAExecutionProvider'])

C/C++

New API to query CUDA stream to launch a custom kernel for scenarios where custom ops compiled into shared libraries need implicit synchronization with ORT CUDA kernels - #9141

Updated Invalid -> OrtInvalidAllocator

Updated every item in OrtCudnnConvAlgoSearch to a safer global name

WinML

New APIs to create OrtValues from Windows platform specific ID3D12Resources by exposing DirectML Execution Provider specific APIs. These APIs allow DML to extend the C-API and provide EP specific extensions.

OrtSessionOptionsAppendExecutionProviderEx_DML

DmlCreateGPUAllocationFromD3DResource

DmlFreeGPUAllocation

DmlGetD3D12ResourceFromAllocation

Bug fix: LearningModel::LoadFromFilePath in UWP apps

Packages

Added Mac M1 Universal2 build support for a single binary that runs natively on both Apple silicon and Intel-based Macs. These are included in the official Nuget packages. (build instructions)

Windows C API Symbols are now uploaded to Microsoft symbol server

Nuget package now supports ARM64 Linux C#

Python GPU package now includes both TensorRT and CUDA EPs. Note: EPs need to be explicitly registered to ensure the correct provider is used. e.g. InferenceSession('model.onnx', providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider']). Please also ensure you have appropriate TensorRT dependencies and CUDA dependencies installed.

Execution Providers

TensorRT EP

Python GPU release packages now include support for TensorRT 8.0. Enable TensorrtExecutionProvider by explicitly setting providers parameter when creating an InferenceSession. e.g. InferenceSession('model.onnx', providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider'])

Published quantized BERT model example

OpenVINO EP

Add support for OpenVINO 2021.4.x

Auto Plugin support

IO Buffer/Copy Avoidance Optimizations for GPU plugin

Misc fixes

DNNL EP

Add Softmaxgrad op

Add Transpose, Reshape, Pow and LeakyRelu ops

Add DynamicQuantizeLinear op

Add squeeze/unsqueeze ops

DirectML EP

Update DirectML.dll from 1.5.1 to 1.8.0

Support full precision uint64/int64 for 48 operators

Add 8D for 7 more existing operators

Add DynamicQuantizeLinear op

Accept ID3DResource's via C API

Mobile

Added Xamarin support to the ORT C# Nuget packages

Updated target frameworks in native package

iOS and Android binaries now included in native package

ORT format models now have backwards compatibility guarantee

Web

Support WebAssembly SIMD for qgemm kernel to accelerate the performance of quantized models

Upgraded existing WebGL kernels to the latest opset

Optimized bundle size to support various production scenarios, such as WebAssembly only or WebGL only

Contributions

Contributors to ONNX Runtime include members across teams at Microsoft, along with our community members: snnn, gineshidalgo99, fs-eire, gwang-msft, edgchen1, hariharans29, skottmckay, jeffdaily, baijumeswani, fdwr, smk2007, suffiank, souptc, RyanUnderhill, iK1D, yuslepukhin, chilo-ms, satyajandhyala, hanbitmyths, thiagocrepaldi, wschin, tianleiwu, pengwa, xadupre, zhanghuanrong, SherlockNoMad, wangyems, RandySheriffH, ashbhandare, tiagoshibata, yufenglee, mindest, sumitsays, MaajidKhan, gramalingam, tracysh, georgen117, jywu-msft, sfatimar, martinb35, nkreeger, ytaous, ashari4, stevenlix, chandru-r, jingyanwangms, mosdav, raviskolli, faxu, liqunfu, kit1980, weixingzhang, pranavsharma, jcwchen, chenfucn, BowenBao, jeffbloo
Source code(tar.gz)
Source code(zip)
Microsoft.AI.MachineLearning.1.10.0.symbols.zip(181.42 MB)
Microsoft.AI.MachineLearning.1.10.0.zip(43.15 MB)
Microsoft.ML.OnnxRuntime.DirectML.1.10.0.zip(148.30 MB)
onnxruntime-linux-aarch64-1.10.0.tgz(4.69 MB)
onnxruntime-linux-x64-1.10.0.tgz(5.16 MB)
onnxruntime-linux-x64-gpu-1.10.0.tgz(99.41 MB)
onnxruntime-osx-arm64-1.10.0.tgz(4.51 MB)
onnxruntime-osx-universal2-1.10.0.tgz(9.55 MB)
onnxruntime-osx-x86_64-1.10.0.tgz(5.13 MB)
onnxruntime-win-arm-1.10.0.zip(30.43 MB)
onnxruntime-win-arm64-1.10.0.zip(32.59 MB)
onnxruntime-win-x64-1.10.0.zip(32.90 MB)
onnxruntime-win-x64-gpu-1.10.0.zip(134.14 MB)
onnxruntime-win-x86-1.10.0.zip(31.97 MB)
v1.9.1(Oct 5, 2021)
This is a patch release on 1.9.0 with the following fixes:

Microsoft.AI.MachineLearning NuGet Package Fixes

Bug fix for the issue that fails GPU execution if the executable is on the path that contained the unicode characters - 9229.

Bug fix for the NuGet package to be installed on UWP apps with 1.9 - 9182.

Bug fix for OpenVino EP Python API- 9166.

Bump up TVM version for NUPHAR EP - 9159.

Fixed build issue for iOS 11 and earlier versions - 9036.

Source code(tar.gz)
Source code(zip)
Microsoft.AI.MachineLearning.1.9.1.symbols.zip(180.31 MB)
Microsoft.AI.MachineLearning.1.9.1.zip(43.67 MB)
v1.9.0(Sep 23, 2021)
Announcements

GCC version < 7 is no longer supported

CMAKE_SYSTEM_PROCESSOR needs be set when cross-compiling on Linux because pytorch cpuinfo was introduced as a dependency for ARM big.LITTLE support. Set it to the value of uname -m output of your target device.

General

ONNX 1.10 support

opset 15

ONNX IR 8 (SparseTensor type, model local functionprotos, Optional type not yet fully supported this release)

Improved documentation of C/C++ APIs

IBM Power support

WinML - DLL dependency fix supports learning models on Windows 8.1

Support for sub-building onnxruntime-extensions and statically linking into onnxruntime binary for custom builds

Add --_use_extensions option to run models with custom operators implemented in onnxruntime-extensions

APIs

Registration of a custom allocator for sharing between multiple sessions. (See RegisterAllocator and UnregisterAllocator APIs in onnxruntime_c_api.h)

SessionOptionsAppendExecutionProvider_TensorRT API is deprecated; use SessionOptionsAppendExecutionProvider_TensorRT_V2

New APIs: SessionOptionsAppendExecutionProvider_TensorRT_V2, CreateTensorRTProviderOptions, UpdateTensorRTProviderOptions, GetTensorRTProviderOptionsAsString, ReleaseTensorRTProviderOptions, EnableOrtCustomOps, RegisterAllocator, UnregisterAllocator, IsSparseTensor, CreateSparseTensorAsOrtValue, FillSparseTensorCoo, FillSparseTensorCsr, FillSparseTensorBlockSparse, CreateSparseTensorWithValuesAsOrtValue, UseCooIndices, UseCsrIndices, UseBlockSparseIndices, GetSparseTensorFormat, GetSparseTensorValuesTypeAndShape, GetSparseTensorValues, GetSparseTensorIndicesTypeShape, GetSparseTensorIndices,

Performance and quantization

Performance improvement on ARM

Added S8S8 (signed int8, signed int8) matmul kernel. This avoids extending uin8 to int16 for better performance on ARM64 without dot-product instruction

Expanded GEMM udot kernel to 8x8 accumulator

Added sgemm and qgemm optimized kernels for ARM64EC

Operator improvements

Improved performance for quantized operators: DynamicQuantizeLSTM, QLinearAvgPool

Added new quantized operator QGemm for quantizing Gemm directly

Fused HardSigmoid and Conv

Quantization tool - subgraph support

Transformers tool improvements

Fused Attention for BART encoder and Megatron GPT-2

Integrated mixed precision ONNX conversion and parity test for GPT-2

Updated graph fusion for embed layer normalization for BERT

Improved symbolic shape inference for operators: Attention, EmbedLayerNormalization, Einsum and Reciprocal

Packages

Official ORT GPU packages (except Python) now include both CUDA and TensorRT Execution Providers.

Python packages will be updated next release. Please note that EPs should be explicitly registered to ensure the correct provider is used.

GPU packages are built with CUDA 11.4 and should be compatible with 11.x on systems with the minimum required driver version. See: CUDA minor version compatibility

Pypi

ORT + DirectML Python packages now available: onnxruntime-directml

GPU package can be used on both CPU-only and GPU machines

Nuget

C#: Added support for using netstandard2.0 as a target framework

Windows symbol (PDB) files are no longer included in the Nuget package, reducing size of the binary Nuget package by 85%. To download, please see the artifacts below in Github.

Execution Providers

CUDA EP

Framework improvements that boost CUDA performance of subgraph heavy models (#8642, #8702)

Support for sequence ops for improved performance for models using sequence type

Kernel perf improvements for Pad and Upsample (up to 4.5x faster)

TensorRT EP

Added support for TensorRT 8.0 (x64 Windows/Linux, ARM Jetson), which includes new TensorRT explicit-quantization features (ONNX Q/DQ support)

General fixes and quality improvements

OpenVINO EP

Added support for OpenVINO 2021.4

DirectML EP

Bug fix for Identity with non-float inputs affecting DynamicQuantizeLinear ONNX backend test

ORT Web

WebAssembly

SIMD (Single Instruction, Multiple Data) support

Option to load WebAssembly from worker thread to avoid blocking main UI thread

wasm file path override

WebGL

Simpler workflow for WebGL kernel implementation

Improved performance with Conv kernel enhancement

ORT Mobile

Added more example mobile apps

CoreML and NNAPI EP enhancements

Reduced peak memory usage when initializing session with ORT format model as bytes

Enhanced partitioning to improve performance when using NNAPI and CoreML

Reduce number of NNAPI/CoreML partitions required

Add ability to force usage of CPU for post-processing in SSD models

Improves performance by avoiding expensive device copy to/from NPU for cheap post-processing section of the model

Changed to using xcframework in the iOS package

Supports usage of arm64 iPhone simulator on Mac with Apple silicon

ORT Training

Expanding input formats supported to include dictionaries and lists.

Enable user defined autograd functions

Support for fallback to PyTorch for execution

Added support for deterministic compute to enable reproducibility with ORTModule

Add DebugOptions and LogLevels to ORTModule API* to improve debuggability

Improvements additions to kernels/gradients: Concat, Split, MatMul, ReluGrad, PadOp, Tile, BatchNormInternal

Support for ROCm 4.3.1 on AMD GPU

Contributions

Contributors to ONNX Runtime include members across teams at Microsoft, along with our community members: edgchen1, gwang-msft, tianleiwu, fs-eire, hariharans29, skottmckay, baijumeswani, RyanUnderhill, iK1D, souptc, nkreeger, liqunfu, pengwa, SherlockNoMad, wangyems, chilo-ms, thiagocrepaldi, KeDengMS, suffiank, oliviajain, chenfucn, satyajandhyala, yuslepukhin, pranavsharma, tracysh, yufenglee, hanbitmyths, ytaous, YUNQIUGUO, zhanghuanrong, stevenlix, jywu-msft, chandru-r, duli2012, smk2007, wschin, MaajidKhan, tiagoshibata, xadupre, RandySheriffH, ashbhandare, georgen117, Tixxx, harshithapv, Craigacp, BowenBao, askhade, zhangxiang1993, gramalingam, weixingzhang, natke, tlh20, codemzs, ryanlai2, raviskolli, pranav-prakash, faxu, adtsai, fdwr, wenbingl, jcwchen, neginraoof, cschreib-ibex
Source code(tar.gz)
Source code(zip)
Microsoft.AI.MachineLearning.1.9.0.symbols.zip(180.78 MB)
Microsoft.AI.MachineLearning.1.9.0.zip(41.89 MB)
Microsoft.ML.OnnxRuntime.1.9.0.snupkg(121.39 MB)
Microsoft.ML.OnnxRuntime.DirectML.1.9.0.zip(11.87 MB)
Microsoft.ML.OnnxRuntime.Gpu.1.9.0.snupkg(49.93 MB)
onnxruntime-linux-x64-1.9.0.tgz(5.08 MB)
onnxruntime-linux-x64-gpu-1.9.0.tgz(92.23 MB)
onnxruntime-osx-x64-1.9.0.tgz(4.67 MB)
onnxruntime-win-arm-1.9.0.zip(31.10 MB)
onnxruntime-win-arm64-1.9.0.zip(32.79 MB)
onnxruntime-win-x64-1.9.0.zip(33.09 MB)
onnxruntime-win-x64-gpu-1.9.0.zip(127.10 MB)
onnxruntime-win-x86-1.9.0.zip(32.15 MB)
v1.8.2(Aug 6, 2021)
This is a minor patch release on 1.8.1 with the following changes:

Inference

Fix a crash issue when optimizing Conv->Add->Relu for CUDA EP

ORT Mobile updates

Change Pre-built iOS package to static framework to fix App Store submission issue

Support for metadata in ORT format models

Additional operators

Bug fixes

Known issues

cudnn 8.0.5 causes memory leaks on T4 GPU as indicated by the issue, an upgrade to later version solves the problem.

Source code(tar.gz)
Source code(zip)
v1.8.1(Jul 8, 2021)
This release contains fixes and key updates for 1.8.0. For all package installation details, please refer to https://www.onnxruntime.ai.

Inference

Fixes for GPU package loading issues

Fix for memory issue for models with convolution nodes while using the EXHAUSTIVE algo search mode

ORT Mobile updates

CoreML EP enabled in iOS mobile package

Additional operators

Bug fixes

React Native package now available

Training

Performance updates for ONNX Runtime for PyTorch (training acceleration for PyTorch models)

Accelerates most popular Hugging Face models as well as GPT-Neo and Microsoft TNLG and TNLU models

Support for PyTorch 1.8.1 and 1.9

Support for CUDA 10.2 and 11.1

Preview packages for ROCm 4.2

Source code(tar.gz)
Source code(zip)
Microsoft.AI.MachineLearning.1.8.1.symbols.zip(172.52 MB)
Microsoft.AI.MachineLearning.1.8.1.zip(38.69 MB)
Microsoft.ML.OnnxRuntime.DirectML.1.8.1.zip(137.95 MB)
onnxruntime-linux-x64-1.8.1.tgz(4.51 MB)
onnxruntime-linux-x64-gpu-1.8.1.tgz(29.40 MB)
onnxruntime-osx-x64-1.8.1.tgz(4.41 MB)
onnxruntime-win-arm-1.8.1.zip(28.30 MB)
onnxruntime-win-arm64-1.8.1.zip(29.84 MB)
onnxruntime-win-gpu-x64-1.8.1.zip(83.14 MB)
onnxruntime-win-x64-1.8.1.zip(30.16 MB)
onnxruntime-win-x86-1.8.1.zip(30.04 MB)
v1.8.0(Jun 3, 2021)
Announcements

This release

Building onnxruntime from source now requires a C++ compiler with full C++14 support.

Builds with OpenMP are no longer published. They can still be built from source if needed. The default threadpool option should provide optimal performance for the majority of models.

New dependency for Python package: flatbuffers

Next release (v1.9)

Builds will require C++ 17 compiler

GPU build will be updated to CUDA 11.1

General

ONNX opset 14 support - new and updated operators from the ONNX 1.9 release

Dynamically loadable CUDA execution provider

Allows a single build to work for both CPU and GPU (excludes Python packages)

Profiler tool now includes information on threadpool usage

multi-threading preparation time

multi-threading run time

multi-threading wait time

[Experimental] onnxruntime-extensions package

Crowd-sourced library of common/shareable custom operator implementations that can be loaded and run with ONNX Runtime; community contributions are welcome! - microsoft/onnxruntime-extensions

Currently includes mostly ops and tokenizers for string operations (full list here)

Tutorials to export and load custom ops from onnxruntime-extensions: TensorFlow, PyTorch

Training

torch-ort package released as the ONNX Runtime backend in PyTorch

onnxruntime-training-gpu and onnxruntime-training-rocm packages now available for distributed training on NVIDIA and AMD GPUs

Mobile

Official package now available

Pre-built Android and iOS packages with support for selected operators and data types

Objective-C API for iOS in preview

Expanded operators supported by NNAPI (Android) and CoreML (iOS) execution providers

All operators in the ai.onnx domain now support type reduction

Create ORT format model with --enable_type_reduction flag, and perform minimal build --enable_reduced_operator_type_support flag

ORT Web

New ONNX Runtime Javascript API

ONNX Runtime Web package

Support WebAssembly and WebGL for CPU and GPU

Support Web Worker based multi-threaded WebAssembly backend

Supports ORT model format

Improved WebGL performance

Performance

Memory footprint reduction through shared pre-packed weights for shared initializers

Pre-packing refers to weights that are pre-processed at model load time

Allows pre-packed weights of shared initializers to also be shared between sessions, preserving memory savings from using shared initializers

Memory footprint reduction through arena shrinkage

By default, the memory arena doesn't shrink and it holds onto any allocated memory forever. This feature exposes a RunOption that scans the arena and potentially returns unused memory back to the system after the end of a Run. This feature is particularly useful while running a dynamic shape model that may occasionally process an outlier inference request that requires a large amount of memory. If the shrinkage option if invoked as part of these Runs, the memory that was required for that Run is not held on forever by the memory arena.

Quantization

Native support of Quantize-Dequantize (QDQ) format for CPU

Support for Concat, Transpose, GlobalAveragePool, AveragePool, Resize, Squeeze

Improved performance on high-end ARM devices by leveraging dot-product instructions

Improved performance for batched quant GEMM with optimized multi-threading logic

Per-column quantization for MatMul

Transformers

GPT-2 and beam search integration (example)

APIs

WinML

New native WinML API SetIntraOpThreadSpinning for toggling Intra Op thread spin behavior. When enabled, and when there is no current workload, IntraOp threads will continue to spin for some additional time as it waits for any additional work. This can result in better performance for the current workload but may impact performance of other unrelated workloads. This toggle is enabled by default.

ORT Inferencing

The following APIs have been added to this release. Please check the API documentation for information.

KernelInfoGetAttributeArray_float

KernelInfoGetAttributeArray_int64

CreateArenaCfgV2

AddRunConfigEntry

CreatePrepackedWeightsContainer

PrepackedWeightsContainer

CreateSessionWithPrepackedWeightsContainer

CreateSessionFromArrayWithPrepackedWeightsContainer

Execution Providers

TensorRT

Added support for TensorRT EP configuration using session options instead of environment variables.

Added support for DLA on Jetson Xavier (AGX, NX)

General bug fixes and quality improvements.

OpenVINO

Added support for OpenVINO 2021.3

Removed support for OpenVINO 2020.4

Added support for Loading/Saving of Blobs on MyriadX devices to avoid expensive model blob compilation at runtime.

DirectML • Supports ARM/ARM64 architectures now in WinML and ONNX RunTime NuGet packages. • Support for 8-dimensional tensors to: BatchNormalization, Cast, Join, LpNormalization, MeanVarianceNormalization, Padding, Tile, TopK. • Substantial performance improvements for several operators. • Resize nearest_mode “floor” and “round_prefer_ceil”. • Fusion activations for: Conv, ConvTranspose, BatchNormalization, MeanVarianceNormalization, Gemm, MatMul. • Decomposes unsupported QLinearSigmoid operation. • Removes strided 64-bit emulation in Cast. • Allows empty shapes on constant CPU inputs.

Known issues

This release has an issue that may result in segmentation faults when deployed on Intel 12th Gen processors with hybrid architecture capabilities with Performance and Efficient-cores (P-core and E-core). This has been fixed in ORT 1.9.

The CUDA build of this release has a regression in that the memory utilization increases significantly compared to the previous releases. A fix for this will be released shortly as part of 1.8.1 patch. Here is an incomplete list of issues where this was reported - 8287, 8171, 8147.

GPU part of source code is not compatible with

Visual Studio 2019 16.10.0 ( which was just released on May 25, 2021). 16.9.x is fine.

clang 12

CPU part of source code is not compatible with

VS 2017 (https://github.com/microsoft/onnxruntime/issues/7936). Before we fix it please use VS 2019 instead.

GCC 11. See #7918

C# OpenVino EP is broken. #7951

Python and Windows only: if your CUDNN DLLs are not in CUDA's installation dir, then you need to set manually "CUDNN_HOME" variable. Just putting them in %PATH% is not enough. #7965

onnxruntime-win-gpu-x64-1.8.0.zip on this page misses important DLLs, please don't use it.

Contributions

Contributors to ONNX Runtime include members across teams at Microsoft, along with our community members:

snnn, gwang-msft, baijumeswani, fs-eire, edgchen1, zhanghuanrong, yufenglee, thiagocrepaldi, hariharans29, skottmckay, weixingzhang, tianleiwu, SherlockNoMad, ashbhandare, tracysh, satyajandhyala, liqunfu, iK1D, RandySheriffH, suffiank, hanbitmyths, wangyems, askhade, stevenlix, chilo-ms, smk2007, kit1980, codemzs, raviskolli, pranav-prakash, chenfucn, xadupre, gramalingam, harshithapv, oliviajain, xzhu1900, ytaous, MaajidKhan, RyanUnderhill, mrry, orilevari, jingyanwangms, sfatimar, KeDengMS, jywu-msft, souptc, adtsai, tlh20, yuslepukhin, duli2012, pranavsharma, faxu, georgen117, jeffbloo, Tixxx, wschin, YUNQIUGUO, tiagoshibata, martinb35, alberto-magni, ryanlai2, Craigacp, suryasidd, fdwr, jcwchen, neginraoof, natke, BowenBao
Source code(tar.gz)
Source code(zip)
Microsoft.AI.MachineLearning.1.8.0.symbols.zip(169.63 MB)
Microsoft.AI.MachineLearning.1.8.0.zip(38.52 MB)
Microsoft.ML.OnnxRuntime.DirectML.1.8.0.zip(136.22 MB)
onnxruntime-linux-x64-1.8.0.tgz(4.51 MB)
onnxruntime-linux-x64-gpu-1.8.0.tgz(29.40 MB)
onnxruntime-osx-x64-1.8.0.tgz(4.42 MB)
onnxruntime-win-arm-1.8.0.zip(27.99 MB)
onnxruntime-win-arm64-1.8.0.zip(29.60 MB)
onnxruntime-win-gpu-x64-1.8.0.zip(28.56 MB)
onnxruntime-win-x64-1.8.0.zip(29.19 MB)
onnxruntime-win-x86-1.8.0.zip(29.51 MB)
v1.7.2(Apr 8, 2021)
This is a minor patch release on 1.7.1 with the following changes:

Fix Microsoft.AI.MachineLearning NuGet package to correctly install on C# UWP projects in Visual Studio.

Source code(tar.gz)
Source code(zip)
Microsoft.AI.MachineLearning.1.7.2.symbols.zip(159.70 MB)
Microsoft.AI.MachineLearning.1.7.2.zip(34.96 MB)
v1.7.1(Mar 4, 2021)

The Microsoft.ML.OnnxRuntime.Gpu and Microsoft.ML.OnnxRuntime.Managed packages are uploaded to Nuget.org. Please note the version numbers for the Microsoft.ML.OnnxRuntime.Managed package.
Source code(tar.gz)
Source code(zip)
v1.7.0(Mar 3, 2021)
Announcements

Starting from this release, all ONNX Runtime CPU packages are now built without OpenMP. A version with OpenMP is available on Nuget (Microsoft.ML.OnnxRuntime.OpenMP) and PyPi (onnxruntime-openmp). Please report any issues in GH Issues.

Note: The 1.7.0 GPU package is uploaded on this Azure DevOps Feed due to the size limit on Nuget.org. Please use 1.7.1 for the GPU package through Nuget.

Key Feature Updates

General

Mobile

Custom operators now supported in the ONNX Runtime Mobile build

Added ability to reduce types supported by operator kernels to only the types required by the models

Expect a 25-33% reduction in binary size contribution from the kernel implementations. Reduction is model dependent, but testing with common models like Mobilenet v2, SSD Mobilenet and Mobilebert achieved reductions in this range.

Custom op support for dynamic input

MKLML/openblas/jemalloc build configs removed

Removed dependency on gemmlowp

[Experimental] Audio Operators

Fourier Transforms (DFT, IDFT, STFT), Windowing Functions (Hann, Hamming, Blackman), and a MelWeightMatrix operator in "com.microsoft.experimental” domain

Buildable using ms_experimental build flag (included in Microsoft.AI.MachineLearning NuGet package)

Performance

Quantization

Quantization tool now supports quantization of models in QDQ (QuantizeLinear-DequantizeLinear) format

Depthwise Conv quantization performance improvement

Quantization support added for Pad, Split and MaxPool for channel last

QuantizeLinear performance improvement on AVX512

Optimization: Fusion for Conv + Mul/Add

Transformers

Longformer Attention CUDA kernel memory footprint reduction

Einsum Float16 CUDA kernel for ALBERT and XLNet

Python optimizer tool now supports fusion for BART

CPU profiling tool for transformers models

APIs and Packages

Python 3.8 and 3.9 support added for all platforms, removed support for 3.5

ARM32/64 Windows builds are now included in the CPU Nuget and zip packages

WinML

.NET5 support - will work with .NET5 Standard 2.0 Projections

Image descriptors expose NominalPixelRange properties

Native support added for additional pixel ranges [0..1] and [-1..1] in image models.

A new property is added to the ImageFeatureDescriptor runtimeclass to expose the ImageNominalPixelRange property in ImageFeatureDescriptor. Other similar properties exposed are the image’s BitmapPixelFormat and BitmapAlphaMode.

Bug fixes and performance improvements, including #6249

[Experimental] Model Building API available under the Microsoft.AI.MachineLearning.Experimental namespace. (included in Microsoft.AI.MachineLearning NuGet package)

Can be used to create dynamic models on the fly to enable engine-optimized and hardware accelerated dynamic tensor featurization code sample

Execution Providers

CUDA EP

Official GPU build now built with CUDA 11

OpenVINO EP

Support for OpenVINO 2021.2

Deprecated support for OpenVINO 2020.2

Support for OpenVINO EP options in onnxruntime_perf_test tool

General fixes

TensorRT EP

Support for TensorRT 7.2

General fixes and perf improvements

DirectML EP

Support for DirectML 1.4.2

DirectML PIX markers added to enable profiling graph at operator level.

NNAPI EP

Performance improvement for quantized models

Support of per-channel quantization for QlinearConv

Additional operator support – Min/Max/Pow

Contributions

Contributors to ONNX Runtime include members across teams at Microsoft, along with our community members: edgchen1, snnn, skottmckay, gwang-msft, hariharans29, tianleiwu, xadupre, yufenglee, ryanlai2, wangyems, suffiank, liqunfu, orilevari, baijumeswani, weixingzhang, pranavsharma, RandySheriffH, ashbhandare, oliviajain, smk2007, tracysh, stevenlix, fs-eire, Craigacp, faxu, mrry, codemzs, chilo-ms, jcwchen, zhanghuanrong, SherlockNoMad, iK1D, askhade, zhangxiang1993, yuslepukhin, tlh20, MaajidKhan, wschin, smkarlap, wenbingl, pengwa, duli2012, natke, alberto-magni, Tixxx, HectorSVC, jingyanwangms, jstoecker, kit1980, suryasidd, RandyShuai, sfatimar, jywu-msft, liuziyue, mosdav, thiagocrepaldi, souptc, fdwr
Source code(tar.gz)
Source code(zip)
Microsoft.AI.MachineLearning.1.7.0.symbols.zip(160.00 MB)
Microsoft.AI.MachineLearning.1.7.0.zip(35.55 MB)
Microsoft.ML.OnnxRuntime.DirectML.1.7.0.zip(65.74 MB)
onnxruntime-linux-x64-1.7.0.tgz(4.16 MB)
onnxruntime-linux-x64-gpu-1.7.0.tgz(170.39 MB)
onnxruntime-osx-x64-1.7.0.tgz(4.68 MB)
onnxruntime-win-arm-1.7.0.zip(26.74 MB)
onnxruntime-win-arm64-1.7.0.zip(28.24 MB)
onnxruntime-win-gpu-x64-1.7.0.zip(81.79 MB)
onnxruntime-win-x64-1.7.0.zip(27.82 MB)
onnxruntime-win-x86-1.7.0.zip(28.21 MB)
v1.6.0(Dec 11, 2020)
Announcements

OpenMP will be disabled in future official builds (build option will still be available). A NoOpenMP version of ONNX Runtime is now available with this release on Nuget and PyPi for C/C++/C#/Python users.

In the next release, MKL-ML, openblas, and jemallac build options will be removed, and the Microsoft.ML.OnnxRuntime.MKLML Nuget package will no longer be published. Users of MKL-ML are recommended to use the Intel EPs. If you are using these options and identify issues switching to an alternative build, please file an issue with details.

Key Feature Updates

General

ONNX 1.8 support / opset 13

New contrib ops: BiasSoftmax, MatMulIntegerToFloat, QLinearSigmoid, Trilu

ORT Mobile now compatible with NNAPI for accelerating model execution on Android devices

Build support for Mac with Apple Silicon (CPU only)

New dependency: flatbuffers

Support for loading sparse tensor initializers in pruned models

Support for setting the execution priority of a node

Support for selection of cuDNN conv algorithms

BERT Model profiling tool

Performance

New session option to disable denormal floating numbers on sse3 supporting CPUs

Eliminates unexpected performance degradation due to denormals without needing to retrain the model

Option to share initializers between sessions to improve memory utilization

Useful when several models that use the same set of initializers except the last few layers of the model are loaded in the same process

Eliminates wasteful memory usage when every model (session) creates a separate instance of the same initializer

Exposed by the AddInitializer API

Transformer model optimizations

Longformer: LongformerAttention CUDA operator added

Support for BERT models exported from Tensorflow with 1 or 2 inputs

Python optimizer supports additional models: openai-GPT, ALBERT and FlauBERT

Quantization

Support of per-channel QuantizeLinear and DeQuantizeLinear

Support of LSTM quantization

Quantization performance improvement on ARM

CNN quantization perf optimizations, including u8s8 support and NHWC transformer in QLinearConv

ThreadPool

Use _mm_pause() for spin loop to improve performance and power consumption

APIs and Packages

Python - I/O Binding enhancements

Usage Documentation (OrtValue and IOBinding sections)

Python binding for the OrtValue data structure

An interface is exposed to allocate memory on a CUDA-supported device and define the contents of this memory. No longer need to use allocators provided by other libraries to allocate and manage CUDA memory to be used with ORT.

Allows consuming ORT allocated device memory as an OrtValue (check Scenario 4 in the IOBinding section of the documentation for an example)

OrtValue instances can be used to bind inputs/outputs. This is in addition to existing interfaces that allows binding a piece of memory directly/using numpy arrays that can be bound and may be particularly useful when binding ORT allocated device memory.

C# - float16 and bfloat16 support

Windows ML

NuGet package now supports UWP applications targeting Windows Store deployment for both CPU and GPU

Minor API Improvements:

Able to bind IIterable as inputs and outputs

Able to create Tensor* via multiple buffers

WindowsAI Redist now includes a statically linked C-Runtime package for additional deployment options

Execution Providers

DNNL EP Updates

DNNL updated from 1.1.1 to 1.7

NNAPI EP Updates

Support for CNN models

Additional operator support - Resize/Flatten/Clip

TensorRT EP Updates

Int8 quantization support (experimental)

Engine cache refactoring and improvements

General fixes and performance improvements

OpenVINO EP Updates

OpenVINO 2021.1 support

OpenVINO EP builds as shared library

Multi-threaded inferencing support

fp16 input type support

Multi-device plugin support

Hetero plugin support

Enable build on ARM64

DirectML EP Updates (1.3.0 -> 1.4.0)

Utilizing the first public standalone release of the DirectML API through the DirectML NuGet package release

General fixes and improvements

nGraph EP is removed. Recommend to use OpenVINO instead

Additional notes

VCRuntime2019 with OpenMP: pinning a process to NUMA node 1 forces the execution to be single threaded. Fix is in progress in VC++.

Workaround: place the VS2017 vcomp DLL side-by-side so that ORT uses the VS2017 version

Pip version >=20.3 is required for use on macOS Big Sur (11.x)

The destructor of OrtEnv is now non-trivial and may do DLL unloading Do not call ReleaseEnv from DLLMain or put OrtEnv in global variables. It is not safe to call FreeLibrary from DllMain. - reference

Some unit tests fail on Pascal GPUs. See: https://github.com/microsoft/onnxruntime/issues/5914

If using the default CPU package (built with OpenMP), consider tuning the OpenMP settings to improve performance. By default the number of threads to use for openmp parallel regions is set to the number of logical CPUs. This may not be optimal for machines with hyper-threading; when CPUs are oversubscribed the 99-percentile latency could be 10x greater. Setting the OMP_NUM_THREADS environment variable to the number of physical cores is a good starting point. As noted in Announcements, future official builds of ORT will be published without OpenMP

Contributions

Contributors to ONNX Runtime include members across teams at Microsoft, along with our community members: gwang-msft, snnn, skottmckay, edgchen1, hariharans29, wangyems, yufenglee, yuslepukhin, tianleiwu, SherlockNoMad, tracysh, ryanlai2, askhade, xadupre, liqunfu, RandySheriffH, jywu-msft, KeDengMS, pranavsharma, mrry, ashbhandare, iK1D, RyanUnderhill, MaajidKhan, wenbingl, kit1980, weixingzhang, tlh20, suffiank, Craigacp, smkarlap, stevenlix, zhanghuanrong, sfatimar, ytaous, tiagoshibata, fdwr, oliviajain, alberto-magni, jcwchen, mosdav, xzhu1900, wschin, codemzs, duli2012, smk2007, natke, zhijxu-MS, manashgoswami, zhangxiang1993, faxu, HectorSVC, take-cheeze, jingyanwangms, chilo-ms, YUNQIUGUO, jgbradley1, jessebenson, martinb35, Andrews548, souptc, pengwa, liuziyue, orilevari, BowenBao, thiagocrepaldi, jeffbloo
Source code(tar.gz)
Source code(zip)
Microsoft.AI.MachineLearning.1.6.0.symbols.zip(153.10 MB)
Microsoft.AI.MachineLearning.1.6.0.zip(34.52 MB)
Microsoft.ML.OnnxRuntime.DirectML.1.6.0.zip(63.18 MB)
onnxruntime-linux-x64-1.6.0.tgz(4.10 MB)
onnxruntime-linux-x64-gpu-1.6.0.tgz(29.09 MB)
onnxruntime-osx-x64-1.6.0.tgz(4.56 MB)
onnxruntime-win-x64-1.6.0.zip(26.74 MB)
onnxruntime-win-x64-gpu-1.6.0.zip(80.84 MB)
onnxruntime-win-x86-1.6.0.zip(26.99 MB)
v1.5.3(Oct 29, 2020)
This is a minor patch release on 1.5.2 with the following changes:

Fix shared provider unload crash #5553

Minor minimal build header fix

Source code(tar.gz)
Source code(zip)
v1.5.2(Oct 15, 2020)
This is a minor patch release on 1.5.1 with the following changes:

Remove dependency on cudnn64_7.dll for GPU C# nuget: https://github.com/microsoft/onnxruntime/pull/5386

Add config keys header file in the packages for Linux and Mac: https://github.com/microsoft/onnxruntime/pull/5388

Add flatbuffers verifier for ORT format buffer: https://github.com/microsoft/onnxruntime/pull/5378

Use official flatbuffers v1.12: https://github.com/microsoft/onnxruntime/pull/5392

Mitigate pybind11 build break using Xcode 12 on macOS: https://github.com/microsoft/onnxruntime/pull/5381

Support trilinear sampling in the Resize operator: https://github.com/microsoft/onnxruntime/pull/5300

Update TensorRT parser to fix accuracy issue in some opset11 models: https://github.com/microsoft/onnxruntime/pull/5442

Source code(tar.gz)
Source code(zip)
Microsoft.AI.MachineLearning.1.5.2.symbols.zip(147.63 MB)
Microsoft.AI.MachineLearning.1.5.2.zip(37.05 MB)
Microsoft.ML.OnnxRuntime.DirectML.1.5.2.zip(83.37 MB)
onnxruntime-linux-x64-1.5.2.tgz(3.91 MB)
onnxruntime-linux-x64-gpu-1.5.2.tgz(27.49 MB)
onnxruntime-osx-x64-1.5.2.tgz(4.45 MB)
onnxruntime-win-x64-1.5.2.zip(24.88 MB)
onnxruntime-win-x64-gpu-1.5.2.zip(77.30 MB)
onnxruntime-win-x86-1.5.2.zip(25.01 MB)
orttraining_rc3.1(Oct 8, 2020)
Fixes issue discovered during validation.

Changes:

https://github.com/microsoft/onnxruntime/pull/5350

Source code(tar.gz)
Source code(zip)
orttraining_rc3(Sep 30, 2020)

See: https://github.com/microsoft/onnxruntime/releases/tag/v1.5.1
Source code(tar.gz)
Source code(zip)
v1.5.1(Sep 29, 2020)
Key Updates

General

Reduced Operator Kernel build allows ORT binaries to be built with only required operators in the model(s) - learn more

[Preview] ORT for Mobile Platforms - minimizes build size for mobile and embedded devices - learn more

Transformer model inferencing performance optimizations

Perf improvement for DistilBERT

Benchmark tool supports more pretrained models

Improvements in quantization tool

Support quantization-aware training models

Make calibration tool to support general preprocessing and calibrate on input

Simplify the quantization APIs

Support of model larger than 2G

New operators for static quantization: QLinearMul, QLinearAdd, QlinearSigmoid and QLinearLeakyRelu

Prepack constant matrix B for float GEMM (MatMul, Attention)

Limited Python 3.8 support added in addition to 3.5-3.7 for official Python packages. Not yet supported for Windows GPU and Linux ARM builds.

Telemetry enabled in Java and NodeJS packages for Windows builds. Note: data is not directly sent to Microsoft or ORT teams by ONNX Runtime; enabling telemetry means trace events are collected by the Windows operating system and may be sent to the cloud based on the user's privacy settings - learn more.

API

Python API support for RegisterCustomOpsLibrary

IO Binding API for C/C++/C# language bindings. This allows use of pre-allocated buffers on targeted devices and also target device for unknown output shapes.

Sharing of allocators between multiple sessions. This allows much better utilization of memory by not creating a separate arena for each session in the same process. See this for details.

Windows ML

NuGet package now supports UWP applications targeting Windows Store deployment (CPU only)

NuGet package now supports .NET and .NET framework applications

RUST Developers can now deploy Windows ML – sample and documentation available here

New APIs to for additional performance control:

IntraopNumThreads: Provides an ability to change the number of threads used in the threadpool for Intra Operator Execution for CPU operators through LearningModelSessionOptions.

SetNamedDimensionOverrides: Provides the ability to override named input dimensions to concrete values through LearningModelSessionOptions in order to achieve better runtime performance.

Support for additional ONNX format image type denotations – Gray8, normalized [0..1] and normalized [-1..1]

Reduced Windows ML package size by separating debug symbols into separate distribution package.

Execution Providers

CUDA updates

CUDA 10.2 / cuDNN 8.0 in official package

CUDA 11 support added and available to build from source

CUDA conv kernel support asymmetrical padding to fully support models such as YoloV3 for improved GPU perf

TensorRT EP updates

Support for TensorRT 7.1

Added TensorRT engine caching feature, turned on by setting env variable ORT_TENSORRT_ENGINE_CACHE_ENABLE=1

TensorRT builds are now built with the Execution Provider as a separate dll. If enabled in the build, the provider will be available as a shared library. This was previously also enabled for the DNNL EP (ORT 1.3). Other Execution Providers will be added in the future.

OpenVINO EP updates

Support for OpenVINO 2020.4

Added runtime options for VPU hardware to select specific hardware device and enable fast compilation of models.

Enable C# binding support for OpenVINO EP

DirectML EP updates

API available for Python (build from source) and C#Microsoft.ML.OnnxRuntime.DirectML

7 new operators for ONNX 1.7 (opset 12): Celu, GreaterOrEqual, LessOrEqual, ArgMin/Max with select_last_index, GatherND with batch_dim, RoiAlign

New data integer types were added to existing operators: Clip int, Max int, Min int, MaxPool int8, ReduceMin int8, ReduceMax int8, Pow int exponent

Higher dimension support 1D to 8D added to these operators: ElementWise*, Activation*, Reduce*, ArgMin/ArgMax, Gather*, Scatter*, OneHot

64-bit support for indices on GPU's that support it: Gather, Scatter, OneHot, ArgMax/ArgMin, Cast.

Android NNAPI EP updates:

Support for dynamic input shape

Int32/float32/uint8 data type

50% more supported operators (36 total)

Support for Uint8 static quantization

Smaller binary size

Lower memory consumption

CPU fallback for Android level 26-

MiGraphX EP updates

Added ONNX operators: GatherElements, NonZero, Equal, and Where

Support for Boolean data type

Improve support for existing operators:

Asymmetric padding of AveragePool

Multi-dimensional support for Convolution, Pooling, LRN, and Batchnormalization

Ceil mode support for AveragePool and MaxPool

More general approach to check whether constant folding is possible

Improved graph partitioning logic

Training (RC3 release)

New and improved API to simplify integration with PyTorch trainer code - see instructions here

Updated CUDA 11 / cuDNN 8.0 support to accelerate in NVIDIA A100

Dependency updates

MacOS binaries now rely on openmp to be installed. See this for reference.

Contributions

Contributors to ONNX Runtime include members across teams at Microsoft, along with our community members:

gwang-msft, snnn, skottmckay, hariharans29, thiagocrepaldi, tianleiwu, wangyems, RandySheriffH, yufenglee, SherlockNoMad, smk2007, jywu-msft, liqunfu, edgchen1, yuslepukhin, tiagoshibata, fdwr, ashbhandare, iK1D, wschin, BowenBao, zhanghuanrong, RyanUnderhill, ryanlai2, askhade, pranavsharma, martinb35, suffiank, ytaous, KeDengMS, rayankrish, natke, YUNQIUGUO, range4life, smkarlap, zhangxiang1993, xzhu1900, codemzs, weixingzhang, stevenlix, tracysh, mosdav, jingyanwangms, tlh20, souptc, orilevari, kit1980, yangchen-MS, faxu, fs-eire, wenbingl, chilo-ms, xkszltl, Andrews548, yuzawa-san, MaximKalininMS, jgbradley1, nickfeeney, zhijxu-MS, Tixxx, suryasidd, Craigacp, duli2012, jeffbloo
Source code(tar.gz)
Source code(zip)
Microsoft.AI.MachineLearning.1.5.1.symbols.zip(147.86 MB)
Microsoft.AI.MachineLearning.1.5.1.zip(36.97 MB)
Microsoft.ML.OnnxRuntime.DirectML.1.5.1.zip(83.33 MB)
onnxruntime-linux-x64-1.5.1.tgz(3.89 MB)
onnxruntime-linux-x64-gpu-1.5.1.tgz(27.41 MB)
onnxruntime-osx-x64-1.5.1.tgz(4.43 MB)
onnxruntime-win-x64-1.5.1.zip(24.85 MB)
onnxruntime-win-x64-gpu-1.5.1.zip(77.16 MB)
onnxruntime-win-x86-1.5.1.zip(24.97 MB)
orttraining_rc2(Jul 31, 2020)

Source code(tar.gz)
Source code(zip)
v1.4.0(Jul 17, 2020)
Key Updates

Performance optimizations for Transformer models

GPT2 - Enable optimizations for Attention with Past State and Attention Mask

BERT - Improve EmbedLayerNormalization fusion coverage

Quantization updates

Added new quantization operators: QLinearAdd, QAttention

Improved quantization performance for transformer based models on CPU

More graph fusion

Further optimization in MLAS kernel

Introduced pre-packing for constant Matrix B of DynamicQuantizeMatMul and Qattention

New Python IOBinding APIs (bind_cpu_input, bind_output, copy_outputs_to_cpu) allow easier benchmarking

Users no longer need to allocate inputs and outputs on non-CPU devices using third-party allocators.

Users no longer need to copy inputs to non-CPU devices; ORT handles the copy.

Users can now use copy_outputs_to_cpu to copy outputs from non-CPU devices to CPU for verification.

CUDA support for Einsum (opset12)

ONNX Runtime Training updates

Opset 12 support

New sample for training experiment using Huggingface GPT-2.

Upgraded docker image built from the latest PyTorch release

Telemetry is now enabled by default for Python packages and Github release zip files (C API); see more details on what/how telemetry is collected in ORT

[Coming soon] Availability of Python package for ONNX Runtime 1.4 for Jetpack 4.4

Execution Providers

New Execution Providers available for preview:

[Preview] AMD MIGraphX

[Preview] ARM NN

Contributions

Contributors to ONNX Runtime include members across teams at Microsoft, along with our community members:

snnn, tianleiwu, edgchen1, hariharans29, skottmckay, tracysh, yufenglee, fs-eire, codemzs, tiagoshibata, yuslepukhin, gwang-msft, wschin, smk2007, prabhat00155, liuziyue, liqunfu, ytaous, iK1D, BowenBao, askhade, pranavsharma, faxu, jywu-msft, ryanlai2, xzhu1900, KeDengMS, tlh20, smkarlap, weixingzhang, jeffbloo, RyanUnderhill, mrry, jgbradley1, stevenlix, zhanghuanrong, suffiank, Andrews548, pengwa, SherlockNoMad, orilevari, duli2012, yangchen-MS, yan12125, jornt-xilinx, ashbhandare, neginraoof, Tixxx, thiagocrepaldi, Craigacp, mayeut, chilo-ms, prasanthpul, martinb35, manashgoswami, zhangxiang1993, suryasidd, wangyems, kit1980, RandySheriffH, fdwr
Source code(tar.gz)
Source code(zip)
Microsoft.AI.MachineLearning.1.4.0.zip(163.79 MB)
Microsoft.ML.OnnxRuntime.DirectML.1.4.0.zip(72.16 MB)
onnxruntime-linux-x64-1.4.0.tgz(3.61 MB)
onnxruntime-linux-x64-gpu-1.4.0.tgz(23.57 MB)
onnxruntime-osx-x64-1.4.0.tgz(53.56 MB)
onnxruntime-win-x64-1.4.0.zip(23.11 MB)
onnxruntime-win-x64-gpu-1.4.0.zip(66.03 MB)
onnxruntime-win-x86-1.4.0.zip(23.32 MB)
v1.3.1(Jun 16, 2020)
This update includes changes to support the published packages for the Java and nodejs APIs for the 1.3.0 release.

Maven: Java API CPU

Maven: Java API GPU

NPM: ONNX Runtime Node.js API

For all other APIs/builds, the 1.3.0 release packages are suggested. 1.3.1 does address the 1.3.0 issue of Crash when setting IntraOpNumThreads using the C/C++/C# API, so if this fix is needed it can be built from source using this release branch (with official release support).
Source code(tar.gz)
Source code(zip)
v1.3.0(May 19, 2020)
Key Updates

General

ONNX 1.7 support

Opset 12

Function expansion support that enables several new ONNX 1.7 ops such as NegativeLogLikelihoodLoss, GreaterOrEqual, LessOrEqual, Celu to run without a kernel implementation.

[Preview] ONNX Runtime Training

ONNX Runtime Training is a new capability released in preview to accelerate training transformer models. See the sample here to use this feature in your training experiments.

Improved threadpool support for better resource utilization

Improved threadpool abstractions that switch between openmp and Eigen threadpools based on build settings. All operators have been updated to use these new abstractions.

Improved Eigen based threadpool now allow ops to provide cost (among other things like thread affinity) for operations

Simpler configuration of thread count. If built with OpenMP, use the OpenMP env variables; else use the ORT APIs to configure the number of threads.

Support for sessions to share global threadpool. See this for more information.

Performance improvements

~10% average measured latency improvements amongst key representative models (including ONNX model zoo models, MLPerf, and production models shipped in Microsoft products)

Further latency improvements for Transformer models on CPU and GPU - benchmark script

Improved batch inferencing latency for scikit-learn models for large batch sizes

Significant improvements in the implementations of the following ONNX operators: TreeEnsembleRegressor, TreeEnsembleClassifier, LinearRegressor, LinearClassifier, SVMRegressor, SVMClassifier, TopK

C# API optimizations - PR3171

Telemetry enabled for Windows (more details on telemetry collection)

Improved error reporting when a kernel cannot be found due to missing type implementation

Minor fixes based on static code analysis

Dependency updates

Please note that this version of onnxruntime depends on Visual C++ 2019 runtime. Previous versions depended on Visual C++ 2017. Please also refer https://github.com/microsoft/onnxruntime/tree/rel-1.3.0#system-requirements for the full set of system requirements.

APIs and Packages

[General Availability] Windows Machine Learning APIs - package published on Nuget - Microsoft.AI.MachineLearning

Performance improvements

Opset updates

[General Availability] ONNX Runtime with DirectML package published on Nuget -Microsoft.ML.OnnxRuntime.DirectML

[General Availability] Java API - Maven package coming soon.

[Preview] Javascript (node.js) API now available to build from the master branch.

ARM64 Linux CPU Python package now available on Pypi. Note: this requires building ONNX for ARM64.

Nightly dev builds from master (Nuget feed, TestPypi-CPU, GPU)

API Updates

I/O binding support for Python API - This reduces execution time significantly by allowing users to setup inputs/outputs on the GPU prior to model execution.

API to specify free dimensions based on both denotations and symbolic names.

Execution Providers

OpenVINO v2.0 EP

DirectML EP updates

Updated graph interface to abstract GPU-dependent graph optimization

ONNX opset 10 and 11 support

Initial support of 8bit and quantized operators

Performance optimizations

[Preview] Rockchip NPU EP

[Preview] Xilinx FPGA Vitis-AI EP

Capability to build execution providers as DLLs - supported for DNNL EP, work in progress for other EPs.

If enabled in the build, the provider will be available as a shared library. Previously, EPs had to be statically linked with the core code.

No runtime cost to include the EP if it isn't loaded; can now dynamically decide when to load it based on the model

Contributions

We'd like to recognize our community members across various teams at Microsoft and other companies for all their valuable contributions. Our community contributors in this release include: Adam Pocock, pranavm-nvidia, Andrew Kane, Takeshi Watanabe, Jianhao Zhang, Colin Jermain, Andrews548, Jan Scholz, Pranav Prakash, suryasidd, and S. Manohar Karlapalem.

The ONNX Runtime Training code was originally developed internally at Microsoft, before being ported to Github. We’d like to recognize the original contributors: Aishwarya Bhandare, Ashwin Kumar, Cheng Tang, Du Li, Edward Chen, Ethan Tao, Fanny Nina Paravecino, Ganesan Ramalingam, Harshitha Parnandi Venkata, Jesse Benson, Jorgen Thelin, Ke Deng, Liqun Fu, Li-Wen Chang, Peng Wang, Sergii Dymchenko, Sherlock Huang, Stuart Schaefer, Tao Qin, Thiago Crepaldi, Tianju Xu, Weichun Wang, Wei Zuo, Wei-Sheng Chin, Weixing Zhang, Xiaowan Dong, Xueyun Zhu, Zeeshan Siddiqui, and Zixuan Jiang.

Known Issues

The source doesn't compile on Ubuntu 14.04. See #4048

Crash when setting IntraOpNumThreads using the C/C++/C# API. Fix is available in the master branch. Workaround: Setting IntraOpNumThreads is inconsequential when using ORT that is built with openmp enabled. Hence it's not required and can be safely commented out. Use the openmp env variables to set the threading params for openmp enabled builds (which is the recommended way).

Source code(tar.gz)
Source code(zip)
Microsoft.AI.MachineLearning.1.3.0.zip(164.29 MB)
Microsoft.ML.OnnxRuntime.DirectML.1.3.0.zip(72.21 MB)
onnxruntime-linux-x64-1.3.0.tgz(3.52 MB)
onnxruntime-linux-x64-gpu-1.3.0.tgz(20.36 MB)
onnxruntime-osx-x64-1.3.0.tgz(52.84 MB)
onnxruntime-win-x64-1.3.0.zip(23.26 MB)
onnxruntime-win-x64-gpu-1.3.0.zip(61.81 MB)
onnxruntime-win-x86-1.3.0.zip(23.41 MB)
v1.2.0(Mar 10, 2020)
Key Updates

Execution Providers

[Preview] Availability of Windows Machine Learning (WinML) APIs in Windows builds of ONNX Runtime, with DirectML for GPU acceleration

Windows ML is a WinRT API designed specifically for Windows developers that already ships as an inbox component in newer Windows versions

Compatible with Windows 8.1 for CPU and Windows 10 1709 for GPU

Available as source code in the GitHub and pre-built Nuget packages (windows.ai.machinelearning.dll)

For additional documentation and samples on getting started, visit the Windows ML API Reference documentation

TensorRT Execution Provider upgraded to TRT 7

CUDA updated to 10.1

Linux build requires CUDA Runtime 10.1.243, cublas10-10.2.1.243, and CUDNN 7.6.5.32. Note: cublas 10.1.x will not work

Windows build requires CUDA Runtime 10.1.243, CUDNN 7.6.5.32

onnxruntime now depends on curand lib, which is part of the CUDA SDK. If you already have the SDK fully installed, then it won't be an issue

Builds and Packages

Nuget package structure updated. There is now a separate Managed Assembly (Microsoft.ML.OnnxRuntime.Managed) shared between the CPU and GPU Nuget packages. The "native" Nuget will depend on the "managed" Nuget to bring it into relevant projects automatically. PR 3104 Note that this should transparent for customers installing the Nuget packages. ORT package details are here.

Build system: support getting dependencies from vcpkg (a C++ package manager for Windows, Linux, and MacOS)

Capability to generate an onnxruntime Android Archive (AAR) file from source, which can be imported directly in Android Studio

API Updates

SessionOptions:

default value of max_num_graph_transformation_steps increased to 10

default value of graph optimization level is changed to ORT_ENABLE_ALL(99)

OrtEnv can be created/destroyed multiple times

Java API

Gradle now required to build onnxruntime

Available on Android

C API Additions:

GetDenotationFromTypeInfo

CastTypeInfoToMapTypeInfo

CastTypeInfoToSequenceTypeInfo

GetMapKeyType

GetMapValueType

GetSequenceElementType

ReleaseMapTypeInfo

ReleaseSequenceTypeInfo

SessionEndProfiling

SessionGetModelMetadata

ModelMetadataGetProducerName

ModelMetadataGetGraphName

ModelMetadataGetDomain

ModelMetadataGetDescription

ModelMetadataLookupCustomMetadataMap

ModelMetadataGetVersion

ReleaseModelMetadata

Operators

This release introduces a change to the forward-compatibility pattern ONNX Runtime previously followed. This change was added to guarantee correctness of model prediction and removes behavior ambiguity due to missing opset information. This release adds a model opset number and IR version check - ONNX Runtime will not support models with ONNX versions higher than the supported opset implemented for that version (see version matrix). If higher opset versions are needed, consider using custom operators via ORT's custom schema/kernel registry mechanism.

Int8 type support for Where Op

Updates to Contrib ops:

Changes: ReorderInput in kMSNchwcDomain, SkipLayerNormalization

New: QLinearAdd, QLinearMul, QLinearReduceMean, MulInteger, QLinearAveragePool

Added featurizer operators as an expansion of Contrib operators - these are not part of the official build and are experimental

Contributions

We'd like to recognize our community members across various teams at Microsoft and other companies for all their valuable contributions. Our community contributors in this release include: Eric Cousineau (Toyota Research Institute), Adam Pocock (Oracle), tinchi, Changyoung Koh, Andrews548, Jianhao Zhang, nicklas-mohr-jas, James Yuzawa, William Tambellini, Maher Jendoubi, Mina Asham, Saquib Nadeem Hashmi, Sanster, and Takeshi Watanabe.
Source code(tar.gz)
Source code(zip)
onnxruntime-linux-x64-1.2.0.tgz(3.32 MB)
onnxruntime-linux-x64-gpu-1.2.0.tgz(17.01 MB)
onnxruntime-osx-x64-1.2.0.tgz(47.58 MB)
onnxruntime-win-x64-1.2.0.zip(32.41 MB)
onnxruntime-win-x64-gpu-1.2.0.zip(52.51 MB)
onnxruntime-win-x86-1.2.0.zip(31.72 MB)
v1.1.2(Feb 21, 2020)

This is a minor patch release on 1.1.1.

This fixes the a minor issue that some logging in execution_frame.cc cannot be controlled by SessionLogVerbosityLevel in SessionOptions. PR #3043
Source code(tar.gz)
Source code(zip)
onnxruntime-linux-x64-1.1.2.tgz(3.22 MB)
onnxruntime-linux-x64-gpu-1.1.2.tgz(14.57 MB)
onnxruntime-osx-x64-1.1.2.tgz(45.73 MB)
onnxruntime-win-x64-1.1.2.zip(21.25 MB)
onnxruntime-win-x64-gpu-1.1.2.zip(51.38 MB)
onnxruntime-win-x86-1.1.2.zip(20.68 MB)
v1.1.1(Jan 24, 2020)
This is a minor patch release on 1.1.0.

Summary

Updated default optimization level to apply all by default to support best performance for popular models

Operator updates and other bugs

All fixes

update default optimization level + fix gemm_activation fusion #2791

Fix C# handling of unicode strings #2697

Initialize max of softmax with lowest of float #2786

Implement a more stable softmax #2715

add uint8 support to where op #2792

Fix memory leak in samples and test #2778

Fix memory leak in TRT #2815

Fix nightly build version number issue #2771

Source code(tar.gz)
Source code(zip)
onnxruntime-linux-x64-1.1.1.tgz(3.22 MB)
onnxruntime-linux-x64-gpu-1.1.1.tgz(14.57 MB)
onnxruntime-osx-x64-1.1.1.tgz(45.72 MB)
onnxruntime-win-x64-1.1.1.zip(21.25 MB)
onnxruntime-win-x64-gpu-1.1.1.zip(51.38 MB)
onnxruntime-win-x86-1.1.1.zip(20.68 MB)
v1.1.0(Dec 19, 2019)
Key Updates

Performance improvements to accelerate BERT model inference latency on both GPU and CPU. Updates include:

Additional fused CPU kernels as well as related transformers for key operators such as Attention, EmbedLayerNormalization, SkipLayerNormalization, FastGelu

Further optimization such as parallelizing Gelu and LayerNorm, enabling legacy stream mode, improving performance of elementwise operators, and fusing add bias into SkipLayerNormalization and FastGelu

Extended CUDA support for opset 11

Performance improvement for Faster R-CNN and Master R-CNN with new and updated implementation of opset 11 CUDA kernels, including Resize, Expand, Scatter, and Pad

TensorRT Execution Provider updates, including support for inputs with dynamic shapes

MKL-DNN (renamed DNNL) updated to v1.1

[Preview] NN API Execution Provider for Android - see more

[Preview] Java API for ONNX Runtime - see more

Tool for Python API: Automatically maps a dataframe to the inputs of an ONNX graph based on schema information in the pandas frame

Custom ops can be loaded from shared libraries: Custom ops can now be packaged in shared libraries and distributed for use in multiple applications without modification.

Contributions

We'd like to thank our community members across various teams at Microsoft and other companies for all the valuable contributions.

We'd like to extend special recognition to these individuals for their contributions in this release: Jianhao Zhang (JD AI), Adam Pocock (Oracle), nihui (Tencent), and Nick Groszewski. From the Intel teams, we'd like to thank Patrick Foley, Akhila Vidiyala, Ilya Lavrenov, Manohar Karlapalem, Surya Siddharth Pemmaraju, Sreekanth Yalachigere, Michal Karzynski, Thomas V Trimeloni, Tomasz Dolbniak, Amy Zhuang, Scott Cyphers, Alexander Slepko and other team members on their valuable work to support the Intel Execution Providers for ONNX Runtime.
Source code(tar.gz)
Source code(zip)
onnxruntime-linux-x64-1.1.0.tgz(3.22 MB)
onnxruntime-linux-x64-gpu-1.1.0.tgz(14.54 MB)
onnxruntime-osx-x64-1.1.0.zip(44.87 MB)
onnxruntime-win-x64-1.1.0.zip(21.07 MB)
onnxruntime-win-x64-gpu-1.1.0.zip(50.70 MB)
onnxruntime-win-x86-1.1.0.zip(20.48 MB)
v1.0.0(Oct 30, 2019)
Key Updates

General

ONNX 1.6 compatibility - operator support for all opset11 ops on CPU, including Sequence ops.

Free dimension override: Add ability to override free dimensions to the inputs of a model. Free dimensions are tensor shapes which aren't statically known at model author time, and must be provided at runtime. Free dimensions are most often used for the batch size of a model's inputs, allowing for customizable batch sizes at runtime. This feature enables certain optimizations since the shape can be known apriori.

Performance improvements to further accelerate model inferencing latency on CPU and GPU. Notable updates include:

Additional CUDA operators added to support Object Detection and BERT models. Note: CUDA operator coverage is still limited and performance will vary significantly depending on the model and operator usage.

Improved parallelism for operators that use GEMM and MatMul

New implementation for 64 bits MatMul on x86_64 CPU

Added ability to set # of threads used by intra and inter operator parallelism to allow optimal configuration for both sequential and concurrent inferencing scenarios

Gelu fusion optimizer

Threading updates:

Eigen ThreadPool is now the default (previously there were two thread pool implementations, TaskThreadPool and Eigen ThreadPool)

Ability to disable multiple threading by setting thread pool size to 1 and onnxruntime_USE_OPENMP to OFF.

MLAS now uses the number of thread pool threads plus one as the parallelism level. (e.g. if you have 4 CPUs, you need to set the thread pool size to 3 so that you only have one thread per CPU)

CPU Python package is manylinux1 compliant. The GPU Python package is manylinux2010 and compatible with CUDA 10.0/cuDNN 7.6

Support for CentOS 6 and 7 for Python, C, and C++. Most of the code is now C++11 compliant (previously required C++14). C# .NET Core compatibility coming soon.

Package for ArchLinux

Telemetry - component level logging through Trace Logging for Windows builds. Data collection is limited and used strictly to identify areas for improvement. You can read more about the data collected and how to manage these settings here.

Bug fixes to address various issues filed on Github and other channels

API updates

Updates to the C API for clarity of usage. The 1.0 version of the API is now stable and will maintain backwards compatibility. Versioning is in supported to accommodate future updates.

C APIs are ABI compatible and follows Semantic Versioning. Programs linked with the current version of the ONNX Runtime library will continue to work with subsequent releases without updating any client code or re-linking.

New session option available for serializing optimized ONNX models

Enabled some new capabilities through the Python and C# APIs for feature parity, including registration of execution providers in Python and setting additional run options in C#.

Execution Providers (EP)

Updates

General Availability of the OpenVINO™ EP for Intel® CPU, Intel® Integrated Graphics, Intel® Neural Compute Stick 2, and the Intel® Vision Accelerator Design with Intel® Movidius™ Myriad™ VPU powered by OpenVINO™nGraph EP support of new operators.

MKL-DNN EP updated from 0.18.1 to 1.0.2 for an average of 5-10% (up to 50%) performance improvement on ONNX Model Zoo model latency

nGraph EP updated from 0.18 to 0.26, with support of new operators for quantization and performance improvements on LSTM ops (without peephole) and Pad op

TensorRT EP updated to the latest TensorRT 6.0 libraries

Android DNNLibrary version update

New EP support

[Preview] NUPHAR (Neural-network Unified Preprocessing Heterogeneous ARchitecture) is a TVM and LLVM based EP offering model acceleration by compiling nodes in subgraphs into optimized functions via JIT

[Preview] DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning on Windows, providing GPU acceleration for common machine learning tasks across a broad range of supported hardware and drivers

[Preview] Support for Intel® Vision Accelerator Design with Intel® Arria™ 10 FPGA powered by OpenVINO™.

[Preview] ARM Compute Library (ACL) Execution Provider targets ARM CPUs and GPUs for optimized execution of ONNX operators using the low-level libraries.

Build updates

Two new cmake options: onnxruntime_USE_GEMMLOWP, onnxruntime_USE_AUTOML, onnxruntime_USE_DML

Removed two cmake options: onnxruntime_USE_MLAS/onnxruntime_USE_EIGEN_THREADPOOL. These are always ON now.

The minimal supported gcc version is 4.8.2

Tooling

Availability of ONNX Go Live tool, which automates the process of shipping ONNX models by combining model conversion, correctness tests, and performance tuning into a single pipeline as a series of Docker images.

Updates to the quantization tool

Supports selective quantization for some nodes instead of all possible nodes

Bias quantization for Conv nodes

Node fusion for dynamic quantization

onnxruntime_perf_tool usage updates:

new option "-y" for controlling inter_op_num_threads

max optimization level is now 99, and 3 is now an invalid value. In most cases, this tool should be run with "-o 99"

Other Dependency Updates

Replaced gsl with gsl-lite to be compatible with C++11

Added NVIDIA cub

Added Wil for DML execution provider

Pybind11 updated from 2.2.4 to 2.4.0 to fix a compatibility issue with Baidu PaddlePaddle and some other python modules that are also depend on Pybind11

TVM updated to a newer version

Source code(tar.gz)
Source code(zip)
onnxruntime-linux-x64-1.0.0.tgz(3.05 MB)
onnxruntime-linux-x64-gpu-1.0.0.tgz(5.64 MB)
onnxruntime-osx-x64-1.0.0.tgz(43.19 MB)
onnxruntime-win-x64-1.0.0.zip(20.67 MB)
onnxruntime-win-x64-gpu-1.0.0.zip(28.67 MB)
onnxruntime-win-x86-1.0.0.zip(20.27 MB)
v0.5.1(Oct 12, 2019)
Bug Fixes

Fix in C# API marshalling for InferenceSession.Run()

Some fixes in OnnxRuntime server

Only NuGet packages are released for this patch release, because only the C# API users are impacted
Source code(tar.gz)
Source code(zip)

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

Related tags

Overview

Get Started

Build Pipeline Status

Data/Telemetry

Contributions and Feedback

Code of Conduct

License

Comments

v1.22.0

NumPy 1.22.0 Release Notes

Expired deprecations

Deprecated numeric style dtype strings have been removed

Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

Description

Describe the issue

Urgency

Target platform

Build script

Error / output

Visual Studio Version

GCC / Compiler Version

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Describe the issue

To reproduce

Urgency

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

PyTorch Version

Execution Provider

Execution Provider Library Version

Description

Motivation and Context

Description

Motivation and Context

Releases(v1.13.1)

v1.13.1(Oct 24, 2022)

Announcements

General

Performance

Execution Providers

Mobile

Web

Training

Contributions

v1.12.1(Aug 4, 2022)

v1.12.0(Jul 22, 2022)

Announcements

Key Updates

General

Packages

Performance and Quantization

Execution Providers

Mobile

Web

ORT Training

Known issues

Contributions

v1.11.1(Apr 27, 2022)

v1.11.0(Mar 26, 2022)

Key Updates

General

Performance

API

Packages

Execution Providers

Mobile

Web

Known issues

Expired deprecations for `loads`, `ndfromtxt`, and `mafromtxt` in npyio