A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.

Last update: Jan 08, 2023

Overview

NVIDIA DALI

The NVIDIA Data Loading Library (DALI) is a library for data loading and pre-processing to accelerate deep learning applications. It provides a collection of highly optimized building blocks for loading and processing image, video and audio data. It can be used as a portable drop-in replacement for built in data loaders and data iterators in popular deep learning frameworks.

Deep learning applications require complex, multi-stage data processing pipelines that include loading, decoding, cropping, resizing, and many other augmentations. These data processing pipelines, which are currently executed on the CPU, have become a bottleneck, limiting the performance and scalability of training and inference.

DALI addresses the problem of the CPU bottleneck by offloading data preprocessing to the GPU. Additionally, DALI relies on its own execution engine, built to maximize the throughput of the input pipeline. Features such as prefetching, parallel execution, and batch processing are handled transparently for the user.

In addition, the deep learning frameworks have multiple data pre-processing implementations, resulting in challenges such as portability of training and inference workflows, and code maintainability. Data processing pipelines implemented using DALI are portable because they can easily be retargeted to TensorFlow, PyTorch, MXNet and PaddlePaddle.

Highlights

Easy-to-use functional style Python API.
Multiple data formats support - LMDB, RecordIO, TFRecord, COCO, JPEG, JPEG 2000, WAV, FLAC, OGG, H.264, VP9 and HEVC.
Portable accross popular deep learning frameworks: TensorFlow, PyTorch, MXNet, PaddlePaddle.
Supports CPU and GPU execution.
Scalable across multiple GPUs.
Flexible graphs let developers create custom pipelines.
Extensible for user-specific needs with custom operators.
Accelerates image classification (ResNet-50), object detection (SSD) workloads as well as ASR models (Jasper, RNN-T).
Allows direct data path between storage and GPU memory with GPUDirect Storage.
Easy integration with NVIDIA Triton Inference Server with DALI TRITON Backend.
Open source.

Installing DALI

To install the latest DALI release for the latest CUDA version (11.x):

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist --upgrade nvidia-dali-cuda110

DALI comes preinstalled in the TensorFlow, PyTorch, and MXNet containers on NVIDIA GPU Cloud (versions 18.07 and later).

For other installation paths (TensorFlow plugin, older CUDA version, nightly and weekly builds, etc), please refer to the Installation Guide.

To build DALI from source, please refer to the Compilation Guide.

Examples and Tutorials

An introduction to DALI can be found in the Getting Started page.

More advanced examples can be found in the Examples and Tutorials page.

For an interactive version (Jupyter notebook) of the examples, go to the docs/examples directory.

Note: Select the Latest Release Documentation or the Nightly Release Documentation, which stays in sync with the main branch, depending on your version.

Additional Resources

GPU Technology Conference 2018; Fast data pipeline for deep learning training, T. Gale, S. Layton and P. Trędak: slides, recording.
GPU Technology Conference 2019; Fast AI data pre-preprocessing with DALI; Janusz Lisiecki, Michał Zientkiewicz: slides, recording.
GPU Technology Conference 2019; Integration of DALI with TensorRT on Xavier; Josh Park and Anurag Dixit: slides, recording.
GPU Technology Conference 2020; Fast Data Pre-Processing with NVIDIA Data Loading Library (DALI); Albert Wolant, Joaquin Anton Guirao recording.
Developer Page.
Blog Posts.

Contributing to DALI

We welcome contributions to DALI. To contribute to DALI and make pull requests, follow the guidelines outlined in the Contributing document.

If you are looking for a task good for the start please check one from external contribution welcome label.

Reporting Problems, Asking Questions

We appreciate feedback, questions or bug reports. When you need help with the code, follow the process outlined in the Stack Overflow https://stackoverflow.com/help/mcve document. Ensure that the posted examples are:

minimal: Use as little code as possible that still produces the same problem.
complete: Provide all parts needed to reproduce the problem. Check if you can strip external dependency and still show the problem. The less time we spend on reproducing the problems, the more time we can dedicate to the fixes.
verifiable: Test the code you are about to provide, to make sure that it reproduces the problem. Remove all other problems that are not related to your request.

Acknowledgements

DALI was originally built with major contributions from Trevor Gale, Przemek Tredak, Simon Layton, Andrei Ivanov and Serge Panev.

Comments

Rework tutorials general
Why we need this PR?

Refactoring to improve docs in Tutorials/General section

What happened in this PR?

What solution was applied: Applied comments from docs review by Technical Writer Changed the docs to use functional API

Affected modules and functionalities: Docs in Tutorials/General section

Key points relevant for the review: Are docs correct and understandable

Documentation (including examples): [ Describe here if documentation and examples were updated. ]

JIRA TASK: [Use DALI-1716]
opened by awolant 113
Optimize test build.

Why we need this PR?

We want to avoid downloading packages during CI build.

What happend in this PR?

Script for listing pip package configurations was extended to provide a list of all versions of all packages needed to be predownloaded. All pip installs now use /pip-packages directory to find predownloaded packages.

Signed-off-by: Rafal [email protected]

opened by banasraf 110
Add tutorials for Parallel External Source.
Description

[ ] Bug fix (non-breaking change which fixes an issue)

[ ] New feature (non-breaking change which adds functionality)

[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)

[ ] Refactoring (Redesign of existing code that doesn't affect functionality)

[x] Other (e.g. Documentation, Tests, Configuration)

What happened in this PR

Two part tutorial for Parallel external source. Two workarounds the limitations of Jupyter Notebook:

writing a callable definition to temporary file, so it can be loaded and serialized.

clear notebook without CUDA context to use 'fork'

Additional information

Affected modules and functionalities: Only tutorials

Key points relevant for the review: Mostly the wording, the intro might be longish.

Checklist

Tests

[x] Existing tests apply

[ ] New tests added

[ ] Python tests

[ ] GTests

[ ] Benchmark

[ ] Other

[ ] N/A

Documentation

[ ] Existing documentation applies

[x] Documentation updated

[ ] Docstring

[ ] Doxygen

[ ] RST

[x] Jupyter

[ ] Other

[ ] N/A

DALI team only

Requirements

[ ] Implements new requirements

[ ] Affects existing requirements

[x] N/A

REQ IDs: N/A

JIRA TASK: DALI-2215

Signed-off-by: Krzysztof Lecki [email protected]
opened by klecki 107
Improve how iterators count padded samples based on the reader
Signed-off-by: Janusz Lisiecki [email protected]

Why we need this PR?

Pick one, remove the rest

It adds a way to couple DALI FW iterator with the Reader and passes necessary information about shards, padding without user interaction

What happened in this PR?

Fill relevant points, put NA otherwise. Replace anything inside []

What solution was applied: Adds an ability to specify by name which reader would drive the framework iterator. If the reader is provided iterator would extract information if padding is used, if the reader should stick to the shard, shard it, number of shards and data set size. Based on that iterator would calculate the shard size with padding and how many samples per each shard in the last batch is padded. ToDo - TensorFlow operator deserves a similar set of options as other FW iterators but it would only enlarge this already big PR. Also, TF operator doesn't support any of this option now so this PR won't change anything (for good or bad) to the user

Affected modules and functionalities: FW iterators Examples Reader API

Key points relevant for the review: New reader API Iterators logic

Validation and testing: New tests added

Documentation (including examples): Updated examples to use new API. Documented new methods.

JIRA TASK: [DALI-1417]
opened by JanuszL 106
Rework getting started
Why we need this PR?

Refactoring to improve Getting Started docs page

What happened in this PR?

What solution was applied:

Rework Getting Started to use fn API, decorator.

Move old Getting Started as legacy API example*

Affected modules and functionalities: Docs

JIRA TASK: [Use DALI-1822,1871]
opened by awolant 103
Extend conda testing
changes conda build version from the git sha to version+timestamp for regular builds

adds more conda test variants

adds build number to conda build

makes dali use custom opencv (without threads, cuda, with libjpeg-turbo) and ffmpeg builds (with mpeg4_unpack_bframes) as the Conda provided, (with all features enabled) doesn't work well with mpeg4_unpack_bframes

makes all gtest and python basic tests pass

sets libvorbis version to 1.3.5 due to https://github.com/conda-forge/libvorbis-feedstock/issues/14

Signed-off-by: Janusz Lisiecki [email protected]

Why we need this PR?

Pick one, remove the rest

It adds more test variants for conda build and improves DALI conda build process

What happened in this PR?

Fill relevant points, put NA otherwise. Replace anything inside []

What solution was applied: changes conda build version from the git sha to version+timestamp for regular builds adds build number to conda build adds more conda test variants makes dali use custom opencv and ffmpeg builds

Affected modules and functionalities: conda build tests

Key points relevant for the review: NA

Validation and testing: CI

Documentation (including examples): NA

JIRA TASK: [NA]
opened by JanuszL 89
Make DALI buildable for Python 3.8
updates manylinux to build python 3.8.1 and remove not supported python versions

bumps up pybind11 version to 2.4.2

adjusts tests to python 3.8

Signed-off-by: Janusz Lisiecki [email protected]

Why we need this PR?

Pick one, remove the rest

It adds an ability to build DALI for python 3.8.x

What happened in this PR?

Fill relevant points, put NA otherwise. Replace anything inside []

What solution was applied: updates manylinux to build python 3.8.1 and remove not supported python versions bumps up pybind11 version to 2.4.2

Affected modules and functionalities: manylinux, pybind11

Key points relevant for the review: NA

Validation and testing: CI build

Documentation (including examples): NA

JIRA TASK: [DALI-1302]
opened by JanuszL 82
Add tutorial about TF DALI Dataset input handling
Description

[ ] Bug fix (non-breaking change which fixes an issue)

[ ] New feature (non-breaking change which adds functionality)

[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)

[ ] Refactoring (Redesign of existing code that doesn't affect functionality)

[x] Other (e.g. Documentation, Tests, Configuration)

What happened in this PR

Example added.

Additional information

Affected modules and functionalities: TF docs

Key points relevant for the review: Read it through, check if I didn't mix the stuff that I wrap into Dataset or put into run/display.

Checklist

Tests

[ ] Existing tests apply

[ ] New tests added

[ ] Python tests

[ ] GTests

[ ] Benchmark

[ ] Other

[x] N/A

Documentation

[ ] Existing documentation applies

[x] Documentation updated

[ ] Docstring

[ ] Doxygen

[ ] RST

[x] Jupyter

[ ] Other

[ ] N/A

DALI team only

Requirements

[ ] Implements new requirements

[ ] Affects existing requirements

[x] N/A

REQ IDs: N/A

JIRA TASK: DALI-2229
opened by klecki 81
Enable DALI build and tests for SBSA
adds support for DALI build for SBSA (server base system architecture)

Signed-off-by: Janusz Lisiecki [email protected]

Why we need this PR?

Pick one, remove the rest

It enables DALI build and tests for SBSA

What happened in this PR?

Fill relevant points, put NA otherwise. Replace anything inside []

What solution was applied: adds support for DALI build for SBSA (server base system architecture)

Affected modules and functionalities: cmake build.sh video test

Key points relevant for the review: NA

Validation and testing: CI

Documentation (including examples): updated docs to reflect a new build configuration

JIRA TASK: [DALI-1102]
opened by JanuszL 80
Run external source callback in parallel
Why we need this PR?

It adds option to run per-sample external source callbacks in process based python workers.

What happened in this PR?

Fill relevant points, put NA otherwise. Replace anything inside []

What solution was applied: Added process based workers using multiprocessing module, added custom wrapper around shared memory and mmap to avoid unnecessary copies when data between workers, utilized no_copy mode of external source, added prefetching of batches.

Affected modules and functionalities: Mostly python wrappers around pipeline and external source. Added shared_mem.cc util.

Key points relevant for the review: [ Describe here what is the most important part that reviewers should focus on. ]

Validation and testing: Prepared benchmark test to compare parallelized externalsource with cpu FileReader and sequential externalsource both in training and as a plain piepline just augmenting the data

Documentation (including examples): Added relevant parameters description to ExternalSource and Pipeline. Documented shared_mem, shared_batch, worker and pool modules.

DALI-1651
opened by stiepan 79
Use default resources for allocating tensors
Why we need this PR?

Pick one, remove the rest I'll take two ;)

It adds new feature needed for better memory management

Refactoring to improve memory management

What happened in this PR?

Fill relevant points, put NA otherwise. Replace anything inside []

What solution was applied:

Use core alloc to allocate backend-specific memory.

Remove old allocation infrastructure entirely.

Affected modules and functionalities:

pipeline/backend

pipeline/init

pipeline/allocator (removed)

pipeline buffer (simplified)

Key points relevant for the review:

Eveything!

Validation and testing:

Existing tests apply

Documentation (including examples):

N/A

JIRA TASK: DALI-2027
opened by mzient 78

[QUESTION] Image histogram example

Hi, thank you very much for you work, DALI is really a great library. DALI accelerated my model training by almost 19 times compared to another augmentation library and it's really impressive. Although I'm still missing a few things. For example, i want to use custom augmentation and i need to calculate image histogram in the following way (I need it for Otsu thresholding):

import cv2
import numpy as np

IMAGE_PATH = "image.jpeg"


if __name__ == "__main__":
    image = cv2.imread(IMAGE_PATH)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    ...
    # Find normalized_histogram, and its cumulative distribution function
    hist = cv2.calcHist([image], [0], None, [256], [0, 256])
    hist_norm = hist.ravel() / hist.sum()
    Q = hist_norm.cumsum()

Could you please guide me? I've tried something like this:

def dali_calc_hist(decoded_images: DALITensorList, image_sizes: DALITensorList):
    gray = fn.color_space_conversion(
        decoded_images,
        image_type=dali_types.RGB,
        output_type=dali_types.GRAY,
    )
    hist = [0] * 255
    cumsum = [0] * 255
    n = image_size[0] * image_size[1]
    for gray_lvl in range(256):
        hist[i] = fn.reductions.sum(
            fn.cast(gray == gray_lvl, dtype=dali_types.UINT8)
        )
        cumsum[i] = hist[i] / n

but i really feel that i'm doing something wrong, because actually in hist and cumsum i have nested DataNodes.

opened by xevolesi 0

Move CheckAxes to utils.h
Signed-off-by: Joaquin Anton [email protected]

Category:

Refactoring

Description:

Extracting CheckAxes and AdjustAxes to common kernel utilities (to be reused in tensor resize)

Additional information:

Affected modules and functionalities:

Reduction kernels

Key points relevant for the review:

NA

Tests:

[x] Existing tests apply

[ ] New tests added

[ ] Python tests

[ ] GTests

[ ] Benchmark

[ ] Other

[ ] N/A

Checklist

Documentation

[x] Existing documentation applies

[ ] Documentation updated

[ ] Docstring

[ ] Doxygen

[ ] RST

[ ] Jupyter

[ ] Other

[ ] N/A

DALI team only

Requirements

[ ] Implements new requirements

[ ] Affects existing requirements

[x] N/A

REQ IDs: N/A

JIRA TASK: N/A
opened by jantonguirao 15

Process a single image into multiple patches

Hello, I'm working on a Data Loader optimization towards a super resolution problem.

I need to feed Low-Resolution and High-Resolution patches (e.g., 100x100 and 200x200 patches). My first version of the Data Loader decodes an input image, extracts a random crop (High-Resolution patch) and generates the Low-Resolution patch. The code snippet below, shows the implemented logic:

    @pipeline_def(device_id=0, batch_size=8)
    def get_dali_pipeline(self):
        """! Builds Nvidia Dali pipeline for dataset iteration."""
        hr_images, _ = fn.readers.file(
            files=self.paths, read_ahead=False,
            shuffle_after_epoch=True, name="file_reader", device="cpu"
        )
        hr_images = fn.decoders.image(hr_images, device="cpu")

        # Generates HR and LR Patches
        hr_patches = fn.crop(
            hr_images,
            crop=[self.hr_height, self.hr_width],
            crop_pos_x=fn.random.uniform(range=(0.0, 1.0)),
            crop_pos_y=fn.random.uniform(range=(0.0, 1.0)),
            dtype=types.FLOAT,
            out_of_bounds_policy="trim_to_shape",
            device="cpu"
        ).gpu()
        # out = scale * (in - mean) / stddev + shift
        # To normalize as usual (0 - 255) -> (0 - 1)
        # scale = 1 | mean = 0 | stddev = 255 | shift = 0
        # out = 1 * (in - 0) / 255 + 0
        hr_patches = fn.normalize(
            hr_patches,
            dtype=types.FLOAT,
            scale=1.0, mean=0.0, stddev=255, shift=0,
            device="gpu"
        )
        lr_patches = fn.resize(
            hr_patches,
            size=[self.lr_height, self.lr_width],
            interp_type=types.INTERP_LINEAR, # Instead of TensorFlow Bilinear
            antialias=False,
            dtype=types.FLOAT,
            device="gpu"
        )

        return lr_patches, hr_patches

However, the Data Loader is still too slow. According to my investigations, decoding each image took approximately 300ms. Since each image generates a single patch, to feed a batch of size 8, the data loader needs to decode 8 images.

Therefore, I want to modify the data loader to feed N patches for each image. For example, by extracting 64 patches from an image, the data loader will feed 8 batches decoding a single image. By doing that, I expect to speed up the data loader.

However, by investigating the NVIDIA Dali library, I did not identify how to implement the desired logic, repeating the random crop operation for a single image. How do you suggest I should tackle this problem?

help wanted

opened by Fcsalvagnini 1

Inplace operator support
Hello, I wanted to ask whether it is possible to create in place operations. I have a pretty big DALI pipeline (in terms of image size) and I have to preprocess data, but each operation creates a copy of the data, that results in a DALI preprocessing pipeline with around 8Gb of memory consumption.

DALI version: 1.22.0dev

My neural network has an input size of 3 images with batchx3x5000x10000.

The pipeline consists of these steps:

3 Encoded 16-bit TIFF images (900Mb)

nvidia.dali.fn.experimental.decoders.image (900Mb)

nvidia.dali.fn.transpose (900Mb)

nvidia.dali.fn.cast (1'800Mb)

division operator (1'800Mb)

nvidia.dali.fn.stack (1'800Mb)

Which takes around 8.1Gb of GPU memory just for pre-processing.

I am using DALI with Triton Inference Server and this is an issue because the TensorRT model is only around 1Gb memory and the pre-processing is 8x bigger. If some of the operations would be inplace it would greatly imporve the memory usage server-side. Is there a plan or a way to enable this?

Thanks in advance
question
opened by appearancefnp 2

DALI compatibility with TF2.11.0

Hi, I wanted to ask if DALI is currently compatible with the latest version of TF2.11.0. I installed tensorflow via pip in a conda envinronment see here for more informations. When I use tensorflow version 2.10.0 Nvidia DALI works perfectly but when I use version 2.11.0 it gives me this error when I do imports

NotFoundError                             Traceback (most recent call last)
Cell In [1], line 6
      4 import os
      5 from nvidia.dali import pipeline_def
----> 6 import nvidia.dali.plugin.tf as dali_tf
      7 import tensorflow.compat.v1 as tf_v1
      8 import logging

File ~/anaconda3/envs/tf-2.11.0/lib/python3.9/site-packages/nvidia/dali/plugin/tf.py:36
     32 from nvidia.dali_tf_plugin import dali_tf_plugin
     34 from collections.abc import Mapping, Iterable
---> 36 _dali_tf_module = dali_tf_plugin.load_dali_tf_plugin()
     37 _dali_tf = _dali_tf_module.dali
     38 _dali_tf.__doc__ = _dali_tf.__doc__ + """
     39 
     40     Please keep in mind that TensorFlow allocates almost all available device memory by default.
     41     This might cause errors in DALI due to insufficient memory. On how to change this behaviour
     42     please look into the TensorFlow documentation, as it may differ based on your use case.
     43 """

File ~/anaconda3/envs/tf-2.11.0/lib/python3.9/site-packages/nvidia/dali_tf_plugin/dali_tf_plugin.py:52, in load_dali_tf_plugin()
     50             first_error = error
     51 else:
---> 52     raise first_error or Exception(
     53         'No matching DALI plugin found for installed TensorFlow version')
     55 return _dali_tf_module

File ~/anaconda3/envs/tf-2.11.0/lib/python3.9/site-packages/nvidia/dali_tf_plugin/dali_tf_plugin.py:45, in load_dali_tf_plugin()
     43 for libdali_tf in processed_tf_plugins:
     44     try:
---> 45         _dali_tf_module = tf.load_op_library(libdali_tf)
     46         break
     47     # if plugin is not compatible skip it

File ~/anaconda3/envs/tf-2.11.0/lib/python3.9/site-packages/tensorflow/python/framework/load_library.py:54, in load_op_library(library_filename)
     31 @tf_export('load_op_library')
     32 def load_op_library(library_filename):
     33   """Loads a TensorFlow plugin, containing custom ops and kernels.
     34 
     35   Pass "library_filename" to a platform-specific mechanism for dynamically
   (...)
     52     RuntimeError: when unable to load the library or get the python wrappers.
     53   """
---> 54   lib_handle = py_tf.TF_LoadLibrary(library_filename)
     55   try:
     56     wrappers = _pywrap_python_op_gen.GetPythonWrappers(
     57         py_tf.TF_GetOpList(lib_handle))

NotFoundError: /home/pietro/anaconda3/envs/tf-2.11.0/lib/python3.9/site-packages/nvidia/dali_tf_plugin/libdali_tf_2_10.so: undefined symbol: _ZNK10tensorflow4data11DatasetBase8FinalizeEPNS_15OpKernelContextESt8functionIFNS_8StatusOrISt10unique_ptrIS1_NS_4core15RefCountDeleterEEEEvEE

Is this error because Nvidia DALI is not yet compatible with TF2.11.0 or is it for some other reason?

question

opened by pietroorlandi 3

Supporting prefetch between pipeline stages

Hi all,

recently we've been exploring a change in our Cassandra-DALI plugin to allow the list of UUIDs (i.e., the "names" of the images) to be provided as an input to the reader, instead of as a function argument. This would allow more flexibility in the pipeline, which we could exploit in a particular application we are now targeting.

As a comparison, this would be similar to have the standard FileReader accept the list of files as an input from the previous module, instead of using the file_root argument.

The problem with this change is that it destroys our ability to concurrently prefetch images before passing them to the rest of the pipeline, since the pipeline-level prefetching, as I understand it, always runs the whole pipeline and, while it can help mitigating the variance in the loading time, it doesn't allow more loads to run in parallel.

It would be useful if there were a way to allow prefetching also at a stage level, where each module can directly request inputs from the previous one. In this way a module could, at setup time, fill its own prefetch queue before starting to produce outputs, as it is done now internally by the standard FileReader or, in our case, by our Cassandra reader.

I understand that this might require some major changes in the pipeline design, so I'd like to know if you're considering to support, sometime in the future, this kind of finer grain, per-stage prefetch or if you've already discussed it and ruled out its implementation.

Thanks!

opened by fversaci 2

Releases(v1.21.0)

v1.21.0(Dec 28, 2022)
Key Features and Enhancements

This DALI release includes the following key features and enhancements:

Added experimental image decoding operators with support for the following higher dynamic ranges (#4223):

experimental.decoders.image

experimental.decoders.image_crop

experimental.decoders.image_random_crop

experimental.decoders.image_slice

Added the GPU debayer operator (#4495, #4486).

Fixed Issues

The following issues were fixed in this release:

Fixed the issue where the GPU numpy reader was crashing on a DALI process teardown with cufile 1.4.0 (#4466).

Fixed the issue where the GPU video decoder was failing in multi-GPU settings (#4517).

Improvements

Optimizing ShiftPixelCenter kernel configuration (#4430).

Update "Compiling from source" tutorial (#4010).

Imgcodec's decode operator (#4223).

Move to use CMake in DALI deps where possible (#4445).

Bump supported tf version (#4459).

Optimize inflate tests (#4456).

Execute whole Keras code in the expected device scope (#4462).

Update the TensorFlow test to work with 2.11.x (#4460).

Crop rounding argument to control the conversion of anchors to integral values (#4461).

Make Transpose's perm argument optional (by default, reverse dims) (#4465).

Add CastLike operator (#4467).

Accept negative axis in Cat and Stack operators (#4468).

Code drop AutoGraph based on TensorFlow 2.10.0 (#4485).

Remove build and doc files from AutoGraph (#4489).

Rearrange AutoGraph tests (#4490).

Adjust the documentation template for the latest sphinx_rtd_theme (#4481).

Bump the nvidia-tensorflow to 22.11 in tests (#4472).

Improve error reporting in the video decoder (#4484).

Move to generic CUDA_CALL for nvCOMP (#4474).

Extend the warning about the lack of the necessary CUDA libraries (#4473).

Allow negative axes in reductions module (#4470).

Add kernel-wrapper around NPP debayer calls (#4486).

Remove TF-specific codepaths from AutoGraph (#4491).

Lint the AutoGraph code (#4494).

Add bytes_per_sample_hint parameter to parallel external source (#4155).

Add debayer operator (#4495).

Remove trailing comments from .flake.ag (#4497).

Update DALI_DEPS_VERSION (#4496).

Deprecate CUDA 10.2 (#4503).

Extract CachingList from ExternalSource (#4501).

Bug Fixes

Do not call nvcomp with no input (#4434).

Fix libtiff CVE-2022-3970 (#4448).

TL3 SSD Install pycocotools from latest NVIDIA cocoapi repo (#4457).

Fix numpy reader crash (#4466).

Fix stub generation for dynamic linking (#4478).

Fix issues found by static analysis (#4477).

Fix PES tests with Python3.6/3.7 (#4500).

Patch FFmpeg for CVE-2022-3965, CVE-2022-3964 (#4499).

Fix video decoder cache for multiple GPUs (#4517).

Breaking API changes

There are no breaking changes in this DALI release.

Deprecated features

DALI 1.21 is the final release that will support CUDA 10.2.

Known issues:

The GPU numpy reader might crash during the DALI process teardown with cufile 1.4.0.

The video loader operator requires that the key frames occur, at a minimum, every 10 to 15 frames of the video stream.
If the key frames occur at a frequency that is less than 10-15 frames, the returned frames might be out of sync.

Experimental VideoReaderDecoder does not support open GOP.
It will not report an error and might produce invalid frames. VideoReader uses a heuristic approach to detect open GOP and should work in most common cases.

The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later.
To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, you can use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)

In experimental debug and eager modes, the GPU external source is not properly synchronized with DALI internal streams. As a workaround, you can manually synchronize the device before returning the data from the callback.

Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when running in Docker with escalated privileges, for example:

privileged=yes in Extra Settings for AWS data points

--privileged or --security-opt seccomp=unconfined for bare Docker.

Binary builds

Install via pip for CUDA 10.2: pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda102==1.21.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda102==1.210.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). Using the latest driver may enable additional functionality. More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.21.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.21.0

Or use direct download links (CUDA 10.2):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda102/nvidia_dali_cuda102-1.21.0-6799317-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda102/nvidia-dali-tf-plugin-cuda102-1.21.0.tar.gz

Or use direct download links (CUDA 11.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.21.0-6799315-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.21.0-6799315-py3-none-manylinux2014_aarch64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.21.0.tar.gz

FFmpeg source code:

This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

https://developer.download.nvidia.com/compute/redist/nvidia-dali/libsndfile-1.1.0.tar.gz

Source code(tar.gz)
Source code(zip)
v1.20.0(Nov 30, 2022)
Key Features and Enhancements

This DALI release includes the following key features and enhancements:

Added the fn.experimental.remap operator for generic geometric transformation of images and video (#4379, #4419, #4365, #4374, #4425).

Added MPEG4 support to the GPU video decoder (#4424, #4327).

Added the fn.experimental.inflate operator that enables decompression of LZ4 compressed input (#4366).

Added support for broadcasting in arithmetic operators (CPU and GPU) (#4348).

Added experimental split and merge operators for conditional execution (#4359, #4405, #4358).

The following optimizations in GPU operators:

Optimized MelScale kernel (#4395).

Optimizations in the GPU decoder (#4351).

Simplified arithmetic GPU operator (#4411).

Split reduction kernels (#4383).

Avoiding copy from non-pinned memory in PreemphasisFilter operator (#4380).

Refactored the ConvertTimeMajorSpectrogram kernel (#4389).

Fixed Issues

The following issues were fixed in this release:

Fixed TensorList copy synchronization issues (#4458, #4453).

Fixed an issue with hint grid size in OpticalFlow (#4443).

Fixed the ES synchronization issues in integrated memory devices (#4321, #4423).

Added a missing CUDA stream synchronization before cuvidUnmapVideoFrame in nvDecoder (#4426).

Fixed the pipeline initialization in Python after deserialization (#4350).

Fixed issues with serialization of functions in recent notebook versions (#4406).

Fixed an integration with new TF version by replacing Status::OK() with Status() in the TF plugin (#4442).

Improvements

Update dependencies 22/11 (#4427)

fn.experimental.remap optimizations (#4419)

Add mkv support (#4424)

Add inflate operator (#4366)

Include nvCOMP's license and notice in the acknowledgements (#4368)

Use numpy instead of naive loops in remap test. (#4425)

MelScale kernel optimization (#4395)

Optimize GPU decoder (#4351)

Simplify arithmetic operator GPU implementation (#4411)

Add CVE reporting guideline to the repo and readme (#4385)

Add internal Split and Merge operators (#4359)

Fix fstring usage for warning in pipeline (#4401)

Add fn.experimental.remap operator (#4379)

Divide expression_impl to avoid recompiling all ops when touching a detail in the impl (#4412)

Refactor ConvertTimeMajorSpectrogram kernel (#4389)

Remove documentation about data_layout argument for paddle and pytorch iterators (#4409)

Serialize failing global functions by value (#4406)

Limit the TF memory usage in test_dali_tf_dataset_shape.py tests (#4400)

Split reduction kernels (#4383)

Add convenient conversions from a list of arrays to DALI TensorList (#4391)

Add permute_in_place function with tests. (#4387)

Split cuda utils.h & fix includes (#4386)

Enable MPEG4 GPU decoding (#4327)

Update CUDA toolkit for Jetson build to 11.8 (#4376)

Remove TensorFlow 1.15 support from CUDA 11 (#4377)

Avoid copying from non-pinned memory in PreemphasisFilter operator (#4380)

Support broadcasting in arithmetic operators (CPU & GPU) (#4348)

Remove unnecessary reset in the PyTorch SSD example (#4373)

Remap kernel implementation with NPP (#4365)

Utils and prerequisities for NppRemapKernel implementation (#4374)

Extend DALIInterpType to_string (#4370)

Validate ROI in imgcodec (#4279)

Workspace unification (#4339)

Extend and relax TensorList sample APIs (#4358)

Remove the Pipeline/Executor completion callback APIs (#4345)

Bug Fixes

Fix H2H copy in HW NVJPEG. (#4458)

Fix an issue with improper hint grid size in OpticalFlow (#4443)

Enable support for full-swing videos (#4447)

Fix TensorList copy ordering issues (#4453)

Replace Status::OK() with Status() for TF plugin (#4442)

Adds a cuda stream synchronization before cuvidUnmapVideoFrame in nvDecoder (#4426)

Fix ES synchronization issues in integrated memory devices (#4321)

Fix debug build warnings in the inflate op (#4433)

Fix ExecutorSyncTest that run the SimpleExecutor twice (#4432)

Fix setting pinned status of the tensor list in the Python (#4431)

Pinned resource test fix: reset the device buffer on a proper stream. (#4428)

Fix libtiff CVEs (#4414)

Fix pinned resource test on integrated GPUs (#4423)

Fix builtin test - do not use operators lib (#4420)

Harden the code against ODR violations (#4421)

Unroll nested namespaces (#4415)

Add proper validation for empty batch in External Source (#4404)

Fix video decoder test for aarch64 (#4402)

Fix to enable leading underscore in op name (#4405)

Serialize failing global functions by value (#4406)

Add cuh files to linter (#4384)

Avoid reading out of bounds (#4398)

Fix namespace resolution for CUDA and STL math functions (#4378)

Fix unnecessary copy of the workspace object. (#4371)

Fix pipeline initialization in python after deserialization (#4350)

Fix misleading video example with timestamps (#4364)

Fix sanitizer build tests (#4367)

Breaking API changes

Removed the Pipeline/Executor completion callback APIs (#4345).

[C++ API] Workspace unification: C++ workspace is no longer templated with backend type (#4339).

Deprecated features

DALI will drop support for CUDA 10.2 in an upcoming release.

Known issues:

The GPU numpy reader might crash during the DALI process teardown with cufile 1.4.0.

The video loader operator requires that the key frames occur, at a minimum, every 10 to 15 frames of the video stream.
If the key frames occur at a frequency that is less than 10-15 frames, the returned frames might be out of sync.

Experimental VideoReaderDecoder does not support open GOP.
It will not report an error and might produce invalid frames. VideoReader uses a heuristic approach to detect open GOP and should work in most common cases.

The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later.
To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, you can use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)

In experimental debug and eager modes, the GPU external source is not properly synchronized with DALI internal streams. As a workaround, you can manually synchronize the device before returning the data from the callback.

Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when running in Docker with escalated privileges, for example:

privileged=yes in Extra Settings for AWS data points

--privileged or --security-opt seccomp=unconfined for bare Docker.

Binary builds

Install via pip for CUDA 10.2: pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda102==1.20.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda102==1.20.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). Using the latest driver may enable additional functionality. More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.20.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.20.0

Or use direct download links (CUDA 10.2):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda102/nvidia_dali_cuda102-1.20.0-6562492-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda102/nvidia-dali-tf-plugin-cuda102-1.20.0.tar.gz

Or use direct download links (CUDA 11.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.20.0-6562491-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.20.0-6562491-py3-none-manylinux2014_aarch64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.20.0.tar.gz

FFmpeg source code:

This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

https://developer.download.nvidia.com/compute/redist/nvidia-dali/libsndfile-1.1.0.tar.gz

Source code(tar.gz)
Source code(zip)
v1.19.0(Nov 2, 2022)
Key Features and Enhancements

This DALI release includes the following key features and enhancements:

Added the experimental.decoders.video stand-alone video decoder to decode video on GPU and CPU provided as an in-memory buffer (for example, through an external source) (#4354, #4296).

Added support to decode indexless videos (#4347, #4302, and #4335).

Fixed Issues

The following issues were fixed in this release:

Fixed the handling of Caffe LMDB empty samples (without data or labels) (#4266).

Improvements

Exclude HEVC files from video decoder test. (#4357)

Fix a typo in Debug Mode documentation (#4355)

Parallelize gpu video decoding (#4354)

Make tests for DALI linked dynamically with CUDA more flexible (#4341) [categories: Other]

Update MXNet version used in tests (#4342)

Enable indexless video decoding for GPU (#4347)

Prevent obtaining handle values from dead unique handles and stream leases. (#4346)

Update broadcasting shape simplification logic (#4314)

Add warning about the end of support for CUDA 10.2 (#4334)

Frames decoder gpu without index (#4302)

Enable indexless decoding in CPU video decoder (#4335)

Update outdated links in the documentation (#4329)

Add Mixed VideoDecoder (#4296)

Update cutlass and DALI_deps revision. (#4328)

Fixes and performance improvments in imgcodec/nvjpeg (#4318)

Update Jetson build env to support CUDA 11.4 and Orin (#4250)

Update nvJPEG2k version to 0.6.0 (#4320)

Add missing documentation to (Future)DecodingResult(Promise). (#4310)

Update libcudacxx target macros for clang and SM90. (#4315)

Don't use nvjpegGetHardwareDecoderInfo in pre-11.8 toolkits. (#4325)

Prune static cuda libraries DALI links with from unused archs (#4317)

Fix clang warnings (#4312)

Add pass-through tracking to auto-pinning buffers (#4294)

Update protobuf (v21.5 to v21.7) (#4313)

Extended ImageDecoder tests (#4297)

Refactor OpSchema - move implementation to one translation unit (#4293)

Emit the warning about the default value change only when using the default. (#4214)

Reduce the batch size in RN50 data pipeline tests. (#4304)

Enable ROI adjustment for multi-frame inputs + cleanup. (#4303)

Use GPU Convert in nvJPEG decoder (#4247)

Aggregating ImageDecoder (#4224)

Support palette TIFFs (#4206)

Refactor video decoder for reusability (#4290)

Add ROI support to nvJPEG (#4244)

RemapKernel API (#4284)

Presteps to image_decoder.* APIs (#4277)

Add frames decoder CPU without index (#4278)

Add experimental.decoders.video for CPU (#4270)

Fix a typo in the documentation (#4258)

Add orientation to GPU image data Convert (#4232)

Fix hang in TL1_tensorflow-dali_test (#4255)

Make test_dltensor_operator.py consistent when the HW decoder is available (#4272)

Fix issues in DALI in action snippet (#4268)

Assure operator documentation links to enum types (#4264)

Support applying orientation in Convert (#4219)

Add image decoder registry. (#4261)

Support tiled TIFFs (#4201)

Bump up TensorFlow version in tests (#4238)

Bug Fixes

Fix coverity issues (#4349)

Revert pruning of unused architectures (#4336)

Fix order of access order waiting in TL's set_order (#4338)

Fix NVJPEG pinned buffer synchronization. (#4337)

Change the default order of data storage objects (#4276)

Fix checking of the return status of the bundle lib tests (#4330)

Fix executor test - add test operators (#4323)

Fix parameter propagation in ImageDecoder. (#4309)

Fix normalization when running GPU color space conversion (#4285)

Fix support for ANY_DATA in nvJPEG2K (#4299)

Fix inconsistent tensor recreation in TensorList (#4286)

Fix no ffmpeg build (#4288)

Fix libtiff error handling (#4274)

Fix imgcodec batched APIs and tests (#4263)

Fix handling of Caffe LMDB without valid data (#4266)

Move params in PerThreadResources move constructor (#4265)

Fix fusing the dimensions in SliceFlipNormalizePermutePadGpu (#4234)

Improve error handling in LibTiffDecoder (#4210)

Fix exception handling in BatchParallelDecoderImpl (#4262)

Make nvjpeg decoder use its own thread pool (#4241)

Breaking API changes

There are no breaking changes in this DALI release.

Deprecated features

DALI will drop support for CUDA 10.2 in an upcoming release.

Known issues:

The video loader operator requires that the key frames occur, at a minimum, every 10 to 15 frames of the video stream.
If the key frames occur at a frequency that is less than 10-15 frames, the returned frames might be out of sync.

Experimental VideoReaderDecoder does not support open GOP.
It will not report an error and might produce invalid frames. VideoReader uses a heuristic approach to detect open GOP and should work in most common cases.

The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later.
To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, you can use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)

In experimental debug and eager modes, the GPU external source is not properly synchronized with DALI internal streams. As a workaround, you can manually synchronize the device before returning the data from the callback.

Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when running in Docker with escalated privileges, for example:

privileged=yes in Extra Settings for AWS data points

--privileged or --security-opt seccomp=unconfined for bare Docker.

Binary builds

Install via pip for CUDA 10.2: pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda102==1.19.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda102==1.19.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). Using the latest driver may enable additional functionality. More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.19.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.19.0

Or use direct download links (CUDA 10.2):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda102/nvidia_dali_cuda102-1.19.0-6205437-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda102/nvidia-dali-tf-plugin-cuda102-1.19.0.tar.gz

Or use direct download links (CUDA 11.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.19.0-6205436-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.19.0-6205436-py3-none-manylinux2014_aarch64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.19.0.tar.gz

FFmpeg source code:

This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

https://developer.download.nvidia.com/compute/redist/nvidia-dali/libsndfile-1.1.0.tar.gz

Source code(tar.gz)
Source code(zip)
v1.18.0(Oct 5, 2022)
Key Features and Enhancements

This DALI release includes the following key features and enhancements:

Unified batch representation in the GPU and CPU stages of the pipeline (effort towards conditional execution) (#4253, #4236, #4220, #4189).

Added support to specify the fill_value argument for each sample in the fn.erase operator (#4182).

Added support for the memory video file in FramesDecoder (#4184).

Moved the audio_resample operator out of experimental module (#4194).

Fixed Issues

The following issues were fixed in this release:

Fixed an unnecessary synchronization in MakeContiguous. (#4248).

Fixed the Python tool to create the webdataset index (#4226).

Added a fix to prevent DALI from allocating GPU memory when constructing CPU TensorList (#4203).

Fixed a PyTorch example to comply with the new PyTroch (#4213).

Improvements

GPU image data conversion (#4208)

Fix libtiff and libtar vulnerabilities (#4245)

Update third party dependencies (#4233)

Reduce batch size in the WebDataset integration using External Source example (#4240)

Rename the set and copy sample APIs in TensorList (#4236)

Move nvjpeg decoder files to imgcodec/decoders/nvjpeg/ (#4235)

Add Nvjpeg decoder (#4178)

Rename TensorVector to TensorList (#4220)

Make JPEG HW decoder test to fully use HW and not hybrid approach (#4222)

Add bulk parameter passing to decoders and factories. (#4212)

Support any bitdepth in TIFF (#4180)

Remove TensorList and use only TensorVector (#4189)

[imgcodec] API adjustments (#4205)

ROI support for nvjpeg2k decoder (#4175)

Use deprecated PIL resampling import for Python 3.6, due to lack of availability of a newer version of PIL (#4200)

Add arithmetic expression broadcasting utils (#4188)

Support higher TIFF bitdepths (#4174)

Enable per-sample fill_value argument in Erase operator (#4182)

Fix python linter errors for the qa/ directory (#4117)

Fix usage of deprecated np.float in tests (#4192)

Adjust PIL interpolation types to module PIL.Image.Resampling (#4195)

Move audio_resample out of experimental module (#4194)

Support different layouts in imgcodec's Convert (#4157)

Fix typos in iterator last_batch_policy argument documentation (#4170)

Fix synchronization in external source tests (#4153)

Add support for memory video file in FramesDecoder (#4184)

Support outputting YCbCr in libjpeg-turbo decoder (#4156)

Use std::exchange in move operator for Tensors (#4183)

Bug Fixes

Unify buffers caching in CPU/GPU external source (#4253)

Fix builds without nvJPEG (#4252)

Separate nvjpeg lib wrapper and stub from the decoder (#4249)

Prevent unnecessary synchronization in MakeContiguous. (#4248)

Do not leak DecodeParams (#4242)

Fix AssertClose bug in Imgcodec tests (#4243)

Fix bug in CPU Convert (#4237)

Fix webdataset python index creation script (#4226)

Fix In memory video decoding tests (#4216)

Fix UnpackBits (#4227)

Fix issues detected by Coverity. (#4221)

Make TensorList constructor for CPU not using GPU memory (#4203)

Fix the indexing for newer PyTorch (#4213)

Fix possibly incorrect parallel write access to vector (#4211)

Fix Layout propagation in TensorVector (#4202)

Breaking API changes

There are no breaking changes in this DALI release.

Deprecated features

There are no deprecated features in this DALI release.

Known issues:

The video loader operator requires that the key frames occur, at a minimum, every 10 to 15 frames of the video stream.
If the key frames occur at a frequency that is less than 10-15 frames, the returned frames might be out of sync.

Experimental VideoReaderDecoder does not support open GOP.
It will not report an error and might produce invalid frames. VideoReader uses a heuristic approach to detect open GOP and should work in most common cases.

The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later.
To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, you can use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)

In experimental debug and eager modes, the GPU external source is not properly synchronized with DALI internal streams. As a workaround, you can manually synchronize the device before returning the data from the callback.

Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when running in Docker with escalated privileges, for example:

privileged=yes in Extra Settings for AWS data points

--privileged or --security-opt seccomp=unconfined for bare Docker.

Binary builds

Install via pip for CUDA 10.2: pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda102==1.18.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda102==1.18.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). Using the latest driver may enable additional functionality. More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.18.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.18.0

Or use direct download links (CUDA 10.2):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda102/nvidia_dali_cuda102-1.18.0-5920075-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda102/nvidia-dali-tf-plugin-cuda102-1.18.0.tar.gz

Or use direct download links (CUDA 11.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.18.0-5920076-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.18.0-5920076-py3-none-manylinux2014_aarch64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.18.0.tar.gz

FFmpeg source code:

This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

https://developer.download.nvidia.com/compute/redist/nvidia-dali/libsndfile-1.1.0.tar.gz

Source code(tar.gz)
Source code(zip)
v1.17.0(Oct 5, 2022)
Key Features and Enhancements

This DALI release includes the following key features and enhancements:

Added CUDA 11.8 support.

Improved color conversion performance and precision (#4139).

Laid the groundwork for ongoing conditional execution effort (#4149, #4124, #4083, #3827, #4049).

Laid the groundwork for ongoing effort on improved decoding and processing of images.

Documentation improvements (#4168, #4102, #4059, #4094).

Fixed Issues

The following issues were fixed in this release:

Fixed default dtype in color twist family of operators (#4067)

Fix handling of TIFFs with palette (#4089)

Improvements

Separating nvjpeg2k utils in imgcodec (#4160)

Add NvJpeg2000Decoder (#4114)

Port operators Python tests to nose2 (#4037)

Refactor Tensor Vector (#4149)

Rename ImageDecoder to ImageDecoderFactory. (#4169)

Add section on deferred setup and shm limit to PES docs (#4168)

Change pinned version of matplotlib (#4167)

Add LibTIFF decoder (#4109)

Make decoder_test_helper.h accept TensorView (#4154)

Update dependencies (#4152)

Add color conversion support (#4143)

Extend the ImageDecoder testing framework to support GPU decoders (#4142)

Add color space conversion to imgcodec (#4121)

Fix CVE-2022-34526 (#4133)

Copy nvjpeg utils into imgcodec (#4148)

Fix linter for files inisde the dali_tf_plugin directory (#4118)

Add LibJpegTurboDecoder (#4099)

Color conversion - optimizations and tests (#4139)

Move to CUDA 11.7U1 (#4137)

Remove pageable copies from Convolution, Transpose and Warp kernels. (#4141)

Add AsTensor and related APIs to Tensor Vector (#4124)

[imgcodec] Add thread index and cuda stream to Decode APIs (#4128)

Move operator test files (#4125)

Silence some constexpr-related warnings in NVCC 10. (#4131)

Move libjpeg-turbo utils/impl to imgcodec directory (#4129)

Add missing constexpr to vec and mat. (#4130)

Parse EXIF metadata in PNG imgcodec parser (#4122)

Add parenthesis to assert to avoid using \ (#4123)

Fix error reported by flake8 5.0.1 (#4120)

Turn Python linter on by default (#3997)

Add decoder test framework (#4103)

Add dali namespace to third_party copy of OpenCV's exif (#4112)

Parsing EXIF metadata in WebP images (#4087)

Add PNG parser (#4052)

Fix OpenCV warning in jpeg compression distortion tests (#4107)

Document unsupported external source arguments in TF Dataset (#4102)

Add boilerplate synchronization for batch copying (#4083)

Pin Numba version to 0.55.2 (#4108)

Example image decoder using OpenCV (#4036)

Remove signal handler for SIGKILL (#4015)

Extract common functions from numpy reader (#4100)

Add JPEG EXIF parser (#4073)

Remove video reader warning that a frame has been seen twice (#4092)

Remove unnecessary loggin from resize checkerboard tests (#4086)

Add Jpeg2000 parser (#4068)

Fix flake8 warnings (#4074)

Fix & extend formatting of collections. (#4082)

Add inherited members to the Pytorch plugin docs (#4094)

Adjust Doxygen configuration (#4088)

Add imgcodec compatibility tests (#4057)

Add restrictions to set_type (#4071)

Add WebP parser (#4053)

Add JPEG Parser (#4050)

Silence buggy GCC warning about freeing non-heap objects. (#4077)

Add a tool for testing Imgcodec against ImageMagick (#4058)

BMP parser (#4062)

Make endian swapping work with ADL. (#4075)

Add utilities for swapping endianness. (#4069)

Add PNM parser (#4044)

Add references to image_processing/index. Add optional ordering to references. (#4059)

Extract EXIF parser from OpenCV (#4063)

Fix ifndef guards to be at the end of the file (#4064)

Stop exposing internal contiguous TV storage (#3827)

ReadValue extension to support enums (#4060)

Propagate device_id in ShareData and SetSample APIs (#4049)

Add TIFF parser (#4040)

Make the DALI video reader throw an exception when the VFR video is decoded (#4022)

Add ReadHeader util to parser baseclass (#4042)

Bug Fixes

Prevent excessive synchronization in MakeContiguous (#4228)

Prevent overflow in random_resized_crop tests (#4187)

Fix invalid destruction order in decoder test helper (#4186)

Added missing const in for loops (#4185)

Fix coverity issues (#4164)

Conditional compilation of TIFF Codec (#4166)

Fix zlib CVE-2022-37434 (#4150)

Pin matplotlib version to 3.5.2 (#4159)

Fix parsing of grayscale bitmaps (#4147)

Install flake8 for xavier builds (#4127)

Fix handling of TIFFs with palette (#4089)

Fix missing override in decoder test (#4105)

Disable HEVC tests for FramesDecoderGpu when it is not supported by the GPU (#4084)

Fix default dtype in color twist family of operators (#4067)

Fix libtiff CVE-2022-2058, CVE-2022-2057, CVE-2022-2056 (#4047)

Breaking API changes

There are no breaking changes in this DALI release.

Deprecated features

There are no deprecated features in this DALI release.

Known issues:

The video loader operator requires that the key frames occur, at a minimum, every 10 to 15 frames of the video stream.
If the key frames occur at a frequency that is less than 10-15 frames, the returned frames might be out of sync.

Experimental VideoReaderDecoder does not support open GOP.
It will not report an error and might produce invalid frames. VideoReader uses a heuristic approach to detect open GOP and should work in most common cases.

The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later.
To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, you can use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)

In experimental debug and eager modes, the GPU external source is not properly synchronized with DALI internal streams. As a workaround, you can manually synchronize the device before returning the data from the callback.

Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when running in Docker with escalated privileges, for example:

privileged=yes in Extra Settings for AWS data points

--privileged or --security-opt seccomp=unconfined for bare Docker.

Binary builds

Install via pip for CUDA 10.2: pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda102==1.17.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda102==1.17.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). Using the latest driver may enable additional functionality. More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.17.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.17.0

Or use direct download links (CUDA 10.2):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda102/nvidia_dali_cuda102-1.17.0-5838887-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda102/nvidia-dali-tf-plugin-cuda102-1.17.0.tar.gz

Or use direct download links (CUDA 11.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.17.0-5838886-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.17.0-5838886-py3-none-manylinux2014_aarch64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.17.0.tar.gz

FFmpeg source code:

This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

https://developer.download.nvidia.com/compute/redist/nvidia-dali/libsndfile-1.1.0.tar.gz

Source code(tar.gz)
Source code(zip)
v1.16.1(Aug 26, 2022)
Key Features and Enhancements

This release includes bug fixes, so there are no new features or enhancements.

Fixed Issues

The following issues were fixed in this release:

Fixed the fn.decoders.image was leaking memory on corrupted images (#4138).

A memory leak in the libjpeg-turbo decoder implementation in case of corrupted images was fixed.

Fixed a crash in the fn.readers.numpy, when pad_last_batch is set, and more then one thread is used by DALI (#4056).

Fixed a faulty check that prevented the feed_input method from working after the pipeline was deserialized (#4096).

Improvements

None

Bug Fixes

Fix pad_last_batch in GPU NumpyReader (#4056)

Fix feed_input after deserialization (#4096)

Fix memory leak in libjpeg-turbo decoder implementation in case of corrupted images (#4138)

Add zlib to conda recipe (#4173)

Fix Numba versions in tests (#4111)

Fix device pick in Numpy reader tests (#4104)

Breaking API changes

There are no breaking changes in this DALI release.

Deprecated features

There are no deprecated features in this DALI release.

Known issues:

The video loader operator requires that the key frames occur, at a minimum, every 10 to 15 frames of the video stream.
If the key frames occur at a frequency that is less than 10-15 frames, the returned frames might be out of sync.

Experimental VideoReaderDecoder does not support open GOP.
It will not report an error and might produce invalid frames. VideoReader uses a heuristic approach to detect open GOP and should work in most common cases.

The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later.
To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, you can use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)

In experimental debug and eager modes, GPU external source is not properly synchronized with DALI internal streams. As a workaround, the user may manually synchronize the device before returning the data from the callback.

Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when running in Docker with escalated privileges, for example:

privileged=yes in Extra Settings for AWS data points

--privileged or --security-opt seccomp=unconfined for bare Docker.

Binary builds

Install via pip for CUDA 10.2: pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda102==1.16.1 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda102==1.16.1

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). Using the latest driver may enable additional functionality. More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.16.1 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.16.1

Or use direct download links (CUDA 10.2):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda102/nvidia_dali_cuda102-1.16.1-5688170-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda102/nvidia-dali-tf-plugin-cuda102-1.16.1.tar.gz

Or use direct download links (CUDA 11.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.16.1-5688171-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.16.1-5688171-py3-none-manylinux2014_aarch64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.16.1.tar.gz

FFmpeg source code:

This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

https://developer.download.nvidia.com/compute/redist/nvidia-dali/libsndfile-1.1.0.tar.gz

Source code(tar.gz)
Source code(zip)
v1.16.0(Jul 25, 2022)
Key Features and Enhancements

This DALI release includes the following key features and enhancements:

Added GPU non-silent region detection operator (#3944, #4001).

Added experimental support for the eager execution of stateful operators and arithmetic operators (#4016, #3952, #3969, #3990).

Added antialias flag to Resize operator for improved control over resampling mode used (#4032).

Added experimental support for custom GPU Numba operators (#3891, #3998, #4006, #4013).

Added support for processing video and handling of temporal arguments to color-manipulation operators and affine transform operators (#3937, #3946, #3917).

Fixed Issues

The following issues were fixed in this release:

Fixed DALI + PyTorch Lightning iterator issue resulting in subsequent epochs terminating too early (#3923, #4048).

Fixed scalars handling by the readers.tfrecord operator (#4024).

Fixed variable batch size handling by the crop and coord_transform operators (#4045, #3958).

Improvements

Add little-endian and big-endian read functions for InputStreams (#4038)

Add antialias flag to Resize (#4032)

Reformat python files (#4026)

Python formatting (#4035)

Enable nose2 in Python Tests (#4033)

Imgcodec module boilerplate (interfaces/placeholders/basic logic) (#4029)

Remove deprecated option options.experimental_optimization.map_vectorization.enabled (#4027)

Guided contribution tutorial (#4011)

Fix python formatting (#3982)

Add eager mode stateful operators (#4016)

Disable Numba GPU op for incompatible Numba versions (#4025)

Add missing quote marks to the DALI_AFFINITY_MASK usage example (#4020)

Add abstract InputStream. Refactor existing FileStreams to in to use it. (#4019)

Make DALI iterator to call reset() when iter() is called upon it (#3923)

Add eager mode operators coverage test (#3952)

Add ack for Numba GPU op (#3998)

Add eager mode arithm ops (#3969)

Reduce DALI conda package installation time (#3995)

Add Non-silent region GPU operator (#3944)

Workaround for nosetests in Python 3.10 (#3986)

Numba cuda operator (#3891)

Fix Python formatting (#3992)

Fix Python formatting (#3988)

Add examples of processing video that utilize per-frame operator (#3917)

Per frame affine transforms (#3946)

Handle partially pruned multi-output external sources (#3975)

Dependencies update (#3979)

Doxygen typo (#3989)

Add per frame parameters support to brightness_contrast and color_twist families (#3937)

Fix missing return (#3985)

Support vector alike output for OpSpec::TryGetRepeatedArgument (#3851)

Fix Python formatting (#3962)

Fix and reenable optimized Cast kernel (#3976)

Bug Fixes

Fix lack of reset when iter() is called on the DALI framework iterator (#4048)

Use actual batch size instead of max batch size in crop_attr.h (#4045)

Support scalars in readers.tfrecord (#4024)

Add const char* ctor to ThreadPool (#4005)

Remove unconditional float16 type mapping in Numba GPU op (#4013)

Change flake8 config (#4004)

Fix Numba CI issues (#4006)

Fix and simplify moving mean squares CPU kernel. (#4001)

Fix nan check and unused external source arguments in debug mode (#3990)

Fix fn.coord_transform handling of a default matrix in variable batch case (#3958)

Fix test_dali_tf_dataset_mnist_eager test (#3991)

Fix test_dali_tf_dataset_mnist_eager.py and test_dali_tf_dataset_mnist_graph.py tests (#3987)

Improve handling of "dtype" arguments in OpSchema/OpSpec (#3981)

Breaking API changes

The shape of scalars read by the readers.tfrecord operator is now () instead of (1,).

For cubic and linear interpolation modes, the resize operator applies the antialiasing filter by default now. The antialiasing can be turned off with the antialias flag.

Deprecated features

The triangular interpolation for resize operator has been deprecated as it is equivalent to linear interpolation with antialiasing on.

Known issues:

The video loader operator requires that the key frames occur, at a minimum, every 10 to 15 frames of the video stream.
If the key frames occur at a frequency that is less than 10-15 frames, the returned frames might be out of sync.

Experimental VideoReaderDecoder does not support open GOP.
It will not report an error and might produce invalid frames. VideoReader uses a heuristic approach to detect open GOP and should work in most common cases.

The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later.
To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, you can use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)

In experimental debug and eager modes, GPU external source is not properly synchronized with DALI internal streams. As a workaround, the user may manually synchronize the device before returning the data from the callback.

Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when running in Docker with escalated privileges, for example:

privileged=yes in Extra Settings for AWS data points

--privileged or --security-opt seccomp=unconfined for bare Docker.

Binary builds

Install via pip for CUDA 10.2: pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda102==1.16.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda102==1.16.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). Using the latest driver may enable additional functionality. More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.16.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.16.0

Or use direct download links (CUDA 10.2):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda102/nvidia_dali_cuda102-1.16.0-5323000-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda102/nvidia-dali-tf-plugin-cuda102-1.16.0.tar.gz

Or use direct download links (CUDA 11.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.16.0-5322998-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.16.0-5322998-py3-none-manylinux2014_aarch64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.16.0.tar.gz

FFmpeg source code:

This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

https://developer.download.nvidia.com/compute/redist/nvidia-dali/libsndfile-1.1.0.tar.gz

Source code(tar.gz)
Source code(zip)
v1.15.0(Jun 22, 2022)
Key Features and Enhancements

This DALI release includes the following key features and enhancements:

Added the GPU audio resampling operator (#3884, #3914 and #3911).

Improved the performance of the GPU fn.readers.numpy by custom GDS staging (#3894, #3905).

Added support for video processing and per-frame (temporal) arguments to the warp_affine operator (#3879, #3900).

Added HEVC support to the GPU frames decoder (#3896).

Added experimental support for the eager execution of stateless operators as Python functions and readers as iterators (#3887, #3930).

Added CUDA 11.7 support (#3906).

Profiling improvements:

Added more NVTX ranges to the executor (#3928)

Added thread names to all DALI threads (#3912)

Fixed Issues

The following issues were fixed in this release:

Added the missing device/device synchronization when copying pipeline outputs with copy_to_external (#3953).

Fixed the buffer synchronization between default and custom stream in a multi-GPU case (#3957).

Improvements

Fix Python formatting (#3961)

Fix coverity issues (#3974)

Add FindReduceGPU and FindRegionGPU kernels (#3951)

Fix Python formatting (#3965)

Add .style.yapf file (#3970)

Update Optical Flow example (#3971)

Fix per frame pass through (#3959)

Fixing Python code formatting (#3948)

Suppress the use of a staging buffer for nvJPEG input if it's already pinned.(#3956)

Fix cyclic dependency import problem in fn.py in python 3.6 (#3955)

Refactor qa test scripts (#3933)

Change thread pool creation for eager operators to lazy (#3931)

Fix sequence shape test (#3949)

Expose readers as iterators in eager mode (#3930)

Add Python linter (#3929)

Remove redundant quote marks from the protobuf version specifier (#3945)

Skip GDS tests when the GPU is incompatible. (#3941)

Add sequence processing to warp operator (#3879)

Add MovingMeanSquareGpu kernel (#3922)

Pin protobuf to <4 for Paddle Paddle (#3940)

Update compilation flags for the DALI TensorFlow plugin (#3943)

Change MultiDevice to MultiGpu test suffix (#3942)

Bump up the nvidia-tensorflow version to 20.05 in tests (#3938)

Add FindFirstLastGPU kernel (#3932)

Adjust PR template to ask for listing exisiting tests that apply (#3939)

Pin protobuf to <4 (#3934)

Add VFR detection (#3921)

Fix CVE-2022-0562 in libtiff (#3925)

Update RNN-T pipeline tests to include GPU resampling and silence detection (#3920)

Add more NVTX ranges to the executor (#3928)

Add HEVC support for FramesDecoderGpu (#3896)

Add a thread name to all DALI threads (#3912)

Add dataclasses pip package to tests deps to fix Python3.6 operator tests (#3926)

Add fn.experimental.audio_resample GPU (#3911)

Custom staging for GDS (#3894)

Update the readme roadmap link to use 2022 one (#3918)

Support specifying per-frame positional arguments in sequence processing test utility (#3901)

Move audio resampler CPU implementation to a single compilation unit (#3914)

Add stateless CPU eager operators (#3887)

Add CUDA 11.7 support (#3906)

Add VideoReaderDecoder test for missing labels (#3908)

Add signal resampling GPU kernel (#3884)

Optimize parameter passing for ScatterGather GPU (#3905)

Add references to ops documentation in the tutorials (#3904)

Enable per-frame operator on GPU (#3900)

Bug Fixes

Fix dltensor operator tests (#3984)

Prevent clobbering of outputs before non-blocking copy_to_external finishes. (#3953)

Fix a bug in AccessOrder when synchronizing with a default stream on the same device, which is not the current device. (#3957)

Workaound GDS memory leak in GDSMem tests. (#3936)

Fix circular imports in eager mode (#3919)

Remove intermediate Tensor and use DynamicScratchpad for op tile descirptors. (#3915)

Add missing moving of order in TensorVector's move assgiment/constructor (#3899)

Breaking API changes

There are no breaking changes in this DALI release.

Deprecated features

There are no deprecated features in this DALI release.

Known issues:

The video loader operator requires that the key frames occur, at a minimum, every 10 to 15 frames of the video stream.
If the key frames occur at a frequency that is less than 10-15 frames, the returned frames might be out of sync.

Experimental VideoReaderDecoder does not support open GOP.
It will not report an error and might produce invalid frames. VideoReader uses a heuristic approach to detect open GOP and should work in most common cases.

The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later.
To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, you can use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)

Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when running in Docker with escalated privileges, for example:

privileged=yes in Extra Settings for AWS data points

--privileged or --security-opt seccomp=unconfined for bare Docker.

Binary builds

Install via pip for CUDA 10.2: pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda102==1.15.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda102==1.15.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). Using the latest driver may enable additional functionality. More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.15.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.15.0

Or use direct download links (CUDA 10.2):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda102/nvidia_dali_cuda102-1.15.0-5080387-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda102/nvidia-dali-tf-plugin-cuda102-1.15.0.tar.gz

Or use direct download links (CUDA 11.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.15.0-5080390-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.15.0-5080390-py3-none-manylinux2014_aarch64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.15.0.tar.gz

FFmpeg source code:

This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

https://developer.download.nvidia.com/compute/redist/nvidia-dali/libsndfile-1.1.0.tar.gz

Source code(tar.gz)
Source code(zip)
v1.14.0(May 30, 2022)
Key Features and Enhancements

This DALI release includes the following key features and enhancements.

Added HEVC support to the CPU frames decoder (#3885).

Added the CPU audio resampling operator (#3840).

Added support for video processing and per-frame (temporal) arguments to the rotate operator (#3820).

Added support for variable batch size in the debug mode (#3799).

Performance optimizations:

Optimized tiled transposition algorithm on small data types (#3730).

Improved CropMirrorNormalize operator performance (#3771).

Fixed Issues

Fixed the compatibility with TensorFlow 2.9 by adding type propagation to DALIDataset (#3875).

Added a missing check when the number of files and labels match in the experimental video reader (#3903).

Added a missing check when the number of samples is greater or equal to the number of shards in readers (#3856).

Fixed scalars handling in the GPU cast operator (#3924).

Improvements

Add support for TensorFlow 2.9. (#3909)

Remove deprecated usage of numpy types int and long (#3898)

Add output_dtype and output_ndim arguments to Pipeline constructor (#3877)

Add hevc support cpu frames decoder (#3885)

Add a C API call to get the max batch size (#3890)

Add bool to Pad supported types (#3895)

Adjust eps in test comparing readers (#3892)

Fix coverity issues. Do not re-throw worker thread error in the destructor. (#3886)

Fix memory leak in C API test (#3889)

Add tutorials references to ops docs - general section (#3869)

Refactor video tests (#3864)

Add NonsilentRegion GPU, implemented in terms of the CPU version (#3874)

Add a check of the decoding progress in the VideoReader (#3858)

Reduce libaviutils log verbosity to errors and above (#3871)

Extend C Api to fetch the layout and ndim from External Source (#3862)

Updated PyTorch-Lightning example with new strategy keyword for Trainer. (#3867)

Update clang version to 14.02 (#3863)

Improve cast operator performance (#3783)

Update CUTLASS to v2.9.0 (#3860)

Change the way how CUDA pub key is installed (#3866)

Audio resampling operator for CPU backend (#3840)

Dependencies update (#3831)

Optimization of tiled transposition algorithm on small data types (#3730)

Improve CropMirrorNormalize operator performance (#3771)

Fix typo (model -> module) (#3848)

Add a check against changing layout in ES (#3839)

Add cpu only and variable batch size tests to per-frame operator (#3850)

Missing f prefix on f-strings fix #3847

Fix handling of arguments with trailing newlines when generating operator docs (#3841)

Add support for sequence processing to rotate (#3820)

Fix TF DALIDataset tests that changed layout between iterations (#3836)

Add ndim argument to the external source operator (#3755)

Add operators cross-referencing to data loading index (#3823)

Features required for autoserialization in DALI Backend (#3795)

Remove gtest RandomBBoxCropTest tests (#3822)

Update user documentation footer copyright date (#3819)

Add operator cross-referencing to custom operators tutorials (#3818)

Fix the default value of resize min_filter in the documentation (#3816)

Benchmark for Transpose operator (#3785)

Add operator cross-referencing to data loading section (#3809)

Update [shields.io](http://shields.io/) badges in README.rst. (#3815)

Add operator cross-referencing to audio processing tutorials (#3806)

Add operator cross-referencing to video processing tutorials (#3808)

Add support for variable batch size and NVTX ranges in debug mode (#3799)

Shutdown() a WorkerThread in the destructor (#3810)

Improve the redirect (#3801)

Bug Fixes

Add tests for operator cast. Revert to plain batched cast kernel until the optimized one is fixed. (#3927)

Fix scalar handling in GPU cast. (#3924)

Adds check to the experimental video reader if the number of files and labels match (#3903)

Add type propagation implementation introduced in TF 2.8 (#3875)

Fix corruption: Change bool to int when querying pointer attributes. (#3873)

Make libtar and libsnd root paths customizable. (#3872)

Add check if the number of samples is greater or equal to the number of shards in readers (#3856)

Fix transposition kernel tests (#3859)

Fix default argument handling in cuda_vm_resource constructor (#3857)

Fixes test_coverage case in test_dali_cpu_only.py and test_dali_variable_batch_size.py (#3849)

Fix rotate assertion warning (#3852)

Make failure in curl to fail Dockerfile.build.aarch64-linux image build (#3821)

Breaking API changes

There are no breaking changes in this DALI release.

Deprecated features

There are no deprecated features in this DALI release.

Known issues:

The video loader operator requires that the key frames occur, at a minimum, every 10 to 15 frames of the video stream. If the key frames occur at a frequency that is less than 10-15 frames, the returned frames might be out of sync.

Experimental VideoReaderDecoder does not support open GOP. It will not report an error and might produce invalid frames. VideoReader uses a heuristic approach to detect open GOP and should work in most common cases.

The DALI TensorFlow plug-in might not be compatible with TensorFlow versions 1.15.0 and later. To use DALI with the TensorFlow version that does not have the prebuilt plug-in binary that is shipped with DALI, ensure that the compiler that is used to build TensorFlow exists on the system during the plug-in installation. (Depending on the particular version, you can use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)

Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows the best performance when running in Docker with escalated privileges, for example:

privileged=yes in Extra Settings for AWS data points

--privileged or --security-opt seccomp=unconfined for bare Docker

Binary builds

Install via pip for CUDA 10.2: pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda102==1.14.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda102==1.14.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). Using the latest driver may enable additional functionality. More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.14.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.14.0

Or use direct download links (CUDA 10.2):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda102/nvidia_dali_cuda102-1.14.0-4921279-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda102/nvidia-dali-tf-plugin-cuda102-1.14.0.tar.gz

Or use direct download links (CUDA 11.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.14.0-4921308-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.14.0-4921308-py3-none-manylinux2014_aarch64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.14.0.tar.gz

FFmpeg source code:

This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

https://developer.download.nvidia.com/compute/redist/nvidia-dali/libsndfile-1.1.0.tar.gz

Source code(tar.gz)
Source code(zip)
v1.13.0(Apr 22, 2022)
Key Features and Enhancements

This DALI release includes the following key features and enhancements.

Added support for per-frame (temporal) arguments to the Gaussian Blur and Laplacian operators (#3715 and #3723).

Optimized audio decoder resampling for ARM (#3745).

Improved the debug (immediate execution) mode:

Added direct operator calls in debug mode (#3734).

Added a debug mode benchmark (#3762).

Added support for GPU positional arguments in the Slice operator (#3741).

Documentation improvements:

Split the operator documentation into separate pages (#3794).

Added a mechanism for cross-referencing examples and operators (#3748).

Added an FAQ section to the DALI user guide (#3761).

Added new GTC talks (#3757).

Added shuffling and shards handling snippets to the parallel external source examples (#3744).

Fixed Issues

Fixed the handling of samples that exceed 2GBs in the parallel external source (#3768).

Improvements

Add per-frame operator (#3723)

Add support for per-frame arguments to Gaussian Blur and Laplacian operators (#3715)

Separate the documentation pages! (#3794)

Update zlib to 1.2.12 version (#3787)

Trim TL0_tensorflow_plugin and TL0_python-self-test-readers-decoders tests (#3796)

Add _schema_name attribute in fn API (#3798)

Add resize checkerboard tests, comparing to ONNX reference precomputed data (#3792)

Update nvJPEG2000 to 0.5.0 version (#3791)

Fix header in parallel external source notebook (#3790)

Update documentation link to the '22 roadmap (#3786)

Bump Nvidia TF1 version used in tests to 22.03 (#3769)

Add mechanism for crossreferencing examples and operators (#3748)

Add direct operator calls in debug mode (#3734)

Make number of samples in batch signed (#3789)

Add debug mode benchmark (#3762)

Fix the cuBLAS version to one compatible with nvTF 22.01 (#3781)

Apply changes from TV sample encapsulation in NVJPEG2K (#3780)

Ensure sample encapsulation in Tensor Vector (#3701)

Add a TL0 test that runs on more than 1 GPU (#3772)

Add FAQ section to the DALI documentation (#3761)

Remove the compose operator from the fn API table (#3767)

Add new GTC talks. Update old link (#3757)

Update to CUDA 11.6u2 (#3764)

RNG to use pinned memory for kernel launch args (#3765)

Revert "Pin webdataset version to the last compatible with python 3.6 (#3746)" (#3763)

Fix the wrong patch for CVE-2022-0907 which by mistake duplicated CVE-2022-0909 (#3760)

Quantize GDS chunk size to 1 MB. (#3759)

Add GDS-compatible allocator with 4k alignment. (#3754)

Update error messaging of nvJPEG (#3756)

Allow GPU slice arguments (#3741)

Add filename to the error message in the numpy reader (#3753)

Fix libtiff vulnerabilities (#3752)

Update parallel external source notebook and include shuffling example.. (#3744)

Add supported python version classifier to DALI TF plugin setup.py (#3751)

Vectorize audio resampling for ARM NEON. (#3745)

Remove prints from the regular DALI execution flow (#3740)

Pin webdataset version to the last compatible with python 3.6 (#3746)

Align test expectations with slice implementation rounding logic (#3738)

Update RapidJSON (#3737)

Regenerate getting started jupyter examples (#3732)

Improve documentation for AccessOrder wait and set_order. (#3736)

Bug Fixes

Add missing copying of pinned prop when sharing buffer (#3797)

Disable PES large sample test on Xavier runner (#3788)

Fix source device in PyTorch cross-device test. (#3775)

Fix large mini-batch handling in parallel external source (#3768)

Fix Yolo v4 example non-fatal teardown error (#3739)

Rework Image Decoder example (#3731)

Check return value of a CUDA function call. (#3733)

Breaking API changes

There are no breaking changes in this DALI release.

Deprecated features

There are no deprecated features in this DALI release.

Known issues:

The video loader operator requires that the key frames occur, at a minimum, every 10 to 15 frames of the video stream. If the key frames occur at a frequency that is less than 10-15 frames, the returned frames might be out of sync.

The DALI TensorFlow plug-in might not be compatible with TensorFlow versions 1.15.0 and later. To use DALI with the TensorFlow version that does not have the prebuilt plug-in binary that is shipped with DALI, ensure that the compiler that is used to build TensorFlow exists on the system during the plug-in installation. (Depending on the particular version, you can use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)

Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows the best performance when running in Docker with escalated privileges, for example:

privileged=yes in Extra Settings for AWS data points

--privileged or --security-opt seccomp=unconfined for bare Docker

Binary builds

Install via pip for CUDA 10.2: pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda102==1.13.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda102==1.13.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). Using the latest driver may enable additional functionality. More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.13.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.13.0

Or use direct download links (CUDA 10.2):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda102/nvidia_dali_cuda102-1.13.0-4481322-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda102/nvidia-dali-tf-plugin-cuda102-1.13.0.tar.gz

Or use direct download links (CUDA 11.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.13.0-4481327-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.13.0-4481327-py3-none-manylinux2014_aarch64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.13.0.tar.gz

FFmpeg source code:

This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

https://developer.download.nvidia.com/compute/redist/nvidia-dali/libsndfile-1.0.31.tar.gz

Source code(tar.gz)
Source code(zip)
v1.12.0(Mar 24, 2022)
Key Features and Enhancements

This DALI release includes the following key features and enhancements.

Added support for the GPU-accelerated decoding of videos with a variable frame rate (experimental.readers.video) (#3668).

Reduced the binary size (#3680 and #3682).

Improved the TensorFlow plug-in installation even when none of the prebuilt binaries matches the exact TensorFlow version (#3720).

Improved performance by increasing the usage of pinned memory in argument input buffers (#3728).

Documentation improvements (#3722, #3684, and #3674).

Fixed Issues

Fixed the TensorFlow plug-in issue that prevented it from working in the CPU-only mode (#3719).

Improvements

[DALI TF] Try building from source when TF version doesn't match exactly. Add test step to installation script. (#3720)

Add supported layouts to Crop, CropMirrorNormalize (#3722)

Make output buffers for arugment inputs to GPU operators pinned. (#3728)

Bump up TensorFlow version used in tests (#3688)

Fix coverity issues (#3679)

Bump up CUDA to 11.6U1 (#3709)

Add test to check if importing DALI doesn't break Torch process forking (#3669)

Add non-owning SampleView (#3706)

Use pinned buffers for kernel parameters and for ToContiguousGPU. (#3689)

Update deps version for libtiff-CVE-2022-0561 fix (#3693)

Update documentation regarding GDS being part of CUDA toolkit (#3684)

Add VideoReaderDecoder GPU (#3668)

Custom build: subset of file patterns for kernel and operators (#3672)

Remove lineinfo from RelWithDebInfo DALI builds (#3680)

Build DALI only for major arch versions (#3682)

Remove mpiexec affinity binding in TensorFlow TL1 and TL3 RN50 test (#3681)

Remove Scratchpad from KernelManager (#3678)

Update dependencies (#3677)

Use DynamicScratchpad in KernelManager. (#3670)

Add an info about fill_values being used by pad_output in crop_mirror_normalize (#3674)

Bug Fixes

Fix CVE-2022-0626 in libtiff (#3727)

Fix TensorFlow plugin operation without GPU (#3719)

Syncrhonize at the end of BoxEncoder's constructor. (#3724)

Fix ES debug mode test failing with missing batch (#3712)

Add missing import nose.SkipTest in optical flow tests (#3707)

Fix stream handling in video loader and nvdecoder. (#3705)

Fix typos found in tensor_shape.h docs (#3695)

Fix optical flow tests for Turing (#3685)

Fix Slice's adaptive tiling for smaller output types (#3687)

Breaking API changes

There are no breaking changes in this DALI release.

Deprecated features

There are no deprecated features in this DALI release.

Known issues:

The video loader operator requires that the key frames occur at a minimum every 10 to 15 frames of the video stream. If the key frames occur at a lesser frequency, then the returned frames may be out of sync.

The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later. To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)

Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when run in Docker with escalated privileges, for example:

privileged=yes in Extra Settings for AWS data points

--privileged or --security-opt seccomp=unconfined for bare Docker

Binary builds

Install via pip for CUDA 10.2: pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda102==1.12.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda102==1.12.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). Using the latest driver may enable additional functionality. More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.12.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.12.0

Or use direct download links (CUDA 10.2):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda102/nvidia_dali_cuda102-1.12.0-4144186-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda102/nvidia-dali-tf-plugin-cuda102-1.12.0.tar.gz

Or use direct download links (CUDA 11.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.12.0-4144197-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.12.0-4144197-py3-none-manylinux2014_aarch64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.12.0.tar.gz

FFmpeg source code:

This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

https://developer.download.nvidia.com/compute/redist/nvidia-dali/libsndfile-1.0.31.tar.gz

Source code(tar.gz)
Source code(zip)
v1.11.1(Mar 4, 2022)
Key Features and Enhancements

This is a patch release.

Fixed Issues

Fixed wrong handling of input data by GPU external source in multi-GPU scenario

Fixed wrong usage of streams in C API

Improvements

None

Bug Fixes

Fix multi-device GPU external source. (#3710)

Fix constructing GPU Tensor from DLPack capsule (#3711)

Fix stream usage in C API (#3713)

Breaking API changes

There are no breaking changes in this DALI release.

Deprecated features

There are no deprecated features in this DALI release.

Known issues:

The video loader operator requires that the key frames occur at a minimum every 10 to 15 frames of the video stream. If the key frames occur at a lesser frequency, then the returned frames may be out of sync.

The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later. To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)

Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when run in Docker with escalated privileges, for example:

privileged=yes in Extra Settings for AWS data points

--privileged or --security-opt seccomp=unconfined for bare Docker

The experimental.readers.video operator causes a crash during the process teardown with driver versions 460 to 470.21

Binary builds

Install via pip for CUDA 10.2: pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda102==1.11.1 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda102==1.11.1

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). Using the latest driver may enable additional functionality. More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.11.1 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.11.1

Or use direct download links (CUDA 10.2):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda102/nvidia_dali_cuda102-1.11.1-4069476-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda102/nvidia-dali-tf-plugin-cuda102-1.11.1.tar.gz

Or use direct download links (CUDA 11.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.11.1-4069477-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.11.1-4069477-py3-none-manylinux2014_aarch64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.11.1.tar.gz

FFmpeg source code:

This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

https://developer.download.nvidia.com/compute/redist/nvidia-dali/libsndfile-1.0.31.tar.gz

Source code(tar.gz)
Source code(zip)
v1.11.0(Feb 28, 2022)
Key Features and Enhancements

This DALI release includes the following key features and enhancements.

Added the GPU laplacian operator (#3644, #3618).

Updated the optical_flow operator to use the latest SDK capabilities (#3625).

Extended the readers.webdataset operator to support pax POSIX.1-2001 tar format. (#3645).

Improved the performance of the slice operator (#3604, #3600).

Improved the debug (immediate execution) mode:

Added the direct use of external sources (#3605).

Extended the API and added a string representation and the .shape method to data nodes (#3647, #3591).

Added support for deterministic seed generation (#3589).

Added a tutorial notebook (#3648).

Fixed Issues

Fixed the incorrect construction of TensorList from a list of tensors (#3626).

Fixed an issue in the CPU readers.video operator that prevented it from working in the CPU-only mode (#3660).

Improvements

Improve checking if it is safe to fork the DALI process (#3671)

Add debug mode tutorial notebook (#3648)

Dynamic & stream-aware scratchpad (#3667)

Use fn API in non-silent tests (#3666)

Frames decoder gpu (#3615)

Add Laplacian GPU operator (#3644)

Update third party (#3632)

Improve the documentation about CPU tensors and named arguments (#3655)

Update docs for the parallel option in external source (#3654)

Update optical flow operator to use the latest OF SDK capabilities (#3625)

Remove deprecated usage of .dtype() method (#3650)

Update pattern used to generate TFRecord idx files (#3653)

Add one_hot benchmark (#3553)

Add str and repr for Tensor, TensorList and DataNode[Debug] (#3647)

Relax test tolerance in DisplacementTest/Sphere and Water (#3649)

Update warp_affine test and docs (#3639)

Remove unnecessary Dockerfile.cuda116.x86_64deps file (#3642)

Updates FindNVJPEG.cmake (#3643)

Add JPEG compression distortion to augmentation gallery (#3633)

Use index slicing in geometric transformation notebook (#3635)

Add support for tar pax POSIX.1-2001 WebDataset (#3645)

Remove redundant tests (#3634)

Add dtype member for TensorList and modify dtype for Tensor (#3628)

Remove dependency between dali_test.bin and dali_operators lib (#3637)

Add Laplacian GPU kernel (#3618)

Updated PR template (#3619)

Remove synchronization from deallocate. (#3497)

ArgHelper tests to not depend on operators from dali_operators lib (#3631)

Add dtype argument to ExternalSource in examples (#3611)

Add CUDA 11.6 support (#3623)

Make data objects stream-aware (#3536)

Changing WDS Reader source_info property (#3614)

Relax test tolerance in DisplacementTest/Sphere (#3621)

Video tests utils and refactor (#3620)

Debug mode direct ExternalSource (#3605)

Remove Buffer inheritence from TensorList (#3576)

Relax test tolerance in DisplacementTest/Water (#3616)

Improve Slice's adaptive tiling (#3604)

Explicitly coalesce stores in Slice for smaller output types (#3600)

Add an upper bound for the video decoder workaround (#3609)

Deterministic seeds in debug mode (#3589)

Move from zlib to zlib-ng optimized fork (#3570)

TensorList shape (#3591)

Bug Fixes

Fix frames decoder destruction (#3662)

Removes check of CUDA runtime and linked libs from the backend (#3664)

Remove CUDA call from CUDAStreamPool's constructor (#3663)

Fix librosa bugs after 0.9 release (#3665)

Fix VideoReader CPU only variant (#3660)

Add a separate initialization method to OpticalFlowAdapter (#3657)

Fix get-pip.py for python 3.6 (#3652)

Fix sphinx warnings in the docs (#3651)

Fix synchronization bug in operator benchmark (#3638)

Replace calls to exp2 with std::exp2f (#3646)

Fix null_stream constant evaluation fallback (#3630)

Fix CVE-2021-4156 in libsnd (#3624)

Fix TensorList constructor from list of tensors. (#3626)

Fix CVE-2022-22844 in libtiff (#3612)

Fix dtype in external_source with multiple outputs. (#3608)

Breaking API changes

There are no breaking changes in this DALI release.

Deprecated features

There are no deprecated features in this DALI release.

Known issues:

The video loader operator requires that the key frames occur at a minimum every 10 to 15 frames of the video stream. If the key frames occur at a lesser frequency, then the returned frames may be out of sync.

The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later. To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)

Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when run in Docker with escalated privileges, for example:

privileged=yes in Extra Settings for AWS data points

--privileged or --security-opt seccomp=unconfined for bare Docker

The experimental.readers.video operator causes a crash during the process teardown with driver versions 460 to 470.21

Binary builds

Install via pip for CUDA 10.2: pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda102==1.11.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda102==1.11.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). Using the latest driver may enable additional functionality. More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.11.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.11.0

Or use direct download links (CUDA 10.2):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda102/nvidia_dali_cuda102-1.11.0-3985923-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda102/nvidia-dali-tf-plugin-cuda102-1.11.0.tar.gz

Or use direct download links (CUDA 11.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.11.0-3985922-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.11.0-3985922-py3-none-manylinux2014_aarch64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.11.0.tar.gz

FFmpeg source code:

This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

https://developer.download.nvidia.com/compute/redist/nvidia-dali/libsndfile-1.0.31.tar.gz

Source code(tar.gz)
Source code(zip)
v1.10.0(Jan 25, 2022)
Key Features and Enhancements

This DALI release includes the following key features and enhancements.

New operators:

The get_property operator (CPU and GPU) that is used to fetch tensor metadata, such as the source file name (#3572).

The laplacian operator (CPU) (#3563).

Color-based augmentations were extended to support video data (#3580).

Improved performance of the slice operator (#3584, #3573, and #3568).

Added an experimental debug (immediate execution) mode (#3586 and #3531).

Fixed Issues

No major issues were fixed in this release.

Improvements

Adds video support to color based augmentations (#3580)

Fixed cmake error (#3601)

Fix debug build failures in benchmark code (#3585)

Make sanitizers tests fail when it encounters the first issue (#3583)

Use proper attribute filters for nosetests (#3592)

Fix wrong parameter name in Laplacian docs (#3593)

QA script fix: Add an empty negative branch to a conditional to prevent automatic error (#3588)

Small refactoring in Slice GPU kernel (#3584)

GetProperty operator CPU+GPU (#3572)

Add comments about scale argument (#3581)

Fix coverity issues (#3579)

Check when using ES source and feed_input (#3574)

Prototype of the debug mode (#3531)

Enable tests for dynamically loaded cuda libraries (#3540)

Add Laplacian operator [CPU] (#3563)

Add CUDAStreamPool & CUDAStreamLease. (#3569)

Coalesce stores in Slice for smaller output types (#3568)

Turn off OpticalFlow test on aarch64 platform for driver r495.x and newer (#3566)

Bug Fixes

Fixing typos in WDS's source_info (#3602)

Fix handling of scalar argument in slice operator (#3596)

Use the same device for debug mode test and baseline (#3594)

Fix JPEG distortion GPU quality argument handling for sequences (#3590)

Use current device in _as_gpu (#3586)

Fix version_ge: command not found error in TL0_python-self-test-base-cuda (#3582)

Disable coalescing values in Slice for CUDA 10 (#3573)

Breaking API changes

There are no breaking changes in this DALI release.

Deprecated features

There are no deprecated features in this DALI release.

Known issues:

The video loader operator requires that the key frames occur at a minimum every 10 to 15 frames of the video stream. If the key frames occur at a lesser frequency, then the returned frames may be out of sync.

The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later. To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)

Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when run in Docker with escalated privileges, for example:

privileged=yes in Extra Settings for AWS data points

--privileged or --security-opt seccomp=unconfined for bare Docker

Binary builds

Install via pip for CUDA 10.2: pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda102==1.10.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda102==1.10.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). Using the latest driver may enable additional functionality. More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.10.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.10.0

Or use direct download links (CUDA 10.2):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda102/nvidia_dali_cuda102-1.10.0-3728184-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda102/nvidia-dali-tf-plugin-cuda102-1.10.0.tar.gz

Or use direct download links (CUDA 11.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.10.0-3728186-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.10.0-3728186-py3-none-manylinux2014_aarch64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.10.0.tar.gz

FFmpeg source code:

This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

https://developer.download.nvidia.com/compute/redist/nvidia-dali/libsndfile-1.0.31.tar.gz

Source code(tar.gz)
Source code(zip)
v1.9.0(Jan 3, 2022)
Key Features and Enhancements

This DALI release includes the following key features and enhancements.

Extended the jpeg_compression_distortion operator to support video inputs (#3482 and #3447).

Added the file_filter argument to the readers.file operator that allows you to filter files by names (#3459).

Extended the slice operator to support per-sample axes arguments and negative axis indexing (#3516).

Extended the pad operator to support per-sample axes, fill_value arguments, and negative axis indexing (#3534).

Improved the performance of the slice operator for small batch sizes (#3557).

Added the Laplacian CPU kernel (#3565, #3535, and #3518).

Fixed Issues

This DALI release includes the following fixes:

Fixed a race condition that randomly caused incorrect outputs in the TensorFlow plugin (#3547).

Fixed synchronization issues in the PaddlePaddle plugin that may have caused incorrect results (#3498 and #3487).

Improvements

Make Slice kernel tiling adaptive (#3557)

Add Laplacian CPU kernel (#3518)

Allows DALI to dlopen dependent CUDA toolkit libraries: NPP, cuFFT and nvJPEG (#3519)

Fix test code to be compatible with python 3.6 (#3550)

Fix a typo in warp jupyter notebook. (#3554)

Add Cast and CoinFlip GPU benchmarks (#3541)

Fix DALI TL3 test for 21.11 (#3529)

Pad operator: Add support for per-sample axes and fill_value arguments, and negative axes (#3534)

Add FlipGPU and GaussianBlurGPU benchmarks (#3538)

Make bundle-wheel.sh more configurable (#3539)

Enable DALI test on python 3.9 and add 3.10 support (#3522)

Add transform parameter to convolution cpu (#3535)

Remove nvJPEG leak sanitizer workaround in tests (#3532)

Dependency update Nov 2021 (#3523)

Add support for per-sample axes and negative axes in Slice (#3516)

Refactor ArgValue to support empty samples and batch shape expectations (#3528)

Move to CUDA 11.5 update 1 (#3526)

Add Copy GPU benchmark (#3517)

Move to CUDA_CALL for nvJPEG, nvJPEG2k, and NPP (#3521)

Silence warning in LookupTable (#3508)

Move unfold_outer_dim to common utilities. (#3486)

Remove Context from memory resources. (#3485)

Set minimum python version to 3.7 for TF 2.7 (#3489)

Allow video inputs to JpegCompressionDistortion (#3482)

Bump up TensorFlow version to 2.7 in tests (#3475)

Change the way how NVML wrapper is linked internally (#3481)

Add support for file_filters in FileReader (#3459)

Allow video inputs to JpegCompressionDistortion (#3447)

Move to Ubuntu 20.04 for cuda 10.2 toolkit image (#3477)

Move to Ubuntu 20.04 for cuda toolkit image (#3476)

Pin Keras version for TensorFlow 2.6 (#3474)

Add support for BatchInfo in experimental TF DALI Dataset (#3468)

Bug Fixes

Replace equality with EqualEpsRel in Laplacian kernel tests (#3565)

Synchronize CUDA stream once in operator benchmark (#3525)

Ensure that num_devices and device are stored in correct order. (#3560)

Fix conda test for CUDA 10.x (#3556)

Fix race condition when initializing per-device default memory resources (#3555)

Fix data race when copying outputs in TF plugin (#3547)

CUDA VM resource bugfixes (#3545)

Fix build of DALI TensorFlow plugin during installation (#3546)

Fix issues found during static analysis (#3524)

Fix lack of proper device id used to obtain relevant cuda stream in paddle plugin (#3498)

Add type check to last_batch_policy argument (#3490)

Fix DALI paddle plugin stream synchronization error (#3487)

Reuse GaussianBlur windows between iterations (#3484)

Add synchronization when destroying the Executor. Make all destructors noexcept. (#3492)

Breaking API changes

There are no breaking changes in this DALI release.

Deprecated features

There are no deprecated features in this DALI release.

Known issues:

The video loader operator requires that the key frames occur at a minimum every 10 to 15 frames of the video stream. If the key frames occur at a lesser frequency, then the returned frames may be out of sync.

The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later. To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)

Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when run in Docker with escalated privileges, for example:

privileged=yes in Extra Settings for AWS data points

--privileged or --security-opt seccomp=unconfined for bare Docker

Binary builds

Install via pip for CUDA 10.2: pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda102==1.9.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda102==1.9.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). Using the latest driver may enable additional functionality. More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.9.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.9.0

Or use direct download links (CUDA 10.2):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda102/nvidia_dali_cuda102-1.9.0-3647996-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda102/nvidia-dali-tf-plugin-cuda102-1.9.0.tar.gz

Or use direct download links (CUDA 11.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.9.0-3647997-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.9.0-3647997-py3-none-manylinux2014_aarch64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.9.0.tar.gz

FFmpeg source code:

This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

https://developer.download.nvidia.com/compute/redist/nvidia-dali/libsndfile-1.0.31.tar.gz

Source code(tar.gz)
Source code(zip)
v1.8.0(Nov 22, 2021)
Key Features and Enhancements

This DALI release includes the following key features and enhancements.

Added batch mode support to external_source operator with parallel callback. (#3420 and #3397)

Extended crop_mirror_normalize operator to support per-sample normalization parameters. (#3455)

Improved error messages when trying to decode images with unsupported format. (#3445)

Documentation improvements. (#3448 and #3439)

Fixed Issues

This DALI release includes the following fixes:

Fixed unsound interpretation of the aspect ratio parameter in the random_bbox_crop operator, when input shape is provided. (#3425)

Fixed incorrect output shape in the experimental.readers.video operator. (#3460)

Improvements

Remove reseeding of numpy in RandomlyShapedDataIterator (#3466)

Add indexing information to TF external source tests (#3467)

Extend setup_packages.py to bing package with its dependencies (#3464)

Update dependency versions (#3457)

Optionally load plugins global symbols. (#3462)

Add NVIDIA Video Codec SDK - NVDECODE API (#3458)

CropMirrorNormalize: Add support for per-sample normalization arguments (#3455)

Support batch mode in parallel external source (#3397)

Turn off part of TL0_FW_iterators tests when sanitizers are enabled (#3456)

Read ArgValue constant arguments only once (#3453)

Rename InputRef/OutputRef to Input/Output in workspace API (#3451)

Reduce number of Workspace Input/Output APIs (#3446)

Fix error reporting in image factory (#3445)

Update custom op example for newer CMake (#3448)

Update TF dataset to 2.8 (#3442)

Fix documentation of CropMirrorNormalize dtype argument (#3439)

Bump up nvJPEG2k version to 0.4 (#3440)

Enable CUDA 11.5 builds (#3436)

Enable sanitizers in regular CI runs (#3422)

Improve the way how available python version is available (#3438)

RandomBBoxCrop: Fix interpretation of aspect ratio, when input shape is provided (#3425)

Change the permute function to infer the output size from the indices. (#3434)

Move to the upstream deb packages for JetPack compilation (#3432)

Change C++ standard to c++17 for non-CUDA sources (#3423)

Add epoch number to SampleInfo and introduce BatchInfo (#3420)

Separate type setting from data access in Buffer (#3414)

Make SBSA build compatible with all armv8-a CPUs (#3417)

Update TF plugin for future API change (#3415)

Replace pointers with references for ShareData parameter (#3408)

Code cleanup: remove unused variables, fix buffer overflow (#3410)

Enable usage of sanitizers in tests (#3377)

Bug Fixes

Update tensorflow version in conda build (#3471)

Fix STRING_VEC default arguments presentation in docs (#3470)

Remove broken class method from DALI Dataset (#3465)

Fix experimental.readers.video output shape (#3460)

Fix static analysis detected issues (#3444)

Silence output from build_per_python_lib cmake utility (#3454)

Make Workspace::Input return const reference (#3452)

Update imports from collections to collections.abc where needed (#3429)

Install boost/preprocessor headers (#3443)

Fix ShareData for TensorVector with no elements (#3435)

Update GCC version in conda recipe to 7.5 to workaround GCC bug 82461. (#3431)

Add a missing state destruction for the NVJPEG HW decoder (#3416)

Breaking API changes

There are no breaking changes in this DALI release.

Deprecated features

There are no deprecated features in this DALI release.

Known issues:

The video loader operator requires that the key frames occur at a minimum every 10 to 15 frames of the video stream. If the key frames occur at a lesser frequency, then the returned frames may be out of sync.

The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later. To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)

Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when run in Docker with escalated privileges, for example:

privileged=yes in Extra Settings for AWS data points

--privileged or --security-opt seccomp=unconfined for bare Docker

Binary builds

Install via pip for CUDA 10.2: pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda102==1.8.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda102==1.8.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). Using the latest driver may enable additional functionality. More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.8.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.8.0

Or use direct download links (CUDA 10.2):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda102/nvidia_dali_cuda102-1.8.0-3362432-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda102/nvidia-dali-tf-plugin-cuda102-1.8.0.tar.gz

Or use direct download links (CUDA 11.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.8.0-3362434-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.8.0-3362434-py3-none-manylinux2014_aarch64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.8.0.tar.gz

FFmpeg source code:

This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

https://developer.download.nvidia.com/compute/redist/nvidia-dali/libsndfile-1.0.31.tar.gz

Source code(tar.gz)
Source code(zip)
v1.7.0(Oct 25, 2021)
Key Features and Enhancements

This DALI release includes the following key features and enhancements.

New operators:

readers.webdataset, which is a reader for the Webdataset format (#3395, #3385, #3375, #3372, #3360, and #3306).

experimental.readers.video (CPU), which is an experimental video reader and decoder that includes support for the variable frame rate (#3412, #3411, #3391, and #3362).

Performance improvements:

warp_affine performance has been improved for some common cases (#3370).

Other minor general performance improvements (#3363 and #3338).

Added the DALI_DISABLE_NVML and DALI_RESTRICT_PINNED_MEM environment variables. These variables allow you to limit the use of NVML and pinned memory and enable DALI on more platforms (#3404 and #3382).

Fixed Issues

This DALI release includes the following fixes:

Fixed an issue in the pad operator that caused a crash when the operator was used with a variable batch size (#3354).

Fixed a race condition that occurred in the readers.video operator (#3355).

Fixed a bug in the C API that caused invalid memory access in some use cases (#3350).

Improvements

Add more logging to FramesDecoder (#3412)

Reduce the TensorList and TensorVector API scope (#3403)

Add an env variable DALI_DISABLE_NVML to disable NVML usage on demand (#3404)

Enable BUILD_LDMB by default (#3406)

Add error message checking into existing python tests (#3401)

Bump up Nvidia TensorFlow version in tests to 21.09 (#3383)

Add VideoReaderDecoder (#3391)

Webdataset automatic index file inference (#3385)

Add an environment variable that determines whether pinned memory usage should be restricted. (#3382)

Notebook with an example of webdataset usage (#3372)

Add frames decoder (#3362)

Move to libtar fork - https://github.com/tklauser/libtar (#3375)

Remove possibility of access to contiguous TL buffer (#3373)

Add error message checks (#3371)

Update libcudacxx to include fix for build with ASAN. (#3374)

Specialize warp kernels for common numbers of channels. (#3370)

Webdataset performance and cosmetic optimizations (#3360)

Update documentation about enabling sanitizers (#3365)

general perf changes alongside WDS perf (#3363)

Update CUTLASS and Google Benchmark (#3361)

Remove access to contiguous TL buffer from Coco Reader tests (#3351)

Remove access to contiguous TL buffer from BoxEncoder, Resize, Shapes and Warp (#3339)

Bump clang version to 12.0.1 in deps image (#3342)

Use DALIDataType where possible. (#3338)

Update asserts in python tests (#3336)

Webdataset reader operator implementation (#3306)

Work around PyTorch internal fragmentation in L3 SSD test. (#3343)

Make view converters operate on samples only (#3325)

Add an ability to avoid class remapping in coco reader (#3333)

Remove access to underlying contiguous TL buffer from tests (#3319)

Bug Fixes

Fix the Webdataset documentation formatting (#3395)

Fix documentation formating (#3369)

Fix sharding and shuffling in VideoLoaderDecoder (#3411)

Fix pool process tracking in parallel ES tests, cleanup batches properly (#3400)

Fix ownership issues in Share APIs for Tensor, TL and TV (#3407)

Fix memory leak in async_pool destructor. (#3402)

Fix off build (#3399)

Fix HW decoder overwriting growth factor for CPU buffers (#3398)

Fix libtiff build (#3392)

Fix the memory kind stored in AllocInfo in nvjpeg memory. (#3393)

Fix bug in TensorList test (#3388)

Adjust default eps in video test (#3389)

Fix FFMPEG conda build (#3386)

Fix errors in TF YOLO example (#3379)

Adjust growth and shrink threshold for cpu buffers (#3378)

Fix error reporting in TL3_EfficientDet_convergence and TL3_YOLO_convergence (#3376)

Fix problems detected by asan and lsan (#3367)

Fix Coverity issues (#3366)

Fix EfficientDet docs link (#3364)

Fix Video reader race condition (#3355)

Fix variable batch size handling in pad operator (#3354)

Fix bugs in C API and refactor tests (#3350)

Fix and optimize name handling in TypeInfo. (#3349)

Fix sequence rearrange python test (#3353)

Handle SIGV situation when trying to load prebuild DALI TF Plugin (#3347)

Fix DeviceBuffer copy - use proper copy function. (#3344)

Skip Keras TF tests in versions with broken execption handling (#3341)

Fix squeeze operator test on Python3.7 and earlier (#3337)

Use memory resources in DeviceBuffer and TestTensorList. (#3334)

Breaking API changes

There are no breaking changes in this DALI release.

Deprecated features

There are no deprecated features in this DALI release.

Known issues:

The video loader operator requires that the key frames occur at a minimum every 10 to 15 frames of the video stream. If the key frames occur at a lesser frequency, then the returned frames may be out of sync.

The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later. To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)

Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when run in Docker with escalated privileges, for example:

privileged=yes in Extra Settings for AWS data points

--privileged or --security-opt seccomp=unconfined for bare Docker

Binary builds

Install via pip for CUDA 10.2: pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda102==1.7.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda102==1.7.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). Using the latest driver may enable additional functionality. More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.7.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.7.0

Or use direct download links (CUDA 10.2):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda102/nvidia_dali_cuda102-1.7.0-3161365-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda102/nvidia-dali-tf-plugin-cuda102-1.7.0.tar.gz

Or use direct download links (CUDA 11.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.7.0-3161358-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.7.0-3161358-py3-none-manylinux2014_aarch64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.7.0.tar.gz

FFmpeg source code:

This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

https://developer.download.nvidia.com/compute/redist/nvidia-dali/libsndfile-1.0.31.tar.gz

Source code(tar.gz)
Source code(zip)
v1.6.0(Sep 24, 2021)
Key Features and Enhancements

This DALI release includes the following key features and enhancements.

Added support for lambdas and local functions as callback in parallel external_source operator (#3270, #3269).

Added the following tutorials:

TensorFlow DALI Dataset input handling (#3212).

Parallel external_source operator (#3199).

Added DALI preprocessing to the EfficientDet example (#3118).

Fixed issues

This DALI release includes the following fixes:

Fixed a crash that happened in the gaussian_blur operator for inputs where one of the dimensions equals 1 (#3291).

Fixed random Python crashes on the process teardown when the external_source operator was used (#3245).

Fixed readers.video hanging on some HEVC samples (#3247).

Improvements

Add error message checking in python tests (#3324)

Optimize bundling wheel by using multiprocessing in build_helper.sh (#3323)

Changed "accross" to "across" in README.rst (#3329)

Move to CUDA 11.4 update 2 (#3322)

Fix FFmpeg vulnerabilities (CVE-2020-22037, CVE-2021-38171, CVE-2021-38291) (#3315)

Rework diplacement filter to sample-based approach (#3311)

Remove kernels/alloc.h (#3309)

Adjust usage of rasies and assert_raises in tests (#3318)

Move static UserStream variable to the Get function inside the class (#3242)

Adjust usage of raise and assert_raises (#3316)

Update README with third parties dependencies (#3320)

Add input type validation to feed_ndarray in MXNet and PyTorch (#3308)

Add parameters checks when deserializing a pipeline (#3253)

Extend BlockSetup with 1-dim specialization (#3304)

Move back to upstream libtar from conda (#3301)

Rework LUT to batch processing and remove access to TL buffer (#3298)

Add checking a message of the expected exception against a pattern in nose tests (#3302)

Use libcu++ interfaces. (#3297)

Update third party dependencies (#3300)

Pin nvJPEG2000 and GPU Direct dependencies (#3299)

Bump up nvidia tensorflow version to 21.08 in tests (#3296)

Implement InputDatasets for DALIDataset (#3292)

Remove access to underlying contiguous TL buffer in bb_flip op (#3283)

Make memory kind a tag type instead of an enum value. (#3290)

Add examples on serialization to parallel external source notebook (#3270)

Support lambdas and local functions as callbacks in parallel ExternalSource (#3269)

TarArchive::TellArchie implementation + renaming (#3286)

Remove access to underlying contiguous TL buffer in Flip op (#3280)

Remove access to underlying contiguous TL buffer in Normalize op (#3281)

Use default resources for allocating tensors (#2948)

Remove access to underlying contiguous TL buffer in Constant op (#3276)

TarArchive additional features (#3273)

Add ScatterGatherCPU and rework Copy op to batch processing (#3266)

Change the way how start and end timestamps are converted to frames (#3252)

Update RMM to an up-to-date & version with interface rework applied. (#3254)

Test fused decoder out-of-bounds error (#3175)

Bump supported tested TensorFlow versions (#3250)

Update supported CUDA version in docker/build.sh (#3248)

Adjust capitalization in tutorials (#3246)

Remove not applicable aclaratory note from PyTorch and Paddle iterators (#3235)

Add tutorial about TF DALI Dataset input handling (#3212)

Add tutorials for Parallel External Source (#3199)

Add DALI to EfficientDet example (#3118)

Use fn.random module in tests and examples (#3174)

Bug Fixes

Improve tests for expected errors + fix PythonFunction (#3332)

Fix incorrect use of a global variable in the test of operator Shapes. (#3310)

Rework Cast to batch processing (#3278)

Fix HEVC video handling (#3247)

Fix infinite loop for convolution with extent equal 1 (#3291)

Add yaml as a Webdataset test dependency, adjust to new WDS API (#3295)

Fix missing condition variable include (#3289)

Remove the inclusion of scatter_gather.h from types.h (#3288)

Fix cast warning in ScatterGather (#3284)

Clear to_dealloc and notify under a lock. (#3282)

Fix notification method in deferred deallocation. (#3279)

Fix race condition when initializing plain host memory resource. (#3268)

Fix alignment constraints in CUDA VM resource. (#3274)

Fix missing sizeof in Tensor Test (#3267)

Fix hw decoder tests disabled on old drivers (#3257)

Don't increase alignment to upstream alignment when retrying to allocate (#3264)

Avoid creating primary context for synchronization. (#3263)

Avoid upstream allocation stampede by retrying to allocate from free after gaining the upstream lock. (#3258)

Remove excessive synchronization in AsyncPool. (#3256)

Ensure keeping py_pool alive until pipline is garbage collected (#3245)

Fix running Python core tests (#3249)

Fix an assigment of py::none() to py::dict in backend_impl.cc (#3244)

Fix interoperation between DALI and PyTorch lightning due to buffering (#3239)

Reduce number of iterations in L0 tests (#3173)

Fix memory leak in backend_impl.cc caused by PyObject_GetAttr (#3233)

Fix FFmpeg CVE-2021-38114 (#3231)

Breaking API changes

There are no breaking changes in this DALI release.

Deprecated features

There are no deprecated features in this DALI release.

Known issues:

The video loader operator requires that the key frames occur at a minimum every 10 to 15 frames of the video stream. If the key frames occur at a lesser frequency, then the returned frames may be out of sync.

The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later. To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)

Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when run in Docker with escalated privileges, for example:

privileged=yes in Extra Settings for AWS data points

--privileged or --security-opt seccomp=unconfined for bare Docker

Binary builds

Install via pip for CUDA 10.2: pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda102==1.6.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda102==1.6.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). Using the latest driver may enable additional functionality. More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.6.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.6.0

Or use direct download links (CUDA 10.2):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda102/nvidia_dali_cuda102-1.6.0-2993095-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda102/nvidia-dali-tf-plugin-cuda102-1.6.0.tar.gz

Or use direct download links (CUDA 11.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.6.0-2993096-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.6.0-2993096-py3-none-manylinux2014_aarch64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.6.0.tar.gz

FFmpeg source code:

This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

https://developer.download.nvidia.com/compute/redist/nvidia-dali/libsndfile-1.0.31.tar.gz

Source code(tar.gz)
Source code(zip)
v1.5.0(Aug 23, 2021)
Key Features and Enhancements

This DALI release includes the following key features and enhancements.

Extended decoders.image to support WebP decoding (#3206)

Added indexing (NumPy-like) API for tensor slicing (#3200 and #3195)

Extended external_source to support source argument in TensorFlow DALI Dataset (#3215, #3193, #3177 and #3176)

Added examples:

Tensorflow YOLOv4 (#2883)

WebDataset usage with external_source (#3153)

Fixed issues

This DALI release includes the following fixes:

Fixed include paths that prevented including some parts of DALI in other C/C++ projects (#3210)

Fixed a crash when only anchors and no shapes were provided in multi_paste (#3166)

In the spectrogram operator, extracted windows are now correctly centered before FFT calculation, when the nfft argument is bigger than length of the window. (#3180)

Fixed a minor memory leak in decoders.image (#3148)

Improvements

Add documentation for indexing. (#3200)

Move to CUDA 11.4U1 (#3213)

Add WebP support to image decoder (#3206)

libtar API implementation (#3198)

Tensor indexing (#3195)

Make TF graph-mode tests faster (#3204)

Add support for ES source in TF DALI Dataset (#3177)

Add tensorflow YOLOv4 example (#2883)

Refactor Python External Source code (#3176)

Update third party dependencies to latest release versions (#3184)

Add deferred deallocation to cuda_vm_resource. (#3154)

Adjust test scripts and section header for webadataset notebook (#3162)

Add Webdataset-ExternalSource Jupyter notebook (#3153)

Update PR template (#3150)

Update PR template (#3129)

Bug Fixes

Fix failing TarArchive tests (#3226)

Build custom libtar in conda (#3223)

Improve validation in DALIDataset (#3215)

Update DALI_DEPS_VERSIOn to include https://github.com/NVIDIA/DALI_deps/pull/19 (#3224)

Fix identity check in _is_generator_function which. Add test. (#3216)

Fix unused imports in test_utils.py (#3214)

Remove the usage of ManagedMemory from the OpticalFlow tests (#3211)

Suppress test using unified memory when it is not supported (#3209)

Remove include prefix from include paths (#3210)

Fix CVE-2021-3246 in libsnd (#3208)

Fix pytorch-lighting test (#3196)

Fix coverity issues + skip tests involving managed memory when not supported. (#3190)

Disable NVJPEG HW decoder for driver < 455 due to performance reason (#3189)

Fix compilation with newer GCC (#3188)

Disallow some types of sources for parallel ES explicitly (#3193)

Center windows when extracting windows to a bigger output window (#3180)

Add a compute cap value before running the GDS test (#3185)

MultiPaste to adjust the region shape to cover up to the end of the input shape (#3166)

Fix wording in docs (#3165)

Fix image decode (#3148)

Fix LastBatchPolicy doc and update Parallel ES wording (#3152)

Fix some errors (#3147)

Breaking API changes

There are no breaking changes in this DALI release.

Deprecated features

There are no deprecated features in this DALI release.

Known issues:

The video loader operator requires that the key frames occur at a minimum every 10 to 15 frames of the video stream. If the key frames occur at a lesser frequency, then the returned frames may be out of sync.

The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later. To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)

Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when run in Docker with escalated privileges, for example:

privileged=yes in Extra Settings for AWS data points

--privileged or --security-opt seccomp=unconfined for bare Docker

Binary builds

Install via pip for CUDA 10.2: pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda102==1.5.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda102==1.5.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). Using the latest driver may enable additional functionality. More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.5.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.5.0

Or use direct download links (CUDA 10.2):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda102/nvidia_dali_cuda102-1.5.0-2725759-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda102/nvidia-dali-tf-plugin-cuda102-1.5.0.tar.gz

Or use direct download links (CUDA 11.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.5.0-2725760-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.5.0-2725760-py3-none-manylinux2014_aarch64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.5.0.tar.gz

FFmpeg source code:

This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

https://developer.download.nvidia.com/compute/redist/nvidia-dali/libsndfile-1.0.31.tar.gz

Source code(tar.gz)
Source code(zip)
v1.4.0(Jul 26, 2021)
Key Features and Enhancements

This DALI release includes the following key features and enhancements.

readers.numpy improvements:

Added ROI support in the GPU operator (#3034 and #3040).

Parallelized reading in the CPU operator (#3077).

Added a tutorial (#3095 and #3139).

DALI Dataset improvements:

Added batch support (#3063 and #3089).

Enabled no_copy mode (#3041, #3058, and #3097).

Video reader improvements:

Added an option to pad missing frames at the end of sequence (#3002).

Added support for the VP8 and MJPEG formats (#3045).

Added CPU parallelization to the Slice and SliceFlipNormalizePermutePad kernels. (#3062, #3068, and #3080)

Added an option to readers.nemo_asr to return indices of the entries in the manifest (#3085).

Improved the performance in the GPU image decoder by optimizing the memory allocations. (#3067).

Fixed issues

This DALI release includes the following fixes:

Fixed a crash that happened when a functools.partial result was passed as a source to external_source (#3143).

Fixed the hardware image decoder to fall back to the hybrid implementation for unsupported file formats instead of throwing an error (#3086).

Improvements

Add NumpyReader tutorial to the rendered documentation page (#3139)

Update docs analytics tracking (#3135)

VM async_pool - refactoring & tests (#3117)

Extend the video loader error message for vfr videos on how to disable the check in case of false positives (#3125)

Integer literal suffixes (#3122)

SliceCPU kernel to run plain memcpy when applicable (#3110)

CUDA VM memory resource (#3114)

Add Numpy Reader Tutorial (#3095)

Bump TensorFlow version in tests (#3107)

Efficient det code drop (#3115)

Move to CUDA 11.4 build (#3109)

Add batch support to DALI Dataset (#3089)

Update third party dependencies (#3093)

Add bitmask::append. (#3101)

Free list API cleanup. (#3100)

NemoAsrReader to optionally return indices of the entries in the manifest. (#3085)

Paralellize reading in NumpyReader CPU (#3077)

Bit mask utility (#3083)

Add ExecutionEngine to SliceFlipNormalizePermutePad CPU kernel, to allow parallel execution (#3080)

Add an ability to pad missing frames in the Video reader sequence (#3002)

Rework the TF DALIDataset input API (#3063)

Add ExecutionEngine to Slice CPU kernel, to allow parallel execution (#3068)

Use HW NVJPEG decoder memory pool even if size hint is not set (#3067)

CUDA Virtual Memory API wrappers. (#3064)

Add information about installing CUDA 10.2 DALI version (#3066)

Add image decoder memory hints for nvJPEG in DALI examples (#3029)

Add split shape utility (#3062)

Add ROI support to NumpyReader GPU (#3034)

Enable no_copy mode handling in TF DALI Dataset (#3058)

Add support for VP8 and MJPEG videos (#3045)

Make pytorch lightning example work with multiple GPUs (#3037)

Add override flags for no_copy option of External Source (#3041)

Add NumpyFileWrapper to numpy loader (#3054)

Add a mention of CPU-only arguments inputs in docs (#3039)

Minor changes in Slice GPU kernels, before reusing them in NumpyReader GPU (#3040)

Bug fixes

Fix hint handling: (#3145)

Add support for functools.partial in ExternalSource. (#3143)

Install libcufile (for GDS) as a part of the cuda base build step (#3142)

Add check of strerror_r return value in CUFile HandleIOError (#3141)

Disable VMAsyncPool CrossStream test on incompatible platforms. (#3140)

Fix the lack of execution of variable batch size test (#3134)

Throw std::bad_alloc when ordinary host memory runs out + tests for xxx_malloc resources. (#3131)

Fix allocation hint handling in CUDA VM resource (#3128)

Revert change from python to Python_EXECUTABLE (#3126)

Coverity issue fixes - bulk drop, July 2021 (#3124)

Make nvJPEG detect corrupted stream before offloading to HW decoder (#3113)

Add --no-index option to TL1_tensorflow-dali_test test (#3112)

Minor fixes (#3119)

DALI TF install tool: Copy files for import check, rather than symlink (#3116)

minor fixes (#3108)

Dali TF installation: check import before completing the installation (#3104)

Remove no longer applicable sed command from RN50 MXNet test (#3103)

Use DALI_extra instead of example_audio_file in the spectrogram example (#3106)

Unify apt-get invocations (#3094)

Make DALI extra download optional in tests (#3102)

Remove pre CUDA 10.0 support in TL1_tensorflow-dali_test (#3099)

Bug fixes (#3096)

MMUtilFixes: (#3098)

Fix override no copy flags for External Source C API (#3097)

Fix HW decoder fallback to the hybrid decoder (#3086)

Fix DALI installation for python 3.9 version (#3092)

Fix python test on aarch64 platform (#3091)

Move pycocotools to regular pip packages in SSD test (#3090)

Use PEP 503 compatible extra url index to install PyTorch (#3079)

Remove compiler name subdirectory in prebuilt DALI TF prebuilt directory (#3078)

Disable MNIST dataset download for DALI pipelines (#3075)

Fix known FFmpeg n4.4 vulnerabilities (#3071)

Fix DALI TF Plugin build in TF 2.6 (#3074)

Fix error handling in Executor (#3069)

Fix typo inout -> input (#3070)

Fix error message when creating a TensorShape from iterators with more elements than expected (#3060)

Add warning about not using external_inputs in proto (#3057)

Fix usage of removed _ExternalSource in test (#3059)

Make the Python test utilities have local random state (#3055)

Fix batch size handling in PermuteBatch. (#3026)

Update FFmpeg to address CVE-2021-33815 (#3053)

Remove duplicated ExternalSource implementation (#3033)

Build the latest clang from source (#3025)

Breaking API changes

There are no breaking changes in this DALI release.

Deprecated features

There are no deprecated features in this DALI release.

Known issues:

The video loader operator requires that the key frames occur at a minimum every 10 to 15 frames of the video stream. If the key frames occur at a lesser frequency, then the returned frames may be out of sync.

The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later. To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)

Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when run in Docker with escalated privileges, for example:

privileged=yes in Extra Settings for AWS data points

--privileged or --security-opt seccomp=unconfined for bare Docker

Binary builds

Note: Starting from version 1.4.0, DALI will be providing CUDA 10.2 builds instead of CUDA 10.0

Install via pip for CUDA 10: pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda102==1.4.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda102==1.4.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). Using the latest driver may enable additional functionality. More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.4.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.4.0

Or use direct download links (CUDA 10.2):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda102/nvidia_dali_cuda102-1.4.0-2575284-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda102/nvidia-dali-tf-plugin-cuda102-1.4.0.tar.gz

Or use direct download links (CUDA 11.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.4.0-2575285-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.4.0-2575285-py3-none-manylinux2014_aarch64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.4.0.tar.gz

FFmpeg source code:

This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

https://developer.download.nvidia.com/compute/redist/nvidia-dali/libsndfile-1.0.31.tar.gz

Source code(tar.gz)
Source code(zip)
v1.3.0(Jun 30, 2021)
Key Features and Enhancements

This DALI release includes the following key features and enhancements.

New operator:

Salt and Pepper noise (noise.salt_and_pepper) for CPU and GPU (#2889, #2934, #2956, and #2976).

Added experimental support for inputs via external_source in TensorFlow DALIDataset (#2949, #2993, and #2997).

Numpy reader improvements:

ROI reading for CPU (#3011).

intra-sample threading on GPU (#3010).

Improved CPU color_space_conversion operator performance (#2987).

Improved brightness and contrast operators performance (#2981).

Added a C API call to check backend of an operator (#3031 and #3050).

Documentation improvements (#2936, #2960, #2979, #2972, #3013, and #3035).

Fixed issues

This DALI release includes the following fixes:

Fixed an issue in readers.nemo_asr that caused a system error due to keeping too many open files (#3003).

Fixed a bug that caused out of bound memory access in mel_filter_bank (#2986).

Fixed a cudaErrorLaunchOutOfResources error that appeared in transpose operator on some GPUs (#2971).

Fixed handling of non-existing entries in readers.tfrecord (#2952).

Improvements

Rework numpy reader tests (#3036)

Extend HW decoder bench tool (#3043)

Remove space from file name (#3038)

Add experimental input support to TF DALIDataset (#2997)

Use BrightnessContrast as implementation of Brightness and Contrast ops (#2981)

Add C API call to check backend of an operator (#3031)

Fix Video reader documentation (#3035)

Enable DALI to build for CUDA 10.2 (#3007)

NumpyReader: Add support for ROI (#3016)

Add git hooks (#3023)

Update third party (#3009)

Add channel count checking in Dump Image (#3020)

Add parallel chunking support in GPU variant of the numpy reader operator (#3010)

NumpyReader to use HostWorkspace (#3011)

Update documentation of random.uniform to reflect data type conversion behavior (#3013)

Adjust tf code for experimental Dataset with inputs (#2993)

Add best-fit free tree. (#2996)

Refine torch audio pipeline tests: adding frame splicing, fix sequence length calculation, reflect pad start/end of the signal (#2992)

Rename free_tree to coalescing_free_tree. (#2995)

Use thread_pool in ColorSpaceConversion (#2987)

Move to CUDA 11.3 update 1 (#2990)

pool_resource: upstream lock & refactoring (#2988)

Add tests to cover OGG Vorbis, and FLAC audio formats (#2980)

Add synchronization and deferred deallocation to pool_resource (#2983)

Update FFmpeg, fix video container tests (#2918)

Add Preemphasis border policy (#2984)

Numba function operator, docs update (#2972)

Add a link to the DALI roadmap in the main readme and the documentation (#2979)

Add BOOL_SWITCH (#2974)

Add libopus to the binaries distributed with the wheel (#2969)

Add SaltAndPepper GPU operator (#2956)

Update documenation about supported TensorFlow versions by DALI (#2960)

Guard changes to default resources with a mutex. (#2955)

Add Salt and Pepper noise CPU operator (#2889)

Core allocation functions - improve alignment handling (#2947)

Add portable FP16 type & tests. (#2941)

RNGBase: Separate noise generation and application steps (#2934)

Add information about Open-CE effort that provides DALI (#2936)

Bug fixes

Remove mixed image decoder from GetBackendTest (#3050)

Fix pip download folder usage (#3028)

Avoid pre-commit hook for merge commits (#3032)

Coverity issue fixes. (#3021)

Add more connection attempts in setup_packages.py and increase the timeout to 100s (#3024)

Add 60s timeout for URL request in setup_packages.py (#3018)

Check CUDA API return values in device-side test helper. (#3017)

Run baseline pipelines on separate devices (#3012)

Multi paste refactor & fix (#3008)

Remove outdated warning about not supported ROI HW decoding (#2998)

NemoAsrLoader: Close file handles after reading metadata (#3003)

Improve Element Extract Op (#3004)

Temporarily disable test due to incompatible free list. (#3001)

Work around large alignas bug - align manually. (#3000)

Lifts the sm limitation that is tested in the numpy reader test (#2994)

MultiPaste: Fix in_ids argument type in the schema (#2965)

Fix a buffer overrun when the trailing dimension is collapsed. (#2986)

Add missing #include (#2985)

Enable SaltAndPepper GPU variable batch size tests (#2976)

Add missing tests to test_dali_variable_batch_size.py (#2982)

Change all reference to the master branch in the documentation (#2977)

Add missing tests to test_dali_cpu_only.py (#2964)

Add launch bounds to TransposeBatch kernel to avoid cudaErrorLaunchOutOfResources (#2971)

Fix deps docker with custom DALI_deps SHA (#2970)

Add coverage test for CPU only and variable batch size test (#2962)

Enable variable batch size tests (#2957)

Fix returning memory to upstream from pool resource #2961

Fix handling of non_existing entries in TFRecord reader (#2952)

Enable pool to return memory to the upstream upon Out-of-Memory. (#2951)

Fix mixed indent in tf.py (#2949)

Fix bug in default constructed curand_uniform_dist (#2946)

Breaking API changes

There are no breaking changes in this DALI release.

Deprecated features

There are no deprecated features in this DALI release.

Known issues:

The video loader operator requires that the key frames occur at a minimum every 10 to 15 frames of the video stream. If the key frames occur at a lesser frequency, then the returned frames may be out of sync.

The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later. To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)

Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when run in Docker with escalated privileges, for example:

privileged=yes in Extra Settings for AWS data points

--privileged or --security-opt seccomp=unconfined for bare Docker

Binary builds

Install via pip for CUDA 10: pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda100==1.3.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda100==1.3.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). Using the latest driver may enable additional functionality. More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.3.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.3.0

Or use direct download links (CUDA 10.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda100/nvidia_dali_cuda100-1.3.0-2471498-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda100/nvidia-dali-tf-plugin-cuda100-1.3.0.tar.gz

Or use direct download links (CUDA 11.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.3.0-2471497-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.3.0-2471497-py3-none-manylinux2014_aarch64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.3.0.tar.gz

FFmpeg source code:

This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

https://developer.download.nvidia.com/compute/redist/nvidia-dali/libsndfile-1.0.31.tar.gz

Source code(tar.gz)
Source code(zip)
v1.2.0(May 24, 2021)
Key Features and Enhancements

This DALI release includes the following key features and enhancements.

New operators:

noise.shot CPU and GPU operators (#2861)

noise.gaussian CPU and GPU operators (#2846)

jpeg_compression_distortion CPU and GPU operators (#2823)

New mathematical operations (#2853):

Square and cubic root (sqrt, rsqrt, and cbrt)

Logarithms of different bases (log2 and log10)

Power (** operator and pow function)

Absolute value (abs and fabs)

Roundings (ceil and floor)

Trigonometric functions (sin, cos, and tan)

Inverse trigonometric functions (asin, acos, atan, and atan2)

Hyperbolic functions (sinh, cosh, and tanh)

Inverse hyperbolic functions (asinh, acosh, and atanh)

Added a Python wrapper for the fn.experimental.numba_function (#2886, #2835, #2903, #2893, and #2887)

Image decoder improvements:

Enabled ROI decoding in the hardware decoder (#2734).

Added support for the alpha channel in PNG and JP2 decoding (#2867).

Added support for YCbCr and BGR in JP2 decoding (#2867).

Updated the CUDA version to 11.3 (#2870).

Improved the documentation (#2915, #2911, #2927, #2862, and #2858).

Fixed issues

This DALI release includes the following fixes:

Fixed the readers.numpy cache issue (#2932).

Fixed an error in readers.nemo_asr (#2928).

Fixed a bug that caused the video reader hang (#2916).

Improvements

Improve Tensors docs (#2915)

DALI core allocation functions (#2930)

Update FFmpeg build guide and update DALI_deps version (#2911)

Default memory resources (#2890)

Better error message when insufficient data in cache (#2924)

Add a link to the TensorFlow ResNet50 training script in the Readme (#2927)

Numba func notebook (#2886)

Enable HW decoder ROI support (#2734)

Use a custom color space conversion kernel for all conversions (#2907)

Update packages used for DALI tests (#2906)

Refactor TF Dataset code and lint it (#2909)

Add ShotNoise CPU and GPU operators (#2861)

Remove workaround for the problem with patchelf changing TLS alignment for CUDA < 10.2 and > 11.1 (#2879)

Add dali_data_type_vec (#2887)

Composite resource + renaming. (#2891)

Update deps in third_party and conda (#2878)

Python wrapper for numba (#2835)

Image Decoder: Unified behavior across backends,Alpha channel support in PNG and JP2, YCbCr support in JP2 (#2867)

Better error handling in pipeline.py (#2864)

Update DALI deps (#2876)

Enable CUDA 11.3 based builds (#2870)

Updates MXNet plugin documentation regarding last_batch_policy (#2862)

README update with GTC2021 materials (#2860)

RNGBase to be used as base for noise augmentations + Add GaussianNoise operator (as an example) (#2846)

Pinned async resource (#2858)

Add more mathematical operations (#2853)

Add JpegCompressionDistortion CPU and GPU operators (#2823)

Split Python tests into smaller chunks (#2847)

Asynchronous pool memory resource (#2814)

Bug fixes

Add missing opencv-python dependency to TL2_FW_iterators_perf test (#2939)

Fix numpy reader header cache (#2932)

NemoAsrReader: Call Reset() on tensor vector holding the batch, to clear any previous shared data pointer. (#2928)

Fix DALI compilation for CUDA 11 pre 11.3 version (#2925)

Make dynlink_xxx use statically linked functions to load symbols. (#2931)

Fix test_detection_pipeline.py (#2929)

Add a missing av_bsf_flush call to a VideoRader seek function (#2916)

Run Optical Flow on stream 0 when running driver > 460. (#2914)

Fix nvcc warning about unused arguments in ResampleDepth_Channels (#2913)

Fix CUDA 10.0 compilation (#2917)

Use stream 0 in VideoDecoder when running driver >460 / CUDA >= 11.3. (#2902)

Fix docs and rename numba_func to numba_function (#2903)

Allow to specify optional args of Python-only types (#2898)

DALI TF install tool: Verify that a compatible prebuilt plugin is available for the required TF version before proceeding to attempt installation (#2882)

Fix coverity issues by adding lacking CUDA_CALL (#2888)

Fix failing test for Numba Func (#2893)

Fix double accumulation in horizontal resampling. Add test. (#2871)

Add espilon to math function tests and adjust epsilon for rsqrt. (#2865)

Make not schedule any pipeline run when the iterator has prepare_first_batch=False (#2859)

Adjust the filenames of decoder test files and update licenses (#2844)

Breaking API changes

There are no breaking changes in this DALI release.

Deprecated features

There are no deprecated features in this DALI release.

Known issues:

The video loader operator requires that the key frames occur at a minimum every 10 to 15 frames of the video stream. If the key frames occur at a lesser frequency, then the returned frames may be out of sync.

The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later. To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)

Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when run in Docker with escalated privileges, for example:

privileged=yes in Extra Settings for AWS data points

--privileged or --security-opt seccomp=unconfined for bare Docker

Binary builds

Install via pip for CUDA 10: pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda100==1.2.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda100==1.2.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). Using the latest driver may enable additional functionality. More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.2.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.2.0

Or use direct download links (CUDA 10.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda100/nvidia_dali_cuda100-1.2.0-2353277-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda100/nvidia-dali-tf-plugin-cuda100-1.2.0.tar.gz

Or use direct download links (CUDA 11.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.2.0-2356513-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.2.0-2356513-py3-none-manylinux2014_aarch64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.2.0.tar.gz

FFmpeg source code:

This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

https://developer.download.nvidia.com/compute/redist/nvidia-dali/libsndfile-1.0.31.tar.gz

Source code(tar.gz)
Source code(zip)
v1.1.0(Apr 15, 2021)
Key Features and Enhancements

This DALI release includes the following key features and enhancements.

Documentation improvements (#2834, #2824, #2831, #2758, #2820, and #2822).

The following operators were added:

The experimental numba_func operator that allows the use of Numba functions in the DALI pipeline (#2804).

The expand_dims and squeeze operators for shape manipulation (GPU and CPU) (#2800, #2791, #2792).

The multi_paste operator (GPU) (#2681).

The following kernels were added:

JPEG compression distortion (GPU) (#2801, #2830, and #2839).

JPEG color conversion and chroma subsampling (GPU) (#2771).

Enabled CUDA kernels compression to decrease the DALI binaries size (#2833).

Added the src_dims argument to the reshape operator (#2788).

Fixed issues

This DALI release includes the following fixes:

Fixed a race condition in readers.nemo_asr when pad_last_batch is set to True (#2828).

Fixed the optical flow initialization issue (#2816).

Fixed a race condition in the data loader (#2773).

Improvements

Remove 0 default value from mean/std arguments of normalize. (#2834)

Add JpegCompressionDistortionGPU kernel (#2830)

Updates the pipeline docs page (#2824)

Enable CUDA kernels compression in the final binary (#2833)

Updates build documentation (#2831)

Update key visual (#2822)

Add NumbaFunc operator (#2804)

Add JPEG distortion kernel (#2801)

Add AddArg overloads for enum types (#2819)

Update third party dependencies to latest release versions (#2811)

Add an ability to provide a custom DALI_extra sha via env variable (#2810)

Move all deps into subrepos (#2756)

Reshape, Reinterpret, Squeeze and ExpandDims tutorial. (#2791)

Separate creation of dependency creation and CUDA installation (#2786)

Remove intermediate stage from CUDA toolkit dockerfile (#2803)

Add Expand dims operator (#2800)

Update TensorFlow ResNet50 example to the latest horovod 21.03 (#2793)

Add squeeze operator (#2792)

Add JPEG color conversion and chroma subsampling kernel (#2771)

Add src_dims to reshape operator (#2788)

GPU MultiPaste (#2681)

Add --upgrade to pip install commands in documentation (#2758)

Use flattened view of the array for copying to shared memory. (#2783)

Bug fixes

Fix JPEG distortion kernel quality parameter handling (#2839)

Fix typo "funcions" <- "funcions" in math doc (#2820)

Update DALI_deps to include FLAC security patch (#2826)

Fix coverity issues (#2812)

Fix optical flow parameter initialization. (#2816)

Add host fallback when nvjpegDecodeJpegDevice and nvjpegDecodeJpegHost fail (#2805)

ExternalSource - discard data from all callbacks when one raises StopIteration (#2784)

Exclude PyTorch-lighting test with MNIST (#2785)

Fix iteration number tracking with pipeline.reset (#2777)

Fix a race when loader starts reading even the metadata is not ready yet (#2773)

Fix race condition in NemoAsrReader when pad_last_batch is set to True (#2828)

Breaking API changes

There are no breaking changes in this DALI release.

Deprecated features

There are no deprecated features in this DALI release.

Known issues:

The video loader operator requires that the key frames occur at a minimum every 10 to 15 frames of the video stream. If the key frames occur at a lesser frequency, then the returned frames may be out of sync.

The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later. To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)

Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when run in Docker with escalated privileges, for example:

privileged=yes in Extra Settings for AWS data points

--privileged or --security-opt seccomp=unconfined for bare Docker

Binary builds

Install via pip for CUDA 10: pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda100==1.1.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda100==1.1.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). Using the latest driver may enable additional functionality. More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.1.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.1.0

Or use direct download links (CUDA 10.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda100/nvidia_dali_cuda100-1.1.0-2159051-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda100/nvidia-dali-tf-plugin-cuda100-1.1.0.tar.gz

Or use direct download links (CUDA 11.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.1.0-2159930-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.1.0-2159930-py3-none-manylinux2014_aarch64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.1.0.tar.gz

FFmpeg source code:

This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

https://developer.download.nvidia.com/compute/redist/nvidia-dali/libsndfile-1.0.28.tar.gz

Source code(tar.gz)
Source code(zip)
v1.0.0(Mar 24, 2021)
Key Features and Enhancements

This DALI release includes the following key features and enhancements.

The API documentation has been improved:

The functional API has became the main DALI API (#2653).

Rewrote all examples to use the functional API (#2761, #2755, #2744, #2748, #2745, and #2716).

Applied layout and editorial changes (#2729, #2730, #2713, #2710, #2703, and #2694).

New operators:

A GridMask GPU operator for GridMask data augmentation (#2652).

A RandomObjectBBox operator with caching to randomly select a bounding box (#2718, #2696, #2677, and #2657).

A MultiPaste operator, is required to implement Mosaic augmentation (#2583).

External Source can now run the per-sample callbacks in parallel (#2543).

Added pipeline_def decorator, which is an easier to define a pipeline with the functional API (#2757 and #2629).

Moved all decoders to a dedicated Python module (#2741, #2743, and #2725).

Moved all readers to a dedicated Python module (#2720, #2721, #2717, #2715, and #2722).

Exposed the pipeline output names in the C API (#2665).

Introduced the following named Slice operator arguments (#2625):

start/rel_start

end/rel_end

shape/rel_shape

Enabled additional codecs and demuxers in FFmpeg (#2651).

Added an option to disable the first batch preparation during the iterator construction (#2664).

Fixed issues

This DALI release includes the following fixes:

Fixed the JPEG 2000 ROI decoding (#2692).

Fixed the layout length check in Transpose (#2693).

Fixed the .gpu() usage detection and error for CPU-only pipelines (#2682).

Improvements

Rework frameworks notebooks to fn API (#2761)

Bump up OpenCV-python version in tests (#2749)

Enhance deprecated argument documentation (#2755)

Convert notebooks to fn API: audio_processing, custom_operator, serialization (#2744)

Expose all pipeline constructor arguments as properties. (#2757)

Convert notebooks to fn API: sequence_processing (#2748)

Gridmask Gpu (#2652)

Run external source callback in parallel (#2543)

Bump up nvidia-tensorflow version to 1.15.5 21.02 (#2738)

Rewrite image processing examples to fn api. (#2745)

Update augmentation gallery (#2716)

Remove dynlink CUDA libs from the build image (#2739)

Rework getting started (#2729)

Adjust Python decoders tests to decoders module (#2741)

Adjust notebooks to new decoder module (#2743)

Update memory resource interfaces. (#2742)

Move decoders to decoders module (#2725)

Add Examples and Tutorials metadata title (#2730)

Adjust test to new readers module (#2720)

Adjust examples to new readers module (#2721)

Documentation home update (#2713)

Move tfrecord reader to readers module (#2722)

Move readers to dedicated submodule (#2717)

Add hash-based caching to RandomObjectBBox. (#2718)

Add break of VideoReader loop when keyframe past requested has been reached (#2706)

Improve set_outputs to accept list or tuple of data nodes as well (#2698)

Documentation: New layout of Examples and Tutorials section (#2710)

Rename test files for readers (#2715)

Add error checking if provided shape to tfrecord can house underlying data (#2705)

Documentation editorial changes: Init caps for all headings, Copyright update (#2703)

Add documentation to functional API (all fn.*) + New documentation layout (#2653)

Parallel random object BBox (#2677)

Rework ThreadPool and spinlock (#2696)

Improvements in Dockerfile.deps so that RUN commands are easily run in a non-docker environment (#2686)

Fix formatting of Resnet-N with Tensorflow example (#2694)

Operator RandomObjectBBox (#2657)

MultiPaste operator (#2583)

Add better exception granurality to memory::alloc_shared and memory::alloc_unique (#2683)

Make DALI pipeline use default seed (-1) when None is set to seed (#2676)

Make preparation of the first batch during the iterator construction optional (#2664)

Parallelize commands in bundle-wheel.sh (#2672)

Pipeline decorator (#2629)

Move to CUDA 11.2 update 1 (#2668)

Make sure that OpenCV decoding fallback follows EXIF information handling (#2666)

Expose names of Pipeline outputs in C API (#2665)

Enable named Slice arguments: start/rel_start, end/rel_end, shape/rel_shape (#2625)

Update nvidia-tensorflow in qa scripts to 20.12 (#2654)

Enable more codecs and demuxers in FFmpeg (#2651)

Bug fixes

Fix paddle ssd (#2765)

Fix Gluon example (#2764)

Remove redundant dimension from Optical Flow example. (#2762)

Fix 403 error when downloading Mnist dataset in Pytorch Lighting example (#2759)

Fix documentation instances of deprecated fn.image_decoder (#2754)

Shutdown executor when an error occurs in the executor itself, not in one of operators. (#2750)

Fix libcufile.so name to have *.0 sufix (#2735)

Fix test exclude pattern for Xavier (#2731)

Fix auto replacement of deprecated args for schema inheritance (#2733)

Fix constant input promotion for mixed backend. (#2726)

Fix type of slice's rel_shape argument (#2714)

Fix a regression in RandomObjectBBox: weights not set to default. (#2719)

Update TensorFlow ReseNet50 example to work with the latest TF 2.4.x version (#2704)

Add auto generated docs files to .gitignore (#2711)

Update DALI PyTorch ligthing example to work with the newest lighting (#2697)

Fix JPEG2K fused decoding (with ROI), add native tests for JP2k decoding (#2692)

Fix TL1_tensorflow-dali_test (#2687)

Remove unnecessary cuda runtime dependency from alloc.h (#2691)

Fix layout length check in Transpose. (#2693)

Replace eval with safer ast.literal_eval (#2690)

Fix .gpu usage detection and error for CPU only pipelines (#2682)

Add support for TensorFlow 2.4.1 in tests and for TF plugin (#2679)

Fix wrong early exit in function inside bundle-wheel.sh (#2675)

Fix apex compilation on Ubuntu 20.04 in TL1_ssd_training (#2671)

Fix cmake installation in TL1 for Ubuntu 20.04 (#2669)

Remove the split stages implementation of the hybrid image decoder (#2753)

Breaking API changes

There are no breaking changes in this DALI release.

Deprecated features

fn.audio_decoder / ops.AudioDecoder has been renamed to fn.decoders.audio / ops.decoders.Audio.

fn.image_decoder / ops.ImageDecoder has been renamed to fn.decoders.image / ops.decoders.Image.

fn.image_decoder_crop / ops.ImageDecoderCrop has been renamed to fn.decoders.image_crop / ops.decoders.ImageCrop.

fn.image_decoder_random_crop / ops.ImageDecoderRandomCrop has been renamed to fn.decoders.image_random_crop / ops.decoders.ImageRandomCrop.

fn.image_decoder_slice / ops.ImageDecoderSlice has been renamed to fn.decoders.image_slice / ops.decoders.ImageSlice.

fn.caffe2_reader / ops.Caffe2Reader has been renamed to fn.readers.caffe2 / ops.readers.Caffe2.

fn.caffe_reader / ops.CaffeReader has been renamed to fn.readers.caffe / ops.readers.Caffe.

fn.coco_reader / ops.CocoReader has been renamed to fn.readers.coco / ops.readers.Coco.

fn.file_reader / ops.FileReader has been renamed to fn.readers.file / ops.readers.File.

fn.mxnet_reader / ops.MXNetReader has been renamed to fn.readers.mxnet / ops.readers.MXNet.

fn.nemo_asr_reader / ops.NemoAsrReader has been renamed to fn.readers.nemo_asr / ops.readers.NemoAsr.

fn.numpy_reader / ops.NumpyReader has been renamed to fn.readers.numpy / ops.readers.Numpy.

fn.sequence_reader / ops.SequenceReader has been renamed to fn.readers.sequence / ops.readers.Sequence.

fn.tfrecord_reader / ops.TFRecordReader has been renamed to fn.readers.tfrecord / ops.readers.TFRecord.

fn.video_reader / ops.VideoReader has been renamed to fn.readers.video / ops.readers.Video.

fn.video_reader_resize/ops.VideoReaderResize has been renamed to fn.readers.video_resize / ops.readers.VideoResize.

Known issues:

The video loader operator requires that the key frames occur at a minimum every 10 to 15 frames of the video stream. If the key frames occur at a lesser frequency, then the returned frames may be out of sync.

The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later. To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)

Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when run in Docker with escalated privileges, for example:

privileged=yes in Extra Settings for AWS data points

--privileged or --security-opt seccomp=unconfined for bare Docker

Binary builds

Install via pip for CUDA 10: pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda100==1.0.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda100==1.0.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). Using the latest driver may enable additional functionality. More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.0.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.0.0

Or use direct download links (CUDA 10.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda100/nvidia_dali_cuda100-1.0.0-2159051-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda100/nvidia-dali-tf-plugin-cuda100-1.0.0.tar.gz

Or use direct download links (CUDA 11.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.0.0-2159930-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.0.0-2159930-py3-none-manylinux2014_aarch64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.0.0.tar.gz

FFmpeg source code:

This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

https://developer.download.nvidia.com/compute/redist/nvidia-dali/libsndfile-1.0.28.tar.gz

Source code(tar.gz)
Source code(zip)
v0.31.0(Feb 25, 2021)
Key Features and Enhancements

This DALI release includes the following key features and enhancements.

New operators:

Gridmask CPU and GridMask Data Augmentation (https://arxiv.org/abs/2001.04086), which is useful for the EfficientNet pipeline (#2582).

ROIRandomCrop CPU, where an operator is required to perform the biased random crop in segmentation applications (#2638).

Added support for the variable batch size in ExternalSource (#2481, #2641).

Added support for the time-major layout in the following spectrogram processing operators:

GPU and CPU Spectrogram (#2619, #2617)

GPU and CPU MelFilterBank (#2620)

Refactored and unified the following RNG operators:

Uniform (#2531)

CoinFlip (#2577)

Reworked the custom operators documentation (#2568).

Applied performance improvements in the JPEG decoder (#2655, #2610).

Fixed issues

Fixed the length that was reported by DALI FW iterators when the DROP policy is used (#2611)

Provided a workaround for a compiler problem that caused an Invalid device function error. (#2656)

Fixed RandomBBoxCrop errors while using the crop_shape argument (#2605)

Improvements

Use pinned memory for staging buffer for HW nvJPEG decoder (#2655)

Find bounding boxes of multiple labels (#2650)

Add ROIRandomCrop operator (#2638)

Add FW iterators handling of variable batch size and improve ES examples (#2641)

Connected components (#2640)

Gridmask Cpu (#2582)

Iter-to-iter variable batch size (#2481)

Enable support for different layouts in the MelFilterBank (#2620)

Rework ops.random.CoinFlip (#2577)

Enable time-major layout in Spectrogram CPU (#2619)

Update clang format (#2524)

Improve Optical Flow error verbosity (#2618)

TF dataset tests rework (#2539)

Time major Spectrogram (GPU-only) (#2617)

Integrate RMM (#2609)

Propagate scalar in transform.scale (#2581)

Remove redundant JPEG decoder initialization from peeking shape function (#2610)

Rework ops.random.Uniform (#2531)

Rework custom operator docs (#2568)

Bug fixes

Workaround a compiler problem that caused Invalid device function error. (#2656)

Python fixes: argument inputs, external source, docs (#2646)

Fix SeparateQueuePolicy handling of the CPU stage (#2636)

Fix variable batch size for list of tensors. Make constants constant again. (#2637)

Fix Uniform discrete distribution (#2635)

Fix a double set of preserve schema arg and uninitialized var (#2632)

Add handling of empty inputs and tiny outputs in Resize op and Resampling kernels. (#2634)

Refactor functions that extract a range of samples from TLS and TLV. (#2628)

Fix RandomBBoxCrop errors while using crop_shape argument (#2605)

Update ResNet50 example to work with TensorFlow 2.x (#2537)

Keep reference to owner of data in Python Tensor and TensorList (#2606)

Enable nvJPEG2K for CUDA 11.2 builds (#2614)

Disable mmap based test for Xavier (#2612)

Fix length reported by DALI FW iterators when DROP policy is used (#2611)

Use smaller block in Warp (#2613)

Breaking API changes

Deprecated features

ops.Uniform was moved to ops.random.Uniform

ops.CoinFlip was moved to ops.random.CoinFlip

Known issues:

The video loader operator requires that the key frames occur at a minimum every 10 to 15 frames of the video stream. If the key frames occur at a lesser frequency, then the returned frames may be out of sync.

The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later. To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)

Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when run in Docker with escalated privileges, for example:

privileged=yes in Extra Settings for AWS data points

--privileged or --security-opt seccomp=unconfined for bare Docker

Binary builds

Install via pip for CUDA 10: pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda100==0.31.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda100==0.31.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). Using the latest driver may enable additional functionality. More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==0.31.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==0.31.0

Or use direct download links (CUDA 10.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda100/nvidia_dali_cuda100-0.31.0-2055431-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda100/nvidia-dali-tf-plugin-cuda100-0.31.0.tar.gz

Or use direct download links (CUDA 11.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-0.31.0-2054952-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-0.31.0-2054952-py3-none-manylinux2014_aarch64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-0.31.0.tar.gz

FFmpeg source code:

This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

https://developer.download.nvidia.com/compute/redist/nvidia-dali/libsndfile-1.0.28.tar.gz

Source code(tar.gz)
Source code(zip)
v0.30.0(Jan 27, 2021)
Key Features and Enhancements

This DALI release includes the following key features and enhancements.

Optimized CPU resampling (#2540).

Added the following mathematical expressions:

Disallowed unwanted __bool__ conversions (#2538).

Added exp and log math functions (#2555).

Added the images argument for the COCOReader, which allows for the custom ordering of images and fixed a bug in the segmentation data parsing (#2548, #2597).

Added support for the nvJPEG preallocate API for a batched hardware decoder (#2544).

Added support surfaces with strides over 2G (#2600).

Enabled CUDA 11.2 builds (#2553).

Documentation improvements:

Added a supported matrix to the documentation (#2519).

Added a geometric transform tutorial. (#2530).

Allowed DALI to be compiled with Clang (#2416).

Added CUDA API checks in utility functions (#2517) and tests (#2516).

Fixed issues

Fixed the autoreset option in the iterator for the DROP policy (#2567).

Improvements

Make Nvjpeg2kTest more verbose (#2509)

Compile DALI with Clang (#2416)

Try to actually find the library instead of arbitrarily deciding it can't be there (#2511)

Enable GDS for conda build by default (#2515)

Pool memory resource (#2518)

Add GTest Event Listener with CUDA validation after TEST (#2516)

Disable GPU numpy reader test form sm < 6.0 (#2514)

Mention WarpAffine in transforms.* documentation (#2527)

Ops rework to prepare iter-to-iter batch size variability (#2408)

Fix unchecked CUDA API calls in utility functions (#2517)

Bump up nvidia-tensorflow version in tests (#2526)

Cleanup warnings in CUDA code (#2523)

Add debug info to RN50 pipeline (#2522)

Add a supported matrix to the documentation (#2519)

Add ArgValue utility (#2528)

Remove pinning numpy version in TL1_ssd_training test (#2536)

Remove unreachable return statement (#2541)

Vectorize CPU resampling (#2540)

Remove constraint on input type for RandomResizedCrop. Update tests. (#2549)

Hide ArithmeticGenericOp doc and disallow bool (#2538)

Support for nvJPEG preallocate API for batched HW decoder (#2544)

Add exp and log math functions (#2555)

Add COCOReader files arg support and fix bug in the segmentation data parsing (#2548)

Event pool (#2520)

Rework random number generators. RNGBase operator template and NormalDistribution. (#2513)

Enable CUDA 11.2 builds (#2553)

Adjust range of tested log inputs (#2564)

Add geometric transform tutorial. (#2530)

Add synchronization after randomizer construction. (#2565)

Move to the upstream version of paddle paddle (#2561)

Move examples to fn api (#2566)

Remove legacy API based nvJPEG decoder implementation (#2591)

Support surfaces with strides over 2G (#2600)

COCOReader images argument can be used to provide a custom order of images (#2597)

Bug fixes

Fix build for Jetson platform (#2512)

Fix aarch64 build errors (#2529)

Fix broken uniform operator python tests (#2556)

Fix Clang build (#2560)

Fix Xavier test crash caused by NumPy faulty build (#2596)

Fix autoreset option in iterator for DROP policy (#2567)

Fix uniform distribution test expectations (#2589)

Breaking API changes

Deprecated features

Known issues:

The video loader operator requires that the key frames occur at a minimum every 10 to 15 frames of the video stream. If the key frames occur at a lesser frequency, then the returned frames may be out of sync.

The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later. To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)

Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when run in Docker with escalated privileges, for example:

privileged=yes in Extra Settings for AWS data points

--privileged or --security-opt seccomp=unconfined for bare Docker

Binary builds

Install via pip for CUDA 10: pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda100==0.30.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda100==0.30.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). Using the latest driver may enable additional functionality. More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==0.30.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==0.30.0

Or use direct download links (CUDA 10.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda100/nvidia_dali_cuda100-0.30.0-1983576-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda100/nvidia-dali-tf-plugin-cuda100-0.30.0.tar.gz

Or use direct download links (CUDA 11.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-0.30.0-1983575-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-0.30.0-1983575-py3-none-manylinux2014_aarch64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-0.30.0.tar.gz

FFmpeg source code:

This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

https://developer.download.nvidia.com/compute/redist/nvidia-dali/libsndfile-1.0.28.tar.gz

Source code(tar.gz)
Source code(zip)
v0.29.0(Dec 30, 2020)
Key Features and Enhancements

This DALI release includes the following key features and enhancements.

New operators:

NumpyReader GPU Operator with the support of GPU Direct Storage (#2477)

NvJpeg2K decoding was enabled in ImageDecoder operator (#2501)

segmentation.RandomMaskPixel operator for creating random masks containing foreground pixels (#2445)

OneHot for GPU (#2436)

Move all NVTX infrastructure into core and create DALI domain (#2472)

New Examples:

Add mask processing to COCO Reader with Augmentations example (#2426)

Add reductions example (#2457)

Example of random_mask_pixel to perform biased random crop (#2474)

Update ExternalSource framework examples (#2482)

Operator Improvements:

Pad: Add support for per-sample shape and alignment requirements (#2432)

RandomResizedCrop: enable channel-first and video support + add tests (#2430)

PythonFunction Operator: support for output layouts (#2486)

Optimize the DCT GPU kernel. (#2471)

COCOReader: Support for uncompressed RLE masks (#2478)

transforms.Rotation to accept scalar inputs (#2494)

Move to CUDA 11.1 update 1 (#2419)

Fixed issues

NumpyReader : Replace std::regex with custom implementation (#2489) - fix ABI incompatibility issues

Fix the dimensionality of labels in SSDRandomCrop. (#2488)

Improvements

Move to CUDA 11.1 update 1 (#2419)

RandomResizedCrop: enable channel-first and video support + add tests (#2430)

Pad operator: Add support for per-sample shape and alignment requirements (#2432)

Update clang to 10.0 (#2424)

Add mask processing to COCO Reader with Augmentations example (#2426)

Make custom nvJEPG allocator return a relevant allocation status (#2438)

Make the custom nvJPEG allocator not throw and return only the status (#2443)

Add SearchableRLEMask utility (#2441)

Add GPU support to OneHot operator (#2436)

Reduce axes names (#2425)

Remove CUDA headers and generate stubs in runtime (#2420)

TensorVector update for iter-to-iter variable batch size (#2435)

Fix build with all options off, relax libclang required version (#2455)

Add support for UINT8 and INT8 outputs in CMN + scale and shift arguments (#2458)

CocoReader Parse RLE masks only when piwelwise masks are requested (#2462)

Add reductions example (#2457)

Enables direct linking with libcuda.so instead of dlopen (#2459)

Add segmentation.RandomMaskPixel operator (#2445)

Skips the building of prebuilt DALI package for nvidia-tensorflow (#2451)

Pad to square tests (#2442)

Enable compile time generation of dynlink wrappers for nvml (#2463)

Deprecate squeeze_labels option from MXNet iterator and enhance .squeeze function to match numpy style interface (#2450)

Hide hidden ops and improve Enum docs quality (#2470)

Enforce uniform rank and type of the outputs read by CPU DataReader. (#2476)

Move all NVTX infrastructure into core and create DALI domain (#2472)

MXNet Iterator: Revert to squeeze_labels=True behavior by default (#2479)

Example of random_mask_pixel to perform biased random crop (#2474)

Update DALI dependency (#2483)

Update ExternalSource framework examples (#2482)

Optimize the DCT GPU kernel. (#2471)

Support the output layouts in the PythonFunction Operator (#2486)

transforms.Rotation to accept scalar inputs (#2494)

Rework tutorials general (#2480)

Add support for GPU based numpy reader (#2477)

Per sample ExternalSource (#2469)

Use atol instead of rtol (#2499)

Lifts the restriction and enables enable_frame_num and enable_timestamps for filenames (#2468)

Reenable nvJPEG2000 (#2501)

Disables GDS for the default build configuration (#2502)

COCOReader: Support for uncompressed RLE masks (#2478)

Memory manager - interfaces, utilities, monotonic resources, malloc resource (#2497)

Update Jetson compilation guide (#2508)

Makes sure that cuFile and nvJPEG2k are not possible to set when not supported (#2510)

Bug fixes

Fix seed in RandomResizedCrop test. (#2437)

QNX build fix (#2440)

Fix lack of proper loading of best_prec1 from the checkpoint (#2466)

Fix the dimensionality of labels in SSDRandomCrop. (#2488)

NumpyReader : Replace std::regex with custom implementation (#2489)

Fix CPU only mode in C API (#2496)

Fix bugs reported by static analysis (#2491)

Fix typo in STYLE_GUIDE.md (#2503)

Fix NVJPEG2K_ENABLED test macros (#2504)

Breaking API changes

Deprecated features

Deprecate squeeze_labels option from MXNet iterator and enhance .squeeze function to match numpy style interface (#2450)

Known issues:

The video loader operator requires that the key frames occur at a minimum every 10 to 15 frames of the video stream. If the key frames occur at a lesser frequency, then the returned frames may be out of sync.

The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later. To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)

Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when run in Docker with escalated privileges, for example:

privileged=yes in Extra Settings for AWS data points

--privileged or --security-opt seccomp=unconfined for bare Docker

Binary builds

Install via pip for CUDA 10 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda100==0.29.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda100==0.29.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). Using the latest driver may enable additional functionality. More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==0.29.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==0.29.0

Or use direct download links (CUDA 10.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda100/nvidia_dali_cuda100-0.29.0-1852439-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda100/nvidia-dali-tf-plugin-cuda100-0.29.0.tar.gz

Or use direct download links (CUDA 11.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-0.29.0-1852440-py3-none-manylinux2014_aarch64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-0.29.0-1852440-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-0.29.0.tar.gz

FFmpeg source code:

This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

https://developer.download.nvidia.com/compute/redist/nvidia-dali/libsndfile-1.0.28.tar.gz

Source code(tar.gz)
Source code(zip)
v0.28.0(Nov 30, 2020)
Key Features and Enhancements

This DALI release includes the following key features and enhancements.

New operators:

Affine transform generators, which are operators that generate scale, rotate, shear, translate, crop transform matrices (#2309).

You can use the transforms.Combine operator to combine these matrices (#2317).

These transformations can be applied to data by using the CoordTransform operator.

Added min, max, and clamp arithmetic operators (#2298).

Cat and Stack Operators to concatenate and stack Tensors for the CPU and the GPU (#2301, #2339, #2350).

The following reductions for the CPU and the GPU (#2342, #2379 #2395):

Min, Max, Sum, Mean, MeanSquare, RootMeanSquare, Std, Variance

The MFCC operator for the GPU (#2423).

The SelectMasks operator (#2381).

Add operators for batch reordering:

BatchPermutation for generating random reordering of the batch.

PermuteBatch, which reorders tensors in a batch, based on a list of provided indices (#2417).

Operator Compose: PyTorch-style API to compose the operators (#2393).

Improvements in existing operators:

Added SeekFrames to the audio decoder. The redesign allows you to decide the decoded data type at runtime (#2334).

Added the ability to handle UTF8 text to the NemoAsrReader (#2358).

Added explicit file list support to the FileReader (#2389).

Improvements in the COCO reader API (#2406).

The COCOReader API now outputs relative mask polygon coordinates when the option ratio is set to True (#2375).

RandomBBoxCrop now optionally outputs the indices of the bounding boxes that passed the centroid filter (#2374).

The late initialization of torch_gpu_device in the Pytorch plugin (#2411).

The automatic constant-to-input promotion (#2361) and generalized handling of operator arguments (#2393).

Added a MNIST example for DALI and PyTorch Lightning (#2360).

Added the last_batch_policy to the framework iterator (#2269).

New builds:

Python 3.9 is now enabled (#2333).

The DALI wheels for CUDA 11 are built with CUDA 11.1 and use Enhanced Compatibility to work with CUDA 11.0 (#2302, #2356, #2367, and #2413).

Added support for the SM_86 architecture (#2364).

Added the ability to cross-build Python wheels for Jetson (#2313).

Bug fixes

Fix error when VideoReader is prematurely terminated (#2336)

Fix failure in affine transforms tests (#2337)

Fix the problem of output outliving the pipeline in python (#2341)

Fix lack of proper layout setting in the VideoReader (#2346)

Fix uniform generator operator (#2352)

Bugfixes: Default nfft value and to_snake_case implementation (#2353)

Fixes problems in the weekly build (#2372)

Fix a problem with reference to "incomplete" type (error in Clang/CUDA). (#2377)

Fix how DALI handles StopIteration from the ExternalSource (#2373)

Fix TL1_nodeps_build and TL0_cpu_only (#2391)

Fix CPU only mode for arithm operators (#2400)

Preserve shape of psuedoscalars in arithmetic ops. (#2359)

Improvements

Add affine transform generators: TransformScale, TransformRotation, TransformShear, TransformCrop (#2309)

Change code/docs language to be more inclusive (#2322)

Update nvidia-tensorflow test package to 20.9 and bump tensorflow-gpu minor versions (#2320)

Update example usage of DALIClassificationIterator in docs strings (#2306)

Reduce video reader memory consumption (#2308)

TensorJoin kernel for CPU (#2301)

Enable automatic python modules for operator (#2329)

Split GaussianBlur Python test (#2332)

Add CombineTransforms operator (#2317)

Append TensorListShapes (#2291)

Enable CUDA 11.1 builds (#2302)

Add min, max and clamp arithmetic ops (#2298)

Update TensorFlow plugin documentation (#2328)

Remove Python 3.5 support, enable Python 3.9 (#2333)

Enable nvJPEG2k build for CUDA 11.1 (#2343)

Add BUILD_DALI_NODEPS to allow building dali_core and dali_kernels without extra third party libraries present in the system (#2321)

Add SeekFrames to audio decoder. Redesign to allow deciding decoded data type at runtime. (#2334)

Add discrete mode to Uniform operator (#2340)

Test for utility CMake function (find_dali) (#2325)

Propagate new build options to other build utilities (#2349)

Add support for N-dim tensors to OneHot (#2345)

Adds a separate option to preallocate nvjPEG2k memory (#2347)

Tensor join GPU (#2339)

Reductions: min, max (#2342)

Tensor concatenation and stacking (#2350)

Use inverse (source-to-destination) matrix in WarpAffine operator (#2338)

Disable more dependencies for nodeps build (#2355)

Update DALI trademark information (#2351)

Reduce GPU memory fraction in TF tests to 0.5. (#2357)

Automatic constant-to-input promotion. (#2361)

Add support for SM_86 architecture (#2364)

Use current class next implementation in init, to avoid special handling of first batch in child classes (#2363)

Add ability to cross-build Python wheels for Jetson (#2313)

Add NemoAsrReader handling of UTF8 text (#2358)

Enable CUDA 11 compatibility mode (#2356)

Add MNIST example for DALI and PyTorch Lightning (#2360)

Add last_batch_policy to the framework iterator (#2269)

COCOReader to output relative mask polygon coordinates when the option ratio is set to True (#2375)

RandomBBoxCrop to optionally output the indices of the bounding boxes that passed the centroid filter (#2374)

Enable compatibility layer in tests for CUDA 11 (#2367)

Reduce Sum Op (#2379)

Install DALI license, copyright and acknowledgments explicitly (#2392)

Add layout support to OneHot operator (#2388)

Generalized handling of operator arguments + operator Compose. (#2393)

GPU DCT kernel (#2398)

Bump up Nvidia TF version to 20.10 (#2397)

More reductions (#2395)

Late initialization of torch_gpu_device in pytorch plugin (#2411)

Add a link to CUDA Enhanced Compatibility Across Minor Releases guide (#2410)

Add explicit file list support to FileReader. (#2389)

Add TransformTranslation deprecation placeholder Op (#2412)

Bump up the CuPy to one that supports CUDA 11.0 (#2413)

Add a missing include in filesystem.cc (#2414)

Add a warning about the Python function incompatibility with TensorFlow (#2415)

Improvements in COCO reader API (#2406)

Add operators for batch reordering (#2417)

Add SelectMasks operator (#2381)

GPU MFCC operator. (#2423)

Make base image for dockers customizable at the build time (#2427)

Breaking API changes

Python 3.5 is no longer supported by the official DALI wheels.

Deprecated feature

Known issues:

The video loader operator requires that the key frames occur at a minimum every 10 to 15 frames of the video stream. If the key frames occur at a lesser frequency, then the returned frames may be out of sync.

The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later. To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)

Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when run in Docker with escalated privileges, for example:

privileged=yes in Extra Settings for AWS data points

--privileged or --security-opt seccomp=unconfined for bare Docker

Binary builds

Install via pip for CUDA 10 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda100==0.28.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda100==0.28.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). Using the latest driver may enable additional functionality. More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==0.28.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==0.28.0

Or use direct download links (CUDA 10.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda100/nvidia_dali_cuda100-0.28.0-1761993-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda100/nvidia-dali-tf-plugin-cuda100-0.28.0.tar.gz

Or use direct download links (CUDA 11.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-0.28.0-1758882-py3-none-manylinux2014_aarch64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-0.28.0-1758882-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-0.28.0.tar.gz

FFmpeg source code:

This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

https://developer.download.nvidia.com/compute/redist/nvidia-dali/libsndfile-1.0.28.tar.gz

Source code(tar.gz)
Source code(zip)
v0.27.0(Oct 29, 2020)
Key Features and Enhancements

This DALI release includes the following key features and enhancements.

New operators:

CoordTransform Operator for applying a linear transformation to points or vectors (#2288)

GaussianBlur Gpu Operator (#2314, #2311, #2254)

Nemo ASR Reader (#2234)

Resize 3D - operator can now process 3D inputs (#2226)

Add Translate affine transform generator (#2297) - in the next release it will be moved to a dedicated module.

Use true scalars (except in classification readers) - 0-dim Tensors represent scalar values (#2318)

Adjust documentation after review (#2175)

Support for ZSTD compression for TIFF files (#2273)

Support for Run-Length Encodings and Pixelwise Masks in COCO Reader (#2248)

Support more types in Lookup table (#2290)

Bug fixes

Fixes crash in RandomBBoxCrop when no labels are provided (#2265)

Fix minor issues reported by static analysis (#2276)

Fix detection pipeline test on Ampere (#2304)

Fix BUILD_LIBSND=OFF build (#2316)

Fix build for LMDB disabled (#2319)

Improvements

Update build and test deps to the latest version (#2250)

Resize 3D + resize tests (#2226)

Allow passing a <= 0 values in the file list to allow more flexible frame indexing (#2264)

Extend host decoder to support jpeg2000 (#2270)

Add file_list argument support to the Numpy reader operator (#2274)

Allow Slice to silently assume absolute anchor and shape when those are represented by an integer (#2282)

TransformPoints kernel (#2287)

Add inline to LookaheadParser methods (#2289)

Add deprecation handling in backend (#2279)

Support more types in Lookup table (#2290)

Adjust documentation after review (#2175)

Transform points op (#2288)

Support for ZSTD compression for TIFF files (#2273)

Support for Run-Length Encodings and Pixelwise Masks in COCO Reader (#2248)

Extract a DecodeAudio implementation from Audio decoder operator (#2294)

Extend test_RN50_data_pipeline.py test (#2295)

Add ConvolutionGPU kernel based on CUTLASS (#2254)

Add Translate affine transform generator (#2297)

Add *.cuh and *.inl to list of headers to bundle (#2307)

Add Nemo ASR reader (#2234)

Add SeprableConvolutionGPU kernel (#2311)

Add GaussianBlur Gpu Operator (#2314)

Use true scalars (except in classification readers) + bug fixes (#2318)

Add nvjpeg2k support to GPU Image Decoder. Extend nvjpeg memory pool to support nvjpeg2k allocators.

Adds a separate option to preallocate nvjPEG2k memory (#2347)

Due to some decoding problems disable nvJPEG2K support for now by the default

Breaking API changes

Deprecated feature

Known issues:

The video loader operator requires that the key frames occur at a minimum every 10 to 15 frames of the video stream. If the key frames occur at a lesser frequency, then the returned frames may be out of sync.

The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later. To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)

Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when run in Docker with escalated privileges, for example:

privileged=yes in Extra Settings for AWS data points

--privileged or --security-opt seccomp=unconfined for bare Docker

Binary builds

Install via pip for CUDA 10 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda100==0.27.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda100==0.27.0

or for CUDA 11: pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==0.27.0 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==0.27.0

Or use direct download links (CUDA 10.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda100/nvidia_dali_cuda100-0.27.0-1699645-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda100/nvidia-dali-tf-plugin-cuda100-0.27.0.tar.gz

Or use direct download links (CUDA 11.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-0.27.0-1699648-py3-none-manylinux2014_aarch64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-0.27.0-1699648-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-0.27.0.tar.gz

FFmpeg source code:

This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

https://developer.download.nvidia.com/compute/redist/nvidia-dali/libsndfile-1.0.28.tar.gz

Source code(tar.gz)
Source code(zip)
v0.25.1(Sep 11, 2020)
Key Features and Enhancements

This is a patch release that contains only fixes.

Bug fixes

Fixed a crash that occurred when DALI CUDA 11 runs on pre 450.x driver with the compatibility layer (#2208, #2230).

Known issues

The video loader operator requires that the key frames occur at a minimum every 10 to 15 frames of the video stream. If the key frames occur at a lesser frequency, then the returned frames may be out of sync.

The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later. To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)

Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when run in Docker with escalated privileges, for example:

privileged=yes in Extra Settings for AWS data points

--privileged or --security-opt seccomp=unconfined for bare Docker

Binary builds

Install via pip for CUDA 10 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda100==0.25.1 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda100==0.25.1

or for CUDA 11: pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==0.25.1 pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==0.25.1

Or use direct download links (CUDA 10.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda100/nvidia_dali_cuda100-0.25.1-1612464-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda100/nvidia-dali-tf-plugin-cuda100-0.25.1.tar.gz

Or use direct download links (CUDA 11.0):

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-0.25.1-1612461-py3-none-manylinux2014_x86_64.whl

https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-0.25.1.tar.gz

SBSA aarch64 CUDA 11.0 direct download link:

https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-0.25.1-1612461-py3-none-manylinux2014_aarch64.whl

FFmpeg source code:

This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

https://developer.download.nvidia.com/compute/redist/nvidia-dali/libsndfile-1.0.28.tar.gz

Source code(tar.gz)
Source code(zip)

Owner

NVIDIA Corporation

GitHub Repository https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/index.html

cuDF - GPU DataFrame Library

cuDF - GPU DataFrames NOTE: For the latest stable README.md ensure you are on the main branch. Resources cuDF Reference Documentation: Python API refe

5.2k Jan 08, 2023

Python interface to GPU-powered libraries

Package Description scikit-cuda provides Python interfaces to many of the functions in the CUDA device/runtime, CUBLAS, CUFFT, and CUSOLVER libraries

924 Dec 26, 2022

Python 3 Bindings for NVML library. Get NVIDIA GPU status inside your program.

py3nvml Documentation also available at readthedocs. Python 3 compatible bindings to the NVIDIA Management Library. Can be used to query the state of

212 Jan 04, 2023

Conda package for artifact creation that enables offline environments. Ideal for air-gapped deployments.

Conda-Vendor Conda Vendor is a tool to create local conda channels and manifests for vendored deployments Installation To install with pip, run: pip i

13 Nov 17, 2022

BlazingSQL is a lightweight, GPU accelerated, SQL engine for Python. Built on RAPIDS cuDF.

A lightweight, GPU accelerated, SQL engine built on the RAPIDS.ai ecosystem. Get Started on app.blazingsql.com Getting Started | Documentation | Examp

1.8k Jan 02, 2023

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

Introduction This repository holds NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pytorch. Some of the code her

6.9k Dec 28, 2022

Library for faster pinned CPU <-> GPU transfer in Pytorch

SpeedTorch Faster pinned CPU tensor - GPU Pytorch variabe transfer and GPU tensor - GPU Pytorch variable transfer, in certain cases. Update 9-29-1

657 Dec 19, 2022

CUDA integration for Python, plus shiny features

PyCUDA lets you access Nvidia's CUDA parallel computation API from Python. Several wrappers of the CUDA API already exist-so what's so special about P

1.4k Jan 02, 2023

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.

NVIDIA DALI The NVIDIA Data Loading Library (DALI) is a library for data loading and pre-processing to accelerate deep learning applications. It provi

4.2k Jan 08, 2023

cuSignal - RAPIDS Signal Processing Library

cuSignal The RAPIDS cuSignal project leverages CuPy, Numba, and the RAPIDS ecosystem for GPU accelerated signal processing. In some cases, cuSignal is

646 Dec 30, 2022

ArrayFire: a general purpose GPU library.

ArrayFire is a general-purpose library that simplifies the process of developing software that targets parallel and massively-parallel architectures i

4k Dec 29, 2022

QPT-Quick packaging tool 前项式Python环境快捷封装工具

QPT - Quick packaging tool 快捷封装工具 GitHub主页 | Gitee主页 QPT是一款可以“模拟”开发环境的多功能封装工具，一行命令即可将普通的Python脚本打包成EXE可执行程序，与此同时还可轻松引入CUDA等深度学习加速库，尽可能在用户使用时复现您的开发环境。

545 Dec 28, 2022

📊 A simple command-line utility for querying and monitoring GPU status

gpustat Just less than nvidia-smi? NOTE: This works with NVIDIA Graphics Devices only, no AMD support as of now. Contributions are welcome! Self-Promo

3.2k Jan 04, 2023

Python 3 Bindings for the NVIDIA Management Library

====== pyNVML ====== *** Patched to support Python 3 (and Python 2) *** ------------------------------------------------ Python bindings to the NVID

95 Jan 01, 2023

General purpose GPU compute framework for cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends). Blazing fast, mobile-enabled, asynchronous and optimized for advanced GPU data processing usecases.

Vulkan Kompute The general purpose GPU compute framework for cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends). Blazing fast, mobile-enabl

1k Dec 26, 2022

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.

Related tags

Overview

NVIDIA DALI

Highlights

Installing DALI

Examples and Tutorials

Additional Resources

Contributing to DALI

Reporting Problems, Asking Questions

Acknowledgements

Comments

Why we need this PR?

What happened in this PR?

Description

What happened in this PR

Additional information

Checklist

Tests

Documentation

DALI team only

Requirements

Why we need this PR?

What happened in this PR?

Why we need this PR?

What happened in this PR?

Why we need this PR?

What happened in this PR?

Why we need this PR?

What happened in this PR?

Description

What happened in this PR

Additional information

Checklist

Tests

Documentation

DALI team only

Requirements

Why we need this PR?

What happened in this PR?

Why we need this PR?

What happened in this PR?

Why we need this PR?

What happened in this PR?

Category:

Description:

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

Releases(v1.21.0)

v1.21.0(Dec 28, 2022)

Key Features and Enhancements

Fixed Issues

Improvements

Bug Fixes

Breaking API changes

Deprecated features

Known issues:

Binary builds

v1.20.0(Nov 30, 2022)

Key Features and Enhancements

Fixed Issues

Improvements

Bug Fixes

Breaking API changes

Deprecated features

Known issues:

Binary builds

v1.19.0(Nov 2, 2022)

Key Features and Enhancements

Fixed Issues

Improvements

Bug Fixes

Breaking API changes

Deprecated features