WarpRNNT loss ported in Numba CPU/CUDA for Pytorch

Overview

RNNT loss in Pytorch - Numba JIT compiled (warprnnt_numba) Test-CPU

Warp RNN Transducer Loss for ASR in Pytorch, ported from HawkAaron/warp-transducer and a replica of the stable version in NVIDIA Neural Module repository (NVIDIA NeMo).

NOTE: The code here will have experimental extensions and may be potentially unstable, use the version in NeMo for long term supported loss version of RNNT for PyTorch.

Supported Features

Currently supports :

  1. WarpRNNT loss in pytorch for CPU / CUDA (jit compiled)
  2. FastEmit
  3. Gradient Clipping (from Torch Audio)

Installation

You will need PyTorch (usually the latest version should be used), plus installation of Numba in a Conda environment (pip only environment is untested but may work).

# Follow installation instructions to install pytorch from website (with cuda if required)
conda install -c conda-force numba or conda update -c conda-forge numba (to get latest version)

# Then install this library
pip install --upgrade git+https://github.com/titu1994/warprnnt_numba.git

Usage

Import warprnnt_numba and use RNNTLossNumba. If attempting to use CUDA version of loss, it is advisable to test that your installed CUDA version is compatible with numba version using numba_utils.

There is also included a very slow numpy/pytorch explicit-loop based loss implementation for verification of exact correct results.

import torch
import numpy as np
import warprnnt_numba

# Define the loss function
fastemit_lambda = 0.001  # any float >= 0.0
loss_pt = warprnnt_numba.RNNTLossNumba(blank=4, reduction='sum', fastemit_lambda=fastemit_lambda)

# --------------
# Example usage

device = "cuda"
torch.random.manual_seed(0)

# Assume Batchsize=2, Acoustic Timesteps = 8, Label Timesteps = 5 (including BLANK=BOS token),
# and Vocabulary size of 5 tokens (including RNNT BLANK)
acts = torch.randn(2, 8, 5, 5, device=device, requires_grad=True)
sequence_length = torch.tensor([5, 8], dtype=torch.int32,
                               device=device)  # acoustic sequence length. One element must be == acts.shape[1].

# Let 0 be MASK/PAD value, 1-3 be token ids, and 4 represent RNNT BLANK token
# The BLANK token is overloaded for BOS token as well here, but can be different token.
# Let first sample be padded with 0 (actual length = 3). Loss is computed according to supplied `label_lengths`.
# and gradients for the 4th index onwards (0 based indexing).
labels = torch.tensor([[4, 1, 1, 3, 0], [4, 2, 2, 3, 1]], dtype=torch.int32, device=device)
label_lengths = torch.tensor([3, 4], dtype=torch.int32,
                             device=device)  # Lengths here must be WITHOUT the BOS token.

# If on CUDA, log_softmax is computed internally efficiently (preserving memory and speed)
# Compute it explicitly for CPU, this is done automatically for you inside forward() of the loss.
# -1-th vocab index is RNNT blank token here.
loss_func = warprnnt_numba.RNNTLossNumba(blank=4, reduction='none',
                                         fastemit_lambda=0.0, clamp=0.0)
loss = loss_func(acts, labels, sequence_length, label_lengths)
print("Loss :", loss)
loss.sum().backward()

# When parsing the gradients, look at grads[0] -
# Since it was padded in T (sequence_length=5 < T=8), there are gradients only for grads[0, :5, :, :].
# Since it was padded in U (label_lengths=3+1 < U=5), there are gradeints only for grads[0, :5, :3+1, :].
grads = acts.grad
print("Gradients of activations :")
print(grads)

Tests

Tests will perform CPU only checks if there are no GPUs. If GPUs are present, will run all tests once for cuda:0 as well.

pytest tests/

Requirements

  • pytorch >= 1.10. Older versions might work, not tested.
  • numba - Minimum required version is 0.53.0, preferred is 0.54+.
You might also like...
This Repo is the official CUDA implementation of ICCV 2019 Oral paper for CARAFE: Content-Aware ReAssembly of FEatures

Introduction This Repo is the official CUDA implementation of ICCV 2019 Oral paper for CARAFE: Content-Aware ReAssembly of FEatures. @inproceedings{Wa

Example repository for custom C++/CUDA operators for TorchScript

Custom TorchScript Operators Example This repository contains examples for writing, compiling and using custom TorchScript operators. See here for the

Convert Python 3 code to CUDA code.

Py2CUDA Convert python code to CUDA. Usage To convert a python file say named py_file.py to CUDA, run python generate_cuda.py --file py_file.py --arch

This demo showcase the use of onnxruntime-rs with a GPU on CUDA 11 to run Bert in a data pipeline with Rust.

Demo BERT ONNX pipeline written in rust This demo showcase the use of onnxruntime-rs with a GPU on CUDA 11 to run Bert in a data pipeline with Rust. R

LightSeq is a high performance training and inference library for sequence processing and generation implemented in CUDA
CUDA Python Low-level Bindings

CUDA Python Low-level Bindings

A dead simple python wrapper for darknet that works with OpenCV 4.1, CUDA 10.1

What Dead simple python wrapper for Yolo V3 using AlexyAB's darknet fork. Works with CUDA 10.1 and OpenCV 4.1 or later (I use OpenCV master as of Jun

Prevent `CUDA error: out of memory` in just 1 line of code.
Prevent `CUDA error: out of memory` in just 1 line of code.

🐨 Koila Koila solves CUDA error: out of memory error painlessly. Fix it with just one line of code, and forget it. 🚀 Features 🙅 Prevents CUDA error

An addernet CUDA version

Training addernet accelerated by CUDA Usage cd adder_cuda python setup.py install cd .. python main.py Environment pytorch 1.10.0 CUDA 11.3 benchmark

Comments
  • GPU under utilization due to low occupancy.

    GPU under utilization due to low occupancy.

    Thank you for the warprnnt_numba, I got the warnning (show blow) when I use this loss in my code. 1650880807(1) Is this known issue? How can it be debugged and solved?

    Thank you!

    opened by jiay7 2
  • Fix runtime speed

    Fix runtime speed

    Improve runtime speed of numba loss

    • Fix issue with data movement of costs tensor from llForward to pytorch data view in numba
    • This alone costs a linear loop (scaling with batch size) that is roughly 10x the kernel costs themselves.
    • Fix by writing a small kernel to copy the data and update the costs.
    opened by titu1994 0
Releases(v0.4.0)
  • v0.4.0(Jan 30, 2022)

    Supports

    • Simple RNNT loss with Atomic Locks implementation

    Improvements

    • Improve runtime speed of numba loss
      • Fix issue with data movement of costs tensor from llForward to pytorch data view in numba
      • This alone costs a linear loop (scaling with batch size) that is roughly 10x the kernel costs themselves.
      • Fix by writing a small kernel to copy the data and update the costs.
    Source code(tar.gz)
    Source code(zip)
  • v0.2.2(Jan 24, 2022)

    Initial release of Warp RNNT loss with Numba JIT compile (CPU/CUDA)

    Supports:

    1. Pytorch RNNT loss (CPU and JIT compiled CUDA)
    2. FastEmit
    3. Gradient clipping
    Source code(tar.gz)
    Source code(zip)
Owner
Somshubra Majumdar
Interested in Machine Learning, Deep Learning and Data Science in general
Somshubra Majumdar
Code for “ACE-HGNN: Adaptive Curvature ExplorationHyperbolic Graph Neural Network”

ACE-HGNN: Adaptive Curvature Exploration Hyperbolic Graph Neural Network This repository is the implementation of ACE-HGNN in PyTorch. Environment pyt

9 Nov 28, 2022
SpiroMask: Measuring Lung Function Using Consumer-Grade Masks

SpiroMask: Measuring Lung Function Using Consumer-Grade Masks Anonymised repository for paper submitted for peer review at ACM HEALTH (October 2021).

0 May 10, 2022
Unsupervised Image to Image Translation with Generative Adversarial Networks

Unsupervised Image to Image Translation with Generative Adversarial Networks Paper: Unsupervised Image to Image Translation with Generative Adversaria

Hao 71 Oct 30, 2022
This's an implementation of deepmind Visual Interaction Networks paper using pytorch

Visual-Interaction-Networks An implementation of Deepmind visual interaction networks in Pytorch. Introduction For the purpose of understanding the ch

Mahmoud Gamal Salem 166 Dec 06, 2022
harmonic-percussive-residual separation algorithm wrapped as a VST3 plugin (iPlug2)

Harmonic-percussive-residual separation plug-in This work is a study on the plausibility of a sines-transients-noise decomposition inspired algorithm

Derp Learning 9 Sep 01, 2022
PyTorch Implementation of Backbone of PicoDet

PicoDet-Backbone PyTorch Implementation of Backbone of PicoDet Original Implementation is implemented on PaddlePaddle. Example picodet_l_backbone = ES

Yonghye Kwon 7 Jul 12, 2022
A TensorFlow implementation of the Mnemonic Descent Method.

MDM A Tensorflow implementation of the Mnemonic Descent Method. Mnemonic Descent Method: A recurrent process applied for end-to-end face alignment G.

123 Oct 07, 2022
Official code of Team Yao at Multi-Modal-Fact-Verification-2022

Official code of Team Yao at Multi-Modal-Fact-Verification-2022 A Multi-Modal Fact Verification dataset released as part of the De-Factify workshop in

Wei-Yao Wang 11 Nov 15, 2022
NeurIPS 2021, "Fine Samples for Learning with Noisy Labels"

[Official] FINE Samples for Learning with Noisy Labels This repository is the official implementation of "FINE Samples for Learning with Noisy Labels"

mythbuster 27 Dec 23, 2022
This is a Tensorflow implementation of Learning to See in the Dark in CVPR 2018

Learning-to-See-in-the-Dark This is a Tensorflow implementation of Learning to See in the Dark in CVPR 2018, by Chen Chen, Qifeng Chen, Jia Xu, and Vl

5.3k Jan 01, 2023
Regularized Frank-Wolfe for Dense CRFs: Generalizing Mean Field and Beyond

CRF - Conditional Random Fields A library for dense conditional random fields (CRFs). This is the official accompanying code for the paper Regularized

Đ.Khuê Lê-Huu 21 Nov 26, 2022
PESTO: Switching Point based Dynamic and Relative Positional Encoding for Code-Mixed Languages

PESTO: Switching Point based Dynamic and Relative Positional Encoding for Code-Mixed Languages Abstract NLP applications for code-mixed (CM) or mix-li

Mohsin Ali, Mohammed 1 Nov 12, 2021
Near-Optimal Sparse Allreduce for Distributed Deep Learning (published in PPoPP'22)

Near-Optimal Sparse Allreduce for Distributed Deep Learning (published in PPoPP'22) Ok-Topk is a scheme for distributed training with sparse gradients

Shigang Li 9 Oct 29, 2022
Build tensorflow keras model pipelines in a single line of code. Created by Ram Seshadri. Collaborators welcome. Permission granted upon request.

deep_autoviml Build keras pipelines and models in a single line of code! Table of Contents Motivation How it works Technology Install Usage API Image

AutoViz and Auto_ViML 102 Dec 17, 2022
The official implementation of CSG-Stump: A Learning Friendly CSG-Like Representation for Interpretable Shape Parsing

CSGStumpNet The official implementation of CSG-Stump: A Learning Friendly CSG-Like Representation for Interpretable Shape Parsing Paper | Project page

Daxuan 39 Dec 26, 2022
From Canonical Correlation Analysis to Self-supervised Graph Neural Networks

Code for CCA-SSG model proposed in the NeurIPS 2021 paper From Canonical Correlation Analysis to Self-supervised Graph Neural Networks.

Hengrui Zhang 44 Nov 27, 2022
A PyTorch library and evaluation platform for end-to-end compression research

CompressAI CompressAI (compress-ay) is a PyTorch library and evaluation platform for end-to-end compression research. CompressAI currently provides: c

InterDigital 680 Jan 06, 2023
Implements Gradient Centralization and allows it to use as a Python package in TensorFlow

Gradient Centralization TensorFlow This Python package implements Gradient Centralization in TensorFlow, a simple and effective optimization technique

Rishit Dagli 101 Nov 01, 2022
Matlab Python Heuristic Battery Opt - SMOP conversion and manual conversion

SMOP is Small Matlab and Octave to Python compiler. SMOP translates matlab to py

Tom Xu 1 Jan 12, 2022
Official implementation of the paper Image Generators with Conditionally-Independent Pixel Synthesis https://arxiv.org/abs/2011.13775

CIPS -- Official Pytorch Implementation of the paper Image Generators with Conditionally-Independent Pixel Synthesis Requirements pip install -r requi

Multimodal Lab @ Samsung AI Center Moscow 201 Dec 21, 2022