DiffWave

DiffWave is a fast, high-quality neural vocoder and waveform synthesizer. It starts with Gaussian noise and converts it into speech via iterative refinement. The speech can be controlled by providing a conditioning signal (e.g. log-scaled Mel spectrogram). The model and architecture details are described in DiffWave: A Versatile Diffusion Model for Audio Synthesis.

What's new (2021-11-09)

unconditional waveform synthesis (thanks to Andrechang!)

What's new (2021-04-01)

fast sampling algorithm based on v3 of the DiffWave paper

What's new (2020-10-14)

new pretrained model trained for 1M steps
updated audio samples with output from new model

Status (2021-11-09)

Big thanks to Zhifeng Kong (lead author of DiffWave) for pointers and bug fixes.

Audio samples

22.05 kHz audio samples

Pretrained models

22.05 kHz pretrained model (31 MB, SHA256: d415d2117bb0bba3999afabdd67ed11d9e43400af26193a451d112e2560821a8)

This pre-trained model is able to synthesize speech with a real-time factor of 0.87 (smaller is faster).

Pre-trained model details

trained on 4x 1080Ti
default parameters
single precision floating point (FP32)
trained on LJSpeech dataset excluding LJ001* and LJ002*
trained for 1000578 steps (1273 epochs)

Install

Install using pip:

pip install diffwave

or from GitHub:

git clone https://github.com/lmnt-com/diffwave.git
cd diffwave
pip install .

Training

Before you start training, you'll need to prepare a training dataset. The dataset can have any directory structure as long as the contained .wav files are 16-bit mono (e.g. LJSpeech, VCTK). By default, this implementation assumes a sample rate of 22.05 kHz. If you need to change this value, edit params.py.

python -m diffwave.preprocess /path/to/dir/containing/wavs
python -m diffwave /path/to/model/dir /path/to/dir/containing/wavs

# in another shell to monitor training progress:
tensorboard --logdir /path/to/model/dir --bind_all

You should expect to hear intelligible (but noisy) speech by ~8k steps (~1.5h on a 2080 Ti).

Multi-GPU training

By default, this implementation uses as many GPUs in parallel as returned by torch.cuda.device_count(). You can specify which GPUs to use by setting the CUDA_DEVICES_AVAILABLE environment variable before running the training module.

Inference API

Basic usage:

from diffwave.inference import predict as diffwave_predict

model_dir = '/path/to/model/dir'
spectrogram = # get your hands on a spectrogram in [N,C,W] format
audio, sample_rate = diffwave_predict(spectrogram, model_dir, fast_sampling=True)

# audio is a GPU tensor in [N,T] format.

Inference CLI

python -m diffwave.inference --fast /path/to/model /path/to/spectrogram -o output.wav

DiffWave is a fast, high-quality neural vocoder and waveform synthesizer.

Related tags

Overview

DiffWave

What's new (2021-11-09)

What's new (2021-04-01)

What's new (2020-10-14)

Status (2021-11-09)

Audio samples

Pretrained models

Pre-trained model details

Install

Training

Multi-GPU training

Inference API

Inference CLI

References

Owner

LMNT

Interacting Two-Hand 3D Pose and Shape Reconstruction from Single Color Image (ICCV 2021)

PyTorch implementation of Algorithm 1 of "On the Anatomy of MCMC-Based Maximum Likelihood Learning of Energy-Based Models"

[CVPR 2022] Official Pytorch code for OW-DETR: Open-world Detection Transformer

Faster RCNN with PyTorch

Navigating StyleGAN2 w latent space using CLIP

AI Virtual Calculator: This is a simple virtual calculator based on Artificial intelligence.

An efficient 3D semantic segmentation framework for Urban-scale point clouds like SensatUrban, Campus3D, etc.

Code for "Localization with Sampling-Argmax", NeurIPS 2021

Learning Dynamic Network Using a Reuse Gate Function in Semi-supervised Video Object Segmentation.

The Self-Supervised Learner can be used to train a classifier with fewer labeled examples needed using self-supervised learning.

A robust pointcloud registration pipeline based on correlation.

minimizer-space de Bruijn graphs (mdBG) for whole genome assembly

An implementation of the Contrast Predictive Coding (CPC) method to train audio features in an unsupervised fashion.

Code for the published paper : Learning to recognize rare traffic sign

A machine learning library for spiking neural networks. Supports training with both torch and jax pipelines, and deployment to neuromorphic hardware.

AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition

This repository contain code on Novelty-Driven Binary Particle Swarm Optimisation for Truss Optimisation Problems.

ERISHA is a mulitilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available.

Awesome Remote Sensing Toolkit based on PaddlePaddle.

Enabling dynamic analysis of Legacy Embedded Systems in full emulated environment