DiffWave

DiffWave is a fast, high-quality neural vocoder and waveform synthesizer. It starts with Gaussian noise and converts it into speech via iterative refinement. The speech can be controlled by providing a conditioning signal (e.g. log-scaled Mel spectrogram). The model and architecture details are described in DiffWave: A Versatile Diffusion Model for Audio Synthesis.

What's new (2021-11-09)

unconditional waveform synthesis (thanks to Andrechang!)

What's new (2021-04-01)

fast sampling algorithm based on v3 of the DiffWave paper

What's new (2020-10-14)

new pretrained model trained for 1M steps
updated audio samples with output from new model

Status (2021-11-09)

Big thanks to Zhifeng Kong (lead author of DiffWave) for pointers and bug fixes.

Audio samples

22.05 kHz audio samples

Pretrained models

22.05 kHz pretrained model (31 MB, SHA256: d415d2117bb0bba3999afabdd67ed11d9e43400af26193a451d112e2560821a8)

This pre-trained model is able to synthesize speech with a real-time factor of 0.87 (smaller is faster).

Pre-trained model details

trained on 4x 1080Ti
default parameters
single precision floating point (FP32)
trained on LJSpeech dataset excluding LJ001* and LJ002*
trained for 1000578 steps (1273 epochs)

Install

Install using pip:

pip install diffwave

or from GitHub:

git clone https://github.com/lmnt-com/diffwave.git
cd diffwave
pip install .

Training

Before you start training, you'll need to prepare a training dataset. The dataset can have any directory structure as long as the contained .wav files are 16-bit mono (e.g. LJSpeech, VCTK). By default, this implementation assumes a sample rate of 22.05 kHz. If you need to change this value, edit params.py.

python -m diffwave.preprocess /path/to/dir/containing/wavs
python -m diffwave /path/to/model/dir /path/to/dir/containing/wavs

# in another shell to monitor training progress:
tensorboard --logdir /path/to/model/dir --bind_all

You should expect to hear intelligible (but noisy) speech by ~8k steps (~1.5h on a 2080 Ti).

Multi-GPU training

By default, this implementation uses as many GPUs in parallel as returned by torch.cuda.device_count(). You can specify which GPUs to use by setting the CUDA_DEVICES_AVAILABLE environment variable before running the training module.

Inference API

Basic usage:

from diffwave.inference import predict as diffwave_predict

model_dir = '/path/to/model/dir'
spectrogram = # get your hands on a spectrogram in [N,C,W] format
audio, sample_rate = diffwave_predict(spectrogram, model_dir, fast_sampling=True)

# audio is a GPU tensor in [N,T] format.

Inference CLI

python -m diffwave.inference --fast /path/to/model /path/to/spectrogram -o output.wav

DiffWave is a fast, high-quality neural vocoder and waveform synthesizer.

Related tags

Overview

DiffWave

What's new (2021-11-09)

What's new (2021-04-01)

What's new (2020-10-14)

Status (2021-11-09)

Audio samples

Pretrained models

Pre-trained model details

Install

Training

Multi-GPU training

Inference API

Inference CLI

References

Owner

LMNT

Get a Grip! - A robotic system for remote clinical environments.

Code for "Retrieving Black-box Optimal Images from External Databases" (WSDM 2022)

Six - a Python 2 and 3 compatibility library

A faster pytorch implementation of faster r-cnn

BBScan py3 - BBScan py3 With Python

Deep Reinforcement Learning with pytorch & visdom

Code for CVPR2021 paper "Learning Salient Boundary Feature for Anchor-free Temporal Action Localization"

Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning, CVPR 2021

Code for NeurIPS 2021 paper 'Spatio-Temporal Variational Gaussian Processes'

Ensembling Off-the-shelf Models for GAN Training

This is a beginner-friendly repo to make a collection of some unique and awesome projects. Everyone in the community can benefit & get inspired by the amazing projects present over here.

[ICCV 2021 (oral)] Planar Surface Reconstruction from Sparse Views

A2LP for short, ECCV2020 spotlight, Investigating SSL principles for UDA problems

End-to-end beat and downbeat tracking in the time domain.

Accelerated deep learning R&D

Generate vibrant and detailed images using only text.

🔎 Super-scale your images and run experiments with Residual Dense and Adversarial Networks.

Solving SMPL/MANO parameters from keypoint coordinates.

[NeurIPS 2021] COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining

MPI-IS Mesh Processing Library