Unofficial Pytorch Implementation of WaveGrad2

Last update: Nov 29, 2022

Overview

WaveGrad 2 — Unofficial PyTorch Implementation

WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis
Unofficial PyTorch+Lightning Implementation of Chen et al.(JHU, Google Brain), WaveGrad2.
Audio Samples: https://mindslab-ai.github.io/wavegrad2/

TODO

More training for WaveGrad-Base setup
Checkpoint release
WaveGrad-Large Decoder
Inference by reduced sampling steps

Requirements

Pytorch
Pytorch-Lightning==1.2.10
The requirements are highlighted in requirements.txt.
We also provide docker setup Dockerfile.

Datasets

The supported datasets are

LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
AISHELL-3: a Mandarin TTS dataset with 218 male and female speakers, roughly 85 hours in total.
etc.

We take LJSpeech as an example hereafter.

Preprocessing

Adjust preprocess.yaml, especially path section.

path:
  corpus_path: '/DATA1/LJSpeech-1.1' # LJSpeech corpus path
  lexicon_path: 'lexicon/librispeech-lexicon.txt'
  raw_path: './raw_data/LJSpeech'
  preprocessed_path: './preprocessed_data/LJSpeech'

run prepare_align.py for some preparations.

python prepare_align.py -c preprocess.yaml

Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Alignments for the LJSpeech and AISHELL-3 datasets are provided here. You have to unzip the files in preprocessed_data/LJSpeech/TextGrid/.
After that, run preprocess.py.

python preprocess.py -c preprocess.yaml

Alternately, you can align the corpus by yourself.
Download the official MFA package and run it to align the corpus.

./montreal-forced-aligner/bin/mfa_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt english preprocessed_data/LJSpeech

./montreal-forced-aligner/bin/mfa_train_and_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt preprocessed_data/LJSpeech

And then run preprocess.py.

python preprocess.py -c preprocess.yaml

Training

Adjust hparameter.yaml, especially train section.

train:
  batch_size: 12 # Dependent on GPU memory size
  adam:
    lr: 3e-4
    weight_decay: 1e-6
  decay:
    rate: 0.05
    start: 25000
    end: 100000
  num_workers: 16 # Dependent on CPU cores
  gpus: 2 # number of GPUs
  loss_rate:
    dur: 1.0

If you want to train with other dataset, adjust data section in hparameter.yaml

data:
  lang: 'eng'
  text_cleaners: ['english_cleaners'] # korean_cleaners, english_cleaners, chinese_cleaners
  speakers: ['LJSpeech']
  train_dir: 'preprocessed_data/LJSpeech'
  train_meta: 'train.txt'  # relative path of metadata file from train_dir
  val_dir: 'preprocessed_data/LJSpeech'
  val_meta: 'val.txt'  # relative path of metadata file from val_dir'
  lexicon_path: 'lexicon/librispeech-lexicon.txt'

run trainer.py

python trainer.py

If you want to resume training from checkpoint, check parser.

parser = argparse.ArgumentParser()
parser.add_argument('-r', '--resume_from', type =int,\
	required = False, help = "Resume Checkpoint epoch number")
parser.add_argument('-s', '--restart', action = "store_true",\
	required = False, help = "Significant change occured, use this")
parser.add_argument('-e', '--ema', action = "store_true",
	required = False, help = "Start from ema checkpoint")
args = parser.parse_args()

During training, tensorboard logger is logging loss, spectrogram and audio.

tensorboard --logdir=./tensorboard --bind_all

Inference

run inference.py

python inference.py -c <checkpoint_path> --text <'text'>

Or you can run inference.ipynb.

Checkpoint file will be released!

Note

Since this repo is unofficial implementation and WaveGrad2 paper do not provide several details, a slight differences between paper could exist.
We listed modifications or arbitrary setups

Normal LSTM without ZoneOut is applied for encoder.
g2p_en is applied instead of Google's unknown G2P.
Trained with LJSpeech datasdet instead of Google's proprietary dataset.
- Due to dataset replacement, output audio's sampling rate becomes 22.05kHz instead of 24kHz.
MT + SpecAug are not implemented.
hyperparameters
- train.batch_size: 12 for 2 A100 (40GB) GPUs
- train.adam.lr: 3e-4 and train.adam.weight_decay: 1e-6
- train.decay learning rate decay is applied during training
- train.loss_rate: 1 as total_loss = 1 * L1_loss + 1 * duration_loss
- ddpm.ddpm_noise_schedule: torch.linspace(1e-6, 0.01, hparams.ddpm.max_step)
- encoder.channel is reduced to 512 from 1024 or 2048
Current sample page only contains samples from WaveGrad-Base decoder.
TODO things.

Tree

.
├── Dockerfile
├── README.md
├── dataloader.py
├── docs
│   ├── spec.png
│   ├── tb.png
│   └── tblogger.png
├── hparameter.yaml
├── inference.py
├── lexicon
│   ├── librispeech-lexicon.txt
│   └── pinyin-lexicon-r.txt
├── lightning_model.py
├── model
│   ├── base.py
│   ├── downsampling.py
│   ├── encoder.py
│   ├── gaussian_upsampling.py
│   ├── interpolation.py
│   ├── layers.py
│   ├── linear_modulation.py
│   ├── nn.py
│   ├── resampling.py
│   ├── upsampling.py
│   └── window.py
├── prepare_align.py
├── preprocess.py
├── preprocess.yaml
├── preprocessor
│   ├── ljspeech.py
│   └── preprocessor.py
├── text
│   ├── __init__.py
│   ├── cleaners.py
│   ├── cmudict.py
│   ├── numbers.py
│   └── symbols.py
├── trainer.py
├── utils
│   ├── mel.py
│   ├── stft.py
│   ├── tblogger.py
│   └── utils.py
└── wavegrad2_tester.ipynb

Author

This code is implemented by

Seungu Han at MINDs Lab [email protected]
Junhyeok Lee at MINDs Lab [email protected]

Special thanks to

Kang-wook Kim at MINDs Lab
Wonbin Jung at MINDs Lab
Sang Hoon Woo at MINDs Lab

References

Chen et al., WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis
Chen et al.,WaveGrad: Estimating Gradients for Waveform Generation
Ho et al., Denoising Diffusion Probabilistic Models
Shen et al., Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling

This implementation uses code from following repositories:

The webpage for the audio samples uses a template from:

WaveGrad2 Official Github.io

The audio samples on our webpage(TBD) are partially derived from:

LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
WaveGrad2 Official Github.io

Unofficial Pytorch Implementation of WaveGrad2

Related tags

Overview

WaveGrad 2 — Unofficial PyTorch Implementation

TODO

Requirements

Datasets

Preprocessing

Training

Inference

Note

Tree

Author

References

Owner

MINDs Lab

🛠️ Tools for Transformers compression using Lightning ⚡

Simple STAC Catalogs discovery tool.

ByteTrack: Multi-Object Tracking by Associating Every Detection Box

Implementation of the paper NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series Forecasting.

Public implementation of the Convolutional Motif Kernel Network (CMKN) architecture

This project provides a stock market environment using OpenGym with Deep Q-learning and Policy Gradient.

A task-agnostic vision-language architecture as a step towards General Purpose Vision

Another pytorch implementation of FCN (Fully Convolutional Networks)

Arbitrary Distribution Modeling with Censorship in Real Time 59 2 60 3 Bidding Advertising for KDD'21

My personal code and solution to the Synacor Challenge from 2012 OSCON.

Project for tracking occupancy in Tel-Aviv parking lots.

Sharpened cosine similarity torch - A Sharpened Cosine Similarity layer for PyTorch

Real-Time-Student-Attendence-System - Real Time Student Attendence System

[ICML'21] Estimate the accuracy of the classifier in various environments through self-supervision

project page for VinVL

Leveraging Social Influence based on Users Activity Centers for Point-of-Interest Recommendation

Bare bones use-case for deploying a containerized web app (built in streamlit) on AWS.

LERP : Label-dependent and event-guided interpretable disease risk prediction using EHRs

Pytorch implementation of MalConv

Quantization library for PyTorch. Support low-precision and mixed-precision quantization, with hardware implementation through TVM.