Unofficial Pytorch Implementation of WaveGrad2

Last update: Nov 29, 2022

Overview

WaveGrad 2 — Unofficial PyTorch Implementation

WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis
Unofficial PyTorch+Lightning Implementation of Chen et al.(JHU, Google Brain), WaveGrad2.
Audio Samples: https://mindslab-ai.github.io/wavegrad2/

TODO

More training for WaveGrad-Base setup
Checkpoint release
WaveGrad-Large Decoder
Inference by reduced sampling steps

Requirements

Pytorch
Pytorch-Lightning==1.2.10
The requirements are highlighted in requirements.txt.
We also provide docker setup Dockerfile.

Datasets

The supported datasets are

LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
AISHELL-3: a Mandarin TTS dataset with 218 male and female speakers, roughly 85 hours in total.
etc.

We take LJSpeech as an example hereafter.

Preprocessing

Adjust preprocess.yaml, especially path section.

path:
  corpus_path: '/DATA1/LJSpeech-1.1' # LJSpeech corpus path
  lexicon_path: 'lexicon/librispeech-lexicon.txt'
  raw_path: './raw_data/LJSpeech'
  preprocessed_path: './preprocessed_data/LJSpeech'

run prepare_align.py for some preparations.

python prepare_align.py -c preprocess.yaml

Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Alignments for the LJSpeech and AISHELL-3 datasets are provided here. You have to unzip the files in preprocessed_data/LJSpeech/TextGrid/.
After that, run preprocess.py.

python preprocess.py -c preprocess.yaml

Alternately, you can align the corpus by yourself.
Download the official MFA package and run it to align the corpus.

./montreal-forced-aligner/bin/mfa_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt english preprocessed_data/LJSpeech

./montreal-forced-aligner/bin/mfa_train_and_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt preprocessed_data/LJSpeech

And then run preprocess.py.

python preprocess.py -c preprocess.yaml

Training

Adjust hparameter.yaml, especially train section.

train:
  batch_size: 12 # Dependent on GPU memory size
  adam:
    lr: 3e-4
    weight_decay: 1e-6
  decay:
    rate: 0.05
    start: 25000
    end: 100000
  num_workers: 16 # Dependent on CPU cores
  gpus: 2 # number of GPUs
  loss_rate:
    dur: 1.0

If you want to train with other dataset, adjust data section in hparameter.yaml

data:
  lang: 'eng'
  text_cleaners: ['english_cleaners'] # korean_cleaners, english_cleaners, chinese_cleaners
  speakers: ['LJSpeech']
  train_dir: 'preprocessed_data/LJSpeech'
  train_meta: 'train.txt'  # relative path of metadata file from train_dir
  val_dir: 'preprocessed_data/LJSpeech'
  val_meta: 'val.txt'  # relative path of metadata file from val_dir'
  lexicon_path: 'lexicon/librispeech-lexicon.txt'

run trainer.py

python trainer.py

If you want to resume training from checkpoint, check parser.

parser = argparse.ArgumentParser()
parser.add_argument('-r', '--resume_from', type =int,\
	required = False, help = "Resume Checkpoint epoch number")
parser.add_argument('-s', '--restart', action = "store_true",\
	required = False, help = "Significant change occured, use this")
parser.add_argument('-e', '--ema', action = "store_true",
	required = False, help = "Start from ema checkpoint")
args = parser.parse_args()

During training, tensorboard logger is logging loss, spectrogram and audio.

tensorboard --logdir=./tensorboard --bind_all

Inference

run inference.py

python inference.py -c <checkpoint_path> --text <'text'>

Or you can run inference.ipynb.

Checkpoint file will be released!

Note

Since this repo is unofficial implementation and WaveGrad2 paper do not provide several details, a slight differences between paper could exist.
We listed modifications or arbitrary setups

Normal LSTM without ZoneOut is applied for encoder.
g2p_en is applied instead of Google's unknown G2P.
Trained with LJSpeech datasdet instead of Google's proprietary dataset.
- Due to dataset replacement, output audio's sampling rate becomes 22.05kHz instead of 24kHz.
MT + SpecAug are not implemented.
hyperparameters
- train.batch_size: 12 for 2 A100 (40GB) GPUs
- train.adam.lr: 3e-4 and train.adam.weight_decay: 1e-6
- train.decay learning rate decay is applied during training
- train.loss_rate: 1 as total_loss = 1 * L1_loss + 1 * duration_loss
- ddpm.ddpm_noise_schedule: torch.linspace(1e-6, 0.01, hparams.ddpm.max_step)
- encoder.channel is reduced to 512 from 1024 or 2048
Current sample page only contains samples from WaveGrad-Base decoder.
TODO things.

Tree

.
├── Dockerfile
├── README.md
├── dataloader.py
├── docs
│   ├── spec.png
│   ├── tb.png
│   └── tblogger.png
├── hparameter.yaml
├── inference.py
├── lexicon
│   ├── librispeech-lexicon.txt
│   └── pinyin-lexicon-r.txt
├── lightning_model.py
├── model
│   ├── base.py
│   ├── downsampling.py
│   ├── encoder.py
│   ├── gaussian_upsampling.py
│   ├── interpolation.py
│   ├── layers.py
│   ├── linear_modulation.py
│   ├── nn.py
│   ├── resampling.py
│   ├── upsampling.py
│   └── window.py
├── prepare_align.py
├── preprocess.py
├── preprocess.yaml
├── preprocessor
│   ├── ljspeech.py
│   └── preprocessor.py
├── text
│   ├── __init__.py
│   ├── cleaners.py
│   ├── cmudict.py
│   ├── numbers.py
│   └── symbols.py
├── trainer.py
├── utils
│   ├── mel.py
│   ├── stft.py
│   ├── tblogger.py
│   └── utils.py
└── wavegrad2_tester.ipynb

Author

This code is implemented by

Seungu Han at MINDs Lab [email protected]
Junhyeok Lee at MINDs Lab [email protected]

Special thanks to

Kang-wook Kim at MINDs Lab
Wonbin Jung at MINDs Lab
Sang Hoon Woo at MINDs Lab

References

Chen et al., WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis
Chen et al.,WaveGrad: Estimating Gradients for Waveform Generation
Ho et al., Denoising Diffusion Probabilistic Models
Shen et al., Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling

This implementation uses code from following repositories:

The webpage for the audio samples uses a template from:

WaveGrad2 Official Github.io

The audio samples on our webpage(TBD) are partially derived from:

LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
WaveGrad2 Official Github.io

Unofficial Pytorch Implementation of WaveGrad2

Related tags

Overview

WaveGrad 2 — Unofficial PyTorch Implementation

TODO

Requirements

Datasets

Preprocessing

Training

Inference

Note

Tree

Author

References

Owner

MINDs Lab

[SDM 2022] Towards Similarity-Aware Time-Series Classification

Implementation of ConvMixer for "Patches Are All You Need? 🤷"

Job-Recommend-Competition - Vectorwise Interpretable Attentions for Multimodal Tabular Data

Deep learning PyTorch library for time series forecasting, classification, and anomaly detection

使用yolov5训练自己数据集(详细过程)并通过flask部署

ICLR 2021, Fair Mixup: Fairness via Interpolation

Implementation of Neural Distance Embeddings for Biological Sequences (NeuroSEED) in PyTorch

EMNLP 2020 - Summarizing Text on Any Aspects

Pytorch implementation of NeurIPS 2021 paper: Geometry Processing with Neural Fields.

Implements VQGAN+CLIP for image and video generation, and style transfers, based on text and image prompts. Emphasis on ease-of-use, documentation, and smooth video creation.

[CVPR2021] Invertible Image Signal Processing

FPGA: Fast Patch-Free Global Learning Framework for Fully End-to-End Hyperspectral Image Classification

Code for "Learning From Multiple Experts: Self-paced Knowledge Distillation for Long-tailed Classification", ECCV 2020 Spotlight

Metadata-Extractor - Metadata Extractor Script can be used to read in exif metadata

This repository contains PyTorch models for SpecTr (Spectral Transformer).

Improved Fitness Optimization Landscapes for Sequence Design

Backdoor Attack through Frequency Domain

YoloAll is a collection of yolo all versions. you you use YoloAll to test yolov3/yolov5/yolox/yolo_fastest

Simple Text-Generator with OpenAI gpt-2 Pytorch Implementation

Code repository for our paper regarding the L3D dataset.