VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Last update: Jan 08, 2023

Related tags

Overview

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Jaehyeon Kim, Jungil Kong, and Juhee Son

In our recent paper, we propose VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.

Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel end-to-end TTS method that generates more natural sounding audio than current two-stage models. Our method adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling. We also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text. With the uncertainty modeling over latent variables and the stochastic duration predictor, our method expresses the natural one-to-many relationship in which a text input can be spoken in multiple ways with different pitches and rhythms. A subjective human evaluation (mean opinion score, or MOS) on the LJ Speech, a single speaker dataset, shows that our method outperforms the best publicly available TTS systems and achieves a MOS comparable to ground truth.

Visit our demo for audio samples.

We also provide the pretrained models.

VITS at training	VITS at inference

Pre-requisites

Python >= 3.6
Clone this repository
Install python requirements. Please refer requirements.txt
1. You may need to install espeak first: apt-get install espeak
Download datasets
1. Download and extract the LJ Speech dataset, then rename or create a link to the dataset folder: ln -s /path/to/LJSpeech-1.1/wavs DUMMY1
2. For mult-speaker setting, download and extract the VCTK dataset, and downsample wav files to 22050 Hz. Then rename or create a link to the dataset folder: ln -s /path/to/VCTK-Corpus/downsampled_wavs DUMMY2
Build Monotonic Alignment Search and run preprocessing if you use your own datasets.

# Cython-version Monotonoic Alignment Search
cd monotonic_align
python setup.py build_ext --inplace

# Preprocessing (g2p) for your own datasets. Preprocessed phonemes for LJ Speech and VCTK have been already provided.
# python preprocess.py --text_index 1 --filelists filelists/ljs_audio_text_train_filelist.txt filelists/ljs_audio_text_val_filelist.txt filelists/ljs_audio_text_test_filelist.txt 
# python preprocess.py --text_index 2 --filelists filelists/vctk_audio_sid_text_train_filelist.txt filelists/vctk_audio_sid_text_val_filelist.txt filelists/vctk_audio_sid_text_test_filelist.txt

Training Exmaple

# LJ Speech
python train.py -c configs/ljs_base.json -m ljs_base

# VCTK
python train_ms.py -c configs/vctk_base.json -m vctk_base

Inference Example

See inference.ipynb

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Related tags

Overview

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Jaehyeon Kim, Jungil Kong, and Juhee Son

Pre-requisites

Training Exmaple

Inference Example

Owner

Jaehyeon Kim

Source code for our Paper "Learning in High-Dimensional Feature Spaces Using ANOVA-Based Matrix-Vector Multiplication"

Scalable implementation of Lee / Mykland (2012) and Ait-Sahalia / Jacod (2012) Jump tests for noisy high frequency data

This is a TensorFlow implementation for C2-Rec

Solution to the Weather4cast 2021 challenge

Code for CVPR 2018 paper --- Texture Mapping for 3D Reconstruction with RGB-D Sensor

Code for reproducible experiments presented in KSD Aggregated Goodness-of-fit Test.

Code for the paper "Next Generation Reservoir Computing"

Colossal-AI: A Unified Deep Learning System for Large-Scale Parallel Training

simple_pytorch_example project is a toy example of a python script that instantiates and trains a PyTorch neural network on the FashionMNIST dataset

The code is for the paper "A Self-Distillation Embedded Supervised Affinity Attention Model for Few-Shot Segmentation"

A PyTorch Implementation of ViT (Vision Transformer)

[NeurIPS-2021] Mosaicking to Distill: Knowledge Distillation from Out-of-Domain Data

A cool little repl-based simulation written in Python

Official PyTorch implementation of Retrieve in Style: Unsupervised Facial Feature Transfer and Retrieval.

A PyTorch Toolbox for Face Recognition

Introduction to AI assignment 1 HCM University of Technology, term 211

HCQ: Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval

Adds timm pretrained backbone to pytorch's FasterRcnn model

Code for sound field predictions in domains with impedance boundaries. Used for generating results from the paper

Official implementation of the paper Label-Efficient Semantic Segmentation with Diffusion Models