YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone

Last update: Dec 29, 2022

Overview

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone

In our recent paper we propose the YourTTS model. YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS. Our method builds upon the VITS model and adds several novel modifications for zero-shot multi-speaker and multilingual training. We achieved state-of-the-art (SOTA) results in zero-shot multi-speaker TTS and results comparable to SOTA in zero-shot voice conversion on the VCTK dataset. Additionally, our approach achieves promising results in a target language with a single-speaker dataset, opening possibilities for zero-shot multi-speaker TTS and zero-shot voice conversion systems in low-resource languages. Finally, it is possible to fine-tune the YourTTS model with less than 1 minute of speech and achieve state-of-the-art results in voice similarity and with reasonable quality. This is important to allow synthesis for speakers with a very different voice or recording characteristics from those seen during training.

Audios samples

Visit our website for audio samples.

Implementation

All of our experiments were implemented on the Coqui TTS repo. (Still a PR).

Colab Demos

Demo	URL
Zero-Shot TTS	link
Zero-Shot VC	link

Checkpoints

All the released checkpoints are licensed under CC BY-NC-ND 4.0

Model	URL
Speaker Encoder	link
Exp 1. YourTTS-EN(VCTK)	link
Exp 1. YourTTS-EN(VCTK) + SCL	link
Exp 2. YourTTS-EN(VCTK)-PT	link
Exp 2. YourTTS-EN(VCTK)-PT + SCL	link
Exp 3. YourTTS-EN(VCTK)-PT-FR	link
Exp 3. YourTTS-EN(VCTK)-PT-FR SCL	link
Exp 4. YourTTS-EN(VCTK+LibriTTS)-PT-FR SCL	link

Results replicability

To insure replicability, we make the audios used to generate the MOS available here. In addition, we provide the MOS for each audio here.

To re-generate our MOS results, follow the instructions here. To predict the test sentences and generate the SECS, please use the Jupyter Notebooks available here.

Comments

Languages other than PT, FR, EN

As YourTTS is multilingual TTS, I think that by training datasets, it seems that other languages might be available. However, YourTTS's checkpoint structure seems distinctive. Is there any training procedures that I can refer?

opened by papercore-dev 7

Issue with Input type and weight type should be the same

Hi,

I am trying to train YourTTS on my own dataset. So I followed your helpful guide with the latest stable version of Coqui TTS (0.8.0).

After computing the embeddings (on GPU) without issue, I run into this RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same.

I have already trained a VITS model with this dataset so everything is already set up. I understood that input Tensor resides on GPU whereas weight Tensor resides on CPU but how can I solve this ? Should I downgrade to CoquiTTS 0.6.2 ?

Here is the full traceback :

File "/home/caraduf/YourTTS/yourtts_env/lib/python3.10/site-packages/trainer/trainer.py", line 1533, in fit
    self._fit()
  File "/home/caraduf/YourTTS/yourtts_env/lib/python3.10/site-packages/trainer/trainer.py", line 1517, in _fit
    self.train_epoch()
  File "/home/caraduf/YourTTS/yourtts_env/lib/python3.10/site-packages/trainer/trainer.py", line 1282, in train_epoch
    _, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
  File "/home/caraduf/YourTTS/yourtts_env/lib/python3.10/site-packages/trainer/trainer.py", line 1135, in train_step
    outputs, loss_dict_new, step_time = self._optimize(
  File "/home/caraduf/YourTTS/yourtts_env/lib/python3.10/site-packages/trainer/trainer.py", line 996, in _optimize
    outputs, loss_dict = self._model_train_step(batch, model, criterion, optimizer_idx=optimizer_idx)
  File "/home/caraduf/YourTTS/yourtts_env/lib/python3.10/site-packages/trainer/trainer.py", line 954, in _model_train_step
    return model.train_step(*input_args)
  File "/home/caraduf/YourTTS/TTS/TTS/tts/models/vits.py", line 1250, in train_step
    outputs = self.forward(
  File "/home/caraduf/YourTTS/TTS/TTS/tts/models/vits.py", line 1049, in forward
    pred_embs = self.speaker_manager.encoder.forward(wavs_batch, l2_norm=True)
  File "/home/caraduf/YourTTS/TTS/TTS/encoder/models/resnet.py", line 169, in forward
    x = self.torch_spec(x)
  File "/home/caraduf/YourTTS/yourtts_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/caraduf/YourTTS/yourtts_env/lib/python3.10/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/home/caraduf/YourTTS/yourtts_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/caraduf/YourTTS/TTS/TTS/encoder/models/base_encoder.py", line 22, in forward
    return torch.nn.functional.conv1d(x, self.filter).squeeze(1)
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

Thanks for helping me out!

opened by Ca-ressemble-a-du-fake 6

Speaker Encoder train on new language

Hi, Can you elaborate about the source of where you get Speaker Encoder, and how do you train it with additional languages? How do you use model Wav2Vec which trained from fairseq? on config_se.json "run_description": "resnet speaker encoder trained with commonvoice all languages dev and train, Voxceleb 1 dev and Voxceleb 2 dev". Which languages include in this CV? which version of CV in this training? Thanks.

opened by ikcla 5
YourTTS_zeroshot_VC_demo.ipynb

Hi! I am trying to run YourTTS_zeroshot_VC_demo.ipynb and there seems to be access changes to the file best_model.pth.tar I am downloading it right now and I will manually upload it, so that I can run the notebook, but could you kindly fix the access rights so that others could easily run it like it was before. Thank you in advance!

opened by stalevna 5
train our own voice model

Hi ,

I have found your repo very interesting. So, I am trying out this. I am curious to know about training our voice files to creating checkpoint without involvement of text(As i have seen in previous issues to take reference of coqui model training) and without altering config.json. Can you please guide us how to proceed on this further.

opened by chandrakanthlns 4
Train YourTTS on another language

Good day!

I have several questions, could you please help?

Do I understand correctly that if I want to train the model on another language it is better to fine tune this model (YourTTS-EN(VCTK+LibriTTS)-PT-FR SCL): https://drive.google.com/drive/folders/15G-QS5tYQPkqiXfAdialJjmuqZV0azQV Or it is better to use other checkpoints.

How many hours of audio is needed to have appropriate quality?

I planned to use Common Voice Corpus to fine-tune the model on a new language, however, the audio format is mp3 not wav. Do I need to convert all the audio files or I can use mp3 format. If yes, how?

Thank you for your time in advance!

opened by annaklyueva 4
Select Speakers for Zero Shot TTS

Hi ,

Firstly great work on the project with time trying to understand the repo with more clarity. Wanted to know how can I select different speakers for different sections of text .

Thanks in advance.

opened by dipanjannC 4
From which version does coqui TTS starts supporting voice conversions and cloning?

Hi @Edresson, I am fairly new into the feild so please forgive for naive question. I am trying to use voice cloning feature. I trained a model on coqui-ai version 0.6 and in that installed environment. And I am using the command below to get the cloning done but it gives error that tts command does not expect "reference_wav" tts --model_path trained_model/best_model.pth.tar --config_path trained_model/config.json --speaker_idx "icici" --out_path output.wav --reference_wav target_content/asura_10secs.wav which might be because it did not support voice conversion then. Can you please confirm? Also, the model trained on version 0.6 doesn't run with latest version and ends up in dimension mismatch error which I am assuming due to model structure change probably. Please shed some light on this, It'll be really helpful.

opened by tieincred 3
finetune VC on my voice

I would like to finetune yourTTS voice conversion on my own voice, and compare it to the zero-shot model. Could you provide the finetuning procedure for VC?

opened by odeliazavlianovSC 3

Exp 1. YourTTS-EN(VCTK) + SCL(speaker encoder layers are not initialized )

I tried to run an experiment similar to Exp 1. YourTTS-EN(VCTK) + SCL initializing use_speaker_encoder_as_loss=true, speaker_encoder_loss_alpha=9.0, speaker_encoder_config_path and speaker_encoder_model_path(downloaded them from your google disk

So my config file is almost identical to the one you have for the experiment(I don't have fine_tuning_mode=0, but I checked and 0 means disabled, so it shouldn't affect anything. Also use_speaker_embedding=false, otherwise it complains that vectors are initialized)

My problem is when I print out model weights keys of your model and mine I have speaker encoder layers missing. They are not initialized for some reason. Unfortunately, I don't have any ideas why this could be happening :( Could you maybe point out a direction and what could I check?

  "use_sdp": true,
    "noise_scale": 1.0,
    "inference_noise_scale": 0.667,
    "length_scale": 1,
    "noise_scale_dp": 1.0,
    "inference_noise_scale_dp": 0.8,
    "max_inference_len": null,
    "init_discriminator": true,
    "use_spectral_norm_disriminator": false,
    "use_speaker_embedding": true,
    "num_speakers": 97,
    "speakers_file": null,
    "d_vector_file": "../speaker_embeddings/new-SE/VCTK+TTS-PT+MAILABS-FR/speakers.json",
    "speaker_embedding_channels": 512,
    "use_d_vector_file": true,
    "d_vector_dim": 512,
    "detach_dp_input": true,
    "use_language_embedding": false,
    "embedded_language_dim": 4,
    "num_languages": 0,
    "use_speaker_encoder_as_loss": true,
    "speaker_encoder_config_path": "../checkpoints/Speaker_Encoder/Resnet-original-paper/config.json",
    "speaker_encoder_model_path": "../checkpoints/Speaker_Encoder/Resnet-original-paper/converted_checkpoint.pth.tar",
    "fine_tuning_mode": 0,
    "freeze_encoder": false,
    "freeze_DP": false,
    "freeze_PE": false,
    "freeze_flow_decoder": false,
    "freeze_waveform_decoder": false

opened by stalevna 3

Zeroshot TTS notebook no longer working

Hi @Edresson @WeberJulian

the demo notebook is no longer working with the current TTS master repo.

I'm having hard time to execute things.

Do you intend to adjust ? thanks

opened by vince62s 3

Releases(MOS)

MOS(Nov 19, 2021)

Source code(tar.gz)
Source code(zip)
Audios_MOS.zip(311.92 MB)

Owner

Edresson Casanova

Computer Science PhD Student

GitHub Repository

Repo for the paper "DiLBERT: Cheap Embeddings for Disease Related Medical NLP"

DiLBERT Repo for the paper "DiLBERT: Cheap Embeddings for Disease Related Medical NLP" Pretrained Model The pretrained model presented in the paper is

2 Dec 15, 2022

The fastest way to visualize GradCAM with your Keras models.

VizGradCAM VizGradCam is the fastest way to visualize GradCAM in Keras models. GradCAM helps with providing visual explainability of trained models an

58 Nov 19, 2022

Synthesize photos from PhotoDNA using machine learning 🌱

Ribosome Synthesize photos from PhotoDNA. See the blog post for more information. Installation Dependencies You can install Python dependencies using

112 Nov 23, 2022

Tensorflow implementation of "Learning Deep Features for Discriminative Localization"

Weakly_detector Tensorflow implementation of "Learning Deep Features for Discriminative Localization" B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and

363 Jun 29, 2022

Denoising Diffusion Implicit Models

Denoising Diffusion Implicit Models (DDIM) Jiaming Song, Chenlin Meng and Stefano Ermon, Stanford Implements sampling from an implicit model that is t

465 Jan 05, 2023

A task-agnostic vision-language architecture as a step towards General Purpose Vision

Towards General Purpose Vision Systems By Tanmay Gupta, Amita Kamath, Aniruddha Kembhavi, and Derek Hoiem Overview Welcome to the official code base f

79 Dec 23, 2022

phylotorch-bito is a package providing an interface to BITO for phylotorch

phylotorch-bito phylotorch-bito is a package providing an interface to BITO for phylotorch Dependencies phylotorch BITO Installation Get the source co

2 Sep 01, 2022

Distributional Sliced-Wasserstein distance code

Distributional Sliced Wasserstein distance This is a pytorch implementation of the paper "Distributional Sliced-Wasserstein and Applications to Genera

39 Jan 01, 2023

[NeurIPS-2021] Slow Learning and Fast Inference: Efficient Graph Similarity Computation via Knowledge Distillation

Efficient Graph Similarity Computation - (EGSC) This repo contains the source code and dataset for our paper: Slow Learning and Fast Inference: Effici

24 Dec 31, 2022

Homepage of paper: Paint Transformer: Feed Forward Neural Painting with Stroke Prediction, ICCV 2021.

Paint Transformer: Feed Forward Neural Painting with Stroke Prediction [Paper] [Official Paddle Implementation] [Huggingface Gradio Demo] [Unofficial

442 Dec 16, 2022

A Pytorch implementation of "LegoNet: Efficient Convolutional Neural Networks with Lego Filters" (ICML 2019).

LegoNet This code is the implementation of ICML2019 paper LegoNet: Efficient Convolutional Neural Networks with Lego Filters Run python train.py You c

140 Sep 26, 2022

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.

NNI Doc | 简体中文 NNI (Neural Network Intelligence) is a lightweight but powerful toolkit to help users automate Feature Engineering, Neural Architecture

12.4k Dec 31, 2022

Resources related to our paper "CLIN-X: pre-trained language models and a study on cross-task transfer for concept extraction in the clinical domain"

CLIN-X (CLIN-X-ES) & (CLIN-X-EN) This repository holds the companion code for the system reported in the paper: "CLIN-X: pre-trained language models a

4 Dec 05, 2022

Async API for controlling Hue Lights

Hue API Async API for controlling Hue Lights Documentation: hue-api.nirantak.com Source: github.com/nirantak/hue-api Installation This is an async cli

4 Nov 16, 2022

A numpy-based implementation of RANSAC for fundamental matrix and homography estimation. The degeneracy updating and local optimization components are included and optional.

Description A numpy-based implementation of RANSAC for fundamental matrix and homography estimation. The degeneracy updating and local optimization co

9 Nov 10, 2022

How to Become More Salient? Surfacing Representation Biases of the Saliency Prediction Model

49 Nov 05, 2022

Code and real data for the paper "Counterfactual Temporal Point Processes", available at arXiv.

counterfactual-tpp This is a repository containing code and real data for the paper Counterfactual Temporal Point Processes. Pre-requisites This code

11 Dec 09, 2022

Anime Face Detector using mmdet and mmpose

Anime Face Detector This is an anime face detector using mmdetection and mmpose. (To avoid copyright issues, I use generated images by the TADNE model

198 Jan 07, 2023

Official Pytorch implementation of Online Continual Learning on Class Incremental Blurry Task Configuration with Anytime Inference (ICLR 2022)

The Official Implementation of CLIB (Continual Learning for i-Blurry) Online Continual Learning on Class Incremental Blurry Task Configuration with An

34 Oct 26, 2022

CLEAR algorithm for multi-view data association

CLEAR: Consistent Lifting, Embedding, and Alignment Rectification Algorithm The Matlab, Python, and C++ implementation of the CLEAR algorithm, as desc

30 Jan 02, 2023

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone

Related tags

Overview

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone

Audios samples

Implementation

Colab Demos

Checkpoints

Results replicability

Comments

Releases(MOS)

MOS(Nov 19, 2021)

Owner

Edresson Casanova

Repo for the paper "DiLBERT: Cheap Embeddings for Disease Related Medical NLP"

The fastest way to visualize GradCAM with your Keras models.

Synthesize photos from PhotoDNA using machine learning 🌱

Tensorflow implementation of "Learning Deep Features for Discriminative Localization"

Denoising Diffusion Implicit Models

A task-agnostic vision-language architecture as a step towards General Purpose Vision

phylotorch-bito is a package providing an interface to BITO for phylotorch

Distributional Sliced-Wasserstein distance code

[NeurIPS-2021] Slow Learning and Fast Inference: Efficient Graph Similarity Computation via Knowledge Distillation

Homepage of paper: Paint Transformer: Feed Forward Neural Painting with Stroke Prediction, ICCV 2021.

A Pytorch implementation of "LegoNet: Efficient Convolutional Neural Networks with Lego Filters" (ICML 2019).

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.

Resources related to our paper "CLIN-X: pre-trained language models and a study on cross-task transfer for concept extraction in the clinical domain"

Async API for controlling Hue Lights

A numpy-based implementation of RANSAC for fundamental matrix and homography estimation. The degeneracy updating and local optimization components are included and optional.

How to Become More Salient? Surfacing Representation Biases of the Saliency Prediction Model

Code and real data for the paper "Counterfactual Temporal Point Processes", available at arXiv.

Anime Face Detector using mmdet and mmpose

Official Pytorch implementation of Online Continual Learning on Class Incremental Blurry Task Configuration with Anytime Inference (ICLR 2022)

CLEAR algorithm for multi-view data association