TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, Korean, Chinese, German and Easy to adapt for other languages)

Last update: Jan 04, 2023

Overview

😋 TensorFlowTTS

Real-Time State-of-the-art Speech Synthesis for Tensorflow 2

🤪 TensorFlowTTS provides real-time state-of-the-art speech synthesis architectures such as Tacotron-2, Melgan, Multiband-Melgan, FastSpeech, FastSpeech2 based-on TensorFlow 2. With Tensorflow 2, we can speed-up training/inference progress, optimizer further by using fake-quantize aware and pruning, make TTS models can be run faster than real-time and be able to deploy on mobile devices or embedded systems.

What's new

2020/12/02 (NEW!) Support German TTS with Thorsten dataset. See the Colab. Thanks thorstenMueller and monatis.
2020/11/24 (NEW!) Add HiFi-GAN vocoder. See here
2020/11/19 (NEW!) Add Multi-GPU gradient accumulator. See here
2020/08/23 Add Parallel WaveGAN tensorflow implementation. See here
2020/08/23 Add MBMelGAN G + ParallelWaveGAN G example. See here
2020/08/20 Add C++ inference code. Thank @ZDisket. See here
2020/08/18 Update new base processor. Add AutoProcessor and pretrained processor json file
2020/08/14 Support Chinese TTS. Pls see the colab. Thank @azraelkuan
2020/08/05 Support Korean TTS. Pls see the colab. Thank @crux153
2020/07/17 Support MultiGPU for all Trainer
2020/07/05 Support Convert Tacotron-2, FastSpeech to Tflite. Pls see the colab. Thank @jaeyoo from the TFlite team for his support
2020/06/20 FastSpeech2 implementation with Tensorflow is supported.
2020/06/07 Multi-band MelGAN (MB MelGAN) implementation with Tensorflow is supported

Features

High performance on Speech Synthesis.
Be able to fine-tune on other languages.
Fast, Scalable, and Reliable.
Suitable for deployment.
Easy to implement a new model, based-on abstract class.
Mixed precision to speed-up training if possible.
Support Single/Multi GPU gradient Accumulate.
Support both Single/Multi GPU in base trainer class.
TFlite conversion for all supported models.
Android example.
Support many languages (currently, we support Chinese, Korean, English.)
Support C++ inference.
Support Convert weight for some models from PyTorch to TensorFlow to accelerate speed.

Requirements

This repository is tested on Ubuntu 18.04 with:

Python 3.7+
Cuda 10.1
CuDNN 7.6.5
Tensorflow 2.2/2.3
Tensorflow Addons >= 0.10.0

Different Tensorflow version should be working but not tested yet. This repo will try to work with the latest stable TensorFlow version. We recommend you install TensorFlow 2.3.0 to training in case you want to use MultiGPU.

Installation

With pip

$ pip install TensorFlowTTS

From source

Examples are included in the repository but are not shipped with the framework. Therefore, to run the latest version of examples, you need to install the source below.

$ git clone https://github.com/TensorSpeech/TensorFlowTTS.git
$ cd TensorFlowTTS
$ pip install .

If you want to upgrade the repository and its dependencies:

$ git pull
$ pip install --upgrade .

Supported Model architectures

TensorFlowTTS currently provides the following architectures:

MelGAN released with the paper MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis by Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brebisson, Yoshua Bengio, Aaron Courville.
Tacotron-2 released with the paper Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions by Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu.
FastSpeech released with the paper FastSpeech: Fast, Robust, and Controllable Text to Speech by Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu.
Multi-band MelGAN released with the paper Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech by Geng Yang, Shan Yang, Kai Liu, Peng Fang, Wei Chen, Lei Xie.
FastSpeech2 released with the paper FastSpeech 2: Fast and High-Quality End-to-End Text to Speech by Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu.
Parallel WaveGAN released with the paper Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram by Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim.
HiFi-GAN released with the paper HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis by Jungil Kong, Jaehyeon Kim, Jaekyoung Bae.

We are also implementing some techniques to improve quality and convergence speed from the following papers:

Guided Attention Loss released with the paper Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention by Hideyuki Tachibana, Katsuya Uenoyama, Shunsuke Aihara.

Audio Samples

Here in an audio samples on valid set. tacotron-2, fastspeech, melgan, melgan.stft, fastspeech2, multiband_melgan

Tutorial End-to-End

Prepare Dataset

Prepare a dataset in the following format:

|- [NAME_DATASET]/
|   |- metadata.csv
|   |- wavs/
|       |- file1.wav
|       |- ...

Where metadata.csv has the following format: id|transcription. This is a ljspeech-like format; you can ignore preprocessing steps if you have other format datasets.

Note that NAME_DATASET should be [ljspeech/kss/baker/libritts] for example.

Preprocessing

The preprocessing has two steps:

Preprocess audio features
- Convert characters to IDs
- Compute mel spectrograms
- Normalize mel spectrograms to [-1, 1] range
- Split the dataset into train and validation
- Compute the mean and standard deviation of multiple features from the training split
Standardize mel spectrogram based on computed statistics

To reproduce the steps above:

tensorflow-tts-preprocess --rootdir ./[ljspeech/kss/baker/libritts/thorsten] --outdir ./dump_[ljspeech/kss/baker/libritts/thorsten] --config preprocess/[ljspeech/kss/baker/thorsten]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts/thorsten]
tensorflow-tts-normalize --rootdir ./dump_[ljspeech/kss/baker/libritts/thorsten] --outdir ./dump_[ljspeech/kss/baker/libritts/thorsten] --config preprocess/[ljspeech/kss/baker/libritts/thorsten]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts/thorsten]

Right now we only support ljspeech, kss, baker, libritts and thorsten for dataset argument. In the future, we intend to support more datasets.

Note: To run libritts preprocessing, please first read the instruction in examples/fastspeech2_libritts. We need to reformat it first before run preprocessing.

After preprocessing, the structure of the project folder should be:

|- [NAME_DATASET]/
|   |- metadata.csv
|   |- wav/
|       |- file1.wav
|       |- ...
|- dump_[ljspeech/kss/baker/libritts/thorsten]/
|   |- train/
|       |- ids/
|           |- LJ001-0001-ids.npy
|           |- ...
|       |- raw-feats/
|           |- LJ001-0001-raw-feats.npy
|           |- ...
|       |- raw-f0/
|           |- LJ001-0001-raw-f0.npy
|           |- ...
|       |- raw-energies/
|           |- LJ001-0001-raw-energy.npy
|           |- ...
|       |- norm-feats/
|           |- LJ001-0001-norm-feats.npy
|           |- ...
|       |- wavs/
|           |- LJ001-0001-wave.npy
|           |- ...
|   |- valid/
|       |- ids/
|           |- LJ001-0009-ids.npy
|           |- ...
|       |- raw-feats/
|           |- LJ001-0009-raw-feats.npy
|           |- ...
|       |- raw-f0/
|           |- LJ001-0001-raw-f0.npy
|           |- ...
|       |- raw-energies/
|           |- LJ001-0001-raw-energy.npy
|           |- ...
|       |- norm-feats/
|           |- LJ001-0009-norm-feats.npy
|           |- ...
|       |- wavs/
|           |- LJ001-0009-wave.npy
|           |- ...
|   |- stats.npy
|   |- stats_f0.npy
|   |- stats_energy.npy
|   |- train_utt_ids.npy
|   |- valid_utt_ids.npy
|- examples/
|   |- melgan/
|   |- fastspeech/
|   |- tacotron2/
|   ...

stats.npy contains the mean and std from the training split mel spectrograms
stats_energy.npy contains the mean and std of energy values from the training split
stats_f0.npy contains the mean and std of F0 values in the training split
train_utt_ids.npy / valid_utt_ids.npy contains training and validation utterances IDs respectively

We use suffix (ids, raw-feats, raw-energy, raw-f0, norm-feats, and wave) for each input type.

IMPORTANT NOTES:

This preprocessing step is based on ESPnet so you can combine all models here with other models from ESPnet repository.
Regardless of how your dataset is formatted, the final structure of the dump folder SHOULD follow the above structure to be able to use the training script, or you can modify it by yourself 😄 .

Training models

To know how to train model from scratch or fine-tune with other datasets/languages, please see detail at example directory.

For Tacotron-2 tutorial, pls see examples/tacotron2
For FastSpeech tutorial, pls see examples/fastspeech
For FastSpeech2 tutorial, pls see examples/fastspeech2
For FastSpeech2 + MFA tutorial, pls see examples/fastspeech2_libritts
For MelGAN tutorial, pls see examples/melgan
For MelGAN + STFT Loss tutorial, pls see examples/melgan.stft
For Multiband-MelGAN tutorial, pls see examples/multiband_melgan
For Parallel WaveGAN tutorial, pls see examples/parallel_wavegan
For Multiband-MelGAN Generator + Parallel WaveGAN Discriminator tutorial, pls see examples/multiband_pwgan
For HiFi-GAN tutorial, pls see examples/hifigan

Abstract Class Explaination

Abstract DataLoader Tensorflow-based dataset

A detail implementation of abstract dataset class from tensorflow_tts/dataset/abstract_dataset. There are some functions you need overide and understand:

get_args: This function return argumentation for generator class, normally is utt_ids.
generator: This function have an inputs from get_args function and return a inputs for models. Note that we return a dictionary for all generator functions with the keys that exactly match with the model's parameters because base_trainer will use model(**batch) to do forward step.
get_output_dtypes: This function need return dtypes for each element from generator function.
get_len_dataset: Return len of datasets, normaly is len(utt_ids).

IMPORTANT NOTES:

A pipeline of creating dataset should be: cache -> shuffle -> map_fn -> get_batch -> prefetch.
If you do shuffle before cache, the dataset won't shuffle when it re-iterate over datasets.
You should apply map_fn to make each element return from generator function have the same length before getting batch and feed it into a model.

Some examples to use this abstract_dataset are tacotron_dataset.py, fastspeech_dataset.py, melgan_dataset.py, fastspeech2_dataset.py

Abstract Trainer Class

A detail implementation of base_trainer from tensorflow_tts/trainer/base_trainer.py. It include Seq2SeqBasedTrainer and GanBasedTrainer inherit from BasedTrainer. All trainer support both single/multi GPU. There a some functions you MUST overide when implement new_trainer:

compile: This function aim to define a models, and losses.
generate_and_save_intermediate_result: This function will save intermediate result such as: plot alignment, save audio generated, plot mel-spectrogram ...
compute_per_example_losses: This function will compute per_example_loss for model, note that all element of the loss MUST has shape [batch_size].

All models on this repo are trained based-on GanBasedTrainer (see train_melgan.py, train_melgan_stft.py, train_multiband_melgan.py) and Seq2SeqBasedTrainer (see train_tacotron2.py, train_fastspeech.py).

End-to-End Examples

You can know how to inference each model at notebooks or see a colab (for English), colab (for Korean). Here is an example code for end2end inference with fastspeech and melgan.

import numpy as np
import soundfile as sf
import yaml

import tensorflow as tf

from tensorflow_tts.inference import AutoConfig
from tensorflow_tts.inference import TFAutoModel
from tensorflow_tts.inference import AutoProcessor

# initialize fastspeech model.
fs_config = AutoConfig.from_pretrained('./examples/fastspeech/conf/fastspeech.v1.yaml')
fastspeech = TFAutoModel.from_pretrained(
    config=fs_config,
    pretrained_path="./examples/fastspeech/pretrained/model-195000.h5"
)


# initialize melgan model
melgan_config = AutoConfig.from_pretrained('./examples/melgan/conf/melgan.v1.yaml')
melgan = TFAutoModel.from_pretrained(
    config=melgan_config,
    pretrained_path="./examples/melgan/checkpoint/generator-1500000.h5"
)


# inference
processor = AutoProcessor.from_pretrained(pretrained_path="./test/files/ljspeech_mapper.json")

ids = processor.text_to_sequence("Recent research at Harvard has shown meditating for as little as 8 weeks, can actually increase the grey matter in the parts of the brain responsible for emotional regulation, and learning.")
ids = tf.expand_dims(ids, 0)
# fastspeech inference

masked_mel_before, masked_mel_after, duration_outputs = fastspeech.inference(
    ids,
    speaker_ids=tf.zeros(shape=[tf.shape(ids)[0]], dtype=tf.int32),
    speed_ratios=tf.constant([1.0], dtype=tf.float32)
)

# melgan inference
audio_before = melgan.inference(masked_mel_before)[0, :, 0]
audio_after = melgan.inference(masked_mel_after)[0, :, 0]

# save to file
sf.write('./audio_before.wav', audio_before, 22050, "PCM_16")
sf.write('./audio_after.wav', audio_after, 22050, "PCM_16")

Contact

Minh Nguyen Quan Anh: [email protected], erogol: [email protected], Kuan Chen: [email protected], Dawid Kobus: [email protected], Takuya Ebata: [email protected], Trinh Le Quang: [email protected], Yunchao He: [email protected], Alejandro Miguel Velasquez: [email protected]

License

Overall, Almost models here are licensed under the Apache 2.0 for all countries in the world, except in Viet Nam this framework cannot be used for production in any way without permission from TensorFlowTTS's Authors. There is an exception, Tacotron-2 can be used with any purpose. If you are Vietnamese and want to use this framework for production, you Must contact us in advance.

Acknowledgement

We want to thank Tomoki Hayashi, who discussed with us much about Melgan, Multi-band melgan, Fastspeech, and Tacotron. This framework based-on his great open-source ParallelWaveGan project.

Comments

FastSpeech2 training with MFA and Phoneme-based

When training FastSpeech2 (fastspeech2_v2) with phonetic alignments extracted from MFA I get the error described:

/content/TensorflowTTS/tensorflow_tts/trainers/base_trainer.py in run(self)
     65         )
     66         while True:
---> 67             self._train_epoch()
     68 
     69             if self.finish_train:

/content/TensorflowTTS/tensorflow_tts/trainers/base_trainer.py in _train_epoch(self)
     87         for train_steps_per_epoch, batch in enumerate(self.train_data_loader, 1):
     88             # one step training
---> 89             self._train_step(batch)
     90 
     91             # check interval

<ipython-input-39-dd452e77975e> in _train_step(self, batch)
     75         """Train model one step."""
     76         charactor, duration, f0, energy, mel = batch
---> 77         self._one_step_fastspeech2(charactor, duration, f0, energy, mel)
     78 
     79         # update counts

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py in __call__(self, *args, **kwds)
    578         xla_context.Exit()
    579     else:
--> 580       result = self._call(*args, **kwds)
    581 
    582     if tracing_count == self._get_tracing_count():

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py in _call(self, *args, **kwds)
    642         # Lifting succeeded, so variables are initialized and we can run the
    643         # stateless function.
--> 644         return self._stateless_fn(*args, **kwds)
    645     else:
    646       canon_args, canon_kwds = \

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py in __call__(self, *args, **kwargs)
   2418     with self._lock:
   2419       graph_function, args, kwargs = self._maybe_define_function(args, kwargs)
-> 2420     return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
   2421 
   2422   @property

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py in _filtered_call(self, args, kwargs)
   1663          if isinstance(t, (ops.Tensor,
   1664                            resource_variable_ops.BaseResourceVariable))),
-> 1665         self.captured_inputs)
   1666 
   1667   def _call_flat(self, args, captured_inputs, cancellation_manager=None):

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py in _call_flat(self, args, captured_inputs, cancellation_manager)
   1744       # No tape is watching; skip to running the function.
   1745       return self._build_call_outputs(self._inference_function.call(
-> 1746           ctx, args, cancellation_manager=cancellation_manager))
   1747     forward_backward = self._select_forward_and_backward_functions(
   1748         args,

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py in call(self, ctx, args, cancellation_manager)
    596               inputs=args,
    597               attrs=attrs,
--> 598               ctx=ctx)
    599         else:
    600           outputs = execute.execute_with_cancellation(

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     58     ctx.ensure_initialized()
     59     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
---> 60                                         inputs, attrs, num_outputs)
     61   except core._NotOkStatusException as e:
     62     if name is not None:

InvalidArgumentError:  Incompatible shapes: [16,823,80] vs. [16,867,80]
	 [[node mean_absolute_error/sub (defined at <ipython-input-39-dd452e77975e>:115) ]] [Op:__inference__one_step_fastspeech2_341496]

Errors may have originated from an input operation.
Input Source operations connected to node mean_absolute_error/sub:
 mel (defined at <ipython-input-39-dd452e77975e>:77)	
 tf_fast_speech2_2/mel_before/BiasAdd (defined at /content/TensorflowTTS/tensorflow_tts/models/fastspeech2.py:196)

Function call stack:
_one_step_fastspeech2

I did everything I could think of to rule out my durations as the problem including verification that length is the same, so I don't know what happened. Interestingly enough, when training with mixed_precision off the same error happens but with different values:

InvalidArgumentError:  Incompatible shapes: [16,763,80] vs. [16,806,80]
	 [[node mean_absolute_error/sub (defined at <ipython-input-39-dd452e77975e>:115) ]] [Op:__inference__one_step_fastspeech2_449871]

Errors may have originated from an input operation.
Input Source operations connected to node mean_absolute_error/sub:
 tf_fast_speech2_3/mel_before/BiasAdd (defined at /content/TensorflowTTS/tensorflow_tts/models/fastspeech2.py:196)	
 mel (defined at <ipython-input-39-dd452e77975e>:77)

Function call stack:
_one_step_fastspeech2

Am I missing something?

enhancement 🚀 question ❓ Feature Request 🤗 FastSpeech Discussion 😁

opened by ZDisket 169

Fine-Tuning with a small dataset
Hello!

I'm trying to evaluate ways to achieve TTS for individuals that have lost their ability to speak, the idea is to allow them to regain speech via TTS but using the voice they had prior to losing their voice. This could happen from various causes such as cancer of the larynx, motor neurone disease, etc.

These patients have recorded voice banks, a small dataset of phrases recorded prior to losing their ability to speak.

Conceptually, I wanted to take a pre-trained model and fine-tune it with the individual's voice bank data.

I'd love some guidance.

There are a few constraints:

The patient-specific data bank is not a large dataset, it's approximately 100 recorded phrases.

Latency must be low, we hope for real-time TTS. Some approaches use a pre-trained model followed by vocoders, in our experience, this has been too slow, with latencies of about 5 seconds.

The trained model must work on an Android app (I see there is already an Android example, which has been helpful)

I'd love your guidance on the steps required to achieve this, and any recommendations on which choices would give good results...

Which model architectures will tolerate tuning with a small dataset?

The patients have British accents, whereas most pre-trained models have American accents. Will this be a problem?

Do you have any tutorials or examples that show how to achieve a customised voice via fine-tuning?
question ❓
opened by OscarVanL 127
Tacotron2: Everything become nan at 53k steps
Hi, I am not that experienced in TTS, so I've faced many problem before get the code running with my non-English dataset which has about 10k sentences (~26h long) . However, still some issues and questions.

When training process reaches at 53.5k steps, the model seems lost "everything". The values of train, eval losses and model predictions became nan (but training continues without reporting exception).

So I stopped training and resumed from 50k; I will wait until 53.5k and see if it happens again. By the way, do my figures look fine? looks like model is overfitting; should I wait for a "surprise"?

My language is somehow under-resourced and there is no (at least I couldn't find one) phoneme dictionary to train a G2P and MFA model. However, unlike English, a character roughly represents a phone, except some vowels sound longer or shorter according to meaning of host word. So character-based model seems fine with me. This tacotron2 has been trained just for duration extraction.

Which step seems best for duration extraction so far?

How can I improve the quality of duration extraction? extract_duration.py extracts durations from model prediction but they are supposed to be used with ground-truth mels. Although, the sum of tactron2-extracted durations is forced to match the length of ground-truth mels by alignment = alignment[:real_char_length, :real_mel_length], this is just based on an assumption that predicted mels and their ground-truth counterparts are roughly one-to-one (from index 0).

So, when the goal of training a tactron2 is to extract good duration only, is it a good idea to use whole dataset for training and make a severely over-fitted model (maybe up to 200k steps or more in my case)?

Any idea on MFA model training for a language with no phone dictionary available? Has anyone tried making a fake phone dictionary like this to force MFA align character instead of phoneme. .... hello h e l l o nice n i c e ....

Thanks.
question ❓ performance 🏍 Tacotron Discussion 😁 wontfix
opened by tekinek 57
Error Preprocessing KeyError: 'eos'

I am getting this error when trying to preprocess:

Traceback (most recent call last): File "/home/zak/venv/bin/tensorflow-tts-preprocess", line 8, in sys.exit(preprocess()) File "/home/zak/venv/lib/python3.8/site-packages/tensorflow_tts/bin/preprocess.py", line 442, in preprocess for result, mel, energy, f0, features in train_map: File "/usr/lib/python3.8/multiprocessing/pool.py", line 448, in return (item for chunk in result for item in chunk) File "/usr/lib/python3.8/multiprocessing/pool.py", line 865, in next raise value KeyError: 'eos'

I didn't have this error before the latest updates, after I re installed the TensorflowTTs again and tried to preprocess I got this. any Ideas ?

Thanks
bug 🐛

opened by Zak-SA 52
🇨🇳 Chinese TTS now available 😘

Chinese TTS now available, thank @azraelkuan for his support :D. The model used Baker dataset here (https://www.data-baker.com/open_source.htmlt). The pretrained model licensed under CC BY-NC-SA 4.0 (https://creativecommons.org/licenses/by-nc-sa/4.0/) since the dataset is non-commercial :D

Pls check out the colab bellow and enjoy :D.

https://colab.research.google.com/drive/1YpSHRBRPBI7cnTkQn1UcVTWEQVbsUm1S?usp=sharing

Note: this is just init results, there are more things can be done to make the model better.

cc: @candlewill @l4zyf9x @machineko
enhancement 🚀 good first issue 🤔 Feature Request 🤗 wontfix

opened by dathudeptrai 46
RuntimeError when trying to inference From TFlite for Fastspeech2

Hi, So I converted Fastspeech2 model to TFlite, when I tried to inference from TFlite I am getting this error

decoder_output_tflite, mel_output_tflite = infer(input_text) interpreter.invoke() File "/home/zak/venv/lib/python3.8/site-packages/tensorflow/lite/python/interpreter.py", line 539, in invoke self._interpreter.Invoke() RuntimeError: tensorflow/lite/kernels/reshape.cc:55 stretch_dim != -1 (0 != -1)Node number 83 (RESHAPE) failed to prepare.

the code I used for this purpose is

import numpy as np import yaml import tensorflow as tf

from tensorflow_tts.processor import ZAKSpeechProcessor from tensorflow_tts.processor.ZAKspeech import ZAKSPEECH_SYMBOLS

from tensorflow_tts.configs import FastSpeechConfig, FastSpeech2Config from tensorflow_tts.configs import MultiBandMelGANGeneratorConfig

from tensorflow_tts.models import TFFastSpeech, TFFastSpeech2 from tensorflow_tts.models import TFMBMelGANGenerator

from IPython.display import Audio

Load the TFLite model and allocate tensors.

interpreter = tf.lite.Interpreter(model_path='fastspeech2_quant.tflite')

Get input and output tensors.

input_details = interpreter.get_input_details() output_details = interpreter.get_output_details()

Prepare input data.

def prepare_input(input_ids): input_ids = tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0) return (input_ids, tf.convert_to_tensor([0], tf.int32), tf.convert_to_tensor([1.0], dtype=tf.float32), tf.convert_to_tensor([1.0], dtype=tf.float32), tf.convert_to_tensor([1.0], dtype=tf.float32))

Test the model on random input data.

def infer(input_text): for x in input_details: print(x) for x in output_details: print(x) processor = ZAKSpeechProcessor(data_dir=None, symbols=ZAKSPEECH_SYMBOLS, cleaner_names="arabic_cleaners") input_ids = processor.text_to_sequence(input_text.lower()) interpreter.resize_tensor_input(input_details[0]['index'], [1, len(input_ids)]) interpreter.resize_tensor_input(input_details[1]['index'], [1]) interpreter.resize_tensor_input(input_details[2]['index'], [1]) interpreter.resize_tensor_input(input_details[3]['index'], [1]) interpreter.resize_tensor_input(input_details[4]['index'], [1]) interpreter.allocate_tensors() input_data = prepare_input(input_ids) for i, detail in enumerate(input_details): input_shape = detail['shape'] interpreter.set_tensor(detail['index'], input_data[i])

interpreter.invoke()

The function get_tensor() returns a copy of the tensor data.

Use tensor() in order to get a pointer to the tensor.

return (interpreter.get_tensor(output_details[0]['index']), interpreter.get_tensor(output_details[1]['index']))

initialize melgan model

with open('../examples/multiband_melgan/conf/multiband_melgan.v1.yaml') as f: mb_melgan_config = yaml.load(f, Loader=yaml.Loader) mb_melgan_config = MultiBandMelGANGeneratorConfig(**mb_melgan_config["multiband_melgan_generator_params"]) mb_melgan = TFMBMelGANGenerator(config=mb_melgan_config, name='mb_melgan_generator') mb_melgan._build() mb_melgan.load_weights("../examples/multiband_melgan/exp/train.multiband_melgan.v1/checkpoints/generator-1000000.h5")

input_text = ""

decoder_output_tflite, mel_output_tflite = infer(input_text) audio_before_tflite = mb_melgan(decoder_output_tflite)[0, :, 0] audio_after_tflite = mb_melgan(mel_output_tflite)[0, :, 0]

appreciate your help
bug 🐛 wontfix

opened by Zak-SA 44
fastspeech2 training error

i have already created durations with MFA, and also ran well two preprocess script(tensorflow-tts-preprocess, tensorflow-tts-normalize) with no error. but when i ran the train script, there is an error occurred as follows: 2020-08-12 02:19:06,034 (train_fastspeech2:289) INFO: batch_size = 16 2020-08-12 02:19:06,034 (train_fastspeech2:289) INFO: remove_short_samples = True 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: allow_cache = True 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: mel_length_threshold = 32 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: is_shuffle = True 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: optimizer_params = {'initial_learning_rate': 0.001, 'end_learning_rate': 5e-05, 'decay_steps': 150000, 'warmup_proportion': 0.02, 'weight_decay': 0.001} 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: train_max_steps = 200000 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: save_interval_steps = 5000 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: eval_interval_steps = 500 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: log_interval_steps = 200 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: num_save_intermediate_results = 1 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: train_dir = ./dump/train/ 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: dev_dir = ./dump/valid/ 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: use_norm = True 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: f0_stat = ./dump/stats_f0.npy 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: energy_stat = ./dump/stats_energy.npy 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: outdir = ./examples/fastspeech2/exp/train.fastspeech2.v1/ 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: config = ./examples/fastspeech2/conf/fastspeech2.v1.yaml 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: resume = 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: verbose = 1 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: mixed_precision = True 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: version = 0.6.1 Traceback (most recent call last): File "examples/fastspeech2/train_fastspeech2.py", line 400, in main() File "examples/fastspeech2/train_fastspeech2.py", line 316, in main mel_length_threshold=mel_length_threshold, File "/home/speechlab/TensorflowTTS/examples/fastspeech2/fastspeech2_dataset.py", line 104, in init ), f"Number of charactor, mel, duration, f0 and energy files are different" AssertionError: Number of charactor, mel, duration, f0 and energy files are different how do i solve this problem？ can anybody help me ? thank a lot!
bug 🐛 question ❓

opened by mataym 44
Tacotron2 produces random mel outputs during inference (french dataset)
Hi ! I have trained tacotron2 for 52k steps on the SynPaFlex french dataset. I deleted sentences longer than 20 seconds from the dataset and ended up with around 30 hours of single speaker data.

I made a custom synpaflex.py processor in ./tensorflow_tts/processor/ with these symbols (adapted to french without arpabet) :

_pad = "pad" _eos = "eos" _punctuation = "!/\'(),-.:;? " _letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzéèàùâêîôûçäëïöüÿœæ" # Export all symbols: SYNPAFLEX_SYMBOLS = ( [_pad] + list(_punctuation) + list(_letters) + [_eos] )

I used basic_cleaners for text cleaning.

in #182 the issue was similar, but the problem came from using tacotron2.v1.yaml as configuration file. I am using my own tacotron2.synpaflex.v1.yaml for both training and inference.

During synthesis, mel outputs are completely random : the output is different even if the sentence is kept the exact same. The audio signals sound like a french version of the WaveNet examples where no text has been provided during training, in the "Knowing What to Say" section of this page.

Here are my tensorboard results :

I must be doing something wrong somehow as I have been able to train on LJSpeech successfuly... Any idea ?
bug 🐛
opened by samuel-lunii 41
Long sentences issue with FS2

seem my fastspeech2 implementation can't handle long sentence in some dataset such as KSS. FOr Ljspeech and other dataset from other person report that it's still fine. I'm thinking about the maximum length in the training set that my FS2 need to be able to handle long sentences. In my private dataset, it always fine. Maybe 15s is enough.
question ❓ Discussion 😁 wontfix

opened by dathudeptrai 41
Pretrained fastspeech2 libritts model for testing?

Hi,

Thanks for the nice work. Is there a pretrained fastspeech2 libritts model for testing? Like the one trained with ljspeech data?https://colab.research.google.com/drive/1akxtrLZHKuMiQup00tzO2olCaN-y3KiD?usp=sharing
question ❓ wontfix

opened by ronggong 38
Add C++ inference example and code

This is complete C++ code (from text processing to saving audio) for inference with TensorflowTTS/FastSpeech2 (phonetic MFA-aligned from my fork) and Multi-Band MelGAN using the Tensorflow C API. Can compile and run for Windows 64-bit out of the box(solution and project), but the code is cross-platform assuming one provides the required libraries. The project builds a simple command line program where one inputs sentences and they are generated and saved as WAVs.

There's a link for compiled binaries, libraries, and a sample model required to compile for Win64 in the README.

It will allow deploying TensorflowTTS models in a portable way into desktop environments.
enhancement 🚀 Feature Request 🤗

opened by ZDisket 38
Tacotron2 Pre-training have difficulties

Hello, I am a student who is learning with the Tacotron2 Kss dataset.

If you proceed with Tacotron2 Kss pre-training 120k and check the results through the tensor board, the following result values are given.

The loss percentage in the "val" section tends to be higher and higher.

If you pull it out as a wav file, the sound quality is indistinguishable.

I'd like to ask for your advice on this matter.

opened by Gyuub 0
Support Arabic Language

Are you open to support Arabic language? you can use Dr.Nawar Halabi dataset :https://www.kaggle.com/datasets/bc297d8ca0753cd21cdcacd7bd324c0c607361a14471c801f09b028a1ecb098e

opened by Muhammad-Abdelsattar 1
Fastspeech 2 Training error

When training a fastspeech model with: python "examples\fastspeech2\train_fastspeech2.py" and valid arguments, the script run but then I get an error: AssertionError: Number of charactor, mel, duration, f0 and energy files are different

Ive looked at other issues but none of them have solved my problem. The dataset is on a different hardrive, but it doesnt give any "File not found" errors. Any ways to fix this? I preprocced and normalized with "ljspeech" as the config and dataset. and im training with "Fastspeech2.v1" as the config.

opened by LxtteDev 6
not working with large text.

i want to use large text with fastspeech, but as i understand in need to change the configs. i not able to find the exact place to make change in order to make in work. i try to change some parameter in configs files, but it not working for me. which file and which parameter exactly i need to change?

mels, audios = do_synthesis(input_text, fastspeech, mb_melgan, "FASTSPEECH", "MB-MELGAN")

but i got error:

InvalidArgumentError: indices[0,2048] = 2049 is not in [0, 2049) [[node decoder/position_embeddings/Gather (defined at /usr/local/lib/python3.7/dist-packages/tensorflow_tts/models/fastspeech.py:76) ]] [Op:__inference__inference_63215]

Errors may have originated from an input operation. Input Source operations connected to node decoder/position_embeddings/Gather: In[0] decoder/position_embeddings/Gather/resource: In[1] mul_1 (defined at /usr/local/lib/python3.7/dist-packages/tensorflow_tts/models/fastspeech.py:872)
wontfix

opened by avraamya 1

Releases(v1.8)

v1.8(Aug 21, 2021)
Support Tacotron2/Mb-Melgan for French. See pull request and colab. Many thanks Samuel Delalez

Integrated with Huggingface Gradio web demo. See pull request

Upgrade TF2.3.1 to TF2.6.0 since some users confirm that it works fine.

Source code(tar.gz)
Source code(zip)
v1.6.1(Jun 1, 2021)
Fix bug load model_weights in TFAutoModel

Source code(tar.gz)
Source code(zip)
v1.6(Jun 1, 2021)
Release Notes

Support TFlite C++ inference. (https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/cpptflite)

Add an example for FastSpeech2 and MB-Melgan on IOS. (https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/ios)

Integrated with Huggingface Hub. (PR #555 #564 #566). Our pretrained models uploaded in https://huggingface.co/tensorspeech

Fix convergence problem with hifigan caused by large learning rate (#571)

Source code(tar.gz)
Source code(zip)
v1.1(Jan 12, 2021)
Release Notes

Support German TTS with Thorsten dataset (#405), note that user should install (german_transliterate)

Fix Savable bug in Tacotron2 and FastSpeech/FastSpeech2 (#446)

Source code(tar.gz)
Source code(zip)
v0.11(Nov 25, 2020)
Release Notes

Released with TensorFlow 2.3.1

Support multi-GPU gradient accumulator link.

Support HiFi-GAN vocoder link.

Fix some bugs.

Source code(tar.gz)
Source code(zip)
v0.9(Oct 4, 2020)
Release Notes

Supported both TensorFlow 2.2/2.

Faster Tacotron-2 training.

Stable training fastspeech/fastspeech2/tacotron2/mb-melgan.

Supported Eng/Chinese/Korean.

Supported ParallelWaveGAN.

Added C++ inference code.

Source code(tar.gz)
Source code(zip)
v0.8(Aug 23, 2020)

Edit later ...
Source code(tar.gz)
Source code(zip)
v0.7(Jul 11, 2020)
Release Notes

First release of TensorflowTTS.

Built against TensorFlow 2.2

Changelog

Apply black formatter.

Use pytest as default test runner.

TensorflowTTS Core

tensorflow_tts.bin

Multi-preprocess to calculate mel-spectrogram, f0, energy

Add code to calculate mean/std of mel-spectrogram, f0, energy

Add code to normalize mel-spectrogram, f0, energy based on its mean/std value

tensorflow_tts.config

Add configuration for FastSpeech

Add configuration for FastSpeech2

Add configuration for Tacotron-2

Add configuration for MelGAN

Add configuration for Multiband-MelGAN

tensorflow_tts.datasets

Add dataset abstract based on tf.data

Add dataloder for mel-spectrogram

Add dataloder for audio

tensorflow_tts.losses

Add MultiScale STFT Loss

Add Mel-spectrogram Loss

tensorflow_tts.models

Add FastSpeech modeling

Add FastSpeech2 modeling

Add Melgan modeling

Add Multiband-melgan modeling

Add Tacotorn-2 modeling

tensorflow_tts.optimizers

Add adam-weightdecay optimizers

tensorflow_tts.processor

Add Ljspeech processor for english charactor-based.

tensorflow_tts.trainers

Add base trainer including GanBasedTrainer and Seq2SeqTrainer

tensorflow_tts.utils

Add seq2seq dynamic decoder

Add cleaner for english text

Add group convolution for melgan

Add batch Griffin-Lim version based on librosa and Tensorflow

Add number normalization

Add function to detect outlier from 1D array

Add weight-norm layer

NoteBooks

Add notebook for GL inference

Add notebook for convert FastSpeech/FastSpeech2/Melgan/Mb-melgan/Tacotron-2 to pb and inference

Add notebook for convert FastSpeech/FastSpeech2/Tacotron-2 to tflite and inference

Examples

Add example to training fastspeech

Add example to training fastspeech2

Add example to training tacotron-2

Add example to training melgan

Add example to training melgan.stft

Add example to training multiband melgan

Thanks to our Contributors

@erogol @azraelkuan @l4zyf9x @myagues @sujeendran @MokkeMeguru @jaeyoo @dathudeptrai
Source code(tar.gz)
Source code(zip)

TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, Korean, Chinese, German and Easy to adapt for other languages)

Related tags

Overview

😋 TensorFlowTTS

Real-Time State-of-the-art Speech Synthesis for Tensorflow 2

What's new

Features

Requirements

Installation

With pip

From source

Supported Model architectures

Audio Samples

Tutorial End-to-End

Prepare Dataset

Preprocessing

Training models

Abstract Class Explaination

Abstract DataLoader Tensorflow-based dataset

Abstract Trainer Class

End-to-End Examples

Contact

License

Acknowledgement

Comments

Load the TFLite model and allocate tensors.

Get input and output tensors.

Prepare input data.

Test the model on random input data.

The function get_tensor() returns a copy of the tensor data.

Use tensor() in order to get a pointer to the tensor.

initialize melgan model

decoder_output_tflite, mel_output_tflite = infer(input_text) audio_before_tflite = mb_melgan(decoder_output_tflite)[0, :, 0] audio_after_tflite = mb_melgan(mel_output_tflite)[0, :, 0]

Releases(v1.8)

v1.8(Aug 21, 2021)

v1.6.1(Jun 1, 2021)

v1.6(Jun 1, 2021)

Release Notes

v1.1(Jan 12, 2021)

Release Notes

v0.11(Nov 25, 2020)

Release Notes

v0.9(Oct 4, 2020)

Release Notes

v0.8(Aug 23, 2020)

v0.7(Jul 11, 2020)

Release Notes

Changelog

TensorflowTTS Core

tensorflow_tts.bin

tensorflow_tts.config

tensorflow_tts.datasets

tensorflow_tts.losses

tensorflow_tts.models

tensorflow_tts.optimizers

tensorflow_tts.processor

tensorflow_tts.trainers

tensorflow_tts.utils

NoteBooks

Examples

Thanks to our Contributors

Owner

Pytorch implementation of convolutional neural network visualization techniques

An Empirical Review of Optimization Techniques for Quantum Variational Circuits

A Practical Debugging Tool for Training Deep Neural Networks

Quickly and easily create / train a custom DeepDream model

Neural network visualization toolkit for tf.keras

Lime: Explaining the predictions of any machine learning classifier

⬛ Python Individual Conditional Expectation Plot Toolbox

Portal is the fastest way to load and visualize your deep neural networks on images and videos 🔮

A python library for decision tree visualization and model interpretation.

A ultra-lightweight 3D renderer of the Tensorflow/Keras neural network architectures

Code for visualizing the loss landscape of neural nets

Many Class Activation Map methods implemented in Pytorch for CNNs and Vision Transformers. Including Grad-CAM, Grad-CAM++, Score-CAM, Ablation-CAM and XGrad-CAM

ModelChimp is an experiment tracker for Deep Learning and Machine Learning experiments.

Visualizer for neural network, deep learning, and machine learning models

Using / reproducing ACD from the paper "Hierarchical interpretations for neural network predictions" 🧠 (ICLR 2019)

Contrastive Explanation (Foil Trees), developed at TNO/Utrecht University

Pytorch Feature Map Extractor

Delve is a Python package for analyzing the inference dynamics of your PyTorch model.

The function `get_tensor()` returns a copy of the tensor data.

Use `tensor()` in order to get a pointer to the tensor.