Byte-based multilingual transformer TTS for low-resource/few-shot language adaptation.

Overview

One model to speak them all 🌎

Audio Language Text
Chinese 人人生而自由,在尊严和权利上一律平等。
English All human beings are born free and equal in dignity and rights.
Japanese すべての人間は、生まれながらにして自由であり、かつ、尊厳と権利とについてびょうどうである。
Korean 모든 인간은 태어날 때부터 자유로우며 그 존엄과 권리에 있어 동등하다.
German Alle Menschen sind frei und gleich an Würde und Rechten geboren.
Russian Все люди рождаются свободными и равными в своем достоинстве и правах.
Spanish Todos los seres humanos nacen libres e iguales en dignidad y derechos.
Gujarati પ્રતિષ્ઠા અને અધિકારોની દૃષ્ટિએ સર્વ માનવો જન્મથી સ્વતંત્ર અને સમાન હોય છે.
...even when there are only 30 utterances for training
Norwegian Alle mennesker er født frie og med samme menneskeverd og menneskerettigheter.
Romanian Toate ființele umane se nasc libere și egale în demnitate și în drepturi.
Greek Όλοι οι άνθρωποι γεννιούνται ελεύθεροι και ίσοι στην αξιοπρέπεια και τα δικαιώματα.

This is an implementation of the paper Multilingual Byte2Speech Models for Scalable Low-resource Speech Synthesis, which can handle 40+ languages in a single model, and learn a brand new language in few shots or minutes of recordings. The code is partially based on the open-source Tacotron2 and Transformer-TTS. More audio samples of the paper are available here.

Quickstart

We follow the paper's training recipe, but with open datasets instead. By a combination of 15 speech datasets with 572 speakers in 38 languages, we can reach results similar to what we demonstrated in the paper to an extent, as shown by the audio samples above. These datasets are listed below, the preprocessor scripts below are located at corpora/. Locations and details to download the data are also given in the respective preprocessor.

Name Preprocessor script name Languages
M-AILABS caito es-es, fr-fr, de-de, uk-ua, ru-ru, pl-pl, it-it, en-us, en-uk
CSS-10 css10 es-es, fr-fr, ja-jp, de-de, fi-fi, hu-hu, ja-jp, nl-nl, ru-ru, zh-cn
SIWIS siwis fr-fr
JSUT jsut ja-jp
KSS kss ko-kr
Databaker databaker zh-cn
LJSpeech ljspeech en-us
NST nst da-dk, nb-no
TTS-Portuguese portuguese pt-br
Thorsten Mueller thorsten de-de
Google google bn-bd, bn-in, ca-es, eu-es, gl-es, gu-in, jv-id, km-kh, kn-in, ml-in, mr-in, my-mm, ne-np, si-lk, su-id, ta-in, te-in, yo-ng
RuLS lsru ru-ru
English Bible enbible en-us
Hifi-TTS hifitts en-us, en-uk
RSS rss ro-ro

Preprocessing

  1. Please download and extract these datasets to the dataset_path specified in corpora/__init__.py. You can change the dataset_path, transformed_path and packed_path to your own.
  2. Run the preprocessor for each dataset given in corpora. The results are saved to transformed_path. include_corpus in corpora/__init__.py could be modified to add or remove datasets to be used. Particularly, you may refer to the preprocessors to include your own datasets to the training,
    and then add the dataset to include_corpus and dataset_language in corpora/__init__.py.
  3. Run the corpora/process_corpus.py, which filters the dataset, trims the audios, produces the metadata, generates the mel spectrograms, and pack all the features into a single zip file. The processed dataset will be put at packed_path, which uses around 100GB space. See the script for details.

Training

Similarly, we split the dataset into three tiers. Below are the commands to train and evaluate on each tier. Please substitute the directories with your own. The evaluation script can be run simultaneously with the training script. You may also use the evaluation script to synthesize samples from pretrained models. Please refer to the help of the arguments for their meanings.

Besides, to report CER, you need to create azure_key.json with your own Azure STT subscription, with content of {"subscription": "YOUR_KEY", "region": "YOUR_REGION"}, see utils/transcribe.py. Due to significant differences of the datasets used, the implementation is for demonstration only and could not fully reproduce the results in the paper.

T1

python -m torch.distributed.launch --nproc_per_node=NGPU train.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --training_languages=en-us:de-de:ja-jp:es-es --warmup_languages=en-us --ddp=True --eval_steps=40000:100000

python eval.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --start_step=100000 --eval_languages=en-us:de-de:ja-jp

T2

python -m torch.distributed.launch --nproc_per_node=NGPU train.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --training_languages=en-us:de-de:fr-fr:ru-ru:en-uk:es-es:uk-ua:pl-pl:it-it:ja-jp:zh-cn --ddp=True --hparams="warmup_steps=350000" --restore_from=T1_MODEL_DIR/model.ckpt-350000 --eval_steps=400000:450000 --eval_languages=zh-cn:ru-ru:it-it

python eval.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --start_step=400000 --eval_languages=zh-cn:ru-ru:it-it

T3

python -m torch.distributed.launch --nproc_per_node=NGPU train.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --training_languages=en-us:de-de:fr-fr:ru-ru:en-uk:es-es:uk-ua:pl-pl:it-it:ja-jp:zh-cn:nl-nl:fi-fi: ko-kr:eu-es:pt-br:hu-hu:jv-id:gl-es:gu-in:kn-in:da-dk:su-id:ta-in:ca-es:ml-in:te-in:my-mm:yo-ng:km-kh:mr-in:ne-np:bn-bd: bn-in:si-lk --ddp=True --hparams="warmup_steps=650000,batch_frame_quad_limit=6500000" --restore_from=T2_MODEL_DIR/model.ckpt-650000 --eval_steps=700000:750000 --eval_languages=ko-kr:da-dk:te-in

python eval.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --start_step=700000 --eval_languages=ko-kr:da-dk:te-in

Few-shot adaptation

Norwegian Bokmal (nb-no), Greek (el-gr), and Romanian (ro-ro) are excluded from the training dataset and can be used for few-shot/low-resource adaptation. The command below gives an example for adaptation to el-gr with 100 samples, and you may substitute the --adapt_languages and --downsample_languages with your own.

python -m torch.distributed.launch --nproc_per_node=NGPU train.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --training_languages=en-us:de-de:fr-fr:ru-ru:en-uk:es-es:uk-ua:pl-pl:it-it:ja-jp:zh-cn:nl-nl:fi-fi: ko-kr:eu-es:pt-br:hu-hu:jv-id:gl-es:gu-in:kn-in:da-dk:su-id:ta-in:ca-es:ml-in:te-in:my-mm:yo-ng:km-kh:mr-in:ne-np: bn-bd:bn-in:si-lk --adapt_languages=el-gr --downsample_languages=el-gr:100 --ddp=True --hparams="warmup_steps=800000" --restore_from=T3_MODEL_DIR/model.ckpt-700000

python eval.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=DATA_DIR --start_step=700000 --eval_languages=el-gr

Performance

Below listed the best CERs of selected languages reached by models from each tier on these open datasets, as well as the CERs on few-shot adaptation. The CERs are based on Azure Speech-to-Text.

T1 en-us de-de ja-jp
2.68% 2.17% 19.06%
T2 it-it ru-ru zh-cn
1.95% 3.21% 7.30%
T3 da-dk ko-kr te-in
1.31% 0.94% 4.41%

Adaptation

#Samples nb-no el-gr ro-ro
30 9.18% 5.71% 5.58%
100 3.63% 4.63% 4.89%

Pretrained Models

The pretrained models are available at OneDrive Link. Metadata for eval are also given to aid fast reproduction. Below listed are the models provided.

Base models

  • T1 350k steps, ready for T2
  • T2 650k steps, ready for T3
  • T3 700k steps, ready for adaptation
  • T3 1.16M steps, which reaches satisfactory performances on most languages

Few-shot adaptation

  • nb-no, 30 samples, at 710k steps
  • nb-no, 100 samples, at 750k steps
  • el-gr, 30 samples, at 1M steps
  • el-gr, 100 samples, at 820k steps
  • ro-ro, 30 samples, at 970k steps
  • ro-ro, 100 samples, at 910k steps

Synthesis

To synthesize audios from the pretrained models, download the models along with the metadata files (lang_id.json and spk_id.json). Since there are no ground truth mels, you need to create metadata with dummy mel targets information , and run eval.py without neither --zipfilepath specified nor mels.zip present in --data-dir. The metadata file takes the form of SPEAKERNAME_FILEID|DUMMY_LENGTH|TEXT|LANG for each line of the file. For example, you can generate the audio examples above by saving the following metadata to script.txt:

databaker_0|500|人人生而自由,在尊严和权利上一律平等。|zh-cn
ljspeech_0|500|All human beings are born free and equal in dignity and rights.|en-us
jsut_0|500|すべての人間は、生まれながらにして自由であり、かつ、尊厳と権利とについてびょうどうである。|ja-jp
kss_0|500|모든 인간은 태어날 때부터 자유로우며 그 존엄과 권리에 있어 동등하다.|ko-kr
thorsten_0|500|Alle Menschen sind frei und gleich an Würde und Rechten geboren.|de-de
hajdurova_0|500|Все люди рождаются свободными и равными в своем достоинстве и правах.|ru-ru
tux_0|500|Todos los seres humanos nacen libres e iguales en dignidad y derechos.|es-es
guf02858_0|500|પ્રતિષ્ઠા અને અધિકારોની દૃષ્ટિએ સર્વ માનવો જન્મથી સ્વતંત્ર અને સમાન હોય છે.|gu-in

, and with the command python eval.py --model-dir=T3_MODEL_DIR --log-dir=OUTPUT_DIR --data-dir=METADATA_DIR --eval_meta=script.txt --eval_step=1160000 --no_wait=True. You may refer to lang_id.json and spk_id.json to synthesize audios with other languages or speakers.

The waveforms are produced by Griffin-Lim, while mel spectrograms are also saved to SPEAKERNAME_FILEID.npy, which are normalized to a [-4, 4] range. Pretrained vocoders like Wavenet can be used to reach better quality. Those using recipes similar to Tacotron2 should be applicable to these mels, although you need to map mels to a range of [0, 1], simply by mels = (mels + 8) / 4.

Owner
Mutian He
Mutian He
Official PyTorch implementation of Data-free Knowledge Distillation for Object Detection, WACV 2021.

Introduction This repository is the official PyTorch implementation of Data-free Knowledge Distillation for Object Detection, WACV 2021. Data-free Kno

NVIDIA Research Projects 50 Jan 05, 2023
Awesome AI Learning with +100 AI Cheat-Sheets, Free online Books, Top Courses, Best Videos and Lectures, Papers, Tutorials, +99 Researchers, Premium Websites, +121 Datasets, Conferences, Frameworks, Tools

All about AI with Cheat-Sheets(+100 Cheat-sheets), Free Online Books, Courses, Videos and Lectures, Papers, Tutorials, Researchers, Websites, Datasets

Niraj Lunavat 1.2k Jan 01, 2023
Python PID Tuner - Makes a model of the System from a Process Reaction Curve and calculates PID Gains

PythonPID_Tuner_SOPDT Step 1: Takes a Process Reaction Curve in csv format - assumes data at 100ms interval (column names CV and PV) Step 2: Makes a r

1 Jan 18, 2022
Differentiable Wavetable Synthesis

Differentiable Wavetable Synthesis

4 Feb 11, 2022
This is the source code for the experiments related to the paper Unsupervised Audio Source Separation Using Differentiable Parametric Source Models

Unsupervised Audio Source Separation Using Differentiable Parametric Source Models This is the source code for the experiments related to the paper Un

30 Oct 19, 2022
Rafael Project- Classifying rockets to different types using data science algorithms.

Rocket-Classify Rafael Project- Classifying rockets to different types using data science algorithms. In this project we received data base with data

Hadassah Engel 5 Sep 18, 2021
Norm-based Analysis of Transformer

Norm-based Analysis of Transformer Implementations for 2 papers introducing to analyze Transformers using vector norms: Kobayashi+'20 Attention is Not

Goro Kobayashi 52 Dec 05, 2022
DCGAN LSGAN WGAN-GP DRAGAN PyTorch

Recommendation Our GAN based work for facial attribute editing - AttGAN. News 8 April 2019: We re-implement these GANs by Tensorflow 2! The old versio

Zhenliang He 408 Nov 30, 2022
code for Grapadora research paper experimentation

Road feature embedding selection method Code for research paper experimentation Abstract Traffic forecasting models rely on data that needs to be sens

Eric López Manibardo 0 May 26, 2022
Python script that analyses the given datasets and comes up with the best polynomial regression representation with the smallest polynomial degree possible

Python script that analyses the given datasets and comes up with the best polynomial regression representation with the smallest polynomial degree possible, to be the most reliable with the least com

Nikolas B Virionis 2 Aug 01, 2022
Foreground-Action Consistency Network for Weakly Supervised Temporal Action Localization

FAC-Net Foreground-Action Consistency Network for Weakly Supervised Temporal Action Localization Linjiang Huang (CUHK), Liang Wang (CASIA), Hongsheng

21 Nov 22, 2022
Training DALL-E with volunteers from all over the Internet using hivemind and dalle-pytorch (NeurIPS 2021 demo)

Training DALL-E with volunteers from all over the Internet This repository is a part of the NeurIPS 2021 demonstration "Training Transformers Together

<a href=[email protected]"> 19 Dec 13, 2022
Depression Asisstant GDSC Challenge Solution

Depression Asisstant can help you give solution. Please using Python version 3.9.5 for contribute.

Ananda Rauf 1 Jan 30, 2022
Train Scene Graph Generation for Visual Genome and GQA in PyTorch >= 1.2 with improved zero and few-shot generalization.

Scene Graph Generation Object Detections Ground truth Scene Graph Generated Scene Graph In this visualization, woman sitting on rock is a zero-shot tr

Boris Knyazev 93 Dec 28, 2022
Trains an agent with stochastic policy gradient ascent to solve the Lunar Lander challenge from OpenAI

Introduction This script trains an agent with stochastic policy gradient ascent to solve the Lunar Lander challenge from OpenAI. In order to run this

Momin Haider 0 Jan 02, 2022
Code and data for "Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning" (EMNLP 2021).

GD-VCR Code for Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning (EMNLP 2021). Research Questions and Aims: How well can a model perform o

Da Yin 24 Oct 13, 2022
Radar-to-Lidar: Heterogeneous Place Recognition via Joint Learning

radar-to-lidar-place-recognition This page is the coder of a pre-print, implemented by PyTorch. If you have some questions on this project, please fee

Huan Yin 37 Oct 09, 2022
Source code for CVPR 2021 paper "Riggable 3D Face Reconstruction via In-Network Optimization"

Riggable 3D Face Reconstruction via In-Network Optimization Source code for CVPR 2021 paper "Riggable 3D Face Reconstruction via In-Network Optimizati

130 Jan 02, 2023
The official implementation of NeurIPS 2021 paper: Finding Optimal Tangent Points for Reducing Distortions of Hard-label Attacks

Introduction This repository includes the source code for "Finding Optimal Tangent Points for Reducing Distortions of Hard-label Attacks", which is pu

machen 11 Nov 27, 2022
Official PyTorch Implementation of "AgentFormer: Agent-Aware Transformers for Socio-Temporal Multi-Agent Forecasting".

AgentFormer This repo contains the official implementation of our paper: AgentFormer: Agent-Aware Transformers for Socio-Temporal Multi-Agent Forecast

Ye Yuan 161 Dec 23, 2022