Official PyTorch implementation of SyntaSpeech (IJCAI 2022)

Overview

SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech

arXiv | GitHub Stars | downloads | Hugging Face | 中文文档

This repository is the official PyTorch implementation of our IJCAI-2022 paper, in which we propose SyntaSpeech for syntax-aware non-autoregressive Text-to-Speech.



Our SyntaSpeech is built on the basis of PortaSpeech (NeurIPS 2021) with three new features:

  1. We propose Syntactic Graph Builder (Sec. 3.1) and Syntactic Graph Encoder (Sec. 3.2), which is proved to be an effective unit to extract syntactic features to improve the prosody modeling and duration accuracy of TTS model.
  2. We introduce Multi-Length Adversarial Training (Sec. 3.3), which could replace the flow-based post-net in PortaSpeech, speeding up the inference time and improving the audio quality naturalness.
  3. We support three datasets: LJSpeech (single-speaker English dataset), Biaobei (single-speaker Chinese dataset) , and LibriTTS (multi-speaker English dataset).

Environments

conda create -n synta python=3.7
condac activate synta
pip install -U pip
pip install Cython numpy==1.19.1
pip install torch==1.9.0 
pip install -r requirements.txt
# install dgl for graph neural network, dgl-cu102 supports rtx2080, dgl-cu113 support rtx3090
pip install dgl-cu102 dglgo -f https://data.dgl.ai/wheels/repo.html 
sudo apt install -y sox libsox-fmt-mp3
bash mfa_usr/install_mfa.sh # install force alignment tools

Run SyntaSpeech!

Please follow the following steps to run this repo.

1. Preparation

Data Preparation

You can directly use our binarized datasets for LJSpeech and Biaobei. Download them and unzip them into the data/binary/ folder.

As for LibriTTS, you can download the raw datasets and process them with our data_gen modules. Detailed instructions can be found in dosc/prepare_data.

Vocoder Preparation

We provide the pre-trained model of vocoders for three datasets. Specifically, Hifi-GAN for LJSpeech and Biaobei, ParallelWaveGAN for LibriTTS. Download and unzip them into the checkpoints/ folder.

2. Training Example

Then you can train SyntaSpeech in the three datasets.

cd <the root_dir of your SyntaSpeech folder>
export PYTHONPATH=./
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/lj/synta.yaml --exp_name lj_synta --reset # training in LJSpeech
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name biaobei_synta --reset # training in Biaobei
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name libritts_synta --reset # training in LibriTTS

3. Tensorboard

tensorboard --logdir=checkpoints/lj_synta
tensorboard --logdir=checkpoints/biaobei_synta
tensorboard --logdir=checkpoints/libritts_synta

4. Inference Example

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/lj/synta.yaml --exp_name lj_synta --reset --infer # inference in LJSpeech
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name biaobei_synta --reset --infer # inference in Biaobei
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name libritts_synta --reset ---infer # inference in LibriTTS

Audio Demos

Audio samples in the paper can be found in our demo page.

We also provide HuggingFace Demo Page for LJSpeech. Try your interesting sentences there!

Citation

@article{ye2022syntaspeech,
  title={SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech},
  author={Ye, Zhenhui and Zhao, Zhou and Ren, Yi and Wu, Fei},
  journal={arXiv preprint arXiv:2204.11792},
  year={2022}
}

Acknowledgements

Our codes are based on the following repos:

Comments
  • pinyin preprocess problem

    pinyin preprocess problem

    005804 你当#1我傻啊#3?脑子#1那么大#2怎么#1塞进去#4? ni3 dang1 wo2 sha3 a5 nao3 zi5 na4 me5 da4 zen3 me5 sai1 jin4 qu4

    txt_struct=[['', ['']], ['你', ['n', 'i3']], ['当', ['d', 'ang1']], ['我', ['uo3']], ['傻', ['sh', 'a3']], ['啊', ['a', '?', 'n', 'ao3']], ['?', ['z', 'i']], ['脑', ['n', 'a4']], ['子', ['m', 'e']], ['那', ['d', 'a4']], ['么', ['z', 'en3']], ['大', ['m', 'e']], ['怎', ['s', 'ai1']], ['么', ['j', 'in4']], ['塞', ['q', 'v4', '?']], ['进', []], ['去', []], ['?', []], ['', ['']]]

    ph_gb_word=['', 'n_i3', 'd_ang1', 'uo3', 'sh_a3', 'a_?n_ao3', 'z_i', 'n_a4', 'm_e', 'd_a4', 'z_en3', 'm_e', 's_ai1', 'j_in4', 'q_v4?', '', '', '', '']

    what is 'a_?_n_ao3'

    in the mfa_dict it appears ch_a1_d_ou1 ,a_?_n_ao3 and so on

    opened by windowxiaoming 2
  • discriminator output['y_c'] never used

    discriminator output['y_c'] never used

    Discriminator's output['y_c'] never used, and never calculated in discriminator forward func. What does this variable mean? https://github.com/yerfor/SyntaSpeech/blob/5b07439633a3e714d2a6759ea4097eb36d6cd99a/tasks/tts/synta.py#L81

    opened by mayfool 2
  • A question of KL divergence calculation

    A question of KL divergence calculation

    In modules/tts/portaspeech/fvae.py, SyntaFVAE compute loss_kl (line 121) , Can someone help explain why loss_kl = ((logqx - logpx) * nonpadding_sqz).sum() / nonpadding_sqz.sum() / logqx.shape[1],I think loss_kl should be compute by loss_kl = logqx.exp()*(logqx - logpx) I would be very grateful if you could reply to me!

    opened by JiaYK 2
  • mfa for multi speaker.

    mfa for multi speaker.

    In the code, group MFA inputs for better parallelism. For multi speaker, it maybe go wrong. For input g_uang3 zh_ou1 n_v3 d_a4 x_ve2 sh_eng1 d_eng1 sh_an1 sh_i1 l_ian2 s_i4 t_ian1 j_ing3 f_ang1 zh_ao3 d_ao4 i2 s_i4 n_v3 sh_i1. The TexGrid is

    	item [1]:
    		class = "IntervalTier"
    		name = "words"
    		xmin = 0.0
    		xmax = 9.4444
    		intervals: size = 56
    			intervals [1]:
    				xmin = 0
    				xmax = 0.5700000000000001
    				text = ""
    			intervals [2]:
    				xmin = 0.5700000000000001
    				xmax = 0.61
    				text = "eng"
    			intervals [3]:
    				xmin = 0.61
    				xmax = 0.79
    				text = "s_an1"
    			intervals [4]:
    				xmin = 0.79
    				xmax = 0.89
    				text = "eng"
    			intervals [5]:
    				xmin = 0.89
    				xmax = 1.06
    				text = "i1"
    			intervals [6]:
    				xmin = 1.06
    				xmax = 1.24
    				text = "eng"
    			intervals [7]:
    				xmin = 1.24
    				xmax = 1.3
    				text = ""
    			intervals [8]:
    				xmin = 1.3
    				xmax = 1.36
    				text = "s_an1"
    			intervals [9]:
    				xmin = 1.36
    				xmax = 1.42
    				text = ""
    			intervals [10]:
    				xmin = 1.42
    				xmax = 1.49
    				text = "eng"
    			intervals [11]:
    				xmin = 1.49
    				xmax = 1.67
    				text = "s_i4"
    			intervals [12]:
    				xmin = 1.67
    				xmax = 1.78
    				text = "eng"
    			intervals [13]:
    				xmin = 1.78
    				xmax = 1.91
    				text = ""
    			intervals [14]:
    				xmin = 1.91
    				xmax = 1.96
    				text = "er4"
    			intervals [15]:
    				xmin = 1.96
    				xmax = 2.06
    				text = "eng"
    			intervals [16]:
    				xmin = 2.06
    				xmax = 2.19
    				text = ""
    			intervals [17]:
    				xmin = 2.19
    				xmax = 2.35
    				text = "i1"
    			intervals [18]:
    				xmin = 2.35
    				xmax = 2.53
    				text = "eng"
    			intervals [19]:
    				xmin = 2.53
    				xmax = 3.03
    				text = "i1"
    			intervals [20]:
    				xmin = 3.03
    				xmax = 3.42
    				text = "eng"
    			intervals [21]:
    				xmin = 3.42
    				xmax = 3.48
    				text = "i1"
    			intervals [22]:
    				xmin = 3.48
    				xmax = 3.6
    				text = ""
    			intervals [23]:
    				xmin = 3.6
    				xmax = 3.64
    				text = "eng"
    			intervals [24]:
    				xmin = 3.64
    				xmax = 3.86
    				text = "i1"
    			intervals [25]:
    				xmin = 3.86
    				xmax = 3.99
    				text = "eng"
    			intervals [26]:
    				xmin = 3.99
    				xmax = 4.59
    				text = ""
    			intervals [27]:
    				xmin = 4.59
    				xmax = 4.869999999999999
    				text = "er4"
    			intervals [28]:
    				xmin = 4.869999999999999
    				xmax = 4.9799999999999995
    				text = "eng"
    			intervals [29]:
    				xmin = 4.9799999999999995
    				xmax = 5.1899999999999995
    				text = "s_i4"
    			intervals [30]:
    				xmin = 5.1899999999999995
    				xmax = 5.34
    				text = ""
    			intervals [31]:
    				xmin = 5.34
    				xmax = 5.43
    				text = "eng"
    			intervals [32]:
    				xmin = 5.43
    				xmax = 5.6
    				text = ""
    			intervals [33]:
    				xmin = 5.6
    				xmax = 5.76
    				text = "i1"
    			intervals [34]:
    				xmin = 5.76
    				xmax = 6.279999999999999
    				text = "eng"
    			intervals [35]:
    				xmin = 6.279999999999999
    				xmax = 6.359999999999999
    				text = "s_an1"
    			intervals [36]:
    				xmin = 6.359999999999999
    				xmax = 6.47
    				text = ""
    			intervals [37]:
    				xmin = 6.47
    				xmax = 6.6
    				text = "eng"
    			intervals [38]:
    				xmin = 6.6
    				xmax = 6.9399999999999995
    				text = "i1"
    			intervals [39]:
    				xmin = 6.9399999999999995
    				xmax = 7.039999999999999
    				text = "eng"
    			intervals [40]:
    				xmin = 7.039999999999999
    				xmax = 7.289999999999999
    				text = "s_an1"
    			intervals [41]:
    				xmin = 7.289999999999999
    				xmax = 7.369999999999999
    				text = "eng"
    			intervals [42]:
    				xmin = 7.369999999999999
    				xmax = 7.6
    				text = "s_i4"
    			intervals [43]:
    				xmin = 7.6
    				xmax = 7.699999999999999
    				text = "eng"
    			intervals [44]:
    				xmin = 7.699999999999999
    				xmax = 7.869999999999999
    				text = ""
    			intervals [45]:
    				xmin = 7.869999999999999
    				xmax = 8.049999999999999
    				text = "er4"
    			intervals [46]:
    				xmin = 8.049999999999999
    				xmax = 8.26
    				text = ""
    			intervals [47]:
    				xmin = 8.26
    				xmax = 8.299999999999999
    				text = "eng"
    			intervals [48]:
    				xmin = 8.299999999999999
    				xmax = 8.36
    				text = "s_i4"
    			intervals [49]:
    				xmin = 8.36
    				xmax = 8.389999999999999
    				text = ""
    			intervals [50]:
    				xmin = 8.389999999999999
    				xmax = 8.42
    				text = "eng"
    			intervals [51]:
    				xmin = 8.42
    				xmax = 8.45
    				text = ""
    			intervals [52]:
    				xmin = 8.45
    				xmax = 8.59
    				text = "s_an1"
    			intervals [53]:
    				xmin = 8.59
    				xmax = 8.83
    				text = ""
    			intervals [54]:
    				xmin = 8.83
    				xmax = 9.1
    				text = "eng"
    			intervals [55]:
    				xmin = 9.1
    				xmax = 9.44
    				text = "i1"
    			intervals [56]:
    				xmin = 9.44
    				xmax = 9.4444
    				text = ""
    
    opened by leon2milan 2
  • Problem with DDP

    Problem with DDP

    Hello, I have experimented on your excellent job with this repo. But I found the ddp is not effective. I wonder if the way I used is wrong?

    CUDA_VISIBLE_DEVICES=0,1,2 python -m torch.distributed.launch --nproc_per_node 3 tasks/run.py --config //fs.yaml --exp_name fs_test_demo --reset

    opened by zhazl 0
Releases(v1.0.0)
Owner
Zhenhui YE
I am currently a second-year computer science Ph.D student at Zhejiang University, working on deep learning and reinforcement learning.
Zhenhui YE
Code repository for "Stable View Synthesis".

Stable View Synthesis Code repository for "Stable View Synthesis". Setup Install the following Python packages in your Python environment - numpy (1.1

Intelligent Systems Lab Org 195 Dec 24, 2022
PyTorch implementation of the paper Ultra Fast Structure-aware Deep Lane Detection

PyTorch implementation of the paper Ultra Fast Structure-aware Deep Lane Detection

1.4k Jan 06, 2023
Pytorch implementation of the Variational Recurrent Neural Network (VRNN).

VariationalRecurrentNeuralNetwork Pytorch implementation of the Variational RNN (VRNN), from A Recurrent Latent Variable Model for Sequential Data. Th

emmanuel 251 Dec 17, 2022
Point Cloud Registration Network

PCRNet: Point Cloud Registration Network using PointNet Encoding Source Code Author: Vinit Sarode and Xueqian Li Paper | Website | Video | Pytorch Imp

ViNiT SaRoDe 59 Nov 19, 2022
Inteligência artificial criada para realizar interação social com idosos.

IA SONIA 4.0 A SONIA foi inspirada no assistente mais famoso do mundo e muito bem conhecido JARVIS. Todo mundo algum dia ja sonhou em ter o seu própri

Vinícius Azevedo 2 Oct 21, 2021
The official implementation of CVPR 2021 Paper: Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation.

Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation This repository is the official implementation of CVPR 2021 paper:

9 Nov 14, 2022
Official PyTorch Code of GrooMeD-NMS: Grouped Mathematically Differentiable NMS for Monocular 3D Object Detection (CVPR 2021)

GrooMeD-NMS: Grouped Mathematically Differentiable NMS for Monocular 3D Object Detection GrooMeD-NMS: Grouped Mathematically Differentiable NMS for Mo

Abhinav Kumar 76 Jan 02, 2023
Implementation for paper: Self-Regulation for Semantic Segmentation

Self-Regulation for Semantic Segmentation This is the PyTorch implementation for paper Self-Regulation for Semantic Segmentation, ICCV 2021. Citing SR

Dong ZHANG 30 Nov 21, 2022
Alphabetical Letter Recognition

BayeesNetworks-Image-Classification Alphabetical Letter Recognition In these demo we are using "Bayees Networks" Our database is composed by Learning

Mohammed Firass 4 Nov 30, 2021
Image to Image translation, image generataton, few shot learning

Semi-supervised Learning for Few-shot Image-to-Image Translation [paper] Abstract: In the last few years, unpaired image-to-image translation has witn

yaxingwang 49 Nov 18, 2022
Analysis of Antarctica sequencing samples contaminated with SARS-CoV-2

Analysis of SARS-CoV-2 reads in sequencing of 2018-2019 Antarctica samples in PRJNA692319 The samples analyzed here are described in this preprint, wh

Jesse Bloom 4 Feb 09, 2022
The Hailo Model Zoo includes pre-trained models and a full building and evaluation environment

Hailo Model Zoo The Hailo Model Zoo provides pre-trained models for high-performance deep learning applications. Using the Hailo Model Zoo you can mea

Hailo 50 Dec 07, 2022
Annotated notes and summaries of the TensorFlow white paper, along with SVG figures and links to documentation

TensorFlow White Paper Notes Features Notes broken down section by section, as well as subsection by subsection Relevant links to documentation, resou

Sam Abrahams 437 Oct 09, 2022
This is a Pytorch implementation of the paper: Self-Supervised Graph Transformer on Large-Scale Molecular Data.

This is a Pytorch implementation of the paper: Self-Supervised Graph Transformer on Large-Scale Molecular Data.

212 Dec 25, 2022
PyTorch implementation of spectral graph ConvNets, NIPS’16

Graph ConvNets in PyTorch October 15, 2017 Xavier Bresson http://www.ntu.edu.sg/home/xbresson https://github.com/xbresson https://twitter.com/xbresson

Xavier Bresson 287 Jan 04, 2023
Pytorch implementation of our paper under review — Lottery Jackpots Exist in Pre-trained Models

Lottery Jackpots Exist in Pre-trained Models (Paper Link) Requirements Python = 3.7.4 Pytorch = 1.6.1 Torchvision = 0.4.1 Reproduce the Experiment

Yuxin Zhang 27 Jun 28, 2022
LexGLUE: A Benchmark Dataset for Legal Language Understanding in English

LexGLUE: A Benchmark Dataset for Legal Language Understanding in English ⚖️ 🏆 🧑‍🎓 👩‍⚖️ Dataset Summary Inspired by the recent widespread use of th

95 Dec 08, 2022
Large scale and asynchronous Hyperparameter Optimization at your fingertip.

Syne Tune This package provides state-of-the-art distributed hyperparameter optimizers (HPO) where trials can be evaluated with several backend option

Amazon Web Services - Labs 236 Jan 01, 2023
Official Tensorflow implementation of U-GAT-IT: Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translation (ICLR 2020)

U-GAT-IT — Official TensorFlow Implementation (ICLR 2020) : Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization fo

Junho Kim 6.2k Jan 04, 2023
Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language (NeurIPS 2021)

VRDP (NeurIPS 2021) Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language Mingyu Ding, Zhenfang Chen, Tao Du, Pin

Mingyu Ding 36 Sep 20, 2022