PyTorch Implementation of DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

Overview

DiffGAN-TTS - PyTorch Implementation

PyTorch implementation of DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

Repository Status

  • Naive Version of DiffGAN-TTS
  • Active Shallow Diffusion Mechanism: DiffGAN-TTS (two-stage)

Audio Samples

Audio samples are available at /demo.

Quickstart

DATASET refers to the names of datasets such as LJSpeech and VCTK in the following documents.

MODEL refers to the types of model (choose from 'naive', 'aux', 'shallow').

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Inference

You have to download the pretrained models and put them in

  • output/ckpt/DATASET_naive/ for 'naive' model.
  • output/ckpt/DATASET_shallow/ for 'shallow' model. Please note that the checkpoint of the 'shallow' model contains both 'shallow' and 'aux' models, and these two models will share all directories except results throughout the whole process.

For a single-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --model MODEL --restore_step RESTORE_STEP --mode single --dataset DATASET

For a multi-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --model MODEL --speaker_id SPEAKER_ID --restore_step RESTORE_STEP --mode single --dataset DATASET

The dictionary of learned speakers can be found at preprocessed_data/DATASET/speakers.json, and the generated utterances will be put in output/result/.

Batch Inference

Batch inference is also supported, try

python3 synthesize.py --source preprocessed_data/DATASET/val.txt --model MODEL --restore_step RESTORE_STEP --mode batch --dataset DATASET

to synthesize all utterances in preprocessed_data/DATASET/val.txt.

Controllability

The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios. For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --model MODEL --restore_step RESTORE_STEP --mode single --dataset DATASET --duration_control 0.8 --energy_control 0.8

Please note that the controllability is originated from FastSpeech2 and not a vital interest of DiffGAN-TTS.

Training

Datasets

The supported datasets are

  • LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.

  • VCTK: The CSTR VCTK Corpus includes speech data uttered by 110 English speakers (multi-speaker TTS) with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive.

Preprocessing

  • For a multi-speaker TTS with external speaker embedder, download ResCNN Softmax+Triplet pretrained model of philipperemy's DeepSpeaker for the speaker embedding and locate it in ./deepspeaker/pretrained_models/.

  • Run

    python3 prepare_align.py --dataset DATASET
    

    for some preparations.

    For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Pre-extracted alignments for the datasets are provided here. You have to unzip the files in preprocessed_data/DATASET/TextGrid/. Alternately, you can run the aligner by yourself.

    After that, run the preprocessing script by

    python3 preprocess.py --dataset DATASET
    

Training

You can train three types of model: 'naive', 'aux', and 'shallow'.

  • Training Naive Version ('naive'):

    Train the naive version with

    python3 train.py --model naive --dataset DATASET
    
  • Training Basic Acoustic Model for Shallow Version ('aux'):

    To train the shallow version, we need a pre-trained FastSpeech2. The below command will let you train the FastSpeech2 modules, including Auxiliary (Mel) Decoder.

    python3 train.py --model aux --dataset DATASET
    
  • Training Shallow Version ('shallow'):

    To leverage pre-trained FastSpeech2, including Auxiliary (Mel) Decoder, you must pass --restore_step with the final step of auxiliary FastSpeech2 training as the following command.

    python3 train.py --model shallow --restore_step RESTORE_STEP --dataset DATASET
    

    For example, if the last checkpoint is saved at 200000 steps during the auxiliary training, you have to set --restore_step with 200000. Then it will load and freeze the aux model and then continue the training under the active shallow diffusion mechanism.

TensorBoard

Use

tensorboard --logdir output/log/DATASET

to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.

Naive Diffusion

Notes

  • In addition to the Diffusion Decoder, the Variance Adaptor is also conditioned on speaker information.
  • Unconditional and Conditional output of the JCU discriminator is averaged during each of loss calculation as VocGAN did.
  • Some differences on the Data and Preprocessing compared to the original paper:
    • Using VCTK (109 speakers) instead of Mandarin Chinese of 228 speakers.
    • Following DiffSpeech's audio config, e.g., sample rate is 22050Hz rather than 24,000 Hz.
    • Also, following DiffSpeech's variance extraction and modeling.
  • lambda_fm is fixed to 10 since the dynamically scaled scalar computed as L_recon/L_fm makes the model explode.
  • Two options for embedding for the multi-speaker TTS setting: training speaker embedder from scratch or using a pre-trained philipperemy's DeepSpeaker model (as STYLER did). You can toggle it by setting the config (between 'none' and 'DeepSpeaker').
  • DeepSpeaker on VCTK dataset shows clear identification among speakers. The following figure shows the T-SNE plot of extracted speaker embedding.

  • Use HiFi-GAN instead of Parallel WaveGAN (PWG) for vocoding.

Citation

Please cite this repository by the "Cite this repository" of About section (top right of the main page).

References

Comments
  • About question of code and synthesis

    About question of code and synthesis

    [email protected], Thank you for your suggestions these days, I successfully integrated model PortaSpeech on the basis of this model. These are some questions to ask you! Thank you!

    1. In the DiffGAN-TTS, the return of get_mask from length is mask. And the return of get_mask from length in PortaSpeech is ~mask. I want to know the difference between them,
    2. In DiffGAN-TTS, about def diffuse_trace(self, x_start, mask). I want to know how do the ~ aims to do in def diffuse_trace. In my integrated model, I set the return of get_mask from length is ~mask. If I delete the ~ in diffuse_trace, the synthesis mel is error and the voice likes to the voice of water. While If I preserve the ~ in diffuse_trace, the mel is also error and the voice likes to electric voice. Thank you very much!
    • Deng Yan
    • 2022.5.9
    • GuangXi University
    opened by qw1260497397 8
  • ERROR

    ERROR

    File "train.py", line 320, in 3.24s/it] main(args, configs) File "train.py", line 196, in main figs, wav_reconstruction, wav_prediction, tag = synth_one_sample( File "/data/workspace/liukaiyang/TTS/DiffGAN-TTS-main/utils/tools.py", line 227, in synth_one_sample mels = [mel_pred[0, :mel_len].float().detach().transpose(0, 1) for mel_pred in diffusion.sampling()] File "/root/anaconda3/envs/LKYBase/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(*args, **kwargs) File "/data/workspace/liukaiyang/TTS/DiffGAN-TTS-main/model/diffusion.py", line 157, in sampling b, *_, device = *self.cond.shape, self.cond.device File "/root/anaconda3/envs/LKYBase/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1177, in getattr raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'GaussianDiffusion' object has no attribute 'cond'

    Thank for your work!I seem to get a Error……

    opened by FlyToYourMooN 6
  • TypeError: 'NoneType' object is not subscriptable

    TypeError: 'NoneType' object is not subscriptable

    raceback (most recent call last): | 0/5468 [00:00<?, ?it/s] File "train.py", line 307, in main(args, configs) File "train.py", line 99, in main output = model(*(batch[2:])) TypeError: 'NoneType' object is not subscriptable

    How can I solve this problem? Thank You!

    opened by qw1260497397 5
  • VCTK generation fails

    VCTK generation fails

    Hello, thank you very much for your brilliant open-source project. I have been able to do single and batch generations using the LJSpeech dataset. However, when I try to replicate the results for the VCTK dataset, it fails.

    I run the following command, !python3 synthesize.py --text "Hello World" --model naive --restore_step 300000 --mode single --dataset VCTK

    I obtain the following output:

    [nltk_data] Downloading package averaged_perceptron_tagger to
    [nltk_data]     /root/nltk_data...
    [nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
    [nltk_data] Downloading package cmudict to /root/nltk_data...
    [nltk_data]   Unzipping corpora/cmudict.zip.
    
    ==================================== Inference Configuration ====================================
     ---> Type of Modeling: naive
     ---> Total Batch Size: 32
     ---> Path of ckpt: ./output/ckpt/VCTK_naive
     ---> Path of log: ./output/log/VCTK_naive
     ---> Path of result: ./output/result/VCTK_naive
    ================================================================================================
    Removing weight norm...
    Traceback (most recent call last):
      File "synthesize.py", line 264, in <module>
        )) if load_spker_embed else None
      File "/usr/local/lib/python3.7/dist-packages/numpy/lib/npyio.py", line 416, in load
        fid = stack.enter_context(open(os_fspath(file), "rb"))
    FileNotFoundError: [Errno 2] No such file or directory: './preprocessed_data/VCTK/spker_embed/p225-spker_embed.npy' 
    

    I tried to investigate further and discovered that the specific speaker embedding folder and file did not exist in my directory. Any pointer to how I can solve the issue will be appreciated.

    opened by KwekuYamoah 2
  • About preprocess

    About preprocess

    HI, I wanna run "python3 preprocess.py --dataset VCTK" after "python3 prepare_align.py --dataset VCTK", but in ./preprocessor/preprocessor.py
    line :115 tg_path = os.path.join(self.out_dir, "TextGrid", speaker, "{}.TextGrid".format(basename) I cannot find file named "*TextGrid", I want to know when it created?

    After step "python3 prepare_align.py --dataset VCTK" I only get files name ".lab" and ".wav", no files named ".TextGrid"

    Thanks

    opened by CathyW77 2
  • What does the mlp and Mish function in modules.py do

    What does the mlp and Mish function in modules.py do

    self.mlp = nn.Sequential( LinearNorm(residual_channels, residual_channels * 4), Mish(), # return x * torch.tanh(F.softplus(x)) LinearNorm(residual_channels * 4, residual_channels) )

    class Mish(nn.Module): def forward(self, x): return x * torch.tanh(F.softplus(x))

    opened by qw1260497397 2
  • stft

    stft

    Hello, thank you very much for the open source project. I ran into a problem: the model successfully converged during training, but after generating the mel spectrum (which looked very good), I put the mel spectrum into my own hifigan vocoder, and the resulting wav was murmur, I could be sure that the parameters of the hifigan's sample radio, hoplength and winlength were consistent with the diffgan model, and I guessed that the problem was in the process of processing the audio of the data into a mel spectrum. I noticed that you used pytorch-stft to implement it, which is very different from the processing result of librosa.stft?

    opened by KMzuka 2
  • Some of the problems that occur in training

    Some of the problems that occur in training

    [email protected], I encountered some problems during the training stage. I often have loss functions that occasionally fluctuate a lot during training, even from around 3 to tens or hundreds. After I set the training set shuffle, sometimes I have this problem, sometimes but not this problem. This problem was encountered in the naive, aux and shallow stages. Thank you for my friend!Best wish to you!

    opened by qw1260497397 1
  • Why minmize l1(\hat{x_0}, x_0)+l1(\hat{x_1}, x_0) when optimizing aux model?

    Why minmize l1(\hat{x_0}, x_0)+l1(\hat{x_1}, x_0) when optimizing aux model?

    Hi, keonlee. Thanks for sharing code! I found that when training aux model, we get \hat{x_0} from G, then diffuse it to \hat{x_1}, finally get a prediciton list [ \hat{x_0}, \hat{x_1}]. When calculating mel loss, add l1 loss of them with target. It confuse me. I understand l1(x_0, \hat{x_0}). But why not l1(x_1, \hat{x_1}).

    opened by caisikai 0
  • Is adversarial training actually necessary?

    Is adversarial training actually necessary?

    I realise that when I remove adversarial loss and feature match loss, it still works well and has no degeneration of performance. This makes me question the role of adversarial training in reduction of inference steps, or this this task is simple enough to learn directly with denoise model. Here are samples from two models https://drive.google.com/drive/folders/1uvURiQkOrP9n1jJsKyNe9NcSO4AfdFID?usp=sharing

    opened by nguyenhungquang 3
  • Can I ask you some questions about mel-spectrogram?

    Can I ask you some questions about mel-spectrogram?

    [email protected], I have some questions to ask you about the mel-spectrogram. In the picture, image The above mel-spectrogram alignment has been generated, but the horizontal details have not been released yet. What problem do you think caused it

    opened by qw1260497397 3
  • Can we just use FastSpeech for inference as baseline result

    Can we just use FastSpeech for inference as baseline result

    Hi Keon, thanks so much for sharing this wonderful project. I am wondering can we just use the FastSpeech part for inference? Looking forward to your reply

    opened by Maoshuiyang 1
Releases(v0.1.1)
Owner
Keon Lee
Everything towards conversational AI
Keon Lee
利用Tensorflow实现基于CNN的中文短文本分类

Text Classification with CNN 使用卷积神经网络进行中文文本分类 CNN做句子分类的论文可以参看: Convolutional Neural Networks for Sentence Classification 还可以去读dennybritz大牛的博客:Implemen

Jeremiah 4 Nov 08, 2022
FeTaQA: Free-form Table Question Answering

FeTaQA: Free-form Table Question Answering FeTaQA is a Free-form Table Question Answering dataset with 10K Wikipedia-based {table, question, free-form

Language, Information, and Learning at Yale 40 Dec 13, 2022
A weakly-supervised scene graph generation codebase. The implementation of our CVPR2021 paper ``Linguistic Structures as Weak Supervision for Visual Scene Graph Generation''

README.md shall be finished soon. WSSGG 0 Overview 1 Installation 1.1 Faster-RCNN 1.2 Language Parser 1.3 GloVe Embeddings 2 Settings 2.1 VG-GT-Graph

Keren Ye 35 Nov 20, 2022
Semantic Segmentation for Real Point Cloud Scenes via Bilateral Augmentation and Adaptive Fusion (CVPR 2021)

Semantic Segmentation for Real Point Cloud Scenes via Bilateral Augmentation and Adaptive Fusion (CVPR 2021) This repository is for BAAF-Net introduce

90 Dec 29, 2022
Generate saved_model, tfjs, tf-trt, EdgeTPU, CoreML, quantized tflite and .pb from .tflite.

tflite2tensorflow Generate saved_model, tfjs, tf-trt, EdgeTPU, CoreML, quantized tflite and .pb from .tflite. 1. Supported Layers No. TFLite Layer TF

Katsuya Hyodo 214 Dec 29, 2022
Official repository for "On Generating Transferable Targeted Perturbations" (ICCV 2021)

On Generating Transferable Targeted Perturbations (ICCV'21) Muzammal Naseer, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Fatih Porikli Paper:

Muzammal Naseer 46 Nov 17, 2022
Unofficial implementation of Google's FNet: Mixing Tokens with Fourier Transforms

FNet: Mixing Tokens with Fourier Transforms Pytorch implementation of Fnet : Mixing Tokens with Fourier Transforms. Citation: @misc{leethorp2021fnet,

Rishikesh (ऋषिकेश) 218 Jan 05, 2023
ChainerRL is a deep reinforcement learning library built on top of Chainer.

ChainerRL and PFRL ChainerRL (this repository) is a deep reinforcement learning library that implements various state-of-the-art deep reinforcement al

Chainer 1.1k Jan 01, 2023
Hybrid Neural Fusion for Full-frame Video Stabilization

FuSta: Hybrid Neural Fusion for Full-frame Video Stabilization Project Page | Video | Paper | Google Colab Setup Setup environment for [Yu and Ramamoo

Yu-Lun Liu 430 Jan 04, 2023
TLDR; Train custom adaptive filter optimizers without hand tuning or extra labels.

AutoDSP TLDR; Train custom adaptive filter optimizers without hand tuning or extra labels. About Adaptive filtering algorithms are commonplace in sign

Jonah Casebeer 48 Sep 19, 2022
Code for How To Create A Fully Automated AI Based Trading System With Python

AI Based Trading System This code works as a boilerplate for an AI based trading system with yfinance as data source and RobinHood or Alpaca as broker

Rubén 196 Jan 05, 2023
Pytorch implementation of FlowNet by Dosovitskiy et al.

FlowNetPytorch Pytorch implementation of FlowNet by Dosovitskiy et al. This repository is a torch implementation of FlowNet, by Alexey Dosovitskiy et

Clément Pinard 762 Jan 02, 2023
A Unified Framework and Analysis for Structured Knowledge Grounding

UnifiedSKG 📚 : Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models Code for paper UnifiedSKG: Unifying and Mu

HKU NLP Group 370 Dec 21, 2022
This repository contains code to run experiments in the paper "Signal Strength and Noise Drive Feature Preference in CNN Image Classifiers."

Signal Strength and Noise Drive Feature Preference in CNN Image Classifiers This repository contains code to run experiments in the paper "Signal Stre

0 Jan 19, 2022
Python package for Bayesian Machine Learning with scikit-learn API

Python package for Bayesian Machine Learning with scikit-learn API Installing & Upgrading package pip install https://github.com/AmazaspShumik/sklearn

Amazasp Shaumyan 482 Jan 04, 2023
Providing the solutions for high-frequency trading (HFT) strategies using data science approaches (Machine Learning) on Full Orderbook Tick Data.

Modeling High-Frequency Limit Order Book Dynamics Using Machine Learning Framework to capture the dynamics of high-frequency limit order books. Overvi

Chang-Shu Chung 1.3k Jan 07, 2023
Set of methods to ensemble boxes from different object detection models, including implementation of "Weighted boxes fusion (WBF)" method.

Set of methods to ensemble boxes from different object detection models, including implementation of "Weighted boxes fusion (WBF)" method.

1.4k Jan 05, 2023
PESTO: Switching Point based Dynamic and Relative Positional Encoding for Code-Mixed Languages

PESTO: Switching Point based Dynamic and Relative Positional Encoding for Code-Mixed Languages Abstract NLP applications for code-mixed (CM) or mix-li

Mohsin Ali, Mohammed 1 Nov 12, 2021
Tutorial page of the Climate Hack, the greatest hackathon ever

Tutorial page of the Climate Hack, the greatest hackathon ever

UCL Artificial Intelligence Society 12 Jul 02, 2022
Several simple examples for popular neural network toolkits calling custom CUDA operators.

Neural Network CUDA Example Several simple examples for neural network toolkits (PyTorch, TensorFlow, etc.) calling custom CUDA operators. We provide

WeiYang 798 Jan 01, 2023