STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

Overview

Hits

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

Keon Lee, Kyumin Park, Daeyoung Kim

In our paper, we propose STYLER, a non-autoregressive TTS framework with style factor modeling that achieves rapidity, robustness, expressivity, and controllability at the same time.

Abstract: Previous works on neural text-to-speech (TTS) have been addressed on limited speed in training and inference time, robustness for difficult synthesis conditions, expressiveness, and controllability. Although several approaches resolve some limitations, there has been no attempt to solve all weaknesses at once. In this paper, we propose STYLER, an expressive and controllable TTS framework with high-speed and robust synthesis. Our novel audio-text aligning method called Mel Calibrator and excluding autoregressive decoding enable rapid training and inference and robust synthesis on unseen data. Also, disentangled style factor modeling under supervision enlarges the controllability in synthesizing process leading to expressive TTS. On top of it, a novel noise modeling pipeline using domain adversarial training and Residual Decoding empowers noise-robust style transfer, decomposing the noise without any additional label. Various experiments demonstrate that STYLER is more effective in speed and robustness than expressive TTS with autoregressive decoding and more expressive and controllable than reading style non-autoregressive TTS. Synthesis samples and experiment results are provided via our demo page, and code is available publicly.

Dependencies

Please install the python dependencies given in requirements.txt.

pip3 install -r requirements.txt

Training

Preparation

Clean Data

  1. Download VCTK dataset and resample audios to a 22050Hz sampling rate.
  2. We provide a bash script for the resampling. Refer to data/resample.sh for the detail.
  3. Put audio files and corresponding text (transcript) files in the same directory. Both audio and text files must have the same name, excluding the extension.
  4. You may need to trim the audio for stable model convergence. Refer to Yeongtae's preprocess_audio.py for helpful preprocessing, including the trimming.
  5. Modify the hp.data_dir in hparams.py.

Noisy Data

  1. Download WHAM! dataset and resample audios to a 22050Hz sampling rate.
  2. Modify the hp.noise_dir in hparams.py.

Vocoder

  1. Unzip hifigan/generator_universal.pth.tar.zip in the same directory.

Preprocess

First, download ResCNN Softmax+Triplet pretrained model of philipperemy's DeepSpeaker for the speaker embedding as described in our paper and locate it in hp.speaker_embedder_dir.

Second, download the Montreal Forced Aligner(MFA) package and the pretrained (LibriSpeech) lexicon file through the following commands. MFA is used to obtain the alignments between the utterances and the phoneme sequences as FastSpeech2.

wget https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/download/v1.1.0-beta.2/montreal-forced-aligner_linux.tar.gz
tar -zxvf montreal-forced-aligner_linux.tar.gz

wget http://www.openslr.org/resources/11/librispeech-lexicon.txt -O montreal-forced-aligner/pretrained_models/librispeech-lexicon.txt

Then, process all the necessary features. You will get a stat.txt file in your hp.preprocessed_path/. You have to modify the f0 and energy parameters in the hparams.py according to the content of stat.txt.

python3 preprocess.py

Finally, get the noisy data separately from the clean data by mixing each utterance with a randomly selected piece of background noise from WHAM! dataset.

python3 preprocess_noisy.py

Train

Now you have all the prerequisites! Train the model using the following command:

python3 train.py

Inference

Prepare Texts

Create sentences.py in data/ which has a python list named sentences of texts to be synthesized. Note that sentences can contain more than one text.

# In 'data/sentences.py',
sentences = [
    "Nothing is lost, everything is recycled."
]

Prepare Reference Audios

Reference audio preparation has a similar process to training data preparation. There could be two kinds of references: clean and noisy.

First, put clean audios with corresponding texts in a single directory and modify the hp.ref_audio_dir in hparams.py and process all the necessary features. Refer to the Clean Data section of Train Preparation.

python3 preprocess_refs.py

Then, get the noisy references.

python3 preprocess_noisy.py --refs

Synthesize

The following command will synthesize all combinations of texts in data/sentences.py and audios in hp.ref_audio_dir.

python3 synthesize.py --ckpt CHECKPOINT_PATH

Or you can specify single reference audio in hp.ref_audio_dir as follows.

python3 synthesize.py --ckpt CHECKPOINT_PATH --ref_name AUDIO_FILENAME

Also, there are several useful options.

  1. --speaker_id will specify the speaker. The specified speaker's embedding should be in hp.preprocessed_path/spker_embed. The default value is None, and the speaker embedding is calculated at runtime on each input audio.

  2. --inspection will give you additional outputs that show the effects of each encoder of STYLER. The samples are the same as the Style Factor Modeling section on our demo page.

  3. --cont will generate the samples as the Style Factor Control section on our demo page.

    python3 synthesize.py --ckpt CHECKPOINT_PATH --cont --r1 AUDIO_FILENAME_1 --r2 AUDIO_FILENAME_1

    Note that --cont option is only working on preprocessed data. In detail, the audios' name should have the same format as VCTK dataset (e.g., p323_229), and the preprocessed data must be existing in hp.preprocessed_path.

TensorBoard

The TensorBoard loggers are stored in the log directory. Use

tensorboard --logdir log

to serve the TensorBoard on your localhost. Here are some logging views of the model training on VCTK for 560k steps.

Notes

  1. There were too many noise data where extraction was not possible through pyworld as in clean data. To resolve this, pysptk was applied to extract log f0 for the noisy data's fundamental frequency. The --noisy_input option will automate this process during synthesizing.

  2. If MFA-related problems occur during running preprocess.py, try to manually run MFA by the following command.

    # Replace $data_dir and $PREPROCESSED_PATH with ./VCTK-Corpus-92/wav48_silence_trimmed and ./preprocessed/VCTK/TextGrid, for example
    ./montreal-forced-aligner/bin/mfa_align $YOUR_data_dir montreal-forced-aligner/pretrained_models/librispeech-lexicon.txt english $YOUR_PREPROCESSED_PATH -j 8
  3. DeepSpeaker on VCTK dataset shows clear identification among speakers. The following figure shows the T-SNE plot of extracted speaker embedding in our experiments.

  4. Currently, preprocess.py divides the dataset into two subsets: train and validation set. If you need other sets, such as a test set, the only thing to do is modifying the text files (train.txt or val.txt) in hp.preprocessed_path/.

Citation

If you would like to use or refer to this implementation, please cite our paper with the repo.

@article{lee2021styler,
  title={STYLER: Style Modeling with Rapidity and Robustness via SpeechDecomposition for Expressive and Controllable Neural Text to Speech},
  author={Lee, Keon and Park, Kyumin and Kim, Daeyoung},
  journal={arXiv preprint arXiv:2103.09474},
  year={2021}
}

References

Comments
  • some questions

    some questions

    1.in paper, why output of encoder (text_encoding) upsample and downsample? 2. what is the meaning of text_encoding_neck+pitch_encoding、text_encoding_neck+energy_encoding? why not cat?

    opened by Pydataman 3
  • Low resource languages that won't work with MFA?

    Low resource languages that won't work with MFA?

    Is there a way to fine tune a model or training two languages side by side such that a very low resource language can be trained with the voices of a high resource language?

    opened by michael-conrad 3
  • Undefined names

    Undefined names

    Hi, I noticed some undefined names around the code:

    synthesize.py:495:67: F821 undefined name 'reference' noise_mixer_refs.py:56:42: F821 undefined name 'eps' noise_mixer_refs.py:59:40: F821 undefined name 'eps'

    opened by L3str4nge 2
  • About the pre-process

    About the pre-process

    Hi, I want to ask the trimming operation whether is very important for training your model? Furthermore, can you share the scripts to trimming VCTK dataset?

    opened by yangdongchao 0
  • Bump tensorflow from 2.4.0 to 2.5.1

    Bump tensorflow from 2.4.0 to 2.5.1

    Bumps tensorflow from 2.4.0 to 2.5.1.

    Release notes

    Sourced from tensorflow's releases.

    TensorFlow 2.5.1

    Release 2.5.1

    This release introduces several vulnerability fixes:

    • Fixes a heap out of bounds access in sparse reduction operations (CVE-2021-37635)
    • Fixes a floating point exception in SparseDenseCwiseDiv (CVE-2021-37636)
    • Fixes a null pointer dereference in CompressElement (CVE-2021-37637)
    • Fixes a null pointer dereference in RaggedTensorToTensor (CVE-2021-37638)
    • Fixes a null pointer dereference and a heap OOB read arising from operations restoring tensors (CVE-2021-37639)
    • Fixes an integer division by 0 in sparse reshaping (CVE-2021-37640)
    • Fixes a division by 0 in ResourceScatterDiv (CVE-2021-37642)
    • Fixes a heap OOB in RaggedGather (CVE-2021-37641)
    • Fixes a std::abort raised from TensorListReserve (CVE-2021-37644)
    • Fixes a null pointer dereference in MatrixDiagPartOp (CVE-2021-37643)
    • Fixes an integer overflow due to conversion to unsigned (CVE-2021-37645)
    • Fixes a bad allocation error in StringNGrams caused by integer conversion (CVE-2021-37646)
    • Fixes a null pointer dereference in SparseTensorSliceDataset (CVE-2021-37647)
    • Fixes an incorrect validation of SaveV2 inputs (CVE-2021-37648)
    • Fixes a null pointer dereference in UncompressElement (CVE-2021-37649)
    • Fixes a segfault and a heap buffer overflow in {Experimental,}DatasetToTFRecord (CVE-2021-37650)
    • Fixes a heap buffer overflow in FractionalAvgPoolGrad (CVE-2021-37651)
    • Fixes a use after free in boosted trees creation (CVE-2021-37652)
    • Fixes a division by 0 in ResourceGather (CVE-2021-37653)
    • Fixes a heap OOB and a CHECK fail in ResourceGather (CVE-2021-37654)
    • Fixes a heap OOB in ResourceScatterUpdate (CVE-2021-37655)
    • Fixes an undefined behavior arising from reference binding to nullptr in RaggedTensorToSparse (CVE-2021-37656)
    • Fixes an undefined behavior arising from reference binding to nullptr in MatrixDiagV* ops (CVE-2021-37657)
    • Fixes an undefined behavior arising from reference binding to nullptr in MatrixSetDiagV* ops (CVE-2021-37658)
    • Fixes an undefined behavior arising from reference binding to nullptr and heap OOB in binary cwise ops (CVE-2021-37659)
    • Fixes a division by 0 in inplace operations (CVE-2021-37660)
    • Fixes a crash caused by integer conversion to unsigned (CVE-2021-37661)
    • Fixes an undefined behavior arising from reference binding to nullptr in boosted trees (CVE-2021-37662)
    • Fixes a heap OOB in boosted trees (CVE-2021-37664)
    • Fixes vulnerabilities arising from incomplete validation in QuantizeV2 (CVE-2021-37663)
    • Fixes vulnerabilities arising from incomplete validation in MKL requantization (CVE-2021-37665)
    • Fixes an undefined behavior arising from reference binding to nullptr in RaggedTensorToVariant (CVE-2021-37666)
    • Fixes an undefined behavior arising from reference binding to nullptr in unicode encoding (CVE-2021-37667)
    • Fixes an FPE in tf.raw_ops.UnravelIndex (CVE-2021-37668)
    • Fixes a crash in NMS ops caused by integer conversion to unsigned (CVE-2021-37669)
    • Fixes a heap OOB in UpperBound and LowerBound (CVE-2021-37670)
    • Fixes an undefined behavior arising from reference binding to nullptr in map operations (CVE-2021-37671)
    • Fixes a heap OOB in SdcaOptimizerV2 (CVE-2021-37672)
    • Fixes a CHECK-fail in MapStage (CVE-2021-37673)
    • Fixes a vulnerability arising from incomplete validation in MaxPoolGrad (CVE-2021-37674)
    • Fixes an undefined behavior arising from reference binding to nullptr in shape inference (CVE-2021-37676)
    • Fixes a division by 0 in most convolution operators (CVE-2021-37675)
    • Fixes vulnerabilities arising from missing validation in shape inference for Dequantize (CVE-2021-37677)
    • Fixes an arbitrary code execution due to YAML deserialization (CVE-2021-37678)
    • Fixes a heap OOB in nested tf.map_fn with RaggedTensors (CVE-2021-37679)

    ... (truncated)

    Changelog

    Sourced from tensorflow's changelog.

    Release 2.5.1

    This release introduces several vulnerability fixes:

    • Fixes a heap out of bounds access in sparse reduction operations (CVE-2021-37635)
    • Fixes a floating point exception in SparseDenseCwiseDiv (CVE-2021-37636)
    • Fixes a null pointer dereference in CompressElement (CVE-2021-37637)
    • Fixes a null pointer dereference in RaggedTensorToTensor (CVE-2021-37638)
    • Fixes a null pointer dereference and a heap OOB read arising from operations restoring tensors (CVE-2021-37639)
    • Fixes an integer division by 0 in sparse reshaping (CVE-2021-37640)
    • Fixes a division by 0 in ResourceScatterDiv (CVE-2021-37642)
    • Fixes a heap OOB in RaggedGather (CVE-2021-37641)
    • Fixes a std::abort raised from TensorListReserve (CVE-2021-37644)
    • Fixes a null pointer dereference in MatrixDiagPartOp (CVE-2021-37643)
    • Fixes an integer overflow due to conversion to unsigned (CVE-2021-37645)
    • Fixes a bad allocation error in StringNGrams caused by integer conversion (CVE-2021-37646)
    • Fixes a null pointer dereference in SparseTensorSliceDataset (CVE-2021-37647)
    • Fixes an incorrect validation of SaveV2 inputs (CVE-2021-37648)
    • Fixes a null pointer dereference in UncompressElement (CVE-2021-37649)
    • Fixes a segfault and a heap buffer overflow in {Experimental,}DatasetToTFRecord (CVE-2021-37650)
    • Fixes a heap buffer overflow in FractionalAvgPoolGrad (CVE-2021-37651)
    • Fixes a use after free in boosted trees creation (CVE-2021-37652)
    • Fixes a division by 0 in ResourceGather (CVE-2021-37653)
    • Fixes a heap OOB and a CHECK fail in ResourceGather (CVE-2021-37654)
    • Fixes a heap OOB in ResourceScatterUpdate (CVE-2021-37655)
    • Fixes an undefined behavior arising from reference binding to nullptr in RaggedTensorToSparse

    ... (truncated)

    Commits
    • 8222c1c Merge pull request #51381 from tensorflow/mm-fix-r2.5-build
    • d584260 Disable broken/flaky test
    • f6c6ce3 Merge pull request #51367 from tensorflow-jenkins/version-numbers-2.5.1-17468
    • 3ca7812 Update version numbers to 2.5.1
    • 4fdf683 Merge pull request #51361 from tensorflow/mm-update-relnotes-on-r2.5
    • 05fc01a Put CVE numbers for fixes in parentheses
    • bee1dc4 Update release notes for the new patch release
    • 47beb4c Merge pull request #50597 from kruglov-dmitry/v2.5.0-sync-abseil-cmake-bazel
    • 6f39597 Merge pull request #49383 from ashahab/abin-load-segfault-r2.5
    • 0539b34 Merge pull request #48979 from liufengdb/r2.5-cherrypick
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
Releases(v1.0.0)
Owner
Keon Lee
Expressive Speech Synthesis | Disentangled Representation | Generative Models | NLP | HCI
Keon Lee
No-Reference Image Quality Assessment via Transformers, Relative Ranking, and Self-Consistency

This repository contains the implementation for the paper: No-Reference Image Quality Assessment via Transformers, Relative Ranking, and Self-Consiste

Alireza Golestaneh 75 Dec 30, 2022
Machine Learning Framework for Operating Systems - Brings ML to Linux kernel

KML: A Machine Learning Framework for Operating Systems & Storage Systems Storage systems and their OS components are designed to accommodate a wide v

File systems and Storage Lab (FSL) 186 Nov 24, 2022
A PyTorch implementation of SlowFast based on ICCV 2019 paper "SlowFast Networks for Video Recognition"

SlowFast A PyTorch implementation of SlowFast based on ICCV 2019 paper SlowFast Networks for Video Recognition. Requirements Anaconda PyTorch conda in

Hao Ren 8 Dec 23, 2022
Adversarial Graph Representation Adaptation for Cross-Domain Facial Expression Recognition (AGRA, ACM 2020, Oral)

Cross Domain Facial Expression Recognition Benchmark Implementation of papers: Cross-Domain Facial Expression Recognition: A Unified Evaluation Benchm

89 Dec 09, 2022
This project deploys a yolo fastest model in the form of tflite on raspberry 3b+. The model is from another repository of mine called -Trash-Classification-Car

Deploy-yolo-fastest-tflite-on-raspberry 觉得有用的话可以顺手点个star嗷 这个项目将垃圾分类小车中的tflite模型移植到了树莓派3b+上面。 该项目主要是为了记录在树莓派部署yolo fastest tflite的流程 (之后有时间会尝试用C++部署来提升

7 Aug 16, 2022
Group Fisher Pruning for Practical Network Compression(ICML2021)

Group Fisher Pruning for Practical Network Compression (ICML2021) By Liyang Liu*, Shilong Zhang*, Zhanghui Kuang, Jing-Hao Xue, Aojun Zhou, Xinjiang W

Shilong Zhang 129 Dec 13, 2022
TeachMyAgent is a testbed platform for Automatic Curriculum Learning methods in Deep RL.

TeachMyAgent: a Benchmark for Automatic Curriculum Learning in Deep RL Paper Website Documentation TeachMyAgent is a testbed platform for Automatic Cu

Flowers Team 51 Dec 25, 2022
DeepMind Alchemy task environment: a meta-reinforcement learning benchmark

The DeepMind Alchemy environment is a meta-reinforcement learning benchmark that presents tasks sampled from a task distribution with deep underlying structure.

DeepMind 188 Dec 25, 2022
Training a deep learning model on the noisy CIFAR dataset

Training-a-deep-learning-model-on-the-noisy-CIFAR-dataset This repository contai

1 Jun 14, 2022
FMA: A Dataset For Music Analysis

FMA: A Dataset For Music Analysis Michaël Defferrard, Kirell Benzi, Pierre Vandergheynst, Xavier Bresson. International Society for Music Information

Michaël Defferrard 1.8k Dec 29, 2022
Pytorch implementation of MaskFlownet

MaskFlownet-Pytorch Unofficial PyTorch implementation of MaskFlownet (https://github.com/microsoft/MaskFlownet). Tested with: PyTorch 1.5.0 CUDA 10.1

Daniele Cattaneo 84 Nov 02, 2022
NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.

880 Jan 07, 2023
WebUAV-3M: A Benchmark Unveiling the Power of Million-Scale Deep UAV Tracking

WebUAV-3M: A Benchmark Unveiling the Power of Million-Scale Deep UAV Tracking [Paper Link] Abstract In this work, we contribute a new million-scale Un

25 Jan 01, 2023
Source code of our BMVC 2021 paper: AniFormer: Data-driven 3D Animation with Transformer

AniFormer This is the PyTorch implementation of our BMVC 2021 paper AniFormer: Data-driven 3D Animation with Transformer. Haoyu Chen, Hao Tang, Nicu S

24 Nov 02, 2022
R3Det based on mmdet 2.19.0

R3Det: Refined Single-Stage Detector with Feature Refinement for Rotating Object Installation # install mmdetection first if you haven't installed it

SJTU-Thinklab-Det 38 Dec 15, 2022
Computing Shapley values using VAEAC

Shapley values and the VAEAC method In this GitHub repository, we present the implementation of the VAEAC approach from our paper "Using Shapley Value

3 Nov 23, 2022
Volumetric parameterization of the placenta to a flattened template

placenta-flattening A MATLAB algorithm for volumetric mesh parameterization. Developed for mapping a placenta segmentation derived from an MRI image t

Mazdak Abulnaga 12 Mar 14, 2022
Fast Differentiable Matrix Sqrt Root

Fast Differentiable Matrix Sqrt Root Geometric Interpretation of Matrix Square Root and Inverse Square Root This repository constains the official Pyt

YueSong 42 Dec 30, 2022
AdaDM: Enabling Normalization for Image Super-Resolution

AdaDM AdaDM: Enabling Normalization for Image Super-Resolution. You can apply BN, LN or GN in SR networks with our AdaDM. Pretrained models (EDSR*/RDN

58 Jan 08, 2023
NAS-Bench-x11 and the Power of Learning Curves

NAS-Bench-x11 NAS-Bench-x11 and the Power of Learning Curves Shen Yan, Colin White, Yash Savani, Frank Hutter. NeurIPS 2021. Surrogate NAS benchmarks

AutoML-Freiburg-Hannover 13 Nov 18, 2022