PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis

Last update: Dec 30, 2022

Overview

Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis

Julian Zaïdi, Hugo Seuté, Benjamin van Niekerk, Marc-André Carbonneau

In our recent paper we propose Daft-Exprt, a multi-speaker acoustic model advancing the state-of-the-art on inter-speaker and inter-text prosody transfer. This improvement is achieved using FiLM conditioning layers, alongside adversarial training that encourages disentanglement between prosodic information and speaker identity. The acoustic model inherits attractive qualities from FastSpeech 2, such as fast inference and local prosody attributes prediction for finer grained control over generation. Moreover, results indicate that adversarial training effectively discards speaker identity information from the prosody representation, which ensures Daft-Exprt will consistently generate speech with the desired voice.

Experimental results show that Daft-Exprt accurately transfers prosody, while yielding naturalness comparable to state-of-the-art expressive models. Visit our demo page for audio samples related to the paper experiments.

Pre-trained model

Full disclosure: The model provided in this repository is not the same as in the paper evaluation. The model of the paper was trained with proprietary data which prevents us to release it publicly.
We pre-train Daft-Exprt on a combination of LJ speech dataset and the emotional speech dataset (ESD) from Zhou et al.
Visit the releases of this repository to download the pre-trained model and to listen to prosody transfer examples using this same model.

Installation
- Local Environment
- Docker Image
Quick Start Example
Citation
Contributing

Installation

Local Environment

Requirements:

Ubuntu >= 20.04
Python >= 3.8
NVIDIA Driver >= 450.80.02
CUDA Toolkit >= 11.1
CuDNN >= v8.0.5

We recommend using conda for python environment management, for example download and install Miniconda.
Create your python environment and install dependencies using the Makefile:

conda create -n daft_exprt python=3.8 -y
conda activate daft_exprt
cd environment
make

All Linux/Conda/Python dependencies will be installed by the Makefile, and the repository will be installed as a pip package in editable mode.

Docker Image

Requirements:

NVIDIA Docker
NVIDIA Driver >= 450.80.02

Build the Docker image using the associated Dockerfile:

docker build -f environment/Dockerfile -t daft_exprt .

Quick Start Example

Introduction

This quick start guide will illustrate how to use the different scripts of this repository to:

Format datasets
Pre-process these datasets
Train Daft-Exprt on the pre-processed data
Generate a dataset for vocoder fine-tuning
Use Daft-Exprt for TTS synthesis

All scripts are located in scripts directory.
Daft-Exprt source code is located in daft_exprt directory.
Config parameters used in the scripts are all instanciated in hparams.py.

As a quick start example, we consider using the 22kHz LJ speech dataset and the 16kHz emotional speech dataset (ESD) from Zhou et al.
This combines a total of 11 speakers. All speaker datasets must be in the same root directory. For example:

/data_dir
    LJ_Speech
    ESD
        spk_1
        ...
        spk_N

In this example, we use the docker image built in the previous section:

docker run -it --gpus all -v /path/to/data_dir:/workdir/data_dir -v path/to/repo_dir:/workdir/repo_dir IMAGE_ID

Dataset Formatting

The source code expects the specific tree structure for each speaker data set:

/speaker_dir
    metadata.csv
    /wavs
        wav_file_name_1.wav
        ...
        wav_file_name_N.wav

metadata.csv must be formatted as follows:

wav_file_name_1|text_1
...
wav_file_name_N|text_N

Given each dataset has its own nomenclature, this project does not provide a ready-made universal script.
However, the script format_dataset.py already proposes the code to format LJ and ESD:

python format_dataset.py \
    --data_set_dir /workdir/data_dir/LJ_Speech \
    LJ

python format_dataset.py \
    --data_set_dir /workdir/data_dir/ESD \
    ESD \
    --language english

Data Pre-Processing

In this section, the code will:

Align data using MFA
Extract features for training
Create train and validation sets
Extract features stats on the train set for speaker standardization

To pre-process all available formatted data (i.e. LJ and ESD in this example):

python training.py \
    --experiment_name EXPERIMENT_NAME \
    --data_set_dir /workdir/data_dir \
    pre_process

This will pre-process data using the default hyper-parameters that are set for 22kHz audios.
All outputs related to the experiment will be stored in /workdir/repo_dir/trainings/EXPERIMENT_NAME.
You can also target specific speakers for data pre-processing. For example, to consider only ESD speakers:

python training.py \
    --experiment_name EXPERIMENT_NAME \
    --speakers ESD/spk_1 ... ESD/spk_N \
    --data_set_dir /workdir/data_dir \
    pre_process

The pre-process function takes several arguments:

--features_dir: absolute path where pre-processed data will be stored. Default to /workdir/repo_dir/datasets
--proportion_validation: Proportion of examples that will be in the validation set. Default to 0.1% per speaker.
--nb_jobs: number of cores to use for python multi-processing. If set to max, all CPU cores are used. Default to 6.

Note that if it is the first time that you pre-process the data, this step will take several hours.
You can decrease computing time by increasing the --nb_jobs parameter.

Training

Once pre-processing is finished, launch training. To train on all pre-processed data:

python training.py \
    --experiment_name EXPERIMENT_NAME \
    --data_set_dir /workdir/data_dir \
    train

Or if you targeted specific speakers during pre-processing (e.g. ESD speakers):

python training.py \
    --experiment_name EXPERIMENT_NAME \
    --speakers ESD/spk_1 ... ESD/spk_N \
    --data_set_dir /workdir/data_dir \
    train

All outputs related to the experiment will be stored in /workdir/repo_dir/trainings/EXPERIMENT_NAME.

The train function takes several arguments:

--checkpoint: absolute path of a Daft-Exprt checkpoint. Default to ""
--no_multiprocessing_distributed: disable PyTorch multi-processing distributed training. Default to False
--world_size: number of nodes for distributed training. Default to 1.
--rank: node rank for distributed training. Default to 0.
--master: url used to set up distributed training. Default to tcp://localhost:54321.

These default values will launch a new training starting at iteration 0, using all available GPUs on the machine.
The code supposes that only 1 GPU is available on the machine.
Default batch size and gradient accumulation hyper-parameters are set to values to reproduce the batch size of 48 from the paper.

The code also supports tensorboard logging. To display logging outputs:
tensorboard --logdir_spec=EXPERIMENT_NAME:/workdir/repo_dir/trainings/EXPERIMENT_NAME/logs

Vocoder Fine-Tuning

Once training is finished, you can create a dataset for vocoder fine-tuning:

python training.py \
    --experiment_name EXPERIMENT_NAME \
    --data_set_dir /workdir/data_dir \
    fine_tune \
    --checkpoint CHECKPOINT_PATH

Or if you targeted specific speakers during pre-processing and training (e.g. ESD speakers):

python training.py \
    --experiment_name EXPERIMENT_NAME \
    --speakers ESD/spk_1 ... ESD/spk_N \
    --data_set_dir /workdir/data_dir \
    fine_tune \
    --checkpoint CHECKPOINT_PATH

Fine-tuning dataset will be stored in /workdir/repo_dir/trainings/EXPERIMENT_NAME/fine_tuning_dataset.

TTS Synthesis

For an example on how to use Daft-Exprt for TTS synthesis, run the script synthesize.py.

python synthesize.py \
    --output_dir OUTPUT_DIR \
    --checkpoint CHECKPOINT

Default sentences and reference utterances are used in the script.

The script also offers the possibility to:

--batch_size: process batch of sentences in parallel
--real_time_factor: estimate Daft-Exprt real time factor performance given the chosen batch size
--control: perform local prosody control

Citation

@article{Zaidi2021,
abstract = {},
journal = {arXiv},
arxivId = {2108.02271},
author = {Za{\"{i}}di, Julian and Seut{\'{e}}, Hugo and van Niekerk, Benjamin and Carbonneau, Marc-Andr{\'{e}}},
eprint = {2108.02271},
title = {{Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis}},
url = {https://arxiv.org/pdf/2108.02271.pdf},
year = {2021}
}

Contributing

Any contribution to this repository is more than welcome!
If you have any feedback, please send it to [email protected].

Comments

Error while running Pretrained model

Hi @julianzaidi, I pointed to that file in checkpoint argument (archive/data.pkl) but got an unpickle error. If you could tell how to run this pretrained model, it would be so kind of you.

python synthesize.py --output_dir OUTPUT_DIR --checkpoint "archive/data.pkl"

Traceback (most recent call last): File "synthesize.py", line 148, in file_names, refs, speaker_ids = synthesize(args, use_griffin_lim=True)

File "synthesize.py", line 38, in synthesize checkpoint_dict = torch.load(args.checkpoint, map_location=f'cuda:{0}')

File "/home/saomya/miniconda3/envs/daft_exprt/lib/python3.8/site-packages/torch/serialization.py", line 608, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)

File "/home/saomya/miniconda3/envs/daft_exprt/lib/python3.8/site-packages/torch/serialization.py", line 777, in _legacy_load magic_number = pickle_module.load(f, **pickle_load_args)

_pickle.UnpicklingError: A load persistent id instruction was encountered, but no persistent_load function was specified.

opened by anushvst 12
ldd version

Hi, when I run the python training.py pre_process, it prompts Exception: REAPER binary -- Unsupported ldd version: 2.27 < 2.29. However, my machine could not update the glibc version. Are there any alternatives? Thanks!

opened by inconnu11 3
How to run the Pre-trained model

Hi @julianzaidi, we tried to run your pre-trained model. However, we are unable to get clarification on the values of the parameters that we need to pass, for instance, specific checkpoints. Also, we received the CUDA out of memory issues too. We would like to run the pre-trained model in Windows instead of Linux. How could we do this?

opened by saomya-seasia 2
Automatic aligner like in FastPitch?

Hello! Do you think it's possible to incorporate automatic aligner as in FastPitch (https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch), as described in paper "One TTS Alignment To Rule Them All"? This aligner essentially only requires graphemes or phonemes and learns with the rest of the network. It would allow to omit Montreal Forced Aligner preprocessing and decrease preprocessing time. If it's possible, what should be changed to allow the use of such an aligner?

opened by juliakorovsky 2
Position-dependent prosody transfer result

It seem to obtain position-dependent prosody transfer result by utterance-level embedding. Why location information could be embedded in the representation obtained by mean pool operation?

opened by hyzhan 2
np.frombuffer

Hi, when I extract the f0 using the reaper , it shows the error "ValueError: buffer size must be a multiple of element size ". Could you please help me out?

opened by inconnu11 0
Able to train on LJ & ESD dataset but error in training the model on custom dataset
Hi @julianzaidi @macarbonneau, hope you guys are doing well. Just want to ask few queries regarding the training aspect.

I tried to train the model on my voice

Formatted the dataset successfully

In pre_processing step, got the error: ValueError: zero-size array to reduction operation minimum which has no identity

Created directories in this format: work_dir/data_dir/LJ_Speech/wavs

In wavs folder i gave around 10 audio clips around 2-3 minute length

Prepared the metadata according to the instructions in the repository

Should we use short audio clips to train the model?

Any suggestion regarding this will be very kind of you.
opened by anushvst 0
Problems regarding pretrained model of the daft exprt model
Hi @julianzaidi @macarbonneau, hope you guys are doing well. Just want to ask few queries regarding the model.

I want to use the model such that it can generate audio in a Hip Hop music artist's voice (he passed away few years ago) giving a certain prosody in reference voice and lyrics in the text.

Curious about the answers to these questions as i am trying to get some audio clips > than 30 seconds

When i run the pretrained model giving reference voice and text, it sounds robotic/unnatural.

I gave my reference voice (24 sec)

Text: "Hello John, my name is Don with marketing dot com and I actually just recently came across micro soft and I thought there were some interesting things that we might be able to do together. Um, we do a lot of work in retail and I'm actually coming to New York next week for a conference. So, if you're around I would love to meet with you, buy you a cup of coffee and tell you a little bit more about what we're thinking that we can do for you. Alright, hope to see you soon."

got this output

https://user-images.githubusercontent.com/92500349/201936746-6b7760a1-fbca-465a-ab27-96ae648564a8.mp4

Also in the ouput voice, it generated robotic or un-natural voice till 18 seconds. After that the model generated distorted voice. Any idea about the distortion?

Should we give the model short reference voice and text?

Can the model produce the output voice greater than 1 minute or it produces short voice?

Is punctuation necessary? also will it work if we give "7" instead of "seven" in text file?

Want to clarify whose voice the model produces in the output: the reference speaker voice or the model's voice on which it is trained (LJ, ESD)?

I am still getting the unnatural (but better than previous) voice after training on the LJ dataset. Any tips how to get the natural voice output?

Reference voice - LJ's voice

Text: Hello John, my name is Don with marketing dot com. I actually just recently came across microsoft.

The output i got was:

https://user-images.githubusercontent.com/92500349/202087380-1858ecab-b32f-4db7-9021-885a185222e0.mp4

Is it because the model arcitecture used in generating audios in demo page is different than the model architecture present in the repository?

Any methods to reduce noise in the output voice?
opened by anushvst 0

Releases(1.0.0)

1.0.0(Sep 10, 2021)
Release contents:

Daft-Exprt model pre-trained on LJ Speech Dataset and the Emotional Speech Dataset from Zhou et al.

Prosody transfer examples synthesized using this pre-trained model and Griffin-Lim algorithm

Full disclosure: The model provided in this release is not the same as in the paper evaluation. The model of the paper was trained with proprietary data which prevents us to release it publicly.
Source code(tar.gz)
Source code(zip)
DaftExprt_LJ_ESD_22kHz(168.73 MB)
demo.zip(13.51 MB)

Owner

Ubisoft

Ubisoft open source projects.

GitHub Repository

Pynomial - a lightweight python library for implementing the many confidence intervals for the risk parameter of a binomial model

9 Oct 04, 2022

From Fidelity to Perceptual Quality: A Semi-Supervised Approach for Low-Light Image Enhancement (CVPR'2020)

Under-exposure introduces a series of visual degradation, i.e. decreased visibility, intensive noise, and biased color, etc. To address these problems, we propose a novel semi-supervised learning app

117 Jan 03, 2023

RCD: Relation Map Driven Cognitive Diagnosis for Intelligent Education Systems

RCD: Relation Map Driven Cognitive Diagnosis for Intelligent Education Systems This is our implementation for the paper: Weibo Gao, Qi Liu*, Zhenya Hu

10 Oct 16, 2022

It is a simple library to speed up CLIP inference up to 3x (K80 GPU)

CLIP-ONNX It is a simple library to speed up CLIP inference up to 3x (K80 GPU) Usage Install clip-onnx module and requirements first. Use this trick !

93 Dec 20, 2022

FasterAI: A library to make smaller and faster models with FastAI.

Fasterai fasterai is a library created to make neural network smaller and faster. It essentially relies on common compression techniques for networks

193 Jan 01, 2023

Direct application of DALLE-2 to video synthesis, using factored space-time Unet and Transformers

DALLE2 Video (wip) ** only to be built after DALLE2 image is done and replicated, and the importance of the prior network is validated ** Direct appli

105 May 15, 2022

Study of human inductive biases in CNNs and Transformers.

Are Convolutional Neural Networks or Transformers more like human vision? This repository contains the code and fine-tuned models of popular Convoluti

39 Dec 08, 2022

Developed an optimized algorithm which finds the most optimal path between 2 points in a 3D Maze using various AI search techniques like BFS, DFS, UCS, Greedy BFS and A*

Developed an optimized algorithm which finds the most optimal path between 2 points in a 3D Maze using various AI search techniques like BFS, DFS, UCS, Greedy BFS and A*. The algorithm was extremely

1 Mar 28, 2022

Tool for working with Y-chromosome data from YFull and FTDNA

ycomp ycomp is a tool for working with Y-chromosome data from YFull and FTDNA. Run ycomp -h for information on how to use the program. Installation Th

2 Jun 18, 2022

Send text to girlfriend in the morning

Girlfriend Text Send text to girlfriend (or really anyone with a phone number) in the morning 1. Configure your settings in utils.py. phone_number = "

199 Oct 25, 2022

A PyTorch implementation of NeRF (Neural Radiance Fields) that reproduces the results.

NeRF-pytorch NeRF (Neural Radiance Fields) is a method that achieves state-of-the-art results for synthesizing novel views of complex scenes. Here are

3.2k Jan 08, 2023

In-Place Activated BatchNorm for Memory-Optimized Training of DNNs

In-Place Activated BatchNorm In-Place Activated BatchNorm for Memory-Optimized Training of DNNs In-Place Activated BatchNorm (InPlace-ABN) is a novel

1.3k Dec 29, 2022

Understanding Convolution for Semantic Segmentation

TuSimple-DUC by Panqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, and Garrison Cottrell. Introduction This repository is for Under

585 Dec 31, 2022

In the AI for TSP competition we try to solve optimization problems using machine learning.

AI for TSP Competition Goal In the AI for TSP competition we try to solve optimization problems using machine learning. The competition will be hosted

11 Nov 27, 2022

Code for our CVPR 2021 paper "MetaCam+DSCE"

Joint Noise-Tolerant Learning and Meta Camera Shift Adaptation for Unsupervised Person Re-Identification (CVPR'21) Introduction Code for our CVPR 2021

59 Oct 31, 2022

Resources related to EMNLP 2021 paper "FAME: Feature-Based Adversarial Meta-Embeddings for Robust Input Representations"

FAME: Feature-based Adversarial Meta-Embeddings This is the companion code for the experiments reported in the paper "FAME: Feature-Based Adversarial

11 Nov 27, 2022

SingleVC performs any-to-one VC, which is an important component of MediumVC project.

SingleVC performs any-to-one VC, which is an important component of MediumVC project. Here is the official implementation of the paper, MediumVC.

26 Dec 28, 2022

Official implementation of ACMMM'20 paper 'Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework'

Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework Official code for paper, Self-supervised Video Representation Le

103 Dec 21, 2022

This is a Deep Leaning API for classifying emotions from human face and human audios.

Emotion AI This is a Deep Leaning API for classifying emotions from human face and human audios. Starting the server To start the server first you nee

5 Oct 02, 2022

A non-linear, non-parametric Machine Learning method capable of modeling complex datasets

Fast Symbolic Regression Symbolic Regression is a non-linear, non-parametric Machine Learning method capable of modeling complex data sets. fastsr aim

3 Jun 22, 2022

PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis

Related tags

Overview

Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis

Julian Zaïdi, Hugo Seuté, Benjamin van Niekerk, Marc-André Carbonneau

Pre-trained model

Table of Contents

Installation

Local Environment

Docker Image

Quick Start Example

Introduction

Dataset Formatting

Data Pre-Processing

Training

Vocoder Fine-Tuning

TTS Synthesis

Citation

Contributing

Comments

Releases(1.0.0)

1.0.0(Sep 10, 2021)

Owner

Ubisoft

Pynomial - a lightweight python library for implementing the many confidence intervals for the risk parameter of a binomial model

From Fidelity to Perceptual Quality: A Semi-Supervised Approach for Low-Light Image Enhancement (CVPR'2020)

RCD: Relation Map Driven Cognitive Diagnosis for Intelligent Education Systems

It is a simple library to speed up CLIP inference up to 3x (K80 GPU)

FasterAI: A library to make smaller and faster models with FastAI.

Direct application of DALLE-2 to video synthesis, using factored space-time Unet and Transformers

Study of human inductive biases in CNNs and Transformers.

Developed an optimized algorithm which finds the most optimal path between 2 points in a 3D Maze using various AI search techniques like BFS, DFS, UCS, Greedy BFS and A*

Tool for working with Y-chromosome data from YFull and FTDNA

Send text to girlfriend in the morning

A PyTorch implementation of NeRF (Neural Radiance Fields) that reproduces the results.

In-Place Activated BatchNorm for Memory-Optimized Training of DNNs

Understanding Convolution for Semantic Segmentation

In the AI for TSP competition we try to solve optimization problems using machine learning.

Code for our CVPR 2021 paper "MetaCam+DSCE"

Resources related to EMNLP 2021 paper "FAME: Feature-Based Adversarial Meta-Embeddings for Robust Input Representations"

SingleVC performs any-to-one VC, which is an important component of MediumVC project.

Official implementation of ACMMM'20 paper 'Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework'

This is a Deep Leaning API for classifying emotions from human face and human audios.

A non-linear, non-parametric Machine Learning method capable of modeling complex datasets