Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech

Last update: Dec 25, 2022

Overview

EdiTTS: Score-based Editing for Controllable Text-to-Speech

Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech. Audio samples are available on our demo page.

Abstract

We present EdiTTS, an off-the-shelf speech editing methodology based on score-based generative modeling for text-to-speech synthesis. EdiTTS allows for targeted, granular editing of audio, both in terms of content and pitch, without the need for any additional training, task-specific optimization, or architectural modifications to the score-based model backbone. Specifically, we apply coarse yet deliberate perturbations in the Gaussian prior space to induce desired behavior from the diffusion model, while applying masks and softening kernels to ensure that iterative edits are applied only to the target region. Listening tests demonstrate that EdiTTS is capable of reliably generating natural-sounding audio that satisfies user-imposed requirements.

Citation

Please cite this work as follows.

@misc{tae&kim2021editts,
      title={EdiTTS: Score-based Editing for Controllable Text-to-Speech}, 
      author={Jaesung Tae and Hyeongju Kim and Taesu Kim},
      year={2021}
}

Setup

Create a Python virtual environment (venv or conda) and install package requirements as specified in requirements.txt.
```
python -m venv venv
source venv/bin/activate
pip install -U pip
pip install -r requirements.txt
```

Build the monotonic alignment module.

cd model/monotonic_align
python setup.py build_ext --inplace

For more information, refer to the official repository of Grad-TTS.

Checkpoints

The following checkpoints are already included as part of this repository, under checkpts.

Pitch Shifting

Prepare an input file containing samples for speech generation. Mark the segment to be edited via a vertical bar separator, |. For instance, a single sample might look like

In | the face of impediments confessedly discouraging |

We provide a sample input file in resources/filelists/edit_pitch_example.txt.

To run inference, type

CUDA_VISIBLE_DEVICES=0 python edit_pitch.py \
    -f resources/filelists/edit_pitch_example.txt \
    -c checkpts/grad-tts-old.pt -t 1000 \
    -s out/pitch/wavs

Adjust CUDA_VISIBLE_DEVICES as appropriate.

Content Replacement

Prepare an input file containing pairs of sentences. Concatenate each pair with # and mark the parts to be replaced with a vertical bar separator. For instance, a single pair might look like

Three others subsequently | identified | Oswald from a photograph. #Three others subsequently | recognized | Oswald from a photograph.

We provide a sample input file in resources/filelists/edit_content_example.txt.

To run inference, type

CUDA_VISIBLE_DEVICES=0 python edit_content.py \
    -f resources/filelists/edit_content_example.txt \
    -c checkpts/grad-tts-old.pt -t 1000 \
    -s out/content/wavs

References

License

Released under the modified GNU General Public License.

Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech

Related tags

Overview

EdiTTS: Score-based Editing for Controllable Text-to-Speech

Abstract

Citation

Setup

Checkpoints

Pitch Shifting

Content Replacement

References

License

Owner

Neosapience

A PyTorch implementation of SIN: Superpixel Interpolation Network

A library for uncertainty representation and training in neural networks.

Demo code for ICCV 2021 paper "Sensor-Guided Optical Flow"

Code for the CVPR 2021 paper: Understanding Failures of Deep Networks via Robust Feature Extraction

Code accompanying the paper on "An Empirical Investigation of Domain Generalization with Empirical Risk Minimizers" published at NeurIPS, 2021

Receptive Field Block Net for Accurate and Fast Object Detection, ECCV 2018

Code for EMNLP2020 long paper: BERT-Attack: Adversarial Attack Against BERT Using BERT

It helps user to learn Pick-up lines and share if he has a better one

This is implementation of AlexNet(2012) with 3D Convolution on TensorFlow (AlexNet 3D).

ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts

Secure Distributed Training at Scale

Code for Low-Cost Algorithmic Recourse for Users With Uncertain Cost Functions

[AI6101] Introduction to AI & AI Ethics is a core course of MSAI, SCSE, NTU, Singapore

Codes for building and training the neural network model described in Domain-informed neural networks for interaction localization within astroparticle experiments.

The repository offers the official implementation of our BMVC 2021 paper in PyTorch.

Evaluating deep transfer learning for whole-brain cognitive decoding

HINet: Half Instance Normalization Network for Image Restoration

ML models and internal tensors 3D visualizer

This is the official pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering" on VQA Task

A 1.3B text-to-image generation model trained on 14 million image-text pairs