Non-Attentive-Tacotron - This is Pytorch Implementation of Google's Non-attentive Tacotron.

Last update: Dec 19, 2022

Related tags

Overview

Non-attentive Tacotron - PyTorch Implementation

This is Pytorch Implementation of Google's Non-attentive Tacotron, text-to-speech system. There is some minor modifications to the original paper. We use grapheme directly, not phoneme. For that reason, we use grapheme based forced aligner by using Wav2vec 2.0. We also separate special characters from basic characters, and each is used for embedding respectively. This project is based on NVIDIA tacotron2. Feel free to use this code.

Install

Before you start the code, you have to check your python>=3.6, torch>=1.10.1, torchaudio>=0.10.0 version.
Torchaudio version is strongly restrict because of recent modification.
We support docker image file that we used for this implementation.
or You can install a package through the command below:

## download the git repository
git clone https://github.com/JoungheeKim/Non-Attentive-Tacotron.git
cd Non-Attentive-Tacotron

## install python dependency
pip install -r requirements.txt

## install this implementation locally for further development
python setup.py develop

Quickstart

Install a package.
Download Pretrained tacotron models through links below:
- LJSpeech-1.1 (English, single-female speaker)
  - trained for 40,000 steps with 32 batch size, 8 accumulation) [LINK]
- KSS Dataset (Korean, single-female speaker)
  - trained for 40,000 steps with 32 batch size, 8 accumulation) [LINK]
  - trained for 110,000 steps with 32 batch size, 8 accumulation) [LINK]
Download Pretrained VocGAN vocoder corresponding tacotron model in this [LINK]
Run a python code below:

## import library
from tacotron import get_vocgan
from tacotron.model import NonAttentiveTacotron
from tacotron.tokenizer import BaseTokenizer
import torch

## set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## set pretrained model path
generator_path = '???'
tacotron_path = '???'

## load generator model
generator = get_vocgan(generator_path)
generator.eval()

## load tacotron model
tacotron = NonAttentiveTacotron.from_pretrained(tacotron_path)
tacotron.eval()

## load tokenizer
tokenizer = BaseTokenizer.from_pretrained(tacotron_path)

## Inference
text = 'This is a non attentive tacotron.'
encoded_text = tokenizer.encode(text)
encoded_torch_text = {key: torch.tensor(item, dtype=torch.long).unsqueeze(0).to(device) for key, item in encoded_text.items()}

with torch.no_grad():
    ## make log mel-spectrogram
    tacotron_output = tacotron.inference(**encoded_torch_text)
    
    ## make audio
    audio = generator.generate_audio(**tacotron_output)

We support more details in our tutorials

Preprocess & Train

1. Download Dataset

First, download your own Dataset for training.
We tested our code on LJSpeech-1.1 and KSS ver 1.4 Dataset.

2. Build Forced Aligned Information.

Non-Attentive Tacotron is duration based model.
So, alignment information between grapheme and audio is essential.
We make alignment information using Wav2vec 2.0 released from fairseq.
We also support pretrained wav2vec 2.0 model for Korean in this [LINK].
The Korean Wav2vec 2.0 model is trained on aihub korean dialog dataset to generate grapheme based prediction described in K-Wav2vec 2.0.
The English model is automatically downloaded when you run the code.
Run the command below:

## 1. LJSpeech example
## set your data path and audio path(examples are below:)
AUDIO_PATH=/code/gitRepo/data/LJSpeech-1.1/wavs
SCRIPT_PATH=/code/gitRepo/data/LJSpeech-1.1/metadata.csv

## ljspeech forced aligner
## check config options in [configs/preprocess_ljspeech.yaml]
python build_aligned_info.py \
    base.audio_path=${AUDIO_PATH} \
    base.script_path=${SCRIPT_PATH} \
    --config-name preprocess_ljspeech
    
    
## 2. KSS Dataset 
## set your data path and audio path(examples are below:)
AUDIO_PATH=/code/gitRepo/data/kss
SCRIPT_PATH=/code/gitRepo/data/kss/transcript.v.1.4.txt
PRETRAINED_WAV2VEC=korean_wav2vec2

## kss forced aligner
## check config options in [configs/preprocess_kss.yaml]
python build_aligned_info.py \
    base.audio_path=${AUDIO_PATH} \
    base.script_path=${SCRIPT_PATH} \
    base.pretrained_model=${PRETRAINED_WAV2VEC} \
    --config-name preprocess_kss

We also support our preprocessed forced algined files for KSS ver1.4 dataset and LJSpeech1.1

3. Train & Evaluate

It is recommeded to download the pre-trained vocoder before training the non-attentive tacotron model to evaluate the model performance in training phrase.
You can download pre-trained VocGAN in this [LINK].
We only experiment with our codes on a one gpu such as 2080ti or TITAN RTX.
The robotic sounds are gone when I use batch size 32 with 8 accumulation corresponding to 256 batch size.
Run the command below:

## 1. LJSpeech example
## set your data generator path and save path(examples are below:)
GENERATOR_PATH=checkpoints_g/ljspeech_29de09d_4000.pt
SAVE_PATH=results/ljspeech

## train ljspeech non-attentive tacotron
## check config options in [configs/train_ljspeech.yaml]
python train.py \
    base.generator_path=${GENERATOR_PATH} \
    base.save_path=${SAVE_PATH} \
    --config-name train_ljspeech
  
  
    
## 2. KSS Dataset   
## set your data generator path and save path(examples are below:)
GENERATOR_PATH=checkpoints_g/vocgan_kss_pretrained_model_epoch_4500.pt
SAVE_PATH=results/kss

## train kss non-attentive tacotron
## check config options in [configs/train_kss.yaml]
python train.py \
    base.generator_path=${GENERATOR_PATH} \
    base.save_path=${SAVE_PATH} \
    --config-name train_kss

Parameter informations are stored in tacotron/configs.py

Audio Examples

Language	Text with Accent(bold)	Audio Sample
Korean	이 타코트론은 잘 작동한다.	Sample
Korean	이 타코트론은 잘 작동한다.	Sample
Korean	이 타코트론은 잘 작동한다.	Sample
Korean	이 타코트론은 잘 작동한다.	Sample

Forced Aligned Information Examples

ToDo

Sometimes get torch NAN errors.(help me)
Remove robotic sounds in synthetic audio.

Non-Attentive-Tacotron - This is Pytorch Implementation of Google's Non-attentive Tacotron.

Related tags

Overview

Non-attentive Tacotron - PyTorch Implementation

Install

Quickstart

Preprocess & Train

1. Download Dataset

2. Build Forced Aligned Information.

3. Train & Evaluate

Audio Examples

Forced Aligned Information Examples

ToDo

References

Owner

Jounghee Kim

Speech Recognition is an important feature in several applications used such as home automation, artificial intelligence

Large Scale Multi-Illuminant (LSMI) Dataset for Developing White Balance Algorithm under Mixed Illumination

PyImpetus is a Markov Blanket based feature subset selection algorithm that considers features both separately and together as a group in order to provide not just the best set of features but also the best combination of features

Official implementation of DreamerPro: Reconstruction-Free Model-Based Reinforcement Learning with Prototypical Representations in TensorFlow 2

Face Recognition plus identification simply and fast | Python

The source codes for TME-BNA: Temporal Motif-Preserving Network Embedding with Bicomponent Neighbor Aggregation.

YOLOv5🚀 reproduction by Guo Quanhao using PaddlePaddle

Complete system for facial identity system

RETRO-pytorch - Implementation of RETRO, Deepmind's Retrieval based Attention net, in Pytorch

(CVPR 2022) Pytorch implementation of "Self-supervised transformers for unsupervised object discovery using normalized cut"

Guiding evolutionary strategies by (inaccurate) differentiable robot simulators @ NeurIPS, 4th Robot Learning Workshop

API for RL algorithm design & testing of BCA (Building Control Agent) HVAC on EnergyPlus building energy simulator by wrapping their EMS Python API

An Straight Dilated Network with Wavelet for image Deblurring

UnpNet - Rethinking 3-D LiDAR Point Cloud Segmentation(IEEE TNNLS)

Code To Tune or Not To Tune? Zero-shot Models for Legal Case Entailment.

The Python ensemble sampling toolkit for affine-invariant MCMC

[ACM MM 2021] Joint Implicit Image Function for Guided Depth Super-Resolution

Code for classifying international patents based on the text of their titles/abstracts

Efficient Training of Visual Transformers with Small Datasets

The full training script for Enformer (Tensorflow Sonnet) on TPU clusters