Refactored version of FastSpeech2

Last update: May 26, 2022

Overview

FastSpeech2

This repository is a refactored version from ming024's own. I focused on refactoring structure for fitting my cases and making parallel pre-processing codes. And I wrote installation guide with the latest version of MFA(Montreal Force Aligner).

Installation

Tested on python 3.8, Ubuntu 20.04
- Notice ! For installing MFA, you should install the miniconda.
- If you run MFA under 16.04 or ealier version of Ubuntu, you will face a compile error.
In your system
- To install pyworld, run "sudo apt-get install python3.x-dev". (x is your python version).
- To install sndfile, run "sudo apt-get install libsndfile-dev"
- To use MFA, run "sudo apt-get install libopenblas-base"
Install requirements

# install pytorch_sound
pip install git+https://github.com/appleholic/pytorch_sound
pip install -e .

Download datasets

VCTK
- Visit and download dataset from https://datashare.is.ed.ac.uk/handle/10283/2651
- Move to "./data" and extract compressed file.
  - If you wanna save dataset to another directory, you must change the path of configuration files.
LibriTTS
- To be updated

Install MFA
- Visit and follow a guide that described in MFA installation website.
- Additional installation
  - mfa thirdparty download
  - mfa download acoustic english
Pre-trained checkpoint
- VCTK, 400k steps : Google Drive Link

Preprocess (VCTK case)

Prepare MFA

python fastspeech2/scripts/prepare_align.py configs/vctk_prepare_align.json

Run MFA for making alignments

# Define your the number of threads to run MFA at the last of a command. "-j [The number of threads]"
mfa align data/fastspeech2/vctk lexicons/librispeech-lexicon.txt english data/fastspeech2/vctk-pre -j 24

Feature preprocessing

python fastspeech2/scripts/preprocess.py configs/vctk_preprocess.json

Train

Multi-speaker fastspeech2

python fastspeech2/scripts/train.py configs/fastspeech2_vctk_tts.json

If you want to change the parameters of training FastSpeech2, check out the code and put the option to configuration file.
- train code : fastspeech2/scripts/train.py
- config : configs/fastspeech2_vctk_tts.json

Fastspeech2 with reference encoder (To be updated)

Synthesize

Multi-spaker model

In a code

from fastspeech2.inference import Inferencer
from speech_interface.interfaces.hifi_gan import InterfaceHifiGAN

# arguments
# chk_path: str, lexicon_path: str, device: str = 'cuda'
inferencer = Inferencer(chk_path=chk_path, lexicon_path=lexicon_path, device=device)

# initialize hifigan
interface = InterfaceHifiGAN(model_name='hifi_gan_v1_universal', device='cuda')

# arguments
# text: str, speaker: int = 0, pitch_control: float = 1., energy_control: float = 1., duration_control: float = 1.
txt = 'Hello, I am a programmer.'
mel_spectrogram = inferencer.tts(txt, speaker=0)

# Reconstructs speech by using Hifi-GAN
pred_wav = interface.decode(mel_spectrogram.transpose(1, 2)).squeeze()

# If you test on a jupyter notebook
from IPython.display import Audio
Audio(pred_wav.cpu().numpy(), rate=22050)

In command line

python fastspeech2/scripts/synthesize.py [TEXT] [OUTPUT PATH] [CHECKPOINT PATH] [LEXICON PATH] [[DEVICE]] [[SPEAKER]]

Reference encoder (not updated)

Reference

ming024/FastSpeech2

Refactored version of FastSpeech2

Related tags

Overview

FastSpeech2

Installation

Preprocess (VCTK case)

Train

Synthesize

Multi-spaker model

Reference encoder (not updated)

Reference

Owner

ILJI CHOI

Simple virtual assistant using pyttsx3 and speech recognition optionally with pywhatkit and pther libraries.

Training RNNs as Fast as CNNs

CorNet Correlation Networks for Extreme Multi-label Text Classification

Code for the paper "BERT Loses Patience: Fast and Robust Inference with Early Exit".

Repository for the paper: VoiceMe: Personalized voice generation in TTS

ACL'22: Structured Pruning Learns Compact and Accurate Models

The NewSHead dataset is a multi-doc headline dataset used in NHNet for training a headline summarization model.

glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end.

CMeEE 数据集医学实体抽取

Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

BiNE: Bipartite Network Embedding

STonKGs is a Sophisticated Transformer that can be jointly trained on biomedical text and knowledge graphs

This is the 25 + 1 year anniversary version of the 1995 Rachford-Rice contest

Dual languaged (rus+eng) tool for packing and unpacking archives of Silky Engine.

Yet Another Compiler Visualizer

中文空间语义理解评测

PRAnCER is a web platform that enables the rapid annotation of medical terms within clinical notes.

Word Bot for JKLM Bomb Party

Repositório da disciplina no semestre 2021-2

Rethinking the Truly Unsupervised Image-to-Image Translation - Official PyTorch Implementation (ICCV 2021)