Implementation of the method described in the Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Last update: Mar 11, 2022

Related tags

Deep Learning hifi-ecg

Overview

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

Implementation of the method described in the Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Abstract: We propose using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, we separately extract low-bitrate representations for speech content, prosodic information, and speaker identity. This allows to synthesize speech in a controllable manner. We analyze various state-of-the-art, self-supervised representation learning methods and shed light on the advantages of each method while considering reconstruction quality and disentanglement properties. Specifically, we evaluate the F0 reconstruction, speaker identification performance (for both resynthesis and voice conversion), recordings' intelligibility, and overall quality using subjective human evaluation. Lastly, we demonstrate how these representations can be used for an ultra-lightweight speech codec. Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods.

Quick Links

Samples
Setup
Training
Inference

Setup

Software

Requirements:

Python >= 3.6
PyTorch v1.8

Install dependencies

git clone https://github.com/facebookresearch/speech-resynthesis.git
cd speech-resynthesis
pip install -r requirements.txt

Data

For LJSpeech:

Download LJSpeech dataset from here into data/LJSpeech-1.1 folder.

Downsample audio from 22.05 kHz to 16 kHz and pad

bash
python ./scripts/preprocess.py \
--srcdir data/LJSpeech-1.1/wavs \
--outdir data/LJSpeech-1.1/wavs_16khz \
--pad

For VCTK:

Download VCTK dataset from here into data/VCTK-Corpus folder.

Downsample audio from 48 kHz to 16 kHz, trim trailing silences and pad

python ./scripts/preprocess.py \
--srcdir data/VCTK-Corpus/wav48_silence_trimmed \
--outdir data/VCTK-Corpus/wav16_silence_trimmed_padded \
--pad --postfix mic2.flac

Training

F0 Quantizer Model

To train F0 quantizer model, use the following command:

python -m torch.distributed.launch --nproc_per_node 8 train_f0_vq.py \
--checkpoint_path checkpoints/lj_f0_vq \
--config configs/LJSpeech/f0_vqvae.json

Set to the number of availalbe GPUs on your machine.

Resynthesis Model

To train a resynthesis model, use the following command:

python -m torch.distributed.launch --nproc_per_node <NUM_GPUS> train.py \
--checkpoint_path checkpoints/lj_vqvae \
--config configs/LJSpeech/vqvae256_lut.json

Supported Configurations

Currently, we support the following training schemes:

Dataset	SSL Method	Dictionary Size	Config Path
LJSpeech	HuBERT	100	`configs/LJSpeech/hubert100_lut.json`
LJSpeech	CPC	100	`configs/LJSpeech/cpc100_lut.json`
LJSpeech	VQVAE	256	`configs/LJSpeech/vqvae256_lut.json`
VCTK	HuBERT	100	`configs/VCTK/hubert100_lut.json`
VCTK	CPC	100	`configs/VCTK/cpc100_lut.json`
VCTK	VQVAE	256	`configs/VCTK/vqvae256_lut.json`

Inference

To generate, simply run:

python inference.py \
--checkpoint_file checkpoints/vctk_cpc100 \
-n 10 \
--output_dir generations

To synthesize multiple speakers:

python inference.py \
--checkpoint_file checkpoints/vctk_cpc100 \
-n 10 \
--vc \
--input_code_file datasets/VCTK/cpc100/test.txt \
--output_dir generations_multispkr

You can also generate with codes from a different dataset:

python inference.py \
--checkpoint_file checkpoints/lj_cpc100 \
-n 10 \
--input_code_file datasets/VCTK/cpc100/test.txt \
--output_dir generations_vctk_to_lj

Preprocessing New Datasets

CPC / HuBERT Coding

To quantize new datasets with CPC or HuBERT follow the instructions described in the GSLM code.

To parse CPC output:

python scripts/parse_cpc_codes.py \
--manifest cpc_output_file \
--wav-root wav_root_dir \
--outdir parsed_cpc

To parse HuBERT output:

python parse_hubert_codes.py \
--codes hubert_output_file \
--manifest hubert_tsv_file \
--outdir parsed_hubert

VQVAE Coding

First, you will need to download LibriLight dataset and move it to data/LibriLight.

For VQVAE, train a vqvae model using the following command:

python -m torch.distributed.launch --nproc_per_node <NUM_GPUS> train.py \
--checkpoint_path checkpoints/ll_vq \
--config configs/LibriLight/vqvae256.json

To extract VQVAE codes:

python infer_vqvae_codes.py \
--input_dir folder_with_wavs_to_code \
--output_dir vqvae_output_folder \
--checkpoint_file checkpoints/ll_vq

To parse VQVAE output:

 python parse_vqvae_codes.py \
 --manifest vqvae_output_file \
 --outdir parsed_vqvae

License

You may find out more about the license here.

Citation

@inproceedings{polyak21_interspeech,
  author={Adam Polyak and Yossi Adi and Jade Copet and 
          Eugene Kharitonov and Kushal Lakhotia and 
          Wei-Ning Hsu and Abdelrahman Mohamed and Emmanuel Dupoux},
  title={{Speech Resynthesis from Discrete Disentangled Self-Supervised Representations}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
}

Acknowledgements

This implementation uses code from the following repos: HiFi-GAN and Jukebox, as described in our code.

Implementation of the method described in the Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Related tags

Overview

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

Quick Links

Setup

Software

Data

For LJSpeech:

For VCTK:

Training

F0 Quantizer Model

Resynthesis Model

Supported Configurations

Inference

Preprocessing New Datasets

CPC / HuBERT Coding

VQVAE Coding

License

Citation

Acknowledgements

Owner

unofficial pytorch implementation of RefineGAN

Breaching - Breaching privacy in federated learning scenarios for vision and text

Source code, datasets and trained models for the paper Learning Advanced Mathematical Computations from Examples (ICLR 2021), by François Charton, Amaury Hayat (ENPC-Rutgers) and Guillaume Lample

Official repository for "Restormer: Efficient Transformer for High-Resolution Image Restoration". SOTA results for single-image motion deblurring, image deraining, image denoising (synthetic and real data), and dual-pixel defocus deblurring.

Points2Surf: Learning Implicit Surfaces from Point Clouds (ECCV 2020 Spotlight)

[arXiv] What-If Motion Prediction for Autonomous Driving ❓🚗💨

Let's create a tool to convert Thailand budget from PDF to CSV.

PyTorch Implementation of Fully Convolutional Networks. (Training code to reproduce the original result is available.)

Code release for SLIP Self-supervision meets Language-Image Pre-training

The AugNet Python module contains functions for the fast computation of image similarity.

U^2-Net - Portrait matting This repository explores possibilities of using the original u^2-net model for portrait matting.

SASM - simple crossplatform IDE for NASM, MASM, GAS and FASM assembly languages

PyTorch common framework to accelerate network implementation, training and validation

Neural Caption Generator with Attention

List some popular DeepFake models e.g. DeepFake, FaceSwap-MarekKowal, IPGAN, FaceShifter, FaceSwap-Nirkin, FSGAN, SimSwap, CihaNet, etc.

Efficient Speech Processing Tookit for Automatic Speaker Recognition

Analysis of Antarctica sequencing samples contaminated with SARS-CoV-2

Hand Gesture Volume Control | Open CV | Computer Vision

The 2nd place solution of 2021 google landmark retrieval on kaggle.

Miscellaneous and lightweight network tools