Voice Conversion Using Speech-to-Speech Neuro-Style Transfer

Last update: Jan 05, 2023

Related tags

Overview

Voice Conversion Using Speech-to-Speech Neuro-Style Transfer

This repo contains the official implementation of the VAE-GAN from the INTERSPEECH 2020 paper Voice Conversion Using Speech-to-Speech Neuro-Style Transfer.

Examples of generated audio using the Flickr8k Audio Corpus: https://ebadawy.github.io/post/speech_style_transfer. Note that these examples are a result of feeding audio reconstructions of this VAE-GAN to an implementation of WaveNet.

1. Data Preperation

Dataset file structure:

/path/to/database
├── spkr_1
│   ├── sample.wav
├── spkr_2
│   ├── sample.wav
│   ...
└── spkr_N
    ├── sample.wav
    ...
# The directory under each speaker cannot be nested.

Here is an example script for setting up data preparation from the Flickr8k Audio Corpus. The speakers of interest are the same as in the paper, but may be modified to other speakers if desirable.

2. Data Preprocessing

The prepared dataset is organised into a train/eval/test split, the audio is preprocessed and melspectrograms are computed.

python preprocess.py --dataset [path/to/dataset] --test-size [float] --eval-size [float]

3. Training

The VAE-GAN model uses the melspectrograms to learn style transfer between two speakers.

python train.py --model_name [name of the model] --dataset [path/to/dataset]

3.1. Visualization

By default, the code plots a batch of input and output melspectrograms every epoch. You may add --plot-interval -1 to the above command to disable it. Alternatively you may add --plot-interval 20 to plot every 20 epochs.

3.2. Saving Models

By default, models are saved every epoch. With smaller datasets than Flickr8k it may be more appropriate to save less frequently by adding --checkpoint_interval 20 for 20 epochs.

3.3. Epochs

The max number of epochs may be set with --n_epochs. For smaller datasets, you may want to increase this to more than the default 100. To load a pretrained model you can use --epoch and set it to the epoch number of the saved model.

3.4. Pretrained Model

You can access pretrained model files here. By downloading and storing them in a directory src/saved_models/pretrained, you may call it for training or inference with:

--model_name pretrained --epoch 99

Note that for inference the discriminator files D1 and D2 are not required (meanwhile for training further they are). Also here, G1 refers to the decoding generator for speaker 1 (female) and G2 for speaker 2 (male).

4. Inference

The trained VAE-GAN is used for inference on a specified audio file. It works by; sliding a window over a full melspectrogram, locally inferring melspectrogram subsamples, and averaging the overlap. The script then uses Griffin-Lim to reconstruct audio from the generated melspectrogram.

python inference.py --model_name [name of the model] --epoch [epoch number] --trg_id [id of target generator] --wav [path/to/source_audio.wav]

For achieving high quality results like the paper you can feed the reconstructed audio to trained vocoders such as WaveNet. An example pipeline of using this model with wavenet can be found here.

4.1. Directory Input

Instead of a single .wav as input you may specify a whole directory of .wav files by using --wavdir instead of --wav.

4.2. Visualization

By default, plotting input and output melspectrograms is enabled. This is useful for a visual comparison between trained models. To disable set --plot -1

4.3. Reconstructive Evaluation

Alongside the process of generating, components for reconstruction and cyclic reconstruction may be enabled by specifying the generator id of the source audio --src_id [id of source generator].

When set, SSIM metrics for reconstructed melspectrograms and cyclically reconstructed melspectrograms are computed and printed at the end of inference.

This is an extra feature to help with comparing the reconstructive capabilities of different models. The higher the SSIM, the higher quality the reconstruction.

References

Citation

If you find this code useful please cite us in your work:

@inproceedings{AlBadawy2020,
  author={Ehab A. AlBadawy and Siwei Lyu},
  title={{Voice Conversion Using Speech-to-Speech Neuro-Style Transfer}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={4726--4730},
  doi={10.21437/Interspeech.2020-3056},
  url={http://dx.doi.org/10.21437/Interspeech.2020-3056}
}

TODO:

Rewrite preprocess.py to handle:
- multi-process feature extraction
- display error messages for failed cases
Create:
- Notebook for data visualisation
Want to add something else? Please feel free to submit a PR with your changes or open an issue for that.

Voice Conversion Using Speech-to-Speech Neuro-Style Transfer

Related tags

Overview

Voice Conversion Using Speech-to-Speech Neuro-Style Transfer

1. Data Preperation

2. Data Preprocessing

3. Training

3.1. Visualization

3.2. Saving Models

3.3. Epochs

3.4. Pretrained Model

4. Inference

4.1. Directory Input

4.2. Visualization

4.3. Reconstructive Evaluation

References

Citation

TODO:

Owner

Ehab AlBadawy

A two-stage U-Net for high-fidelity denoising of historical recordings

STEM: An approach to Multi-source Domain Adaptation with Guarantees

Fast and simple implementation of RL algorithms, designed to run fully on GPU.

The source code for CATSETMAT: Cross Attention for Set Matching in Bipartite Hypergraphs

The Few-Shot Bot: Prompt-Based Learning for Dialogue Systems

Code basis for the paper "Camera Condition Monitoring and Readjustment by means of Noise and Blur" (2021)

Official repo for QHack—the quantum machine learning hackathon

YOLOv7 - Framework Beyond Detection

An example project demonstrating how the Autonomous Learning Library can be used to build new reinforcement learning agents.

g2o: A General Framework for Graph Optimization

Styled text-to-drawing synthesis method. Featured at the 2021 NeurIPS Workshop on Machine Learning for Creativity and Design

docTR by Mindee (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.

pytorch implementation of the ICCV'21 paper "MVTN: Multi-View Transformation Network for 3D Shape Recognition"

PyTorch implementation of "A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement."

This is the official code for the paper "Tracker Meets Night: A Transformer Enhancer for UAV Tracking".

Implementation of the paper Recurrent Glimpse-based Decoder for Detection with Transformer.

Two-stage CenterNet

Scripts and a shader to get you started on setting up an exported Koikatsu character in Blender.

Example scripts for the detection of lanes using the ultra fast lane detection model in Tensorflow Lite.

Official code of Team Yao at Multi-Modal-Fact-Verification-2022