Any-to-any voice conversion using synthetic specific-speaker speeches as intermedium features

Last update: Dec 25, 2022

Overview

MediumVC

MediumVC is an utterance-level method towards any-to-any VC. Before that, we propose SingleVC to perform A2O tasks(X_i → Ŷ_i) , X_i means utterance i spoken by X). The Ŷ_i are considered as SSIF. To build SingleVC, we employ a novel data augment strategy: pitch-shifted and duration-remained(PSDR) to produce paired asymmetrical training data. Then, based on pre-trained SingleVC, MediumVC performs an asymmetrical reconstruction task(Ŷ_i → X̂_i). Due to the asymmetrical reconstruction mode, MediumVC achieves more efficient feature decoupling and fusion. Experiments demonstrate MediumVC performs strong robustness for unseen speakers across multiple public datasets. Here is the official implementation of the paper, MediumVC.

The following are the overall model architecture.

For the audio samples, please refer to our demo page. The more converted speeches can be found in "Demo/ConvertedSpeeches/".

Envs

You can install the dependencies with

pip install -r requirements.txt

Speaker Encoder

Dvector is a robust speaker verification (SV) system pre-trained on VoxCeleb1 using GE2E loss, and it produces 256-dim speaker embedding. In our evaluation on multiple datasets(VCTK with 30000 pairs, Librispeech with 30000 pairs and VCC2020 with 10000 pairs), the equal error rates(EERs)and thresholds(THRs) are recorded in Table. Then Dvector with THRs is also employed to calculate SV accuracy(ACC) of pairs produced by MediumVC and other contrast methods for objective evaluation. The more details can access paper.

Dataset	VCTK	LibriSpeech	VCC2020
EER(%)/THR	7.71/0.462	7.95/0.337	1.06/0.432

Vocoder

The HiFi-GAN vocoder is employed to convert log mel-spectrograms to waveforms. The model is trained on universal datasets with 13.93M parameters. Through our evaluation, it can synthesize 22.05 kHz high-fidelity speeches over 4.0 MOS, even in cross-language or noisy environments.

Infer

You can download the pretrained model, and then edit "Any2Any/infer/infer_config.yaml".Test Samples could be organized as "wav22050/$figure$/*.wav".

python Any2Any/infer/infer.py

Train from scratch

Preprocessing

The corpus should be organized as "VCTK22050/$figure$/*.wav", and then edit the config file "Any2Any/pre_feature/preprocess_config.yaml".The output "spk_emb_mel_label.pkl" will be used for training.

python Any2Any/pre_feature/figure_spkemb_mel.py

Training

Please edit the paths of pretrained hifi-model,wav2mel,dvector,SingleVC in config file "Any2Any/config.yaml" at first.

python Any2Any/solver.py

Any-to-any voice conversion using synthetic specific-speaker speeches as intermedium features

Related tags

Overview

MediumVC

Envs

Speaker Encoder

Vocoder

Infer

Train from scratch

Preprocessing

Training

Owner

谷下雨

A small project where I identify notes and key harmonies in a piece of music and use them further to recreate and generate the same piece of music through Python

Frescobaldi LilyPond Editor

Musillow is a music recommender app that finds songs similar to your favourites.

Gradient - A Python program designed to create a reactive and ambient music listening experience

Real-Time Spherical Microphone Renderer for binaural reproduction in Python

eyeD3 is a Python module and command line program for processing ID3 tags. Information about mp3 files (i.e bit rate, sample frequency, play time, etc.) is also provided. The formats supported are ID3v1 (1.0/1.1) and ID3v2 (2.3/2.4).

🎵 A repository for manually annotating files to create labeled acoustic datasets for machine learning.

DeepMusic is an easy to use Spotify like app to manage and listen to your favorites musics.

Pyroomacoustics is a package for audio signal processing for indoor applications. It was developed as a fast prototyping platform for beamforming algorithms in indoor scenarios.

Reading list for research topics in sound event detection

A Simple Script that will help you to Play / Change Songs with just your Voice

gentle forced aligner

C++ library for audio and music analysis, description and synthesis, including Python bindings

This Bot can extract audios and subtitles from video files

Anki vector Music ❤ is the best and only Telegram VC player with playlists, Multi Playback, Channel play and more

:sound: Play and Record Sound with Python :snake:

Python library for handling audio datasets.

Read music meta data and length of MP3, OGG, OPUS, MP4, M4A, FLAC, WMA and Wave files with python 2 or 3

This is a short program that takes the input from your microphone and uses OpenGL to draw a live colourful pattern

Open-Source Tools & Data for Music Source Separation: A Pragmatic Guide for the MIR Practitioner