Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis

Overview

MLP Singer

Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis. Audio samples are available on our demo page.

Abstract

Recent developments in deep learning have significantly improved the quality of synthesized singing voice audio. However, prominent neural singing voice synthesis systems suffer from slow inference speed due to their autoregressive design. Inspired by MLP-Mixer, a novel architecture introduced in the vision literature for attention-free image classification, we propose MLP Singer, a parallel Korean singing voice synthesis system. To the best of our knowledge, this is the first work that uses an entirely MLP-based architecture for voice synthesis. Listening tests demonstrate that MLP Singer outperforms a larger autoregressive GAN-based system, both in terms of audio quality and synthesis speed. In particular, MLP Singer achieves a real-time factor of up to 200 and 3400 on CPUs and GPUs respectively, enabling order of magnitude faster generation on both environments.

Citation

Please cite this work as follows.

@misc{tae2021mlp,
      title={MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis}, 
      author={Jaesung Tae and Hyeongju Kim and Younggun Lee},
      year={2021},
}

Quickstart

  1. Clone the repository including the git submodule.

    git clone --recurse-submodules https://github.com/neosapience/mlp-singer.git
  2. Install package requirements.

cd mlp-singer
pip install -r requirements.txt
  1. To generate audio files with the trained model checkpoint, download the HiFi-GAN checkpoint along with its configuration file and place them in hifi-gan.

  2. Run inference using the following command. Generated audio samples are saved in the samples directory by default.

    python inference.py --checkpoint_path checkpoints/default/model.pt

Dataset

We used the Children Song Dataset, an open-source singing voice dataset comprised of 100 annotated Korean and English children songs sung by a single professional singer. We used only the Korean subset of the dataset to train the model.

You can train the model on any custom dataset of your choice, as long as it includes lyrics text, midi transcriptions, and monophonic a capella audio file triplets. These files should be titled identically, and should also be placed in specific directory locations as shown below.

├── data
│   └── raw
│       ├── mid
│       ├── txt
│       └── wav

The directory names correspond to file extensions. We have included a sample as reference.

Preprocessing

Once you have prepared the dataset, run

python -m data.serialize

from the root directory. This will create data/bin that contains binary files used for training. This repository already contains example binary files created from the sample in data/raw.

Training

To train the model, run

python train.py

This will read the default configuration file located in configs/model.json to initialize the model. Alternatively, you can also create a new configuration and train the model via

python train.py --config_path PATH/TO/CONFIG.json

Running this command will create a folder under the checkpoints directory according to the name field specified in the configuration file.

You can also continue training from a checkpoint. For example, to resume training from the provided pretrained model checkpoint, run

python train.py --checkpoint_path /checkpoints/default/model.pt

Unless a --config_path flag is explicitly provided, the script will read config.json in the checkpoint directory. In both cases, model checkpoints will be saved regularly according to the interval defined in the configuration file.

Inference

MLP Singer produces mel-spectrograms, which are then fed into a neural vocoder to generate raw waveforms. This repository uses HiFi-GAN as the vocoder backend, but you can also plug other vocoders like WaveGlow. To generate samples, run

python inference.py --checkpoint_path PATH/TO/CHECKPOINT.pt --song little_star

This will create .wav samples in the samples directory, and save mel-spectrogram files as .npy files in hifi-gan/test_mel_dirs.

You can also specify any song you want to perform inference on, as long as the song is present in data/raw. The argument to the --song flag should match the title of the song as it is saved in data/raw.

Note

For demo and internal experiments, we used a variant of HiFi-GAN that used different mel-spectrogram configurations. As such, the provided checkpoint for MLP Singer is different from the one referred to in the paper. Moreover, the vocoder used in the demo was further fine-tuned on the Children's Song Dataset.

Acknowledgements

This implementation was inspired by the following repositories.

License

Released under the MIT License.

Owner
Neosapience
Neosapience, an artificial being enabled by artificial intelligence, will soon be everywhere in our daily lives.
Neosapience
An automated program that helps customers of Pizza Palour place their pizza orders

PIzza_Order_Assistant Introduction An automated program that helps customers of Pizza Palour place their pizza orders. The program uses voice commands

Tindi Sommers 1 Dec 26, 2021
Repository for Graph2Pix: A Graph-Based Image to Image Translation Framework

Graph2Pix: A Graph-Based Image to Image Translation Framework Installation Install the dependencies in env.yml $ conda env create -f env.yml $ conda a

18 Nov 17, 2022
Code for Findings at EMNLP 2021 paper: "Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning"

Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning This repo is for Findings at EMNLP 2021 paper: Learn Cont

INK Lab @ USC 6 Sep 02, 2022
NLP made easy

GluonNLP: Your Choice of Deep Learning for NLP GluonNLP is a toolkit that helps you solve NLP problems. It provides easy-to-use tools that helps you l

Distributed (Deep) Machine Learning Community 2.5k Jan 04, 2023
A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

GuwenModels: 古文自然语言处理模型合集, 收录互联网上的古文相关模型及资源. A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Ethan 66 Dec 26, 2022
A minimal code for fairseq vq-wav2vec model inference.

vq-wav2vec inference A minimal code for fairseq vq-wav2vec model inference. Runs without installing the fairseq toolkit and its dependencies. Usage ex

Vladimir Larin 7 Nov 15, 2022
Textlesslib - Library for Textless Spoken Language Processing

textlesslib Textless NLP is an active area of research that aims to extend NLP t

Meta Research 379 Dec 27, 2022
We have built a Voice based Personal Assistant for people to access files hands free in their device using natural language processing.

Voice Based Personal Assistant We have built a Voice based Personal Assistant for people to access files hands free in their device using natural lang

Rushabh 2 Nov 13, 2021
Official code for "Parser-Free Virtual Try-on via Distilling Appearance Flows", CVPR 2021

Parser-Free Virtual Try-on via Distilling Appearance Flows, CVPR 2021 Official code for CVPR 2021 paper 'Parser-Free Virtual Try-on via Distilling App

395 Jan 03, 2023
An attempt to map the areas with active conflict in Ukraine using open source twitter data.

Live Action Map (LAM) An attempt to use open source data on Twitter to map areas with active conflict. Right now it is used for the Ukraine-Russia con

Kinshuk Dua 171 Nov 21, 2022
Official code repository of the paper Linear Transformers Are Secretly Fast Weight Programmers.

Linear Transformers Are Secretly Fast Weight Programmers This repository contains the code accompanying the paper Linear Transformers Are Secretly Fas

Imanol Schlag 77 Dec 19, 2022
Transformer - A TensorFlow Implementation of the Transformer: Attention Is All You Need

[UPDATED] A TensorFlow Implementation of Attention Is All You Need When I opened this repository in 2017, there was no official code yet. I tried to i

Kyubyong Park 3.8k Dec 26, 2022
BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents

BROS (BERT Relying On Spatiality) is a pre-trained language model focusing on text and layout for better key information extraction from documents. Given the OCR results of the document image, which

Clova AI Research 94 Dec 30, 2022
IMDB film review sentiment classification based on BERT's supervised learning model.

IMDB film review sentiment classification based on BERT's supervised learning model. On the other hand, the model can be extended to other natural language multi-classification tasks.

Paris 1 Apr 17, 2022
Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

This repo provides the code of the following papers: (GAR) "Generation-Augmented Retrieval for Open-domain Question Answering", ACL 2021 (RIDER) "Read

morning 49 Dec 26, 2022
translate using your voice

speech-to-text-translator Usage translate using your voice description this project makes translating a word easy, all you have to do is speak and...

1 Oct 18, 2021
SAVI2I: Continuous and Diverse Image-to-Image Translation via Signed Attribute Vectors

SAVI2I: Continuous and Diverse Image-to-Image Translation via Signed Attribute Vectors [Paper] [Project Website] Pytorch implementation for SAVI2I. We

Qi Mao 44 Dec 30, 2022
(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

Towards Abstractive Grounded Summarization of Podcast Transcripts We provide the source code for the paper "Towards Abstractive Grounded Summarization

10 Jul 01, 2022
SummerTime - Text Summarization Toolkit for Non-experts

A library to help users choose appropriate summarization tools based on their specific tasks or needs. Includes models, evaluation metrics, and datasets.

Yale-LILY 213 Jan 04, 2023
A Multi-modal Model Chinese Spell Checker Released on ACL2021.

ReaLiSe ReaLiSe is a multi-modal Chinese spell checking model. This the office code for the paper Read, Listen, and See: Leveraging Multimodal Informa

DaDa 106 Dec 29, 2022