Unofficial PyTorch implementation of Google AI's VoiceFilter system

Overview

VoiceFilter

Note from Seung-won (2020.10.25)

Hi everyone! It's Seung-won from MINDs Lab, Inc. It's been a long time since I've released this open-source, and I didn't expect this repository to grab such a great amount of attention for a long time. I would like to thank everyone for giving such attention, and also Mr. Quan Wang (the first author of the VoiceFilter paper) for referring this project in his paper.

Actually, this project was done by me when it was only 3 months after I started studying deep learning & speech separation without a supervisor in the relevant field. Back then, I didn't know what is a power-law compression, and the correct way to validate/test the models. Now that I've spent more time on deep learning & speech since then (I also wrote a paper published at Interspeech 2020 😊 ), I can observe some obvious mistakes that I've made. Those issues were kindly raised by GitHub users; please refer to the Issues and Pull Requests for that. That being said, this repository can be quite unreliable, and I would like to remind everyone to use this code at their own risk (as specified in LICENSE).

Unfortunately, I can't afford extra time on revising this project or reviewing the Issues / Pull Requests. Instead, I would like to offer some pointers to newer, more reliable resources:

  • VoiceFilter-Lite: This is a newer version of VoiceFilter presented at Interspeech 2020, which is also written by Mr. Quan Wang (and his colleagues at Google). I highly recommend checking this paper, since it focused on a more realistic situation where VoiceFilter is needed.
  • List of VoiceFilter implementation available on GitHub: In March 2019, this repository was the only available open-source implementation of VoiceFilter. However, much better implementations that deserve more attention became available across GitHub. Please check them, and choose the one that meets your demand.
  • PyTorch Lightning: Back in 2019, I could not find a great deep-learning project template for myself, so I and my colleagues had used this project as a template for other new projects. For people who are searching for such project template, I would like to strongly recommend PyTorch Lightning. Even though I had done a lot of effort into developing my own template during 2019 (VoiceFilter -> RandWireNN -> MelNet -> MelGAN), I found PyTorch Lightning much better than my own template.

Thanks for reading, and I wish everyone good health during the global pandemic situation.

Best regards, Seung-won Park


Unofficial PyTorch implementation of Google AI's: VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking.

Result

  • Training took about 20 hours on AWS p3.2xlarge(NVIDIA V100).

Audio Sample

Metric

Median SDR Paper Ours
before VoiceFilter 2.5 1.9
after VoiceFilter 12.6 10.2

  • SDR converged at 10, which is slightly lower than paper's.

Dependencies

  1. Python and packages

    This code was tested on Python 3.6 with PyTorch 1.0.1. Other packages can be installed by:

    pip install -r requirements.txt
  2. Miscellaneous

    ffmpeg-normalize is used for resampling and normalizing wav files. See README.md of ffmpeg-normalize for installation.

Prepare Dataset

  1. Download LibriSpeech dataset

    To replicate VoiceFilter paper, get LibriSpeech dataset at http://www.openslr.org/12/. train-clear-100.tar.gz(6.3G) contains speech of 252 speakers, and train-clear-360.tar.gz(23G) contains 922 speakers. You may use either, but the more speakers you have in dataset, the more better VoiceFilter will be.

  2. Resample & Normalize wav files

    First, unzip tar.gz file to desired folder:

    tar -xvzf train-clear-360.tar.gz

    Next, copy utils/normalize-resample.sh to root directory of unzipped data folder. Then:

    vim normalize-resample.sh # set "N" as your CPU core number.
    chmod a+x normalize-resample.sh
    ./normalize-resample.sh # this may take long
  3. Edit config.yaml

    cd config
    cp default.yaml config.yaml
    vim config.yaml
  4. Preprocess wav files

    In order to boost training speed, perform STFT for each files before training by:

    python generator.py -c [config yaml] -d [data directory] -o [output directory] -p [processes to run]

    This will create 100,000(train) + 1000(test) data. (About 160G)

Train VoiceFilter

  1. Get pretrained model for speaker recognition system

    VoiceFilter utilizes speaker recognition system (d-vector embeddings). Here, we provide pretrained model for obtaining d-vector embeddings.

    This model was trained with VoxCeleb2 dataset, where utterances are randomly fit to time length [70, 90] frames. Tests are done with window 80 / hop 40 and have shown equal error rate about 1%. Data used for test were selected from first 8 speakers of VoxCeleb1 test dataset, where 10 utterances per each speakers are randomly selected.

    Update: Evaluation on VoxCeleb1 selected pair showed 7.4% EER.

    The model can be downloaded at this GDrive link.

  2. Run

    After specifying train_dir, test_dir at config.yaml, run:

    python trainer.py -c [config yaml] -e [path of embedder pt file] -m [name]

    This will create chkpt/name and logs/name at base directory(-b option, . in default)

  3. View tensorboardX

    tensorboard --logdir ./logs

  4. Resuming from checkpoint

    python trainer.py -c [config yaml] --checkpoint_path [chkpt/name/chkpt_{step}.pt] -e [path of embedder pt file] -m name

Evaluate

python inference.py -c [config yaml] -e [path of embedder pt file] --checkpoint_path [path of chkpt pt file] -m [path of mixed wav file] -r [path of reference wav file] -o [output directory]

Possible improvments

  • Try power-law compressed reconstruction error as loss function, instead of MSE. (See #14)

Author

Seungwon Park at MINDsLab ([email protected], [email protected])

License

Apache License 2.0

This repository contains codes adapted/copied from the followings:

Owner
MINDs Lab
MINDsLab provides AI platform and various AI engines based on deep machine learning.
MINDs Lab
NLP library designed for reproducible experimentation management

Welcome to the Transfer NLP library, a framework built on top of PyTorch to promote reproducible experimentation and Transfer Learning in NLP You can

Feedly 290 Dec 20, 2022
Official code for "Parser-Free Virtual Try-on via Distilling Appearance Flows", CVPR 2021

Parser-Free Virtual Try-on via Distilling Appearance Flows, CVPR 2021 Official code for CVPR 2021 paper 'Parser-Free Virtual Try-on via Distilling App

395 Jan 03, 2023
CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)

CrossNER is a fully-labeled collected of named entity recognition (NER) data spanning over five diverse domains (Politics, Natural Science, Music, Literature, and Artificial Intelligence) with specia

Zihan Liu 89 Nov 10, 2022
A website which allows you to play with the GPT-2 transformer

transformers A website which allows you to play with the GPT-2 model Built with ❤️ by raphtlw Table of contents Model Setup About Contributors Model T

raphtlw 2 Jan 27, 2022
A natural language processing model for sequential sentence classification in medical abstracts.

NLP PubMed Medical Research Paper Abstract (Randomized Controlled Trial) A natural language processing model for sequential sentence classification in

Hemanth Chandran 1 Jan 17, 2022
Repository for Project Insight: NLP as a Service

Project Insight NLP as a Service Contents Introduction Features Installation Setup and Documentation Project Details Demonstration Directory Details H

Abhishek Kumar Mishra 286 Dec 06, 2022
Journey is a NLP-Powered Developer assistant

Journey Journey is a NLP-Powered Developer assistant Using on the powerful Natural Language Processing library Mindmeld, this projects aims to assist

Christian Eilers 21 Dec 11, 2022
novel deep learning research works with PaddlePaddle

Research 发布基于飞桨的前沿研究工作,包括CV、NLP、KG、STDM等领域的顶会论文和比赛冠军模型。 目录 计算机视觉(Computer Vision) 自然语言处理(Natrual Language Processing) 知识图谱(Knowledge Graph) 时空数据挖掘(Spa

1.5k Jan 03, 2023
Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis

MLP Singer Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis. Audio samples are available on our demo page.

Neosapience 103 Dec 23, 2022
Snips Python library to extract meaning from text

Snips NLU Snips NLU (Natural Language Understanding) is a Python library that allows to extract structured information from sentences written in natur

Snips 3.7k Dec 30, 2022
Simple program that translates the name of files into English

Simple program that translates the name of files into English. Useful for when editing/inspecting programs that were developed in a foreign language.

0 Dec 22, 2021
Geometry-Consistent Neural Shape Representation with Implicit Displacement Fields

Geometry-Consistent Neural Shape Representation with Implicit Displacement Fields [project page][paper][cite] Geometry-Consistent Neural Shape Represe

Yifan Wang 100 Dec 19, 2022
Tracking Progress in Natural Language Processing

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

Sebastian Ruder 21.2k Dec 30, 2022
[Preprint] Escaping the Big Data Paradigm with Compact Transformers, 2021

Compact Transformers Preprint Link: Escaping the Big Data Paradigm with Compact Transformers By Ali Hassani[1]*, Steven Walton[1]*, Nikhil Shah[1], Ab

SHI Lab 367 Dec 31, 2022
Training RNNs as Fast as CNNs

News SRU++, a new SRU variant, is released. [tech report] [blog] The experimental code and SRU++ implementation are available on the dev branch which

Tao Lei 14 Dec 12, 2022
[ICLR'19] Trellis Networks for Sequence Modeling

TrellisNet for Sequence Modeling This repository contains the experiments done in paper Trellis Networks for Sequence Modeling by Shaojie Bai, J. Zico

CMU Locus Lab 460 Oct 13, 2022
Coreference resolution for English, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, msg systems ag 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 German 1.2.3 Polish 1

msg systems ag 169 Dec 21, 2022
Conversational text Analysis using various NLP techniques

Conversational text Analysis using various NLP techniques

Rita Anjana 159 Jan 06, 2023
CCF BDCI BERT系统调优赛题baseline(Pytorch版本)

CCF BDCI BERT系统调优赛题baseline(Pytorch版本) 此版本基于Pytorch后端的huggingface进行实现。由于此实现使用了Oneflow的dataloader作为数据读入的方式,因此也需要安装Oneflow。其它框架的数据读取可以参考OneflowDataloade

Ziqi Zhou 9 Oct 13, 2022
precise iris segmentation

PI-DECODER Introduction PI-DECODER, a decoder structure designed for Precise Iris Segmentation and Location. The decoder structure is shown below: Ple

8 Aug 08, 2022