An implementation of the Contrast Predictive Coding (CPC) method to train audio features in an unsupervised fashion.

Overview

CPC_audio

This code implements the Contrast Predictive Coding algorithm on audio data, as described in the paper Unsupervised Pretraining Transfers well Across Languages. This is an unsupervised method to train audio features directly from the raw waveform.

Moreover, this code also implements all the evaluation metrics used in the paper:

Setup instructions

The installation is a tiny bit involved due to the torch-audio dependency.

0/ Clone the repo: git clone [email protected]:facebookresearch/CPC_audio.git && cd CPC_audio

1/ Install libraries which would be required for torch-audio https://github.com/pytorch/audio :

  • MacOS: brew install sox
  • Linux: sudo apt-get install sox libsox-dev libsox-fmt-all

2/ conda env create -f environment.yml && conda activate cpc37

3/ Run setup.py python setup.py develop

You can test your installation with: nosetests -d

CUDA driver

This setup is given for CUDA 9.2 if you use a different version of CUDA then please change the version of cudatoolkit in environment.yml. For more information on the cudatoolkit version to use, please check https://pytorch.org/

Standard datasets

We suggest to train the model either on Librispeech or libri-light.

How to run a session

To run a new training session, use:

python cpc/train.py --pathDB $PATH_AUDIO_FILES --pathCheckpoint $PATH_CHECKPOINT_DIR --pathTrain $TRAINING_SET --pathVal $VAL_SET --file_extension $EXTENSION

Where:

  • $PATH_AUDIO_FILES is the directory containing the audio files. The files should be arranged as below:
PATH_AUDIO_FILES  
│
└───speaker1
│   └───...
│         │   seq_11.{$EXTENSION}
│         │   seq_12.{$EXTENSION}
│         │   ...
│   
└───speaker2
    └───...
          │   seq_21.{$EXTENSION}
          │   seq_22.{$EXTENSION}

Please note that each speaker directory can contain an arbitrary number of subdirectories: the speaker label will always be retrieved from the top one. The name of the files isn't relevant. For a concrete example, you can look at the organization of the Librispeech dataset.

  • $PATH_CHECKPOINT_DIR in the directory where the checkpoints will be saved
  • $TRAINING_SET is a path to a .txt file containing the list of the training sequences (see here for example)
  • $VALIDATION_SET is a path to a .txt file containing the list of the validation sequences
  • $EXTENSION is the extension of each audio file

Custom architectures

The code allows you to train a wide range of architectures. For example, to train the CPC method as described in Van Den Oord's paper just run:

python cpc/train.py --pathDB $PATH_AUDIO_FILES --pathCheckpoint $PATH_CHECKPOINT_DIR --pathTrain $TRAINING_SET --pathVal $VAL_SET --file_extension $EXTENSION --normMode batchNorm --rnnMode linear

Or if you want to train a model with a FFD prediction network instead of a transformer:

python cpc/train.py --pathDB $PATH_AUDIO_FILES --pathCheckpoint $PATH_CHECKPOINT_DIR --pathTrain $TRAINING_SET --pathVal $VAL_SET --file_extension $EXTENSION --rnnMode ffd --schedulerRamp 10

The --schedulerRamp option add a learning rate ramp at the beginning of the training: it barely affects the performance of a model with a transformer predictor but is necessary with other models.

Launch cpc/train.py -h to see all the possible options.

How to restart a session

To restart a session from the last saved checkpoint just run

python cpc/train.py --pathCheckpoint $PATH_CHECKPOINT_DIR

How to run an evaluation session

All evaluation scripts can be found in cpc/eval/.

Linear separability:

After training, the CPC model can output high level features for a variety of tasks. For an input audio file sampled at 16kHz, the provided baseline model will output 256 dimensional output features every 10ms. We provide two linear separability tests one for speaker, one for phonemes, in which a linear classifier is trained on top of the CPC features with aligned labels, and evaluated on a held-out test set.

Train / Val splits as well as phone alignments for librispeech-100h can be found here.

Speaker separability:

python cpc/eval/linear_separability.py $PATH_DB $TRAINING_SET $VAL_SET $CHECKPOINT_TO_LOAD --pathCheckpoint $PATH_CHECKPOINT

Phone separability:

python cpc/eval/linear_separability.py $PATH_DB $TRAINING_SET $VAL_SET $CHECKPOINT_TO_LOAD --pathCheckpoint $PATH_CHECKPOINT --pathPhone $PATH_TO_PHONE_LABELS

You can also concatenate the output features of several model by providing several checkpoint to the --load option. For example the following command line:

python cpc/eval/linear_separability.py -$PATH_DB $TRAINING_SET $VAL_SET model1.pt model2.pt --pathCheckpoint $PATH_CHECKPOINT

Will evaluate the speaker separability of the concatenation of the features from model1 and model2.

--gru_level controls from which layer of autoregressive part of CPC to extract the features. By default it's the last one.

Nullspaces:

To conduct the nullspace experiment, first classify speakers using two factorized matrices A (DIM_EMBEDDING x DIM_INBETWEEN) and B (DIM_INBETWEEN x SPEAKERS). You'll want to extract A', the nullspace of matrix A (of size DIM_EMBEDDING x (DIM_EMBEDDING - DIM_INBETWEEN)), to make the embeddings less sensitive to speakers.

python cpc/eval/linear_separability.py $PATH_DB $TRAINING_SET $VAL_SET $CHECKPOINT_TO_LOAD --pathCheckpoint $PATH_CHECKPOINT --mode speakers_factorized  --model cpc --dim_inter $DIM_INBETWEEN --gru_level 2

Next, you evaluate the phone and speaker separabilities of the embeddings from CPC projected into the nullspace A'.

python cpc/eval/linear_separability.py $PATH_DB $TRAINING_SET $VAL_SET $CHECKPOINT_TO_LOAD --pathCheckpoint $PATH_CHECKPOINT --mode phonemes_nullspace --model cpc --pathPhone $PATH_TO_PHONE_LABELS --path_speakers_factorized $PATH_CHECKPOINT_SPEAKERS_FACTORIZED --dim_inter $DIM_INBETWEEN --gru_level 2
python cpc/eval/linear_separability.py $PATH_DB $TRAINING_SET $VAL_SET $CHECKPOINT_TO_LOAD --pathCheckpoint $PATH_CHECKPOINT --mode speakers_nullspace --model cpc --path_speakers_factorized $PATH_CHECKPOINT_SPEAKERS_FACTORIZED --dim_inter $DIM_INBETWEEN --gru_level 2

ABX score:

You can run the ABX score on the Zerospeech2017 dataset. To begin, download the dataset here. Then run the ABX evaluation on a given checkpoint with:

python ABX.py from_checkpoint $PATH_CHECKPOINT $PATH_ITEM_FILE $DATASET_PATH --seq_norm --strict --file_extension .wav --out $PATH_OUT

Where:

  • $PATH_CHECKPOINT is the path pointing to the checkpoint to evaluate
  • $PATH_ITEM_FILE is the path to the .item file containing the triplet annotations
  • $DATASET_PATH path to the directory containing the audio files
  • $PATH_OUT path to the directory into which the results should be dumped
  • --seq_norm normalize each batch of features across the time channel before computing ABX
  • --strict forces each batch of features to contain exactly the same number of frames.

Cross lingual transfer

To begin download the common voices datasets here, you will also need to download our phonem annotations and our train / val / test splits for each language here. Then unzip your data at PATH_COMMON_VOICES. Unfortunately, the audio files in common voices don't have the same sampling rate as in Librispeech. Thus you'll need to convert them into 16kH audio using the command:

DIR_CC=$PATH_COMMON_VOICES
for x in fr zh it ru nl sv es tr tt ky; do python cpc/eval/utils/adjust_sample_rate.py ${DIR_CC}/${x}/clips ${DIR_CC}/${x}/validated_phones_reduced.txt ${DIR_CC}/${x}/clips_16k; done

You can now run the experiments described in the paper. To begin, you must train the linear classifier. You will find below the instructions for the Spanish dataset: you can run the experiments on any other dataset in the same fashion.

Frozen features

To run the training on frozen features with the one hour dataset, just run:

python cpc/eval/common_voices_eval.py train $PATH_COMMON_VOICES/es/clips_16k $PATH_COMMON_VOICES/es/validated_phones_reduced.txt $CHECKPOINT_TO_TEST --pathTrain $PATH_COMMON_VOICES/es/trainSeqs_1.0_uniform_new_version.txt  --pathVal $PATH_COMMON_VOICES/es/trainSeqs_1.0_uniform_new_version.txt --freeze -o $OUTPUT_DIR

Fine tuning

The command is quite similar to run the fine-tuning experiments on the 5 hours dataset. For example in French you need to run:

python cpc/eval/common_voices_eval.py train $PATH_COMMON_VOICES/es/clips_16k $PATH_COMMON_VOICES/es/validated_phones_reduced.txt $CHECKPOINT_TO_TEST --pathTrain $PATH_COMMON_VOICES/es/trainSeqs_5.0_uniform_new_version.txt --pathVal $PATH_COMMON_VOICES/es/trainSeqs_5.0_uniform_new_version.txt --freeze -o $OUTPUT_DIR

PER

Once the training is done, you can compute the associated phone error rate (PER) on the test subset. To do so, just run:

python cpc/eval/common_voices_eval.py per $OUTPUT_DIR --pathVal $PATH_COMMON_VOICES/es/testSeqs_uniform_new_version.txt --pathPhone $PATH_COMMON_VOICES/es/validated_phones_reduced.txt

torch hub

To begin download the common voices datasets here, you will also need to download our phonem annotations and our train / val / test splits for each language here. Then unzip your data at PATH_COMMON_VOICES. Unfortunately, the audio files in common voices don't have the same sampling rate as in Librispeech. Thus you'll need to convert them into 16kH audio using the command:

DIR_CC=$PATH_COMMON_VOICES
for x in fr zh it ru nl sv es tr tt ky; do python cpc/eval/utils/adjust_sample_rate.py ${DIR_CC}/${x}/clips ${DIR_CC}/${x}/validated_phones_reduced.txt ${DIR_CC}/${x}/clips_16k; done

You can now run the experiments described in the paper. To begin, you must train the linear classifier. You will find below the instructions for the Spanish dataset: you can run the experiments on any other dataset in the same fashion.

Frozen features

To run the training on frozen features with the one hour dataset, just run:

python cpc/eval/common_voices_eval.py train $PATH_COMMON_VOICES/es/clips_16k $PATH_COMMON_VOICES/es/validated_phones_reduced.txt $CHECKPOINT_TO_TEST --pathTrain $PATH_COMMON_VOICES/es/trainSeqs_1.0_uniform_new_version.txt  --pathVal $PATH_COMMON_VOICES/es/trainSeqs_1.0_uniform_new_version.txt --freeze -o $OUTPUT_DIR

Fine tuning

The command is quite similar to run the fine-tuning experiments on the 5 hours dataset. For example in French you need to run:

python cpc/eval/common_voices_eval.py train $PATH_COMMON_VOICES/es/clips_16k $PATH_COMMON_VOICES/es/validated_phones_reduced.txt $CHECKPOINT_TO_TEST --pathTrain $PATH_COMMON_VOICES/es/trainSeqs_5.0_uniform_new_version.txt --pathVal $PATH_COMMON_VOICES/es/trainSeqs_5.0_uniform_new_version.txt --freeze -o $OUTPUT_DIR

PER

Once the training is done, you can compute the associated phone error rate (PER) on the test subset. To do so, just run:

python cpc/eval/common_voices_eval.py per $OUTPUT_DIR --pathVal $PATH_COMMON_VOICES/es/testSeqs_uniform_new_version.txt --pathPhone $PATH_COMMON_VOICES/es/validated_phones_reduced.txt

torch hub

This model is also available via torch.hub. For more details, have a look at hubconf.py.

Citations

Please consider citing this project in your publications if it helps your research.

@misc{rivire2020unsupervised,
    title={Unsupervised pretraining transfers well across languages},
    author={Morgane Rivière and Armand Joulin and Pierre-Emmanuel Mazaré and Emmanuel Dupoux},
    year={2020},
    eprint={2002.02848},
    archivePrefix={arXiv},
    primaryClass={eess.AS}
}

License

CPC_audio is MIT licensed, as found in the LICENSE file.

CIFAR-10 Photo Classification

Image-Classification CIFAR-10 Photo Classification CIFAR-10_Dataset_Classfication CIFAR-10 Photo Classification Dataset CIFAR is an acronym that stand

ADITYA SHAH 1 Jan 05, 2022
PantheonRL is a package for training and testing multi-agent reinforcement learning environments.

PantheonRL is a package for training and testing multi-agent reinforcement learning environments. PantheonRL supports cross-play, fine-tuning, ad-hoc coordination, and more.

Stanford Intelligent and Interactive Autonomous Systems Group 57 Dec 28, 2022
最新版本yolov5+deepsort目标检测和追踪,支持5.0版本可训练自己数据集

使用YOLOv5+Deepsort实现车辆行人追踪和计数,代码封装成一个Detector类,更容易嵌入到自己的项目中。

422 Dec 30, 2022
An example showing how to use jax to train resnet50 on multi-node multi-GPU

jax-multi-gpu-resnet50-example This repo shows how to use jax for multi-node multi-GPU training. The example is adapted from the resnet50 example in d

Yangzihao Wang 20 Jul 04, 2022
CoReNet is a technique for joint multi-object 3D reconstruction from a single RGB image.

CoReNet CoReNet is a technique for joint multi-object 3D reconstruction from a single RGB image. It produces coherent reconstructions, where all objec

Google Research 80 Dec 25, 2022
Shallow Convolutional Neural Networks for Human Activity Recognition using Wearable Sensors

-IEEE-TIM-2021-1-Shallow-CNN-for-HAR [IEEE TIM 2021-1] Shallow Convolutional Neural Networks for Human Activity Recognition using Wearable Sensors All

Wenbo Huang 1 May 17, 2022
Experiments with the Robust Binary Interval Search (RBIS) algorithm, a Query-Based prediction algorithm for the Online Search problem.

OnlineSearchRBIS Online Search with Best-Price and Query-Based Predictions This is the implementation of the Robust Binary Interval Search (RBIS) algo

S. K. 1 Apr 16, 2022
Code of Periodic Activation Functions Induce Stationarity

Periodic Activation Functions Induce Stationarity This repository is the official implementation of the methods in the publication: L. Meronen, M. Tra

AaltoML 12 Jun 07, 2022
The project of phase's key role in complex and real NN

Phase-in-NN This is the code for our project at Princeton (co-authors: Yuqi Nie, Hui Yuan). The paper title is: "Neural Network is heterogeneous: Phas

YuqiNie-lab 1 Nov 04, 2021
Code for Talk-to-Edit (ICCV2021). Paper: Talk-to-Edit: Fine-Grained Facial Editing via Dialog.

Talk-to-Edit (ICCV2021) This repository contains the implementation of the following paper: Talk-to-Edit: Fine-Grained Facial Editing via Dialog Yumin

Yuming Jiang 221 Jan 07, 2023
A multi-entity Transformer for multi-agent spatiotemporal modeling.

baller2vec This is the repository for the paper: Michael A. Alcorn and Anh Nguyen. baller2vec: A Multi-Entity Transformer For Multi-Agent Spatiotempor

Michael A. Alcorn 56 Nov 15, 2022
SlideGraph+: Whole Slide Image Level Graphs to Predict HER2 Status in Breast Cancer

SlideGraph+: Whole Slide Image Level Graphs to Predict HER2 Status in Breast Cancer A novel graph neural network (GNN) based model (termed SlideGraph+

28 Dec 24, 2022
An open source app to help calm you down when needed.

By: Seanpm2001, Et; Al. Top README.md Read this article in a different language Sorted by: A-Z Sorting options unavailable ( af Afrikaans Afrikaans |

Sean P. Myrick V19.1.7.2 2 Oct 24, 2022
System Combination for Grammatical Error Correction Based on Integer Programming

System Combination for Grammatical Error Correction Based on Integer Programming This repository contains the code and scripts that implement the syst

NUS NLP Group 0 Mar 29, 2022
DexterRedTool - Dexter's Red Team Tool that creates cronjob/task scheduler to consistently creates users

DexterRedTool Author: Dexter Delandro CSEC 473 - Spring 2022 This tool persisten

2 Feb 16, 2022
This dlib-based facial login system

Facial-Login-System This dlib-based facial login system is a technology capable of matching a human face from a digital webcam frame capture against a

Mushahid Ali 3 Apr 23, 2022
Huawei Hackathon 2021 - Sweden (Stockholm)

huawei-hackathon-2021 Contributors DrakeAxelrod Challenge Requirements: python=3.8.10 Standard libraries (no importing) Important factors: Data depend

Drake Axelrod 32 Nov 08, 2022
Distinguishing Commercial from Editorial Content in News

Distinguishing Commercial from Editorial Content in News In this repository you can find the following: An anonymized version of the data used for my

Timo Kats 3 Sep 26, 2022
Implementation of "Learning to Match Features with Seeded Graph Matching Network" ICCV2021

SGMNet Implementation PyTorch implementation of SGMNet for ICCV'21 paper "Learning to Match Features with Seeded Graph Matching Network", by Hongkai C

87 Dec 11, 2022
MMdet2-based reposity about lightweight detection model: Nanodet, PicoDet.

Lightweight-Detection-and-KD MMdet2-based reposity about lightweight detection model: Nanodet, PicoDet. This repo also includes detection knowledge di

Egqawkq 12 Jan 05, 2023