Code for the paper "Unsupervised Contrastive Learning of Sound Event Representations", ICASSP 2021.

Related tags

Deep Learninguclser20
Overview

Unsupervised Contrastive Learning of
Sound Event Representations

This repository contains the code for the following paper. If you use this code or part of it, please cite:

Eduardo Fonseca, Diego Ortego, Kevin McGuinness, Noel E. O'Connor, Xavier Serra, "Unsupervised Contrastive Learning of Sound Event Representations", ICASSP 2021.

arXiv slides poster blog post video

We propose to learn sound event representations using the proxy task of contrasting differently augmented views of sound events, inspired by SimCLR [1]. The different views are computed by:

  • sampling TF patches at random within every input clip,
  • mixing resulting patches with unrelated background clips (mix-back), and
  • other data augmentations (DAs) (RRC, compression, noise addition, SpecAugment [2]).

Our proposed system is illustrated in the figure.

system

Our results suggest that unsupervised contrastive pre-training can mitigate the impact of data scarcity and increase robustness against noisy labels. Please check our paper for more details, or have a quicker look at our slide deck, poster, blog post, or video presentation (see links above).

This repository contains the framework that we used for our paper. It comprises the basic stages to learn an audio representation via unsupervised contrastive learning, and then evaluate the representation via supervised sound event classifcation. The system is implemented in PyTorch.

Dependencies

This framework is tested on Ubuntu 18.04 using a conda environment. To duplicate the conda environment:

conda create --name <envname> --file spec-file.txt

Directories and files

FSDnoisy18k/ includes folders to locate the FSDnoisy18k dataset and a FSDnoisy18k.py to load the dataset (train, val, test), including the data loader for contrastive and supervised training, applying transforms or mix-back when appropriate
config/ includes *.yaml files defining parameters for the different training modes
da/ contains data augmentation code, including augmentations mentioned in our paper and more
extract/ contains feature extraction code. Computes an .hdf5 file containing log-mel spectrograms and associated labels for a given subset of data
logs/ folder for output logs
models/ contains definitions for the architectures used (ResNet-18, VGG-like and CRNN)
pth/ contains provided pre-trained models for ResNet-18, VGG-like and CRNN
src/ contains functions for training and evaluation in both supervised and unsupervised fashion
main_train.py is the main script
spec-file.txt contains conda environment specs

Usage

(0) Download the dataset

Download FSDnoisy18k [3] from Zenodo through the dataset companion site, unzip it and locate it in a given directory. Fix paths to dataset in ctrl section of *.yaml. It can be useful to have a look at the different training sets of FSDnoisy18k: a larger set of noisy labels and a small set of clean data [3]. We use them for training/validation in different ways.

(1) Prepare the dataset

Create an .hdf5 file containing log-mel spectrograms and associated labels for each subset of data:

python extract/wav2spec.py -m test -s config/params_unsupervised_cl.yaml

Use -m with train, val or test to extract features from each subset. All the extraction parameters are listed in params_unsupervised_cl.yaml. Fix path to .hdf5 files in ctrl section of *.yaml.

(2) Run experiment

Our paper comprises three training modes. For convenience, we provide yaml files defining the setup for each of them.

  1. Unsupervised contrastive representation learning by comparing differently augmented views of sound events. The outcome of this stage is a trained encoder to produce low-dimensional representations. Trained encoders are saved under results_models/ using a folder name based on the string experiment_name in the corresponding yaml (make sure to change it).
CUDA_VISIBLE_DEVICES=0 python main_train.py -p config/params_unsupervised_cl.yaml &> logs/output_unsup_cl.out
  1. Evaluation of the representation using a previously trained encoder. Here, we do supervised learning by minimizing cross entropy loss without data agumentation. Currently, we load the provided pre-trained models sitting in pth/ (you can change this in main_train.py, search for select model). We follow two evaluation methods:

    • Linear Evaluation: train an additional linear classifier on top of the pre-trained unsupervised embeddings.

      CUDA_VISIBLE_DEVICES=0 python main_train.py -p config/params_supervised_lineval.yaml &> logs/output_lineval.out
      
    • End-to-end Fine Tuning: fine-tune entire model on two relevant downstream tasks after initializing with pre-trained weights. The two downstream tasks are:

      • training on the larger set of noisy labels and validate on train_clean. This is chosen by selecting train_on_clean: 0 in the yaml.
      • training on the small set of clean data (allowing 15% for validation). This is chosen by selecting train_on_clean: 1 in the yaml.

      After choosing the training set for the downstream task, run:

      CUDA_VISIBLE_DEVICES=0 python main_train.py -p config/params_supervised_finetune.yaml &> logs/output_finetune.out
      

The setup in the yaml files should provide the best results reported in our paper. JFYI, the main flags that determine the training mode are downstream, lin_eval and method in the corresponding yaml (they are already adequately set in each yaml).

(3) See results:

Check the logs/*.out for printed results at the end. Main evaluation metric is balanced (macro) top-1 accuracy. Trained models are saved under results_models/models* and some metrics are saved under results_models/metrics*.

Model Zoo

We provide pre-trained encoders as described in our paper, for ResNet-18, VGG-like and CRNN architectures. See pth/ folder. Note that better encoders could likely be obtained through a more exhaustive exploration of the data augmentation compositions, thus defining a more challenging proxy task. Also, we trained on FSDnoisy18k due to our limited compute resources at the time, yet this framework can be directly applied to other larger datasets such as FSD50K or AudioSet.

Citation

@inproceedings{fonseca2021unsupervised,
  title={Unsupervised Contrastive Learning of Sound Event Representations},
  author={Fonseca, Eduardo and Ortego, Diego and McGuinness, Kevin and O'Connor, Noel E. and Serra, Xavier},
  booktitle={2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2021},
  organization={IEEE}
}

Contact

You are welcome to contact [email protected] should you have any question/suggestion. You can also create an issue.

Acknowledgment

This work is a collaboration between the MTG-UPF and Dublin City University's Insight Centre. This work is partially supported by Science Foundation Ireland (SFI) under grant number SFI/15/SIRG/3283 and by the Young European Research University Network under a 2020 mobility award. Eduardo Fonseca is partially supported by a Google Faculty Research Award 2018. The authors are grateful for the GPUs donated by NVIDIA.

References

[1] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A Simple Framework for Contrastive Learning of Visual Representations,” in Int. Conf. on Mach. Learn. (ICML), 2020

[2] Park et al., SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. InterSpeech 2019

[3] E. Fonseca, M. Plakal, D. P. W. Ellis, F. Font, X. Favory, X. Serra, "Learning Sound Event Classifiers from Web Audio with Noisy Labels", In proceedings of ICASSP 2019, Brighton, UK

Owner
Eduardo Fonseca
Returning research intern at Google Research | PhD candidate at Music Technology Group, Universitat Pompeu Fabra
Eduardo Fonseca
Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network

Super Resolution Examples We run this script under TensorFlow 2.0 and the TensorLayer2.0+. For TensorLayer 1.4 version, please check release. 🚀 🚀 🚀

TensorLayer Community 2.9k Jan 08, 2023
Chunkmogrify: Real image inversion via Segments

Chunkmogrify: Real image inversion via Segments Teaser video with live editing sessions can be found here This code demonstrates the ideas discussed i

David Futschik 112 Jan 04, 2023
Raptor-Multi-Tool - Raptor Multi Tool With Python

Promises 🔥 20 Stars and I'll fix every error that there is 50 Stars and we will

Aran 44 Jan 04, 2023
Keras-retinanet - Keras implementation of RetinaNet object detection.

Keras RetinaNet Keras implementation of RetinaNet object detection as described in Focal Loss for Dense Object Detection by Tsung-Yi Lin, Priya Goyal,

Fizyr 4.3k Jan 01, 2023
Cards Against Humanity AI

cah-ai This is a Cards Against Humanity AI implemented using a pre-trained Semantic Search model. How it works A player is described by a combination

Alex Nichol 2 Aug 22, 2022
(ICCV 2021) ProHMR - Probabilistic Modeling for Human Mesh Recovery

ProHMR - Probabilistic Modeling for Human Mesh Recovery Code repository for the paper: Probabilistic Modeling for Human Mesh Recovery Nikos Kolotouros

Nikos Kolotouros 209 Dec 13, 2022
An updated version of virtual model making

Model-Swap-Face v2   这个项目是基于stylegan2 pSp制作的,比v1版本Model-Swap-Face在推理速度和图像质量上有一定提升。主要的功能是将虚拟模特进行环球不同区域的风格转换,目前转换器提供西欧模特、东亚模特和北非模特三种主流的风格样式,可帮我们实现生产资料零成

seeprettyface.com 62 Dec 09, 2022
A PyTorch implementation of "Predict then Propagate: Graph Neural Networks meet Personalized PageRank" (ICLR 2019).

APPNP ⠀ A PyTorch implementation of Predict then Propagate: Graph Neural Networks meet Personalized PageRank (ICLR 2019). Abstract Neural message pass

Benedek Rozemberczki 329 Dec 30, 2022
📝 Wrapper library for text generation / language models at char and word level with RNN in TensorFlow

tensorlm Generate Shakespeare poems with 4 lines of code. Installation tensorlm is written in / for Python 3.4+ and TensorFlow 1.1+ pip3 install tenso

Kilian Batzner 63 May 22, 2021
Implementation of Restricted Boltzmann Machine (RBM) and its variants in Tensorflow

xRBM Library Implementation of Restricted Boltzmann Machine (RBM) and its variants in Tensorflow Installation Using pip: pip install xrbm Examples Tut

Omid Alemi 55 Dec 29, 2022
Weakly Supervised Scene Text Detection using Deep Reinforcement Learning

Weakly Supervised Scene Text Detection using Deep Reinforcement Learning This repository contains the setup for all experiments performed in our Paper

Emanuel Metzenthin 3 Dec 16, 2022
The pytorch implementation of SOKD (BMVC2021).

Semi-Online Knowledge Distillation Implementations of SOKD. Requirements This repo was tested with Python 3.8, PyTorch 1.5.1, torchvision 0.6.1, CUDA

4 Dec 19, 2021
The pyrelational package offers a flexible workflow to enable active learning with as little change to the models and datasets as possible

pyrelational is a python active learning library developed by Relation Therapeutics for rapidly implementing active learning pipelines from data management, model development (and Bayesian approximat

Relation Therapeutics 95 Dec 27, 2022
[ICCV 2021] Amplitude-Phase Recombination: Rethinking Robustness of Convolutional Neural Networks in Frequency Domain

Amplitude-Phase Recombination (ICCV'21) Official PyTorch implementation of "Amplitude-Phase Recombination: Rethinking Robustness of Convolutional Neur

Guangyao Chen 53 Oct 05, 2022
naked is a Python tool which allows you to strip a model and only keep what matters for making predictions.

naked is a Python tool which allows you to strip a model and only keep what matters for making predictions. The result is a pure Python function with no third-party dependencies that you can simply c

Max Halford 24 Dec 20, 2022
Deep Implicit Moving Least-Squares Functions for 3D Reconstruction

DeepMLS: Deep Implicit Moving Least-Squares Functions for 3D Reconstruction This repository contains the implementation of the paper: Deep Implicit Mo

103 Dec 22, 2022
Training DALL-E with volunteers from all over the Internet using hivemind and dalle-pytorch (NeurIPS 2021 demo)

Training DALL-E with volunteers from all over the Internet This repository is a part of the NeurIPS 2021 demonstration "Training Transformers Together

<a href=[email protected]"> 19 Dec 13, 2022
Make your own game in a font!

Project structure. Included is a suite of tools to create font games. Tutorial: For a quick tutorial about how to make your own game go here For devel

Michael Mulet 125 Dec 04, 2022
Steer OpenAI's Jukebox with Music Taggers

TagBox Steer OpenAI's Jukebox with Music Taggers! The closest thing we have to VQGAN+CLIP for music! Unsupervised Source Separation By Steering Pretra

Ethan Manilow 34 Nov 02, 2022
ViSD4SA, a Vietnamese Span Detection for Aspect-based sentiment analysis dataset

UIT-ViSD4SA PACLIC 35 General Introduction This repository contains the data of the paper: Span Detection for Vietnamese Aspect-Based Sentiment Analys

Nguyễn Thị Thanh Kim 5 Nov 13, 2022