Official code for Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

Overview

Official code for our Interspeech 2021 - Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset [1]*.

Visually-grounded spoken language datasets can enable models to learn cross-modal correspondences with very weak supervision. However, modern audio-visual datasets contain biases that undermine the real-world performance of models trained on that data. We introduce Spoken ObjectNet, which is designed to remove some of these biases and provide a way to better evaluate how effectively models will perform in real-world scenarios. This dataset expands upon ObjectNet, which is a bias-controlled image dataset that features similar image classes to those present in ImageNet.

*Note: please see the ArXiv version for additional results on the test set.

Setup

  1. Clone this module and any submodules: git clone --recurse-submodules [email protected]:iapalm/Spoken-ObjectNet.git
  2. Follow the directions in data.md to set up ObjectNet images and the Spoken ObjectNet-50k corpus
  3. This code was tested with PyTorch 1.9 with CUDA 10.2 and Python 3.8.8.
  4. To train the models with the code as-is, we use 2 GPUs with 11 Gb of memory. A single GPU can be used, but the batch size or other parameters should be reduced.
  5. Note about the speed of this code: This code will work as-is on the Spoken ObjectNet audio captions, but the speed could be greatly improved. A main bottleneck is the resampling of the audio wav files from 48 kHz to 16 kHz, which is done with librosa here. We suggest to pre-process the audio files into the desired format first, and then remove this line or the on-the-fly spectrogram conversion entirely. We estimate the speed will improve 5x.
  6. On our servers, the zero-shot evaluation takes around 20-30 minutes and training takes around 4-5 days. As mentioned in the previous point, this could be improved with audio pre-processing.

Running Experiments

We support 3 experiments that can be used as baselines for future work:

  • (1) Zero-shot evaluation of the ResDAVEnet-VQ model trained on Places-400k [2].
  • (2) Fine-tuning the ResDAVEnet-VQ model trained on Places-400k on Spoken ObjectNet with a frozen image branch .
  • (3) Training the ResDAVEnet-VQ model from scratch on Spoken ObjectNet with a frozen image branch.
  • Note: fine-tuning the image branch on Spoken ObjectNet is not permitted, but fine-tuning the audio branch is allowed.

Zero-shot transfer from Places-400k

  • Download and extract the directory containing the model weights from this link. Keep the folder named RDVQ_00000 and move it to the exps directory.
  • In scripts/train.sh, change data_dt to data/Spoken-ObjectNet-50k/metadata/SON-test.json to evaluate on the test set instead of the validation set.
  • Run the following command for zero-shot evaluation: source scripts/train.sh 00000 RDVQ_00000 "--resume True --mode eval"
  • The results are printed in exps/RDVQ_00000_transfer/train.out

Fine-tune the model from Places-400k

  • Download and extract the directory containing the args.pkl file which specifies the fine-tuning arguments. The directory at this link contains the args.pkl file as well as the model weights.
  • The model weights of the fine-tuned model are provided for easier evaluation. Run the following command to evaluate the model using those weights: source scripts/train.sh 00000 RDVQ_00000_finetune "--resume True --mode eval"
  • Otherwise, to fine-tune the model yourself, first move the model weights to a new folder model_dl, then make a new folder model to save the new weights, and then run the following command: source scripts/train.sh 00000 RDVQ_00000_finetune "--resume True". This still require the args.pkl file mentioned previously.
  • Plese note the value of data_dt in scripts/train.sh. The code saves the best performing model during training, which is why it should be set to the validation set during training. During evaluation, it loads the best performing model, which is why it should be set to the test set during evaluation.

Train the model from scratch on Spoken ObjectNet

  • Run the following command to train the model from scratch: source scripts/train.sh 00000 RDVQ_scratch_frozen "--lr 0.001 --freeze-image-model True"
  • The model weights can be evaulated with source scripts/train.sh 00000 RDVQ_scratch_frozen "--resume True --mode eval"
  • We also provide the trained model weights at this link.
  • Plese note the value of data_dt in scripts/train.sh. The code saves the best performing model during training, which is why it should be set to the validation set during training. During evaluation, it loads the best performing model, which is why it should be set to the test set during evaluation.

Contact

If You find any problems or have any questions, please open an issue and we will try to respond as soon as possible. You can also try emailing the first corresponding author.

References

[1] Palmer, I., Rouditchenko, A., Barbu, A., Katz, B., Glass, J. (2021) Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset. Proc. Interspeech 2021, 3650-3654, doi: 10.21437/Interspeech.2021-245

[2] David Harwath*, Wei-Ning Hsu*, and James Glass. Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech. Proc. International Conference on Learning Representations (ICLR), 2020

Spoken ObjectNet - Bibtex:

@inproceedings{palmer21_interspeech,
  author={Ian Palmer and Andrew Rouditchenko and Andrei Barbu and Boris Katz and James Glass},
  title={{Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={3650--3654},
  doi={10.21437/Interspeech.2021-245}
}
Owner
Ian Palmer
Ian Palmer
Pytorch-Named-Entity-Recognition-with-BERT

BERT NER Use google BERT to do CoNLL-2003 NER ! Train model using Python and Inference using C++ ALBERT-TF2.0 BERT-NER-TENSORFLOW-2.0 BERT-SQuAD Requi

Kamal Raj 1.1k Dec 25, 2022
Speech Recognition Database Management with python

Speech Recognition Database Management The main aim of this project is to recogn

Abhishek Kumar Jha 2 Feb 02, 2022
Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

pyannote 2.2k Jan 09, 2023
🐍 A hyper-fast Python module for reading/writing JSON data using Rust's serde-json.

A hyper-fast, safe Python module to read and write JSON data. Works as a drop-in replacement for Python's built-in json module. This is alpha software

Matthias 479 Jan 01, 2023
2021语言与智能技术竞赛:机器阅读理解任务

LICS2021 MRC 1. 项目&任务介绍 本项目基于官方给定的baseline(DuReader-Checklist-BASELINE)进行二次改造,对整个代码框架做了简单的重构,对核心网络结构添加了注释,解耦了数据读取的模块,并添加了阈值确认的功能,一些小的细节也做了改进。 本次任务为202

roar 29 Dec 05, 2022
A relatively simple python program to generate one of those reddit text to speech videos dominating youtube.

Reddit text to speech generator A basic reddit tts video generator Current functionality Generate videos for subs based on comments,(askreddit) so rea

Aadvik 17 Dec 19, 2022
Text Analysis & Topic Extraction on Android App user reviews

AndroidApp_TextAnalysis Hi, there! This is code archive for Text Analysis and Topic Extraction from user_reviews of Android App. Dataset Source : http

Fitrie Ratnasari 1 Feb 14, 2022
Awesome Treasure of Transformers Models Collection

💁 Awesome Treasure of Transformers Models for Natural Language processing contains papers, videos, blogs, official repo along with colab Notebooks. 🛫☑️

Ashish Patel 577 Jan 07, 2023
A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. Flair is: A powerful NLP library. Flair allo

flair 12.3k Jan 02, 2023
This is a general repo that helps you develop fast/effective NLP classifiers using Huggingface

NLP Classifier Introduction This project trains a bert model on any NLP classifcation model. And uses the model in make predictions on new data using

Abdullah Tarek 3 Mar 11, 2022
pyupbit 라이브러리를 활용하여 upbit에서 비트코인을 자동매매하는 코드입니다. 조코딩 유튜브 채널에서 자세한 강의 영상을 보실 수 있습니다.

파이썬 비트코인 투자 자동화 강의 코드 by 유튜브 조코딩 채널 pyupbit 라이브러리를 활용하여 upbit 거래소에서 비트코인 자동매매를 하는 코드입니다. 파일 구성 test.py : 잔고 조회 (1강) backtest.py : 백테스팅 코드 (2강) bestK.p

조코딩 JoCoding 186 Dec 29, 2022
✨Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

✨A Python framework to explore, label, and monitor data for NLP projects

Recognai 1.5k Jan 02, 2023
👑 spaCy building blocks and visualizers for Streamlit apps

spacy-streamlit: spaCy building blocks for Streamlit apps This package contains utilities for visualizing spaCy models and building interactive spaCy-

Explosion 620 Dec 29, 2022
Korea Spell Checker

한국어 문서 koSpellPy Korean Spell checker How to use Install pip install kospellpy Use from kospellpy import spell_init spell_checker = spell_init() # d

kangsukmin 2 Oct 20, 2021
無料で使える中品質なテキスト読み上げソフトウェア、VOICEVOXの音声合成エンジン

VOICEVOX ENGINE VOICEVOXの音声合成エンジン。 実態は HTTP サーバーなので、リクエストを送信すればテキスト音声合成できます。 API ドキュメント VOICEVOX ソフトウェアを起動した状態で、ブラウザから

Hiroshiba 3 Jul 05, 2022
Pretrained language model and its related optimization techniques developed by Huawei Noah's Ark Lab.

Pretrained Language Model This repository provides the latest pretrained language models and its related optimization techniques developed by Huawei N

HUAWEI Noah's Ark Lab 2.6k Jan 08, 2023
Train BPE with fastBPE, and load to Huggingface Tokenizer.

BPEer Train BPE with fastBPE, and load to Huggingface Tokenizer. Description The BPETrainer of Huggingface consumes a lot of memory when I am training

Lizhuo 1 Dec 23, 2021
Codes for coreference-aware machine reading comprehension

Data and code for the paper "Tracing Origins: Coreference-aware Machine Reading Comprehension" at ACL2022. Dataset There are three folders for our thr

11 Sep 29, 2022
This codebase facilitates fast experimentation of differentially private training of Hugging Face transformers.

private-transformers This codebase facilitates fast experimentation of differentially private training of Hugging Face transformers. What is this? Why

Xuechen Li 73 Dec 28, 2022
Mastering Transformers, published by Packt

Mastering Transformers This is the code repository for Mastering Transformers, published by Packt. Build state-of-the-art models from scratch with adv

Packt 195 Jan 01, 2023