Official code for Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

Last update: Jan 26, 2022

Related tags

Overview

Official code for our Interspeech 2021 - Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset [1]*.

Visually-grounded spoken language datasets can enable models to learn cross-modal correspondences with very weak supervision. However, modern audio-visual datasets contain biases that undermine the real-world performance of models trained on that data. We introduce Spoken ObjectNet, which is designed to remove some of these biases and provide a way to better evaluate how effectively models will perform in real-world scenarios. This dataset expands upon ObjectNet, which is a bias-controlled image dataset that features similar image classes to those present in ImageNet.

*Note: please see the ArXiv version for additional results on the test set.

Setup

Clone this module and any submodules: git clone --recurse-submodules [email protected]:iapalm/Spoken-ObjectNet.git
Follow the directions in data.md to set up ObjectNet images and the Spoken ObjectNet-50k corpus
This code was tested with PyTorch 1.9 with CUDA 10.2 and Python 3.8.8.
To train the models with the code as-is, we use 2 GPUs with 11 Gb of memory. A single GPU can be used, but the batch size or other parameters should be reduced.
Note about the speed of this code: This code will work as-is on the Spoken ObjectNet audio captions, but the speed could be greatly improved. A main bottleneck is the resampling of the audio wav files from 48 kHz to 16 kHz, which is done with librosa here. We suggest to pre-process the audio files into the desired format first, and then remove this line or the on-the-fly spectrogram conversion entirely. We estimate the speed will improve 5x.
On our servers, the zero-shot evaluation takes around 20-30 minutes and training takes around 4-5 days. As mentioned in the previous point, this could be improved with audio pre-processing.

Running Experiments

We support 3 experiments that can be used as baselines for future work:

(1) Zero-shot evaluation of the ResDAVEnet-VQ model trained on Places-400k [2].
(2) Fine-tuning the ResDAVEnet-VQ model trained on Places-400k on Spoken ObjectNet with a frozen image branch .
(3) Training the ResDAVEnet-VQ model from scratch on Spoken ObjectNet with a frozen image branch.
Note: fine-tuning the image branch on Spoken ObjectNet is not permitted, but fine-tuning the audio branch is allowed.

Zero-shot transfer from Places-400k

Download and extract the directory containing the model weights from this link. Keep the folder named RDVQ_00000 and move it to the exps directory.
In scripts/train.sh, change data_dt to data/Spoken-ObjectNet-50k/metadata/SON-test.json to evaluate on the test set instead of the validation set.
Run the following command for zero-shot evaluation: source scripts/train.sh 00000 RDVQ_00000 "--resume True --mode eval"
The results are printed in exps/RDVQ_00000_transfer/train.out

Fine-tune the model from Places-400k

Download and extract the directory containing the args.pkl file which specifies the fine-tuning arguments. The directory at this link contains the args.pkl file as well as the model weights.
The model weights of the fine-tuned model are provided for easier evaluation. Run the following command to evaluate the model using those weights: source scripts/train.sh 00000 RDVQ_00000_finetune "--resume True --mode eval"
Otherwise, to fine-tune the model yourself, first move the model weights to a new folder model_dl, then make a new folder model to save the new weights, and then run the following command: source scripts/train.sh 00000 RDVQ_00000_finetune "--resume True". This still require the args.pkl file mentioned previously.
Plese note the value of data_dt in scripts/train.sh. The code saves the best performing model during training, which is why it should be set to the validation set during training. During evaluation, it loads the best performing model, which is why it should be set to the test set during evaluation.

Train the model from scratch on Spoken ObjectNet

Run the following command to train the model from scratch: source scripts/train.sh 00000 RDVQ_scratch_frozen "--lr 0.001 --freeze-image-model True"
The model weights can be evaulated with source scripts/train.sh 00000 RDVQ_scratch_frozen "--resume True --mode eval"
We also provide the trained model weights at this link.
Plese note the value of data_dt in scripts/train.sh. The code saves the best performing model during training, which is why it should be set to the validation set during training. During evaluation, it loads the best performing model, which is why it should be set to the test set during evaluation.

Contact

If You find any problems or have any questions, please open an issue and we will try to respond as soon as possible. You can also try emailing the first corresponding author.

References

[1] Palmer, I., Rouditchenko, A., Barbu, A., Katz, B., Glass, J. (2021) Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset. Proc. Interspeech 2021, 3650-3654, doi: 10.21437/Interspeech.2021-245

[2] David Harwath*, Wei-Ning Hsu*, and James Glass. Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech. Proc. International Conference on Learning Representations (ICLR), 2020

Spoken ObjectNet - Bibtex:

@inproceedings{palmer21_interspeech,
  author={Ian Palmer and Andrew Rouditchenko and Andrei Barbu and Boris Katz and James Glass},
  title={{Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={3650--3654},
  doi={10.21437/Interspeech.2021-245}
}

Official code for Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

Related tags

Overview

Setup

Running Experiments

Zero-shot transfer from Places-400k

Fine-tune the model from Places-400k

Train the model from scratch on Spoken ObjectNet

Contact

References

Spoken ObjectNet - Bibtex:

Owner

Ian Palmer

Pytorch-Named-Entity-Recognition-with-BERT

Speech Recognition Database Management with python

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

🐍 A hyper-fast Python module for reading/writing JSON data using Rust's serde-json.

2021语言与智能技术竞赛：机器阅读理解任务

A relatively simple python program to generate one of those reddit text to speech videos dominating youtube.

Text Analysis & Topic Extraction on Android App user reviews

Awesome Treasure of Transformers Models Collection

A very simple framework for state-of-the-art Natural Language Processing (NLP)

This is a general repo that helps you develop fast/effective NLP classifiers using Huggingface

pyupbit 라이브러리를 활용하여 upbit에서 비트코인을 자동매매하는 코드입니다. 조코딩 유튜브 채널에서 자세한 강의 영상을 보실 수 있습니다.

✨Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

👑 spaCy building blocks and visualizers for Streamlit apps

Korea Spell Checker

無料で使える中品質なテキスト読み上げソフトウェア、VOICEVOXの音声合成エンジン

Pretrained language model and its related optimization techniques developed by Huawei Noah's Ark Lab.

Train BPE with fastBPE, and load to Huggingface Tokenizer.

Codes for coreference-aware machine reading comprehension

This codebase facilitates fast experimentation of differentially private training of Hugging Face transformers.

Mastering Transformers, published by Packt