Official codes for the paper "Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech"

Overview

ResDAVEnet-VQ

Official PyTorch implementation of Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

What is in this repo?

  • Multi-GPU training of ResDAVEnet-VQ
  • Quantitative evaluation
    • Image-to-speech and speech-to-image retrieval
    • ZeroSpeech 2019 ABX phone-discriminability test
    • Word detection
  • Qualitative evaluation
    • Visualize time-aligned word/phone/code transcripts
    • F1/recall/precision scatter plots for model/layer comparison

alt text

If you find the code useful, please cite

@inproceedings{Harwath2020Learning,
  title={Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech},
  author={David Harwath and Wei-Ning Hsu and James Glass},
  booktitle={International Conference on Learning Representations},
  year={2020},
  url={https://openreview.net/forum?id=B1elCp4KwH}
}

Pre-trained models

Model [email protected] Link MD5 sum
{} 0.735 gDrive e3f94990c72ce9742c252b2e04f134e4
{}->{2} 0.760 gDrive d8ebaabaf882632f49f6aea0a69516eb
{}->{3} 0.794 gDrive 2c3a269c70005cbbaaa15fc545da93fa
{}->{2,3} 0.787 gDrive d0764d8e97187c8201f205e32b5f7fee
{2} 0.753 gDrive d68c942069fcdfc3944e556f6af79c60
{2}->{2,3} 0.764 gDrive 09e704f8fcd9f85be8c4d5bdf779bd3b
{2}->{2,3}->{2,3,4} 0.793 gDrive 6e403e7f771aad0c95f087318bf8447e
{3} 0.734 gDrive a0a3d5adbbd069a2739219346c8a8f70
{3}->{2,3} 0.760 gDrive 6c92bcc4445895876a7840bc6e88892b
{2,3} 0.667 gDrive 7a98a661302939817a1450d033bc2fcc

Data preparation

Download the MIT Places Image/Audio Data

We use MIT Places scene recognition database (Places Image) and a paired MIT Places Audio Caption Corpus (Places Audio) as visually-grounded speech, which contains roughly 400K image/spoken caption pairs, to train ResDAVEnet-VQ.

  • Places Image can be downloaded here
  • Places Audio can be downloaded here

Optional data preprocessing

Data specifcation files can be found at metadata/{train,val}.json inside the Places Audio directory; however, they do not include the time-aligned word transcripts for analysis. Those with alignments can be downloaded here:

Open the *.json files and update the values of image_base_path and audio_base_path to reflect the path where the image and the audio datasets are stored.

To speed up data loading, we save images and audio data into the HDF5 binary files, and use the h5py Python interface to access the data. The corresponding PyTorch Dataset class is ImageCaptionDatasetHDF5 in ./dataloaders/image_caption_dataset_hdf5.py. To prepare HDF5 datasets, run

./scripts/preprocess.sh

(We do support on-the-fly feature processing with the ImageCaptionDataset class in ./dataloaders/image_caption_dataset.py, which takes a data specification file as input (e.g., metadata/train.json). However, this can be very slow)

ImageCaptionDataset and ImageCaptionDatasetHDF5 are interchangeable, but most scripts in this repo assume the preprocessed HDF5 dataset is available. Users would have to modify the code correspondingly to use ImageCaptionDataset.

Interactive Qualtitative Evaluation

See run_evaluations.ipynb

Quantitative Evaluation

ZeroSpeech 2019 ABX Phone Discriminability Test

Users need to download the dataset and the Docker image by following the instructions here.

To extract ResDAVEnet-VQ features, see ./scripts/dump_zs19_abx.sh.

Word detection

See ./run_unit_analysis.py. It needs both HDF5 dataset and the original JSON dataset to get the time-aligned word transcripts.

Example:

python run_unit_analysis.py --hdf5_path=$hdf5_path --json_path=$json_path \
  --exp_dir=$exp_dir --layer=$layer --output_dir=$out_dir

Cross-modal retrieval

See ./run_ResDavenetVQ.py. Set --mode=eval for retrieval evaluation.

Example:

python run_ResDavenetVQ.py --resume=True --mode=eval \
  --data-train=$data_tr --data-val=$data_dt \
  --exp-dir="./exps/pretrained/RDVQ_01000_01100_01110"

Training

See ./scripts/train.sh.

To train a model from scratch with the 2nd and 3rd layers quantized, run

./scripts/train.sh 01100 RDVQ_01100 ""

To train a model with the 2nd and 3rd layers quantized, and initialize weights from a pre-trained model (e.g., ./exps/RDVQ_00000), run

./scripts/train.sh 01100 RDVQ_01100 "--seed-dir ./exps/RDVQ_00000"
Owner
Wei-Ning Hsu
Research Scientist @ Facebook AI Research (FAIR). Former PhD Student @ MIT Spoken Language Systems Group
Wei-Ning Hsu
[CVPR 2019 Oral] Multi-Channel Attention Selection GAN with Cascaded Semantic Guidance for Cross-View Image Translation

SelectionGAN for Guided Image-to-Image Translation CVPR Paper | Extended Paper | Guided-I2I-Translation-Papers Citation If you use this code for your

Hao Tang 424 Dec 02, 2022
Worktory is a python library created with the single purpose of simplifying the inventory management of network automation scripts.

Worktory is a python library created with the single purpose of simplifying the inventory management of network automation scripts.

Renato Almeida de Oliveira 18 Aug 31, 2022
Real-Time High-Resolution Background Matting

Real-Time High-Resolution Background Matting Official repository for the paper Real-Time High-Resolution Background Matting. Our model requires captur

Peter Lin 6.1k Jan 03, 2023
Pytorch implementation of "Get To The Point: Summarization with Pointer-Generator Networks"

About this repository This repo contains an Pytorch implementation for the ACL 2017 paper Get To The Point: Summarization with Pointer-Generator Netwo

wxDai 7 Oct 14, 2022
Train robotic agents to learn pick and place with deep learning for vision-based manipulation in PyBullet.

Ravens is a collection of simulated tasks in PyBullet for learning vision-based robotic manipulation, with emphasis on pick and place. It features a Gym-like API with 10 tabletop rearrangement tasks,

Google Research 367 Jan 09, 2023
Revitalizing CNN Attention via Transformers in Self-Supervised Visual Representation Learning

Revitalizing CNN Attention via Transformers in Self-Supervised Visual Representation Learning

ChongjianGE 89 Dec 02, 2022
U-Net Brain Tumor Segmentation

U-Net Brain Tumor Segmentation 🚀 :Feb 2019 the data processing implementation in this repo is not the fastest way (code need update, contribution is

Hao 448 Jan 02, 2023
Official Python implementation of the 'Sparse deconvolution'-v0.3.0

Sparse deconvolution Python v0.3.0 Official Python implementation of the 'Sparse deconvolution', and the CPU (NumPy) and GPU (CuPy) calculation backen

Weisong Zhao 23 Dec 28, 2022
Robustness via Cross-Domain Ensembles

Robustness via Cross-Domain Ensembles [ICCV 2021, Oral] This repository contains tools for training and evaluating: Pretrained models Demo code Traini

Visual Intelligence & Learning Lab, Swiss Federal Institute of Technology (EPFL) 27 Dec 23, 2022
RoadMap and preparation material for Machine Learning and Data Science - From beginner to expert.

ML-and-DataScience-preparation This repository has the goal to create a learning and preparation roadMap for Machine Learning Engineers and Data Scien

33 Dec 29, 2022
Implementation of "JOKR: Joint Keypoint Representation for Unsupervised Cross-Domain Motion Retargeting"

JOKR: Joint Keypoint Representation for Unsupervised Cross-Domain Motion Retargeting Pytorch implementation for the paper "JOKR: Joint Keypoint Repres

45 Dec 25, 2022
The code for our paper Semi-Supervised Learning with Multi-Head Co-Training

Semi-Supervised Learning with Multi-Head Co-Training (PyTorch) Abstract Co-training, extended from self-training, is one of the frameworks for semi-su

cmc 6 Dec 04, 2022
A novel pipeline framework for multi-hop complex KGQA task. About the paper title: Improving Multi-hop Embedded Knowledge Graph Question Answering by Introducing Relational Chain Reasoning

Rce-KGQA A novel pipeline framework for multi-hop complex KGQA task. This framework mainly contains two modules, answering_filtering_module and relati

金伟强 -上海大学人工智能小渣渣~ 16 Nov 18, 2022
RodoSol-ALPR Dataset

RodoSol-ALPR Dataset This dataset, called RodoSol-ALPR dataset, contains 20,000 images captured by static cameras located at pay tolls owned by the Ro

Rayson Laroca 45 Dec 15, 2022
Benchmark spaces - Benchmarks of how well different two dimensional spaces work for clustering algorithms

benchmark_spaces Benchmarks of how well different two dimensional spaces work fo

Bram Cohen 6 May 07, 2022
Dataset used in "PlantDoc: A Dataset for Visual Plant Disease Detection" accepted in CODS-COMAD 2020

PlantDoc: A Dataset for Visual Plant Disease Detection This repository contains the Cropped-PlantDoc dataset used for benchmarking classification mode

Pratik Kayal 109 Dec 29, 2022
A full pipeline AutoML tool for tabular data

HyperGBM Doc | 中文 We Are Hiring! Dear folks,we are offering challenging opportunities located in Beijing for both professionals and students who are k

DataCanvas 240 Jan 03, 2023
Source code for deep symbolic optimization.

Update July 10, 2021: This repository now supports an additional symbolic optimization task: learning symbolic policies for reinforcement learning. Th

Brenden Petersen 290 Dec 25, 2022
Automatic Video Captioning Evaluation Metric --- EMScore

Automatic Video Captioning Evaluation Metric --- EMScore Overview For an illustration, EMScore can be computed as: Installation modify the encode_text

Yaya Shi 17 Nov 28, 2022
Deep generative modeling for time-stamped heterogeneous data, enabling high-fidelity models for a large variety of spatio-temporal domains.

Neural Spatio-Temporal Point Processes [arxiv] Ricky T. Q. Chen, Brandon Amos, Maximilian Nickel Abstract. We propose a new class of parameterizations

Facebook Research 75 Dec 19, 2022