TriBERT

This repository contains the code for the NeurIPS 2021 paper titled "TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation".

Data pre-processing:

Please download MUSIC21. we found 314 videos are missing. Moreover, the train/val/test split was unavailable. Therefore, we used a random 80/20 train/test split which is given in data.

After downloading the dataset, please consider following steps as data pre-processing.

Following Sound-of-Pixels we extracted video frames at 8fps and waveforms at 11025Hz from videos. We considered these frames and waveforms as our visual and audio input for TriBERT model.
Setup AlphaPose toolbox to detect 26 keypoints for body joints and 21 keypoints for each hand.
Re-train ST-GCN network with the keypoints detected using AlphaPose and extract body joint features of size 256 × 68. These features will be considered as pose embedding to pose stream of TriBERT model.

Pre-trained model

Please download our pre-trained model from Google Drive. To train from scratch please pre-process the data first and then run:

python train_trimodal.py

Multi-modal Retrieval

The code used for our multi-modal retrieval experiments are in the retrieval directory. We conduct retrieval on TriBERT embeddings as well as baseline (before passing through TriBERT) embeddings. The networks used for these tasks are located in tribert_retrieval_networks.py and orig_retrieval_networks.py, respectively.

To train a retrieval network, use train_retrieval_networks.py. To evaluate the performance of a specific type of retrieval between TriBERT embeddings and baseline embeddings, use train_retrieval_networks.py.

Acknowledgment

This repository is developed on top of ViLBERT and Sound-of-Pixels. Please also refer to the original License of these projects.

Bibtext

If you find this code is useful for your research, please cite our paper

@inproceedings{rahman2021tribert,
  title={TriBERT: Human-centric Audio-visual Representation Learning},
  author={Rahman, Tanzila and Yang, Mengyu and Sigal, Leonid},
  booktitle={Thirty-Fifth Conference on Neural Information Processing Systems},
  year={2021}
}

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
config		config
data		data
retrieval		retrieval
tribert		tribert
visualization		visualization
README.md		README.md
requirements.txt		requirements.txt
train_trimodal.py		train_trimodal.py
utils_music21.py		utils_music21.py
viz.py		viz.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

config

config

data

data

retrieval

retrieval

tribert

tribert

visualization

visualization

README.md

README.md

requirements.txt

requirements.txt

train_trimodal.py

train_trimodal.py

utils_music21.py

utils_music21.py

viz.py

viz.py

Repository files navigation

TriBERT

Data pre-processing:

Pre-trained model

Multi-modal Retrieval

Acknowledgment

Bibtext

About

Releases

Packages

Contributors 2

Languages

ubc-vision/TriBERT

Folders and files

Latest commit

History

Repository files navigation

TriBERT

Data pre-processing:

Pre-trained model

Multi-modal Retrieval

Acknowledgment

Bibtext

About

Resources

Stars

Watchers

Forks

Languages