ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

Overview

ROSITA

News & Updates

(24/08/2021)

  • Release the demo to perform fine-grained semantic alignments using the pretrained ROSITA model.

(15/08/2021)

  • Release the basic framework for ROSITA, including the pretrained base ROSITA model, as well as the scripts to run the fine-tuning and evaluation on three downstream tasks (i.e., VQA, REC, ITR) over six datasets.

Introduction

This repository contains source code necessary to reproduce the results presented in our ACM MM paper ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration, which encodes the cROSs- and InTrA-model prior knowledge in a in a unified scene graph to perform knowledge-guided vision-and-language pretraining. Compared with existing counterparts, ROSITA learns better fine-grained semantic alignments across different modalities, thus improving the capability of the pretrained model.

Performance

We compare ROSITA against existing state-of-the-art VLP methods on three downstream tasks. All methods use the base model of Transformer for a fair comparison. The trained checkpoints to reproduce these results are provided in finetune.md.

Tasks VQA REC ITR
Datasets VQAv2
dev | std
RefCOCO
val | testA | testB
RefCOCO+
val | testA | testB
RefCOCOg
val | test
IR-COCO
[email protected] | [email protected] | [email protected]
TR-COCO
[email protected] | [email protected] | [email protected]
IR-Flickr
[email protected] | [email protected] | [email protected]
TR-Flickr
[email protected] | [email protected] | [email protected]
ROSITA 73.91 | 73.97 84.79 | 87.99 | 78.28 76.06 | 82.01 | 67.40 78.23 | 78.25 54.40 | 80.92 | 88.60 71.26 | 91.62 | 95.58 74.08 | 92.44 | 96.08 88.90 | 98.10 | 99.30
SoTA-base 73.59 | 73.67 81.56 | 87.40 | 74.48 76.05 | 81.65 | 65.70 75.90 | 75.93 54.00 | 80.80 | 88.50 70.00 | 91.10 | 95.50 74.74 | 92.86 | 95.82 86.60 | 97.90 | 99.20

Installation

Software and Hardware Requirements

We recommand a workstation with 4 GPU (>= 24GB, e.g., RTX 3090 or V100), 120GB memory and 50GB free disk space. We strongly recommend to use a SSD drive to guarantee high-speed I/O. Also, you should first install some necessary package as follows:

  • Python >= 3.6
  • PyTorch >= 1.4 with Cuda >=10.2
  • torchvision >= 0.5.0
  • Cython
# git clone
$ git clone https://github.com/MILVLG/rosita.git 

# build essential utils
$ cd rosita/rosita/utils/rec
$ python setup.py build
$ cp build/lib*/bbox.cpython*.so .

Dataset Setup

To download the required datasets to run this project, please check datasets.md for details.

Pretraining

Please check pretrain.md for the details for ROSITA pretraining. We currently only provide the pretrained model to run finetuning on downstream tasks. The codes to run pretraining will be released later.

Finetuning

Please check finetune.md for the details for finetuning on downstream tasks. Scripts to run finetuning on downstream tasks are provided. Also, we provide trained models that can be directly evaluated to reproduce the results.

Demo

We provide the Jupyter notebook scripts for reproducing the visualization results shown in our paper.

Acknowledgment

We appreciate the well-known open-source projects such as LXMERT, UNITER, OSCAR, and Huggingface, which help us a lot when writing our codes.

Yuhao Cui (@cuiyuhao1996) and Tong-An Luo (@Zoroaster97) are the main contributors to this repository. Please kindly contact them if you find any issue.

Citations

Please consider citing this paper if you use the code:

@inProceedings{cui2021rosita,
  title={ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration},
  author={Cui, Yuhao and Yu, Zhou and Wang, Chunqi and Zhao, Zhongzhou and Zhang, Ji and Wang, Meng and Yu, Jun},
  booktitle={Proceedings of the 29th ACM International Conference on Multimedia},
  year={2021}
}
Owner
Vision and Language Group@ MIL
Hangzhou Dianzi University
Vision and Language Group@ MIL
BabelCalib: A Universal Approach to Calibrating Central Cameras. In ICCV (2021)

BabelCalib: A Universal Approach to Calibrating Central Cameras This repository contains the MATLAB implementation of the BabelCalib calibration frame

Yaroslava Lochman 55 Dec 30, 2022
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

This repository holds NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pytorch. Some of the code here will be included in upstream Pytorch eventually. The intenti

NVIDIA Corporation 6.9k Jan 03, 2023
Official Implementation (PyTorch) of "Point Cloud Augmentation with Weighted Local Transformations", ICCV 2021

PointWOLF: Point Cloud Augmentation with Weighted Local Transformations This repository is the implementation of PointWOLF(To appear). Sihyeon Kim1*,

MLV Lab (Machine Learning and Vision Lab at Korea University) 16 Nov 03, 2022
The mini-MusicNet dataset

mini-MusicNet A music-domain dataset for multi-label classification Music transcription is sequence-to-sequence prediction problem: given an audio per

John Thickstun 4 Nov 09, 2022
Official repository of the paper Learning to Regress 3D Face Shape and Expression from an Image without 3D Supervision

Official repository of the paper Learning to Regress 3D Face Shape and Expression from an Image without 3D Supervision

Soubhik Sanyal 689 Dec 25, 2022
Official implementation of our CVPR2021 paper "OTA: Optimal Transport Assignment for Object Detection" in Pytorch.

OTA: Optimal Transport Assignment for Object Detection This project provides an implementation for our CVPR2021 paper "OTA: Optimal Transport Assignme

217 Jan 03, 2023
A Light CNN for Deep Face Representation with Noisy Labels

A Light CNN for Deep Face Representation with Noisy Labels Citation If you use our models, please cite the following paper: @article{wulight, title=

Alfred Xiang Wu 715 Nov 05, 2022
CLOOB training (JAX) and inference (JAX and PyTorch)

cloob-training Pretrained models There are two pretrained CLOOB models in this repo at the moment, a 16 epoch and a 32 epoch ViT-B/16 checkpoint train

Katherine Crowson 64 Nov 27, 2022
Unofficial TensorFlow implementation of the Keyword Spotting Transformer model

Keyword Spotting Transformer This is the unofficial TensorFlow implementation of the Keyword Spotting Transformer model. This model is used to train o

Intelligent Machines Limited 8 May 11, 2022
A forwarding MPI implementation that can use any other MPI implementation via an MPI ABI

MPItrampoline MPI wrapper library: MPI trampoline library: MPI integration tests: MPI is the de-facto standard for inter-node communication on HPC sys

Erik Schnetter 31 Dec 22, 2022
Public Implementation of ChIRo from "Learning 3D Representations of Molecular Chirality with Invariance to Bond Rotations"

Learning 3D Representations of Molecular Chirality with Invariance to Bond Rotations This directory contains the model architectures and experimental

35 Dec 05, 2022
Implementation of "Efficient Regional Memory Network for Video Object Segmentation" (Xie et al., CVPR 2021).

RMNet This repository contains the source code for the paper Efficient Regional Memory Network for Video Object Segmentation. Cite this work @inprocee

Haozhe Xie 76 Dec 14, 2022
Unofficial implementation of Alias-Free Generative Adversarial Networks. (https://arxiv.org/abs/2106.12423) in PyTorch

alias-free-gan-pytorch Unofficial implementation of Alias-Free Generative Adversarial Networks. (https://arxiv.org/abs/2106.12423) This implementation

Kim Seonghyeon 502 Jan 03, 2023
A study project using the AA-RMVSNet to reconstruct buildings from multiple images

3d-building-reconstruction This is part of a study project using the AA-RMVSNet to reconstruct buildings from multiple images. Introduction It is exci

17 Oct 17, 2022
A basic reminder tool written in Python.

A simple Python Reminder Here's a basic reminder tool written in Python that speaks to the user and sends a notification. Run pip3 install pyttsx3 w

Sachit Yadav 4 Feb 05, 2022
This is the research repository for Vid2Doppler: Synthesizing Doppler Radar Data from Videos for Training Privacy-Preserving Activity Recognition.

Vid2Doppler: Synthesizing Doppler Radar Data from Videos for Training Privacy-Preserving Activity Recognition This is the research repository for Vid2

Future Interfaces Group (CMU) 26 Dec 24, 2022
Ensemble Visual-Inertial Odometry (EnVIO)

Ensemble Visual-Inertial Odometry (EnVIO) Authors : Jae Hyung Jung, Yeongkwon Choe, and Chan Gook Park 1. Overview This is a ROS package of Ensemble V

Jae Hyung Jung 95 Jan 03, 2023
ProMP: Proximal Meta-Policy Search

ProMP: Proximal Meta-Policy Search Implementations corresponding to ProMP (Rothfuss et al., 2018). Overall this repository consists of two branches: m

Jonas Rothfuss 212 Dec 20, 2022
Simple converter for deploying Stable-Baselines3 model to TFLite and/or Coral

Running SB3 developed agents on TFLite or Coral Introduction I've been using Stable-Baselines3 to train agents against some custom Gyms, some of which

Gary Briggs 16 Oct 11, 2022
Official PyTorch code for the paper: "Point-Based Modeling of Human Clothing" (ICCV 2021)

Point-Based Modeling of Human Clothing Paper | Project page | Video This is an official PyTorch code repository of the paper "Point-Based Modeling of

Visual Understanding Lab @ Samsung AI Center Moscow 64 Nov 22, 2022