[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Last update: Dec 13, 2022

Related tags

Overview

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [CVPR'21, Oral]

By Zhicheng Huang*, Zhaoyang Zeng*, Yupan Huang*, Bei Liu, Dongmei Fu and Jianlong Fu

Introduction

This is the official implementation of the paper. In this paper, we propose SOHO to "See Out of tHe bOx" that takes a whole image as input, and learns vision-language representation in an end-to-end manner. SOHO does not require bounding box annotations which enables inference 10 times faster than region-based approaches.

Architecture

Release Progress

VQA Codebase
Pre-training Codebase
Other Downstream Tasks

Installation

conda create -n soho python=3.7
conda activate soho
git clone https://github.com/researchmm/soho.git
cd soho
bash tools/install.sh

Getting Started

Download the training, validation and test data

mkdir -p $SOHO_ROOT/data/coco
cd $SOHO_ROOT/data/coco
# need to update
wget https://vqasc.blob.core.windows.net/t-zhihuawork/code_10/MultiScalePretrain/data/coco/train2014.zip
wget https://vqasc.blob.core.windows.net/t-zhihuawork/code_10/MultiScalePretrain/data/coco/val2014.zip
wget https://vqasc.blob.core.windows.net/t-zhihuawork/code_10/MultiScalePretrain/data/coco/test2015.zip
wget https://vqasc.blob.core.windows.net/t-zhihuawork/code_10/MultiScalePretrain/data/coco/train_data_qa_caption_new_box.json
wget https://vqasc.blob.core.windows.net/t-zhihuawork/code_10/MultiScalePretrain/data/coco/val_data_qa_caption_new_box.json
wget https://vqasc.blob.core.windows.net/t-zhihuawork/code_10/MultiScalePretrain/data/coco/test_data_qa.json

Download the Pre-training models

cd $SOHO_ROOT
mkdir -p $SOHO_ROOT/pretrained
cd $SOHO_ROOT/pretrained
# the following need to update
wget

Training a VQA model

cd $SOHO_ROOT
#use 8 GPUS to train the model
bash tools/dist_train.sh configs/VQA/soho_res18_vqa.py 8

Evaluate a VQA model

bash tools/dist_test_vqa.sh configs/VQA/soho_res18_vqa.py 18 8

Citation

If you find this repo useful in your research, please consider citing the following papers:

@inproceedings{huang2021seeing,
  title={Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning},
  author={Huang, Zhicheng and Zeng, Zhaoyang and Huang, Yupan and Liu, Bei and Fu, Dongmei and Fu, Jianlong},
  booktitle={The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2021}
}

@article{huang2020pixel,
  title={Pixel-bert: Aligning image pixels with text by deep multi-modal transformers},
  author={Huang, Zhicheng and Zeng, Zhaoyang and Liu, Bei and Fu, Dongmei and Fu, Jianlong},
  journal={arXiv preprint arXiv:2004.00849},
  year={2020}
}

Acknowledgements

We would like to thank mmcv and mmdetection. Our commons lib is based on mmcv.

[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Related tags

Overview

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [CVPR'21, Oral]

Introduction

Architecture

Release Progress

Installation

Getting Started

Citation

Acknowledgements

Owner

Multimedia Research

Development Kit for the SoccerNet Challenge

Codebase for the solution that won first place and was awarded the most human-like agent in the 2021 NeurIPS Competition MineRL BASALT Challenge.

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

On the Complementarity between Pre-Training and Back-Translation for Neural Machine Translation (Findings of EMNLP 2021))

Next-gen Rowhammer fuzzer that uses non-uniform, frequency-based patterns.

Powerful unsupervised domain adaptation method for dense retrieval.

Model Zoo of BDD100K Dataset

This repository is the official implementation of Open Rule Induction. This paper has been accepted to NeurIPS 2021.

2021 CCF BDCI 全国信息检索挑战杯（CCIR-Cup）智能人机交互自然语言理解赛道第二名参赛解决方案

FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet.

Disease Informed Neural Networks (DINNs) — neural networks capable of learning how diseases spread, forecasting their progression, and finding their unique parameters (e.g. death rate).

Deeper insights into graph convolutional networks for semi-supervised learning

Specification language for generating Generalized Linear Models (with or without mixed effects) from conceptual models

Awesome Monocular 3D detection

JittorVis - Visual understanding of deep learning models

Algorithmic Trading using RNN

A Demo server serving Bert through ONNX with GPU written in Rust with <3

SuperSonic, a new open-source framework to allow compiler developers to integrate RL into compilers easily, regardless of their RL expertise

This repository is an unoffical PyTorch implementation of Medical segmentation in 3D and 2D.

(ICCV 2021) PyTorch implementation of Paper "Progressive Correspondence Pruning by Consensus Learning"