[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Last update: Dec 13, 2022

Related tags

Overview

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [CVPR'21, Oral]

By Zhicheng Huang*, Zhaoyang Zeng*, Yupan Huang*, Bei Liu, Dongmei Fu and Jianlong Fu

Introduction

This is the official implementation of the paper. In this paper, we propose SOHO to "See Out of tHe bOx" that takes a whole image as input, and learns vision-language representation in an end-to-end manner. SOHO does not require bounding box annotations which enables inference 10 times faster than region-based approaches.

Architecture

Release Progress

VQA Codebase
Pre-training Codebase
Other Downstream Tasks

Installation

conda create -n soho python=3.7
conda activate soho
git clone https://github.com/researchmm/soho.git
cd soho
bash tools/install.sh

Getting Started

Download the training, validation and test data

mkdir -p $SOHO_ROOT/data/coco
cd $SOHO_ROOT/data/coco
# need to update
wget https://vqasc.blob.core.windows.net/t-zhihuawork/code_10/MultiScalePretrain/data/coco/train2014.zip
wget https://vqasc.blob.core.windows.net/t-zhihuawork/code_10/MultiScalePretrain/data/coco/val2014.zip
wget https://vqasc.blob.core.windows.net/t-zhihuawork/code_10/MultiScalePretrain/data/coco/test2015.zip
wget https://vqasc.blob.core.windows.net/t-zhihuawork/code_10/MultiScalePretrain/data/coco/train_data_qa_caption_new_box.json
wget https://vqasc.blob.core.windows.net/t-zhihuawork/code_10/MultiScalePretrain/data/coco/val_data_qa_caption_new_box.json
wget https://vqasc.blob.core.windows.net/t-zhihuawork/code_10/MultiScalePretrain/data/coco/test_data_qa.json

Download the Pre-training models

cd $SOHO_ROOT
mkdir -p $SOHO_ROOT/pretrained
cd $SOHO_ROOT/pretrained
# the following need to update
wget

Training a VQA model

cd $SOHO_ROOT
#use 8 GPUS to train the model
bash tools/dist_train.sh configs/VQA/soho_res18_vqa.py 8

Evaluate a VQA model

bash tools/dist_test_vqa.sh configs/VQA/soho_res18_vqa.py 18 8

Citation

If you find this repo useful in your research, please consider citing the following papers:

@inproceedings{huang2021seeing,
  title={Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning},
  author={Huang, Zhicheng and Zeng, Zhaoyang and Huang, Yupan and Liu, Bei and Fu, Dongmei and Fu, Jianlong},
  booktitle={The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2021}
}

@article{huang2020pixel,
  title={Pixel-bert: Aligning image pixels with text by deep multi-modal transformers},
  author={Huang, Zhicheng and Zeng, Zhaoyang and Liu, Bei and Fu, Dongmei and Fu, Jianlong},
  journal={arXiv preprint arXiv:2004.00849},
  year={2020}
}

Acknowledgements

We would like to thank mmcv and mmdetection. Our commons lib is based on mmcv.

[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Related tags

Overview

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [CVPR'21, Oral]

Introduction

Architecture

Release Progress

Installation

Getting Started

Citation

Acknowledgements

Owner

Multimedia Research

Official implementation of DreamerPro: Reconstruction-Free Model-Based Reinforcement Learning with Prototypical Representations in TensorFlow 2

Code for Parameter Prediction for Unseen Deep Architectures (NeurIPS 2021)

FedML: A Research Library and Benchmark for Federated Machine Learning

Inverse Rendering for Complex Indoor Scenes: Shape, Spatially-Varying Lighting and SVBRDF From a Single Image

This is the replication package for paper submission: Towards Training Reproducible Deep Learning Models.

Curated list of awesome GAN applications and demo

SIEM Logstash parsing for more than hundred technologies

Run Effective Large Batch Contrastive Learning on Limited Memory GPU

FCOS: Fully Convolutional One-Stage Object Detection (ICCV'19)

A solution to ensure Crowd Management with Contactless and Safe systems.

A Keras implementation of CapsNet in the paper: Sara Sabour, Nicholas Frosst, Geoffrey E Hinton. Dynamic Routing Between Capsules

This's an implementation of deepmind Visual Interaction Networks paper using pytorch

Code for the Image similarity challenge.

Fine-tune pretrained Convolutional Neural Networks with PyTorch

Pytorch implementation of the paper Progressive Growing of Points with Tree-structured Generators (BMVC 2021)

(NeurIPS '21 Spotlight) IQ-Learn: Inverse Q-Learning for Imitation

Arquitetura e Desenho de Software.

Implementation of Self-supervised Graph-level Representation Learning with Local and Global Structure (ICML 2021).

Winners of the Facebook Image Similarity Challenge

PyTorch implementation of "Optimization Planning for 3D ConvNets"