Vision-Language Pre-training for Image Captioning and Question Answering

Related tags

Deep LearningVLP
Overview

VLP

This repo hosts the source code for our AAAI2020 work Vision-Language Pre-training (VLP). We have released the pre-trained model on Conceptual Captions dataset and fine-tuned models on COCO Captions and Flickr30k for image captioning and VQA 2.0 for VQA.

Installation

Conda Environment (Option I, Recommended)

  1. Recursively ssh clone the repo to include coco and pythia submodules.
git clone --recursive [email protected]:LuoweiZhou/VLP.git

or clone with https:

git clone --recursive https://github.com/LuoweiZhou/VLP.git
  1. Install CUDA (e.g., 10.0), CUDNN (e.g., v7.5), and Miniconda (either Miniconda2 or 3, version 4.6+).

  2. Run the following commands to set up conda env and install Python packages:

MINICONDA_ROOT=[to your Miniconda root directory] # e.g., /home/[usrname]/miniconda3
cd VLP
conda env create -f misc/vlp.yml --prefix $MINICONDA_ROOT/envs/vlp
conda activate vlp
  1. Finally, cd to the repo root directory and install other dependencies by running:
./setup.sh

To support language evaluation (SPICE), run

cd coco-caption
./get_stanford_models.sh

Docker Image (Option II)

First, install or upgrade to the latest docker (e.g., set <VERSION_STRING> to 5:19.03.2~3-0~ubuntu-xenial). Then pull our docker image:

docker pull luzhou/vlp

Before running the container, you need to declare the environment variable to your data root ($DATA_ROOT, see data prep) and it will be attached as a volume to our container. Finally, install nvidia-container-toolkit and run the docker image in a fresh container:

docker run --gpus all --name vlp_container -it \
     -v $DATA_ROOT:/mnt/dat \
     --shm-size 8G -p 8888:8888 vlp /bin/bash

You can know more about docker commands and usages here.

(Optional) To build the image on your own,

docker build -t vlp .

Data Preparation

Download links for dataset annotations and features: COCO Captions+VQA 2.0 (Part I(95GB), Part II(79GB), download both and run cat COCO0* > COCO.tar.gz), Flickr30k Captions(27GB). If you prefer to download with wget, we attach the commands here. Then, uncompress the downloaded files and place under your data root (denoted as DATA_ROOT).

To prepare for the pre-training, first download and uncompress our pre-processed Conceptual Captions (CC) data(6GB) and place under your data root. Then, download and uncompress the region features from Google Drive (feat(509GB), cls(468GB)) under the CC/region_feat_gvd_wo_bgd/feat_cls_1000_float16 dir. To evaluate CC on caption generation, download the reference file and place it under coco-caption/annotations.

Besides, download and uncompress the detectron fc7 weight files under the code root directory (denoted as CODE_ROOT): GVD Detectron fc7.

(Optional, only for VQA) Download the VQA 2.0 annotation (based on Pythia):

cd $CODE_ROOT/pythia
mkdir -p data && cd data
wget http://dl.fbaipublicfiles.com/pythia/data/vocab.tar.gz
tar xf vocab.tar.gz && rm vocab.tar.gz

wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Val_mscoco.zip
unzip v2_Annotations_Val_mscoco.zip && rm v2_Annotations_Val_mscoco.zip

mkdir -p imdb && cd imdb
wget https://dl.fbaipublicfiles.com/pythia/data/imdb/vqa.tar.gz
tar xf vqa.tar.gz && rm vqa.tar.gz

(Optional, only for pre-training) Download the UniLM checkpoints and uncompress under your checkpoint root (denoted as CHECKPOINT_ROOT).

Experiment Overview

Most of the experiments in this work are performed on 8x V100 GPUs with distributed data parallel (i.e., set --world_size to 8, --local_rank and --global_rank from 0 to 7 with 8 separate scripts), unless specified otherwise. See below for detailed configurations (also in the Appendix of the paper).

Dataset Batch Size Learning Rate # of Epochs GPUs Time per Epoch
CC 64(x8) 1e-4(x8) 30 8x V100 5hr
COCO 64(x8) 3e-5(x8) 30 8x V100 12min
VQA 2.0 64(x2) 2e-5(x2) 20 2x V100 32min
Flickr30k 64(x8) 3e-5(x8) 30 8x V100 3min
COCO (w/o pre-training) 64(x8) 3e-4(x8) 30 8x V100 12min
COCO (SCST training) 16(x4) 1e-6(x4) 30 4x Titan Xp 3hr

The (x2), (x4), (x8) in the batch size and learning rate results from distributed data parallel. Gradients are accumulated/added across GPUs.

Note that some modules need to be imported manually:

export PYTHONPATH=$CODE_ROOT/pythia:$CODE_ROOT/pythia/pythia/legacy:$CODE_ROOT:$PYTHONPATH

Pre-training

An example code on single-GPU training:

python vlp/run_img2txt_dist.py --output_dir $CHECKPOINT_ROOT/${checkpoint_cc} \
    --model_recover_path $CHECKPOINT_ROOT/bert_save/base_model_pretrained/model_153999_cpu.bin \
    --do_train --learning_rate ${lr} --new_segment_ids --always_truncate_tail --amp \
    --src_file $DATA_ROOT/CC/annotations/dataset_cc.json \
    --dataset cc --split train --file_valid_jpgs $DATA_ROOT/CC/annotations/cc_valid_jpgs.json \
    --local_rank -1 --global_rank -1 --world_size 1 --enable_butd \
    --s2s_prob ${w_s} --bi_prob ${w_b} --image_root $DATA_ROOT/CC/region_feat_gvd_wo_bgd \
    --region_bbox_file bbox/cc_detection_vg_thresh0.2_feat_gvd_checkpoint_trainval.h5 \
    --region_det_file_prefix feat_cls_1000_float16/cc_detection_vg_100dets_gvd_checkpoint_trainval

where lr=1e-4, w_s=0.75, w_b=0.25, and checkpoint_cc is the id of the checkpoint. The pre-trained models are available here.

Fine-tuning

The fine-tuning checkpoints are available at: COCO (CE optim), COCO (CIDEr optim), VQA 2.0 (train on train set only), Flickr30k.

COCO Captions

An example code on single-GPU training:

python vlp/run_img2txt_dist.py --output_dir $CHECKPOINT_ROOT/${checkpoint_coco_ce} \
    --model_recover_path $CHECKPOINT_ROOT/${checkpoint_cc}/model.30.bin \
    --do_train --new_segment_ids --always_truncate_tail --amp \
    --src_file $DATA_ROOT/COCO/annotations/dataset_coco.json \
    --file_valid_jpgs $DATA_ROOT/COCO/annotations/coco_valid_jpgs.json \
    --image_root $DATA_ROOT/COCO/region_feat_gvd_wo_bgd --enable_butd --s2s_prob 1 --bi_prob 0

(Optional) To enable Self-Critical Sequence Training (SCST), set --model_recover_path $CHECKPOINT_ROOT/${checkpoint_coco_ce}/model.28.bin, --max_pred 0, --mask_prob 0, --scst, --learning_rate 1e-6 (note that SCST requires a much smaller lr than the default 3e-5), and --output_dir accordingly. The training takes 30 epochs to converge with each epoch takes roughly 3hr.

An example code on 2-GPU training with distributed data parallel:

python vlp/run_img2txt_dist.py --output_dir $CHECKPOINT_ROOT/${checkpoint_coco_ce} \
    --model_recover_path $CHECKPOINT_ROOT/${checkpoint_cc}/model.30.bin \
    --do_train --new_segment_ids --always_truncate_tail --amp \
    --src_file $DATA_ROOT/COCO/annotations/dataset_coco.json \
    --file_valid_jpgs $DATA_ROOT/COCO/annotations/coco_valid_jpgs.json \
    --image_root $DATA_ROOT/COCO/region_feat_gvd_wo_bgd --enable_butd --s2s_prob 1 --bi_prob 0 \
    --local_rank 0 --global_rank 0 --world_size 2 &
python vlp/run_img2txt_dist.py --output_dir $CHECKPOINT_ROOT/${checkpoint_coco_ce} \
    --model_recover_path $CHECKPOINT_ROOT/${checkpoint_cc}/model.30.bin \
    --do_train --new_segment_ids --always_truncate_tail --amp \
    --src_file $DATA_ROOT/COCO/annotations/dataset_coco.json \
    --file_valid_jpgs $DATA_ROOT/COCO/annotations/coco_valid_jpgs.json \
    --image_root $DATA_ROOT/COCO/region_feat_gvd_wo_bgd --enable_butd --s2s_prob 1 --bi_prob 0 \
    --local_rank 1 --global_rank 1 --world_size 2

VQA 2.0

An example code on single-GPU training:

python vlp/run_img2txt_dist.py --output_dir $CHECKPOINT_ROOT/${checkpoint_vqa2} \
    --model_recover_path $CHECKPOINT_ROOT/${checkpoint_cc}/model.30.bin \
    --do_train --learning_rate 2e-5 --new_segment_ids --always_truncate_tail --amp \
    --num_train_epochs 20 --enable_butd --s2s_prob 0 --bi_prob 1 \
    --image_root $DATA_ROOT/COCO/region_feat_gvd_wo_bgd
    --tasks vqa2 --src_file $CODE_ROOT/pythia/data/imdb/vqa/imdb_train2014.npy \
    --file_valid_jpgs $DATA_ROOT/COCO/annotations/coco_valid_jpgs.json \
    --mask_prob 0 --max_pred 1

To get the models for leaderboard, we perform the training on both train set and val set (set src_file to imdb_train2014 and imdb_val2014).

Flickr30k Captions

python vlp/run_img2txt_dist.py --output_dir $CHECKPOINT_ROOT/${checkpoint_flickr30k} \
    --model_recover_path $CHECKPOINT_ROOT/${checkpoint_cc}/model.30.bin \
    --do_train --new_segment_ids --always_truncate_tail --amp \
    --image_root $DATA_ROOT/flickr30k/region_feat_gvd_wo_bgd --enable_butd --s2s_prob 1 --bi_prob 0 \
    --dataset flickr30k --region_bbox_file $DATA_ROOT/flickr30k/region_feat_gvd_wo_bgd/flickr30k_detection_vg_thresh0.2_feat_gvd_checkpoint_trainvaltest.h5 \
    --src_file $DATA_ROOT/flickr30k/annotations/dataset_flickr30k.json \
    --file_valid_jpgs $DATA_ROOT/flickr30k/annotations/flickr30k_valid_jpgs.json

Inference and Testing

Here, we list the expected result outcomes from our Unified VLP checkpoints. For image captioning, on Karpathy's test split:

Dataset Method [email protected] METEOR CIDEr SPICE
COCO Unified VLP 36.5 28.4 116.9 21.2
Unified VLP + SCST 39.5 29.3 129.3 23.2
Flickr30k Unified VLP 30.1 23.0 67.4 17.0

For VQA:

Dataset Trained on Eval Split Overall Yes/No Number Other
VQA 2.0 train only Dev 67.4 85.4 50.1 58.3
train+val Test-Dev 70.5 87.2 52.1 60.3
train+val Test-Standard 70.7 87.4 52.1 60.5

Note that results on Test-Dev and Test-Standard are from VQA 2.0 evaluation server. train+val indicates models are trained on both training set and validation set following the practice from early works.

Note: All the evaluation scripts support data parallel. But since we do not use standard PyTorch DataLoader, the data loading speed might be the bottleneck (imagine num_workers is always 0). We recommend to perform single-GPU inference (e.g., CUDA_VISIBLE_DEVICES=0).

COCO Captions

python vlp/decode_img2txt.py \
    --model_recover_path $CHECKPOINT_ROOT/${checkpoint_coco_ce}/model.${epoch}.bin \
    --new_segment_ids --batch_size 100 --beam_size ${beam} --enable_butd \
    --image_root $DATA_ROOT/COCO/region_feat_gvd_wo_bgd/ --split ${split} \
    --src_file $DATA_ROOT/COCO/annotations/dataset_coco.json \
    --file_valid_jpgs $DATA_ROOT/COCO/annotations/coco_valid_jpgs.json

where checkpoint_coco_ce indicates checkpoint name, beam=1 for split=val set and 5 for split=test set, and epoch indicates the checkpoint at which epoch.

VQA 2.0

python vlp/eval_vqa2.py \
    --model_recover_path $CHECKPOINT_ROOT/${checkpoint_vqa2}/model.${epoch}.bin \
    --new_segment_ids --enable_butd --image_root $DATA_ROOT/COCO/region_feat_gvd_wo_bgd/ \
    --src_file $CODE_ROOT/pythia/data/imdb/vqa/imdb_${split}.npy --batch_size 50 \
    --file_valid_jpgs $DATA_ROOT/COCO/annotations/coco_valid_jpgs.json --split ${split}

where split could be val2014 or test2015.

Flickr30k Captions

python vlp/decode_img2txt.py \
    --model_recover_path $CHECKPOINT_ROOT/${checkpoint_flickr30k}/model.${epoch}.bin \
    --new_segment_ids --batch_size 100 --beam_size ${beam} --enable_butd \
    --image_root $DATA_ROOT/flickr30k/region_feat_gvd_wo_bgd/ --split ${split} \
    --dataset flickr30k --region_bbox_file $DATA_ROOT/flickr30k/region_feat_gvd_wo_bgd/flickr30k_detection_vg_thresh0.2_feat_gvd_checkpoint_trainvaltest.h5 \
    --src_file $DATA_ROOT/flickr30k/annotations/dataset_flickr30k.json \
    --file_valid_jpgs $DATA_ROOT/flickr30k/annotations/flickr30k_valid_jpgs.json

where beam=1 for split=val set and 5 for split=test set, and epoch indicates the checkpoint at which epoch.

Testing

For all the datasets, checkpoints (by epochs) with the best validation accuracy (CIDEr in captioning and overall accuracy in VQA) are evaluated on the test set (Test-Dev and Test-Standard for VQA 2.0).

Misc

The Detectron-based feature extraction code is available under this repo. You need to download this config file and checkpoint file.

List of download commands (only for OneDrive):

wget -O caption_cc_val.json "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212017&authkey=AHy5eiJM75RwPxg"

# data
wget -O COCO00 "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212019&authkey=ACn4bwZ0nmZ0nik"
wget -O COCO01 "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212018&authkey=AHoTGG-7-6kwoAY"
wget -O flickr30k.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212015&authkey=AFZ2iehPM8HREeA"
wget -O CC.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%213781&authkey=ANA--esfJnWIKIE"

# UniLM checkpoint
wget -O bert_save.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212016&authkey=AB5-lxzCkgpfLhg"

# pre-training checkpoints
wget -O cc_g8_lr1e-4_batch512_s0.75_b0.25.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212026&authkey=AH98pIVaNS4apSI"

# fine-tuning checkpoints
wget -O coco_g8_lr3e-5_batch512_ft_from_s0.75_b0.25.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212028&authkey=AEjQxFF1FcBK-Aw"
wget -O coco_g4_lr1e-6_batch64_scst.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212027&authkey=ACM1UXlFxgfWyt0"
wget -O vqa2_g2_lr2e-5_batch512_ft_from_s0.75_b0.25.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212029&authkey=APjfGJd1-nzDO7s"
wget -O flickr30k_g8_lr3e-5_batch512_ft_from_s0.75_b0.25.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212030&authkey=AGmfQ0fXcYCQun0"

# Detectron config/model
wget -O e2e_faster_rcnn_X-101-64x4d-FPN_2x-vlp.yaml "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212013&authkey=AHIvnE1FcggwiLU"
wget -O e2e_faster_rcnn_X-101-64x4d-FPN_2x-vlp.pkl "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212014&authkey=AAHgqN3Y-LXcBvU"

Reference

Please acknowledge the following paper if you use the code:

@article{zhou2019vlp,
  title={Unified Vision-Language Pre-Training for Image Captioning and VQA},
  author={Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, Jianfeng Gao},
  journal={arXiv preprint arXiv:1909.11059},
  year={2019}
}

Related Projects/Codebase

Acknowledgement

Our code is mainly based on Li Dong et al.'s UniLM repo. Also, a part of the code is based on pytorch-transformers v0.4.0 and ImageCaptioning.pytorch. We thank the authors for their wonderful open-source efforts.

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the UniLM project and pytorch-transformers v0.4.0 project.

Owner
Luowei Zhou
Senior Researcher @ Microsoft. UMich Ph.D.
Luowei Zhou
The reference baseline of final exam for XMU machine learning course

Mini-NICO Baseline The baseline is a reference method for the final exam of machine learning course. Requirements Installation we use /python3.7 /torc

JoaquinChou 3 Dec 29, 2021
Sound and Cost-effective Fuzzing of Stripped Binaries by Incremental and Stochastic Rewriting

StochFuzz: A New Solution for Binary-only Fuzzing StochFuzz is a (probabilistically) sound and cost-effective fuzzing technique for stripped binaries.

Zhuo Zhang 164 Dec 05, 2022
SIR model parameter estimation using a novel algorithm for differentiated uniformization.

TenSIR Parameter estimation on epidemic data under the SIR model using a novel algorithm for differentiated uniformization of Markov transition rate m

The Spang Lab 4 Nov 30, 2022
Yolov5 + Deep Sort with PyTorch

딥소트 수정중 Yolov5 + Deep Sort with PyTorch Introduction This repository contains a two-stage-tracker. The detections generated by YOLOv5, a family of obj

1 Nov 26, 2021
UFPR-ADMR-v2 Dataset

UFPR-ADMR-v2 Dataset The UFPR-ADMRv2 dataset contains 5,000 dial meter images obtained on-site by employees of the Energy Company of Paraná (Copel), w

Gabriel Salomon 8 Sep 29, 2022
The official codes for the ICCV2021 presentation "Uniformity in Heterogeneity: Diving Deep into Count Interval Partition for Crowd Counting"

UEPNet (ICCV2021 Poster Presentation) This repository contains codes for the official implementation in PyTorch of UEPNet as described in Uniformity i

Tencent YouTu Research 15 Dec 14, 2022
Sequence lineage information extracted from RKI sequence data repo

Pango lineage information for German SARS-CoV-2 sequences This repository contains a join of the metadata and pango lineage tables of all German SARS-

Cornelius Roemer 24 Oct 26, 2022
Python package for dynamic system estimation of time series

PyDSE Toolset for Dynamic System Estimation for time series inspired by DSE. It is in a beta state and only includes ARMA models right now. Documentat

Blue Yonder GmbH 40 Oct 07, 2022
Tensorflow implementation of DeepLabv2

TF-deeplab This is a Tensorflow implementation of DeepLab, compatible with Tensorflow 1.2.1. Currently it supports both training and testing the ResNe

Chenxi Liu 21 Sep 27, 2022
Pytorch implementation of FlowNet by Dosovitskiy et al.

FlowNetPytorch Pytorch implementation of FlowNet by Dosovitskiy et al. This repository is a torch implementation of FlowNet, by Alexey Dosovitskiy et

Clément Pinard 762 Jan 02, 2023
SCALE: Modeling Clothed Humans with a Surface Codec of Articulated Local Elements (CVPR 2021)

SCALE: Modeling Clothed Humans with a Surface Codec of Articulated Local Elements (CVPR 2021) This repository contains the official PyTorch implementa

Qianli Ma 133 Jan 05, 2023
novel deep learning research works with PaddlePaddle

Research 发布基于飞桨的前沿研究工作,包括CV、NLP、KG、STDM等领域的顶会论文和比赛冠军模型。 目录 计算机视觉(Computer Vision) 自然语言处理(Natrual Language Processing) 知识图谱(Knowledge Graph) 时空数据挖掘(Spa

1.5k Dec 29, 2022
Implementation of "Bidirectional Projection Network for Cross Dimension Scene Understanding" CVPR 2021 (Oral)

Bidirectional Projection Network for Cross Dimension Scene Understanding CVPR 2021 (Oral) [ Project Webpage ] [ arXiv ] [ Video ] Existing segmentatio

Hu Wenbo 135 Dec 26, 2022
Python wrapper to access the amazon selling partner API

PYTHON-AMAZON-SP-API Amazon Selling-Partner API If you have questions, please join on slack Contributions very welcome! Installation pip install pytho

Michael Primke 330 Jan 06, 2023
Contains modeling practice materials and homework for the Computational Neuroscience course at Okinawa Institute of Science and Technology

A310 Computational Neuroscience - Okinawa Institute of Science and Technology, 2022 This repository contains modeling practice materials and homework

Sungho Hong 1 Jan 24, 2022
My implementation of DeepMind's Perceiver

DeepMind Perceiver (in PyTorch) Disclaimer: This is not official and I'm not affiliated with DeepMind. My implementation of the Perceiver: General Per

Louis Arge 55 Dec 12, 2022
Disease Informed Neural Networks (DINNs) — neural networks capable of learning how diseases spread, forecasting their progression, and finding their unique parameters (e.g. death rate).

DINN We introduce Disease Informed Neural Networks (DINNs) — neural networks capable of learning how diseases spread, forecasting their progression, a

19 Dec 10, 2022
Inference pipeline for our participation in the FeTA challenge 2021.

feta-inference Inference pipeline for our participation in the FeTA challenge 2021. Team name: TRABIT Installation Download the two folders in https:/

Lucas Fidon 2 Apr 13, 2022
Code for 'Single Image 3D Shape Retrieval via Cross-Modal Instance and Category Contrastive Learning', ICCV 2021

CMIC-Retrieval Code for Single Image 3D Shape Retrieval via Cross-Modal Instance and Category Contrastive Learning. ICCV 2021. Introduction In this wo

42 Nov 17, 2022