VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

Last update: Dec 28, 2022

Overview

VisualGPT

Our Paper VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

Main Architecture of Our VisualGPT

Download the GPT-2 pretrained weights

curl --output gpt2-pytorch_model.bin https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-pytorch_model.bin

Enviroment setup

Clone the repository and create the visualgpt conda environmnet

conda env create -f environment.yml
conda activate visualgpt

Then download spacy data

python -m spacy download en

Data preparation

We provide the COCO dataset for downloading. Please download the annotations file annotations.zip and extract it. and coco_detections.hdf5, in which the data is stored in a where key is the image id and value is a tensor (N, 2048). N it the number of detections

code structure

create the log folder mkdir logs and start the training

Train the model

python train_visualGPT.py --batch_size 50 --head 12 --features_path coco_detections.hdf5 --annotation_folder annotations --lr 1e-4 --gpt_model_type gpt --random_seed 42 --log_file logs/log --exp_name experiment_log --lr 1e-4 --decoder_layer 12 --optimizer_type adamw  --gradient_accumulation_steps 2 --train_percentage 0.001 --split_train_data

Acknowledgement

This code used resources from Meshed Memory Transformer and Transformers

Please cite our paper from the following bibtex

@article{chen2021visualgpt,
  title={VisualGPT: Data-efficient Image Captioning by Balancing Visual Input and Linguistic Knowledge from Pretraining},
  author={Chen, Jun and Guo, Han and Yi, Kai and Li, Boyang and Elhoseiny, Mohamed},
  journal={arXiv preprint arXiv:2102.10407},
  year={2021}
}

@article{chen2021visualgpt,
  title={VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning},
  author={Chen, Jun and Guo, Han and Yi, Kai and Li, Boyang and Elhoseiny, Mohamed},
  journal={arXiv preprint arXiv:2102.10407},
  year={2021}
}

VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

Related tags

Overview

VisualGPT

Main Architecture of Our VisualGPT

Download the GPT-2 pretrained weights

Enviroment setup

Data preparation

code structure

Train the model

Acknowledgement

Owner

Vision CAIR Research Group, KAUST

Various operations like path tracking, counting, etc by using yolov5

State-Relabeling Adversarial Active Learning

Stacked Recurrent Hourglass Network for Stereo Matching

docTR by Mindee (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.

[NeurIPS 2021 Spotlight] Code for Learning to Compose Visual Relations

Code for EMNLP2021 paper "Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training"

Pytorch implementation of "Training a 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet"

GDR-Net: Geometry-Guided Direct Regression Network for Monocular 6D Object Pose Estimation. (CVPR 2021)

Code for the paper Learning the Predictability of the Future

DeepVoxels is an object-specific, persistent 3D feature embedding.

Animatable Neural Radiance Fields for Modeling Dynamic Human Bodies

UT-Sarulab MOS prediction system using SSL models

Music Source Separation; Train & Eval & Inference piplines and pretrained models we used for 2021 ISMIR MDX Challenge.

The InterScript dataset contains interactive user feedback on scripts generated by a T5-XXL model.

This script runs neural style transfer against the provided content image.

PAMI stands for PAttern MIning. It constitutes several pattern mining algorithms to discover interesting patterns in transactional/temporal/spatiotemporal databases

An educational resource to help anyone learn deep reinforcement learning.

Pytorch for Segmentation

Code for Iso-Points: Optimizing Neural Implicit Surfaces with Hybrid Representations

Springer Link Download Module for Python