Moer Grounded Image Captioning by Distilling Image-Text Matching Model

Last update: Dec 16, 2022

Related tags

Deep Learning Grounded-Image-Captioning

Overview

Moer Grounded Image Captioning by Distilling Image-Text Matching Model

Requirements

Python 3.7
Pytorch 1.2

Prepare data

Please use git clone --recurse-submodules to clone this repository and remember to follow initialization steps in coco-caption/README.md. Then download and place the Flickr30k reference file under coco-caption/annotations. Also, download Stanford CoreNLP 3.9.1 for grounding evaluation and place the uncompressed folder under the tools/ directory.
Download the preprocessd dataset from this link and extract it to data/.
For Flickr30k-Entities, please download bottom-up visual feature extracted by Anderson's extractor (Zhou's extractor) from this link ( link) and place the uncompressed folders under data/flickrbu/. For MSCOCO, please follow this instruction to prepare the bottom-up features and place them under data/mscoco/.
Download the pretrained models from here and extract them to log/.
Download the pretrained SCAN models from this link and extract them to misc/SCAN/runs.

Evaluation

To reproduce the results reported in the paper, just simply run

bash eval_flickr.sh

fro Flickr30k-Entities and

bash eval_coco.sh

for MSCOCO.

Training

In the first training stage, run like

python train.py --id CE-scan-sup-0.1kl --caption_model topdown --input_json data/flickrtalk.json --input_fc_dir data/flickrbu/flickrbu_fc --input_att_dir data/flickrbu/flickrbu_att  --input_box_dir data/flickrbu/flickrbu_box  --input_label_h5 data/flickrtalk_label.h5 --batch_size 29 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 0 --checkpoint_path log/CE-scan-sup-0.1kl --save_checkpoint_every 1000 --val_images_use -1 --max_epochs 30  --att_supervise  True   --att_supervise_weight 0.1

In the second training stage, run like

python train.py --id sc-ground-CE-scan-sup-0.1kl --caption_model topdown --input_json data/flickrtalk.json --input_fc_dir data/flickrbu/flickrbu_fc --input_att_dir data/flickrbu/flickrbu_att  --input_box_dir data/flickrbu/flickrbu_box  --input_label_h5 data/flickrtalk_label.h5 --batch_size 29 --learning_rate 5e-5 --start_from log/CE-scan-sup-0.1kl --checkpoint_path log/sc-ground-CE-scan-sup-0.1kl --save_checkpoint_every 1000 --language_eval 1 --val_images_use -1 --self_critical_after 30  --max_epochs  110      --cider_reward_weight  1
--ground_reward_weight   1

Citation

@inproceedings{zhou2020grounded,
  title={More Grounded Image Captioning by Distilling Image-Text Matching Model},
  author={Zhou, Yuanen and Wang, Meng and Liu, Daqing and  Hu, Zhenzhen and Zhang, Hanwang},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  year={2020}
}

Acknowledgements

This repository is built upon self-critical.pytorch, SCAN and grounded-video-description. Thanks for their released code.

Moer Grounded Image Captioning by Distilling Image-Text Matching Model

Related tags

Overview

Moer Grounded Image Captioning by Distilling Image-Text Matching Model

Requirements

Prepare data

Evaluation

Training

Citation

Acknowledgements

Owner

YE Zhou

This repo contains the code required to train the multivariate time-series Transformer.

Container : Context Aggregation Network

[AAAI 2022] Sparse Structure Learning via Graph Neural Networks for Inductive Document Classification

Adversarial Texture Optimization from RGB-D Scans (CVPR 2020).

Code for "The Intrinsic Dimension of Images and Its Impact on Learning" - ICLR 2021 Spotlight

Neural Nano-Optics for High-quality Thin Lens Imaging

Accuracy Aligned. Concise Implementation of Swin Transformer

Creating multimodal multitask models

An index of recommendation algorithms that are based on Graph Neural Networks.

unet-family: Ultimate version

Code for the TASLP paper "PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation".

A containerized REST API around OpenAI's CLIP model.

A flexible submap-based framework towards spatio-temporally consistent volumetric mapping and scene understanding.

Long Expressive Memory (LEM)

pcnaDeep integrates cutting-edge detection techniques with tracking and cell cycle resolving models.

OpenLT: An open-source project for long-tail classification

Time-series-deep-learning - Developing Deep learning LSTM, BiLSTM models, and NeuralProphet for multi-step time-series forecasting of stock price.

Official PyTorch implementation of "Proxy Synthesis: Learning with Synthetic Classes for Deep Metric Learning" (AAAI 2021)

Genetic feature selection module for scikit-learn