Code and data for ImageCoDe, a contextual vison-and-language benchmark

Overview

ImageCoDe

arxiv

This repository contains code and data for ImageCoDe: Image Retrieval from Contextual Descriptions.

Example

Data

All collected descriptions for the training and validation set are under data/train_data.json and data/valid_data.json.

Image sets can be downloaded on Zenodo or GoogleDrive and should be unzipped in data/.

You can download from the commandline via:

wget https://zenodo.org/record/6518944/files/image-sets.zip

For ViLBERT experiments, you need to download a pretrained ViLBERT checkpoint from volta here, simply by clicking on ViLBERT in the table. Save the downloaded file as baselines/vilbert/vilbert-pretrained.bin. Since ViLBERT uses image features from Faster R-CNN, you also have to downloaded these for all ImageCoDe images here: Google Drive link. Save the file as data/rcnn-features36-36.lmdb. The same procedure applies for UNITER.

The format for data/train_data.json looks like this:

{
  "MSR-VTT-videoTrainValVideo_video2044-shot1_0": {
    "6": "a mom holding her babies in the middle of the picture, no other image intervenes with the image.",
    "7": "The image is fading between a woman holding a baby and a woman sitting with a red background. The hands of the woman sitting aren't visible."
  },
  "video-storytelling-videochristmas_56Nm66j-i5Q-shot14_2": {
  "..."
  }
}

And the images under data/ have the following structure. Each folder contains 10 images. If the images are video frames, the number X in imgX.jpg indicates the frame number:

  .
  ├── MSR-VTT-videoTrainValVideo_video2044-shot1_0
      │   ├── img0.jpg
      │   ├── img7.jpg
      │   ├── ...
  ├── video-storytelling-videochristmas_56Nm66j-i5Q-shot14_2
      │   ├── ...

Leaderboard

Based on this you can train your model and test on the unlabeled test set:

{
  "MSR-VTT-videoTestVideo_video7763-shot2_1": [
    "The team name on shirt is visible without a number, but all letters can be seen for team name.",
    "the player can be seen with him on the left close to the logo on the pitch on the right and can be clearly seen"
  ],
  "...":
  ["..."]
}

In order to appear on the leaderboard, please format your results in the following format:

{
  "MSR-VTT-videoTestVideo_video7763-shot2_1": [
    1,
    2
  ],
  "...":
  ["..."]
}

Where the example here with "1" and "2" represent image indices ranging from 0 to 9. You can submit to the leaderboard by sending your test set file (or a download link) to [email protected] and we will update the leaderboard quickly (max. 1-2 days). The leaderboard is maintained on the project website and might change its submission procedure at some point.

Installations

Run install.sh for running CLIP experiments. For VilBERT follow the instructions for volta.

Code

Code for CLIP is under baselines/clip and and code for ViLBERT/UNITER is under baselines/crossencoders.

For details commands to run each model variant shown in the paper, have a look at the README in baselines.

For example to train the best performing model CLIP+TemporalEmbeddings, run:

python3 contextual.py --lr 2e-6 --lr_head 1e-4 -b 36 -m ViT-B/16 --fusion mult -a gelu --logit_scale 1000 --finetuned_checkpoint_path checkpoints/CONTRA_clip_best__36_4e-06_30_1395526.pt --add_input --frozen_clip --positional

Data Analysis

Our manual annotation of various phenomena (negation, nuances, ...) in our validation set can be found under data/manual_annotation_valid.yaml

License

This work is licensed under the MIT license. See LICENSE for details. Third-party software and data sets are subject to their respective licenses.
If you want to cite our paper, please use:

@inproceedings{krojer_contextual_2022,
  address = {Online},
  title = {Image Retrieval from Contextual Descriptions},
  booktitle = {Proceedings of the 60th {Annual} {Meeting} of the {Association} for {Computational} {Linguistics},
  publisher = {Association for Computational Linguistics},
  author = {Krojer, Benno and Adlakha, Vaibhav and Vineet, Vibhav and Goyal, Yash and Ponti, Edoardo and Reddy, Siva},
  month = may,
  year = {2022},
}

Acknowledgement

Our data (specifically the image sets) are built upon 3 video dataset and Open Images:

We also the volta repository for ViLBERT and UNITER baseline variants

For questions or feedback, don't hesitate to contact the author: [email protected]

Owner
McGill NLP
Research group within McGill University and Mila focusing on various topics in natural language processing.
McGill NLP
Neural Magic Eye: Learning to See and Understand the Scene Behind an Autostereogram, arXiv:2012.15692.

Neural Magic Eye Preprint | Project Page | Colab Runtime Official PyTorch implementation of the preprint paper "NeuralMagicEye: Learning to See and Un

Zhengxia Zou 56 Jul 15, 2022
Azion the best solution of Edge Computing in the world.

Azion Edge Function docker action Create or update an Edge Functions on Azion Edge Nodes. The domain name is the key for decision to a create or updat

8 Jul 16, 2022
Code for our CVPR 2021 paper "MetaCam+DSCE"

Joint Noise-Tolerant Learning and Meta Camera Shift Adaptation for Unsupervised Person Re-Identification (CVPR'21) Introduction Code for our CVPR 2021

FlyingRoastDuck 59 Oct 31, 2022
Topic Modelling for Humans

gensim – Topic Modelling in Python Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Targ

RARE Technologies 13.8k Jan 03, 2023
Resources complimenting the Machine Learning Course led in the Faculty of mathematics and informatics part of Sofia University.

Machine Learning and Data Mining, Summer 2021-2022 How to learn data science and machine learning? Programming. Learn Python. Basic Statistics. Take a

Simeon Hristov 8 Oct 04, 2022
Animation of solving the traveling salesman problem to optimality using mixed-integer programming and iteratively eliminating sub tours

tsp-streamlit Animation of solving the traveling salesman problem to optimality using mixed-integer programming and iteratively eliminating sub tours.

4 Nov 05, 2022
Alphabetical Letter Recognition

DecisionTrees-Image-Classification Alphabetical Letter Recognition In these demo we are using "Decision Trees" Our database is composed by Learning Im

Mohammed Firass 4 Nov 30, 2021
code from "Tensor decomposition of higher-order correlations by nonlinear Hebbian plasticity"

Code associated with the paper "Tensor decomposition of higher-order correlations by nonlinear Hebbian learning," Ocker & Buice, Neurips 2021. "plot_f

Gabriel Koch Ocker 4 Oct 16, 2022
KwaiRec: A Fully-observed Dataset for Recommender Systems (Density: Almost 100%)

KuaiRec: A Fully-observed Dataset for Recommender Systems (Density: Almost 100%) KuaiRec is a real-world dataset collected from the recommendation log

Chongming GAO (高崇铭) 70 Dec 28, 2022
Accelerated SMPL operation, commonly used in generate 3D human mesh, STAR included.

SMPL2 An enchanced and accelerated SMPL operation which commonly used in 3D human mesh generation. It takes a poses, shapes, cam_trans as inputs, outp

JinTian 20 Oct 17, 2022
A list of multi-task learning papers and projects.

This page contains a list of papers on multi-task learning for computer vision. Please create a pull request if you wish to add anything. If you are interested, consider reading our recent survey pap

svandenh 297 Dec 17, 2022
[cvpr22] Perturbed and Strict Mean Teachers for Semi-supervised Semantic Segmentation

PS-MT [cvpr22] Perturbed and Strict Mean Teachers for Semi-supervised Semantic Segmentation by Yuyuan Liu, Yu Tian, Yuanhong Chen, Fengbei Liu, Vasile

Yuyuan Liu 132 Jan 03, 2023
Official PyTorch implementation of PS-KD

Self-Knowledge Distillation with Progressive Refinement of Targets (PS-KD) Accepted at ICCV 2021, oral presentation Official PyTorch implementation of

61 Dec 28, 2022
Weakly-Supervised Semantic Segmentation Network with Deep Seeded Region Growing (CVPR 2018).

Weakly-Supervised Semantic Segmentation Network with Deep Seeded Region Growing (CVPR2018) By Zilong Huang, Xinggang Wang, Jiasi Wang, Wenyu Liu and J

Zilong Huang 245 Dec 13, 2022
A Python wrapper for Google Tesseract

Python Tesseract Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and "read" the text embedded i

Matthias A Lee 4.6k Jan 05, 2023
RLBot Python bindings for the Rust crate rl_ball_sym

RLBot Python bindings for rl_ball_sym 0.6 Prerequisites: Rust & Cargo Build Tools for Visual Studio RLBot - Verify that the file %localappdata%\RLBotG

Eric Veilleux 2 Nov 25, 2022
House3D: A Rich and Realistic 3D Environment

House3D: A Rich and Realistic 3D Environment Yi Wu, Yuxin Wu, Georgia Gkioxari and Yuandong Tian House3D is a virtual 3D environment which consists of

Meta Research 1.1k Dec 14, 2022
Image Fusion Transformer

Image-Fusion-Transformer Platform Python 3.7 Pytorch =1.0 Training Dataset MS-COCO 2014 (T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ram

Vibashan VS 68 Dec 23, 2022
Code for the paper Relation Prediction as an Auxiliary Training Objective for Improving Multi-Relational Graph Representations (AKBC 2021).

Relation Prediction as an Auxiliary Training Objective for Knowledge Base Completion This repo provides the code for the paper Relation Prediction as

Facebook Research 85 Jan 02, 2023
给yolov5加个gui界面,使用pyqt5,yolov5是5.0版本

博文地址 https://xugaoxiang.com/2021/06/30/yolov5-pyqt5 代码执行 项目中使用YOLOv5的v5.0版本,界面文件是project.ui pip install -r requirements.txt python main.py 图片检测 视频检测

Xu GaoXiang 215 Dec 30, 2022