improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

Last update: Dec 28, 2022

Related tags

Deep Learning CLIP-ViL

Overview

CLIP-ViL

In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

We release the extracted features and reproducible code here.

Specifically, we develop our methods in two scenarios: (1) direct task-specific fine-tuning; and (2) Vision and Language pre-training.

CLIP-ViL-Direct/VLN

We directly plug CLIP into tasks-pecific models and finetune on three representative tasks including Visual Question Answering, Image Captioning, and Vision-Language Navigation.

Please see the corresponding code directory for full details.

Noted that in direct finetuning, for Visual Question Answering on VQA 2.0 test-dev, we are able to achieve up to 68.37% accuracy with Pythia, 74.01% accuracy with MCAN and generally more than 4.0% improvements in accuracy; For Image Captioning on Karpathy's test split of MS COCO, we got 2.1% improvements in CIDEr metric over resnet alternatives; For Navigation, On RxR, we got 5% improvements with the nDTW metric (the main metric for RxR). On R2R, we got about 6% improvements in accuracy regarding our strong baselines.

CLIP-ViL-Pretrain

In order to test the potential of combining CLIP pre-training and Vision and Language pre-training. We introduce CLIP-ViL-Pretrain, a vision-and-language model pre-trained on image-text data with CLIP visual encoder as its visual backbone. CLIP-ViL-Pretrain is pretrained on aligned image-text data with a reconstructive objective and an image-text matching objective. It is further finetuned on VQA, SNLI-VE and GQA tasks.

Please see the corresponding code directory for full details.

Noted that CLIP-ViL-Pretrain is able to achieve 76.48% accuracy on VQA 2.0 test-dev and 76.70% accuracy on test-std; 80.61% accuracy on SNLI-VE Dev and 80.20% on Test-P; 61.42% accuracy on GQA test-dev and 62.93% accuracy on test-std.

Reference

If you use CLIP-ViL in your research or wish to refer to the baseline results published here, please use the following BibTeX entry.

@misc{shen2021clip,
    title={How Much Can CLIP Benefit Vision-and-Language Tasks?}, 
    author={Sheng Shen and Liunian Harold Li and Hao Tan and Mohit Bansal and Anna Rohrbach and Kai-Wei Chang and Zhewei Yao and Kurt Keutzer},
    year={2021},
    eprint={2107.06383},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

Related tags

Overview

CLIP-ViL

CLIP-ViL-Direct/VLN

CLIP-ViL-Pretrain

Related Links

Reference

Owner

CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images

Official Implementation for the paper DeepFace-EMD: Re-ranking Using Patch-wise Earth Mover’s Distance Improves Out-Of-Distribution Face Identification

Source code of our work: "Benchmarking Deep Models for Salient Object Detection"

Official Pytorch implementation for 2021 ICCV paper "Learning Motion Priors for 4D Human Body Capture in 3D Scenes" and trained models / data

[ICCV 2021] Focal Frequency Loss for Image Reconstruction and Synthesis

Honours project, on creating a depth estimation map from two stereo images of featureless regions

3D2Unet: 3D Deformable Unet for Low-Light Video Enhancement (PRCV2021)

This repository contains the code and models necessary to replicate the results of paper: How to Robustify Black-Box ML Models? A Zeroth-Order Optimization Perspective

Neural Scene Graphs for Dynamic Scene (CVPR 2021)

Pytorch implementation of paper "Efficient Nearest Neighbor Language Models" (EMNLP 2021)

Vision-Language Transformer and Query Generation for Referring Segmentation (ICCV 2021)

wmctrl ported to Python Ctypes

An SE(3)-invariant autoencoder for generating the periodic structure of materials

Library extending Jupyter notebooks to integrate with Apache TinkerPop and RDF SPARQL.

Tgbox-bench - Simple TGBOX upload speed benchmark

Public implementation of the Convolutional Motif Kernel Network (CMKN) architecture

Code for the paper "Spatio-temporal Self-Supervised Representation Learning for 3D Point Clouds" (ICCV 2021)

TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale.

Conditional Generative Adversarial Networks (CGAN) for Mobility Data Fusion

Unofficial implementation of One-Shot Free-View Neural Talking Head Synthesis