code for TCL: Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022

Last update: Jan 02, 2023

Overview

Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022

News

(03/16/2022) upload retrieval checkpoints finetuned on COCO and Flickr

This is the official PyTorch implementation of TCL

Requirements:

conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 -c pytorch
pip install transformers==4.8.1
pip install timm==0.4.9
conda install ruamel_yaml
pip install opencv-python
pip install --upgrade Pillow
pip install einops

Pre-training Datasets:

MSCOCO (2014)
Visual Genome (VG)
- Download images of part1 and part2 and combine them together
Conceptual Captions (CC3M)
- Download Train_GCC-training.tsv and Validation_GCC-1.1.0-Validation.tsv from kaggle
- Then use img2dataset to download images from downloaed tsv files
- More details
SBU Captions
- Download url from subcaptions
- Then use img2dataset to download images from url
CC12M
- Download cc12m.tsv
- Then use img2dataset to download images from the downloaed tsv file

Downstream-task Datasets:

Flickr30k
VQA v2
NLVR2
- recommend to use direct-image-download

Json Files from Pre-training and Downstream Tasks:

refer to Download in ALBEF
you need to change the image path in json files according to your downloaded images

Pre-trained checkpoint:

Pre-training:

python -m torch.distributed.launch --nproc_per_node=8 \
--use_env Pretrain.py \
--config ./configs/Pretrain.yaml \
--output_dir output/pretrain

Downstream Tasks:

Image-Text Retrieval

# zero-shot coco 
python -m torch.distributed.launch --nproc_per_node=8 \
--use_env Retrieval.py \
--config ./configs/Retrieval_coco.yaml \
--output_dir output/pretrain_e30_Retrieval_coco_zeroshot \
--checkpoint output/pretrain/checkpoint_29.pth \
--evaluate

# fine-tune flickr
python -m torch.distributed.launch --nproc_per_node=8 \
--use_env Retrieval.py \
--config ./configs/Retrieval_flickr.yaml \
--output_dir output/pretrain_e30_Retrieval_flickr \
--checkpoint output/pretrain/checkpoint_29.pth

# fine-tune coco
python -m torch.distributed.launch --nproc_per_node=8 \
--use_env Retrieval.py \
--config ./configs/Retrieval_coco.yaml \
--output_dir output/pretrain_e30_Retrieval_coco \
--checkpoint output/pretrain/checkpoint_29.pth

# zero-shot flickr 
python -m torch.distributed.launch --nproc_per_node=8 \
--use_env Retrieval.py \
--config ./configs/Retrieval_flickr.yaml \
--output_dir output/pretrain_e30_Retrieval_flickr_zeroshot \
--checkpoint output/pretrain_e30_Retrieval_coco/checkpoint_best.pth \
--evaluate

VQA

python -m torch.distributed.launch --nproc_per_node=8 \
--use_env VQA.py \
--config ./configs/VQA.yaml \
--output_dir output/pretrain_e30_vqa \
--checkpoint output/pretrain/checkpoint_29.pth

how to evaluate and interpret the results(salesforce/ALBEF#19)

Visual Entailment

python -m torch.distributed.launch --nproc_per_node=8 \
--use_env VE.py \
--config ./configs/VE.yaml \
--output_dir output/pretrain_e30_VE \
--checkpoint output/pretrain/checkpoint_29.pth

how to evaluate and interpret the results(salesforce/ALBEF#19)

NLVR2

# pre-train nlvr
python -m torch.distributed.launch --nproc_per_node=8 \
--use_env Pretrain_nlvr.py \
--config ./configs/NLVR_pretrain.yaml \
--output_dir output/pretrain_e30_NLVR_pretrain \
--checkpoint output/pretrain/checkpoint_29.pth

# fine-tune nlvr
python -m torch.distributed.launch --nproc_per_node=8 \
--use_env NLVR.py \
--config ./configs/NLVR.yaml \
--output_dir output/pretrain_e30_NLVR \
--checkpoint output/pretrain_e30_NLVR_pretrain/checkpoint_00.pth

how to evaluate and interpret the results(salesforce/ALBEF#19)

Citation:

@article{yang2022vision,
  title={Vision-Language Pre-Training with Triple Contrastive Learning},
  author={Yang, Jinyu and Duan, Jiali and Tran, Son and Xu, Yi and Chanda, Sampath and Chen, Liqun and Zeng, Belinda and Chilimbi, Trishul and Huang, Junzhou},
  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
  year={2022}
}

Our code is largely borrowed from ALBEF

code for TCL: Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022

Related tags

Overview

Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022

News

Requirements:

Pre-training Datasets:

Downstream-task Datasets:

Json Files from Pre-training and Downstream Tasks:

Pre-trained checkpoint:

Pre-training:

Downstream Tasks:

Image-Text Retrieval

VQA

Visual Entailment

NLVR2

Citation:

Owner

A privacy-focused, intelligent security camera system.

L-Verse: Bidirectional Generation Between Image and Text

Code for this paper The Lottery Ticket Hypothesis for Pre-trained BERT Networks.

The deployment framework aims to provide a simple, lightweight, fast integrated, pipelined deployment framework that ensures reliability, high concurrency and scalability of services.

Mmdetection3d Noted - MMDetection3D is an open source object detection toolbox based on PyTorch

DeepCAD: A Deep Generative Network for Computer-Aided Design Models

A check for whether the dependency jobs are all green.

This is a re-implementation of TransGAN: Two Pure Transformers Can Make One Strong GAN (CVPR 2021) in PyTorch.

Fully Automatic Page Turning on Real Scores

SparseML is a libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models

Replication Package for AequeVox:Automated Fariness Testing for Speech Recognition Systems

AI virtual gym is an AI program which can be used to exercise and can be used to see if we are doing the exercises

Generates all variables from your .tf files into a variables.tf file.

A object detecting neural network powered by the yolo architecture and leveraging the PyTorch framework and associated libraries.

Few-Shot-Intent-Detection includes popular challenging intent detection datasets with/without OOS queries and state-of-the-art baselines and results.

SiamMOT is a region-based Siamese Multi-Object Tracking network that detects and associates object instances simultaneously.

Img-process-manual - Utilize Python Numpy and Matplotlib to realize OpenCV baisc image processing function

A (PyTorch) imbalanced dataset sampler for oversampling low frequent classes and undersampling high frequent ones.

Code for a real-time distributed cooperative slam(RDC-SLAM) system for ROS compatible platforms.

Safe Local Motion Planning with Self-Supervised Freespace Forecasting, CVPR 2021