X-VLM: Multi-Grained Vision Language Pre-Training

Last update: Dec 23, 2022

Overview

X-VLM: learning multi-grained vision language alignments

Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts. Yan Zeng, Xinsong Zhang, Hang Li. arXiv 2021.

Jan 2022: release official PyTorch implementation and X-VLM-base checkpoints
Dec 2021: X-VLM-base (4M) achieves new SoTA
Nov 2021: release preprint in arXiv

Hiring

We are looking for interns at ByteDance AI-LAB (in Beijing / Shanghai)! If you are interested in working with us on vision language models, please send your resume to [email protected].

Features

Support several backbones
- vision encoder: deit / clip-vit / swin-transformer
- text encoder: bert / roberta
Support apex O1 / O2 for pre-training
Read from and write to HDFS
Distributed training across nodes for both pre-training and fine-tuning

Please read the code for more details.

Requirements

Install python3 environment

pip3 install -r requirements.txt

Download raw images from corresponding websites
Download the json files we provided, which contains image read paths and captions and/or bbox annotations
If running pre-training scripts:
- install Apex
- download pre-trained models for parameter initialization
  - image encoder: swin-transformer-base
  - text encoder: bert-base
Organize these files like this (% is for pre-training only):

X-VLM/
    data/
        finetune/
            refcoco+/*.json
            *.json
        
        %pretrain_4m/*.json
        %swin_base_patch4_window7_224_22k.pth
        %bert-base-uncased/
            config.json
            pytorch_model.bin
            tokenizer_config.json
            tokenizer.json
            vocab.txt

    images/
        coco/
            train2014/*.jpg
            val2014/*.jpg
            test2015/*.jpg
        
        visualgenome/
            image/*.jpg
        
        nlvr2/
            images/
                train/0-99/*.png
            dev/*.png
            test1/*.png
        
        %sbu/*.jpg
        %cc-3m/*.jpg

Pretrain

python3 run.py --task "pretrain_4m_base" --dist "1" --output_dir "output/pretrain_4m_base"

For distributed training across nodes, see run.py for more details.

Data

We are organizing the data and the scripts. All these will be released in Vision-Language-Data in March. Please feel free to prepare your own datasets by referring the code in dataset/pretrain_dataset.py.

Checkpoints

X-VLM-base (4M)
X-VLM-base 14M, WIP
X-VLM-large 14M, WIP

Finetune

2 nodes for fine-tuning, specify --output_hdfs to save some tmp results. # evaluate python3 run.py --task "vqa" --dist "1" --evaluate --output_dir "output/vqa_eval" --checkpoint "4m_base_finetune/vqa/model_state_epoch_9.th" ">

# train
python3 run.py --task "vqa" --dist "1" --output_dir "output/vqa" --checkpoint "4m_base_model_state_step_199999.th"
python3 run.py --task "vqa" --dist "all" --output_dir "output/vqa" --output_hdfs "hdfs://xxx/vqa_tmp" --checkpoint "4m_base_model_state_step_199999.th"  # if using >2 nodes for fine-tuning, specify --output_hdfs to save some tmp results.

# evaluate
python3 run.py --task "vqa" --dist "1" --evaluate --output_dir "output/vqa_eval" --checkpoint "4m_base_finetune/vqa/model_state_epoch_9.th"

See run.py for fine-tuning on other tasks (Retrieval, NLVR2, RefCOCO). We set some python assertions to help you run the code correctly. The fine-tuning scripts are based on ALBEF. We thank the author for opening source their code.

Data

download json files

Checkpoints and Logs

retrieval-mscoco
retrieval-flickr
vqa
nlvr2
refcoco
refcoco-bbox
Note that fine-tuning configs are given in "X-VLM/configs/*.yaml"

Citation

If you use this code, please considering citing:

@article{xvlm,
  title={Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts},
  author={Zeng, Yan and Zhang, Xinsong and Li, Hang},
  journal={arXiv preprint arXiv:2111.08276},
  year={2021}
}

Contact

For issues or help using this code, please submit a GitHub issue.

X-VLM: Multi-Grained Vision Language Pre-Training

Related tags

Overview

X-VLM: learning multi-grained vision language alignments

Hiring

Features

Requirements

Pretrain

Data

Checkpoints

Finetune

Data

Checkpoints and Logs

Citation

Contact

Owner

Yan Zeng

Existing Literature about Machine Unlearning

Library to enable Bayesian active learning in your research or labeling work.

Generative Modelling of BRDF Textures from Flash Images [SIGGRAPH Asia, 2021]

AI-UPV at IberLEF-2021 DETOXIS task: Toxicity Detection in Immigration-Related Web News Comments Using Transformers and Statistical Models

[ICCV'2021] Image Inpainting via Conditional Texture and Structure Dual Generation

Structured Edge Detection Toolbox

Machine Learning Time-Series Platform

High level network definitions with pre-trained weights in TensorFlow

Pytorch implementation of the paper "Topic Modeling Revisited: A Document Graph-based Neural Network Perspective"

ByteTrack: Multi-Object Tracking by Associating Every Detection Box

Official code for "Mean Shift for Self-Supervised Learning"

Label-Free Model Evaluation with Semi-Structured Dataset Representations

ML-based medical imaging using Azure

[Preprint] "Bag of Tricks for Training Deeper Graph Neural Networks A Comprehensive Benchmark Study" by Tianlong Chen, Kaixiong Zhou, Keyu Duan, Wenqing Zheng, Peihao Wang, Xia Hu, Zhangyang Wang

Code and data of the Fine-Grained R2R Dataset proposed in paper Sub-Instruction Aware Vision-and-Language Navigation

Revisiting Video Saliency: A Large-scale Benchmark and a New Model (CVPR18, PAMI19)

PPO Lagrangian in JAX

A minimalist environment for decision-making in autonomous driving

ESL: Event-based Structured Light

auto-tuning momentum SGD optimizer

X-VLM: Multi-Grained Vision Language Pre-Training

Related tags

Overview

X-VLM: learning multi-grained vision language alignments

Hiring

Features

Requirements

Pretrain

Data

Checkpoints

Finetune

Data

Checkpoints and Logs

Citation

Contact

Owner

Yan Zeng

Existing Literature about Machine Unlearning

Library to enable Bayesian active learning in your research or labeling work.

Generative Modelling of BRDF Textures from Flash Images [SIGGRAPH Asia, 2021]

AI-UPV at IberLEF-2021 DETOXIS task: Toxicity Detection in Immigration-Related Web News Comments Using Transformers and Statistical Models

[ICCV'2021] Image Inpainting via Conditional Texture and Structure Dual Generation

Structured Edge Detection Toolbox

Machine Learning Time-Series Platform

High level network definitions with pre-trained weights in TensorFlow

Pytorch implementation of the paper "Topic Modeling Revisited: A Document Graph-based Neural Network Perspective"

ByteTrack: Multi-Object Tracking by Associating Every Detection Box

Official code for "Mean Shift for Self-Supervised Learning"

Label-Free Model Evaluation with Semi-Structured Dataset Representations

ML-based medical imaging using Azure

[Preprint] "Bag of Tricks for Training Deeper Graph Neural Networks A Comprehensive Benchmark Study" by Tianlong Chen*, Kaixiong Zhou*, Keyu Duan, Wenqing Zheng, Peihao Wang, Xia Hu, Zhangyang Wang

Code and data of the Fine-Grained R2R Dataset proposed in paper Sub-Instruction Aware Vision-and-Language Navigation

Revisiting Video Saliency: A Large-scale Benchmark and a New Model (CVPR18, PAMI19)

PPO Lagrangian in JAX

A minimalist environment for decision-making in autonomous driving

ESL: Event-based Structured Light

auto-tuning momentum SGD optimizer

[Preprint] "Bag of Tricks for Training Deeper Graph Neural Networks A Comprehensive Benchmark Study" by Tianlong Chen, Kaixiong Zhou, Keyu Duan, Wenqing Zheng, Peihao Wang, Xia Hu, Zhangyang Wang