X-VLM: Multi-Grained Vision Language Pre-Training

Last update: Dec 23, 2022

Overview

X-VLM: learning multi-grained vision language alignments

Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts. Yan Zeng, Xinsong Zhang, Hang Li. arXiv 2021.

Jan 2022: release official PyTorch implementation and X-VLM-base checkpoints
Dec 2021: X-VLM-base (4M) achieves new SoTA
Nov 2021: release preprint in arXiv

Hiring

We are looking for interns at ByteDance AI-LAB (in Beijing / Shanghai)! If you are interested in working with us on vision language models, please send your resume to [email protected].

Features

Support several backbones
- vision encoder: deit / clip-vit / swin-transformer
- text encoder: bert / roberta
Support apex O1 / O2 for pre-training
Read from and write to HDFS
Distributed training across nodes for both pre-training and fine-tuning

Please read the code for more details.

Requirements

Install python3 environment

pip3 install -r requirements.txt

Download raw images from corresponding websites
Download the json files we provided, which contains image read paths and captions and/or bbox annotations
If running pre-training scripts:
- install Apex
- download pre-trained models for parameter initialization
  - image encoder: swin-transformer-base
  - text encoder: bert-base
Organize these files like this (% is for pre-training only):

X-VLM/
    data/
        finetune/
            refcoco+/*.json
            *.json
        
        %pretrain_4m/*.json
        %swin_base_patch4_window7_224_22k.pth
        %bert-base-uncased/
            config.json
            pytorch_model.bin
            tokenizer_config.json
            tokenizer.json
            vocab.txt

    images/
        coco/
            train2014/*.jpg
            val2014/*.jpg
            test2015/*.jpg
        
        visualgenome/
            image/*.jpg
        
        nlvr2/
            images/
                train/0-99/*.png
            dev/*.png
            test1/*.png
        
        %sbu/*.jpg
        %cc-3m/*.jpg

Pretrain

python3 run.py --task "pretrain_4m_base" --dist "1" --output_dir "output/pretrain_4m_base"

For distributed training across nodes, see run.py for more details.

Data

We are organizing the data and the scripts. All these will be released in Vision-Language-Data in March. Please feel free to prepare your own datasets by referring the code in dataset/pretrain_dataset.py.

Checkpoints

X-VLM-base (4M)
X-VLM-base 14M, WIP
X-VLM-large 14M, WIP

Finetune

2 nodes for fine-tuning, specify --output_hdfs to save some tmp results. # evaluate python3 run.py --task "vqa" --dist "1" --evaluate --output_dir "output/vqa_eval" --checkpoint "4m_base_finetune/vqa/model_state_epoch_9.th" ">

# train
python3 run.py --task "vqa" --dist "1" --output_dir "output/vqa" --checkpoint "4m_base_model_state_step_199999.th"
python3 run.py --task "vqa" --dist "all" --output_dir "output/vqa" --output_hdfs "hdfs://xxx/vqa_tmp" --checkpoint "4m_base_model_state_step_199999.th"  # if using >2 nodes for fine-tuning, specify --output_hdfs to save some tmp results.

# evaluate
python3 run.py --task "vqa" --dist "1" --evaluate --output_dir "output/vqa_eval" --checkpoint "4m_base_finetune/vqa/model_state_epoch_9.th"

See run.py for fine-tuning on other tasks (Retrieval, NLVR2, RefCOCO). We set some python assertions to help you run the code correctly. The fine-tuning scripts are based on ALBEF. We thank the author for opening source their code.

Data

download json files

Checkpoints and Logs

retrieval-mscoco
retrieval-flickr
vqa
nlvr2
refcoco
refcoco-bbox
Note that fine-tuning configs are given in "X-VLM/configs/*.yaml"

Citation

If you use this code, please considering citing:

@article{xvlm,
  title={Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts},
  author={Zeng, Yan and Zhang, Xinsong and Li, Hang},
  journal={arXiv preprint arXiv:2111.08276},
  year={2021}
}

Contact

For issues or help using this code, please submit a GitHub issue.

X-VLM: Multi-Grained Vision Language Pre-Training

Related tags

Overview

X-VLM: learning multi-grained vision language alignments

Hiring

Features

Requirements

Pretrain

Data

Checkpoints

Finetune

Data

Checkpoints and Logs

Citation

Contact

Owner

Yan Zeng

Pytorch implementation for A-NeRF: Articulated Neural Radiance Fields for Learning Human Shape, Appearance, and Pose

Perturbed Self-Distillation: Weakly Supervised Large-Scale Point Cloud Semantic Segmentation (ICCV2021)

The official PyTorch implementation of paper BBN: Bilateral-Branch Network with Cumulative Learning for Long-Tailed Visual Recognition

Code for the paper "Improved Techniques for Training GANs"

Official repository for the paper "Can You Learn an Algorithm? Generalizing from Easy to Hard Problems with Recurrent Networks"

(CVPR 2022) A minimalistic mapless end-to-end stack for joint perception, prediction, planning and control for self driving.

OpenVINO黑客松比赛项目

Codes accompanying the paper "Learning Nearly Decomposable Value Functions with Communication Minimization" (ICLR 2020)

Fedlearn支持前沿算法研发的Python工具库 | Fedlearn algorithm toolkit for researchers

Caffe models in TensorFlow

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

An official PyTorch Implementation of Boundary-aware Self-supervised Learning for Video Scene Segmentation (BaSSL)

Code of TVT: Transferable Vision Transformer for Unsupervised Domain Adaptation

Evaluation framework for testing segmentation networks in PyTorch

Adaptive Denoising Training (ADT) for Recommendation.

Code and results accompanying our paper titled Mixture Proportion Estimation and PU Learning: A Modern Approach at Neurips 2021 (Spotlight)

So-ViT: Mind Visual Tokens for Vision Transformer

Scenarios, tutorials and demos for Autonomous Driving

Compositional Sketch Search

Implementation of "With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition, BMVC, 2021" in PyTorch