Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Overview

ClipBERT

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling

Jie Lei*, Linjie Li*, Luowei Zhou, Zhe Gan, Tamara L. Berg, Mohit Bansal, Jingjing Liu

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipBERT is designed based on 2D CNNs and transformers, and uses a sparse sampling strategy to enable efficient end-to-end video-and-language learning. In this repository, we support end-to-end pretraining and finetuning for the following tasks:

  • Image-text pretraining on COCO and VG captions.
  • Text-to-video retrieval finetuning on MSRVTT, DiDeMo, and ActivityNet Captions.
  • Video-QA finetuning on TGIF-QA and MSRVTT-QA.
  • Image-QA finetuning on VQA 2.0.

It is also feasible and easy to add other image-text or video-text tasks for pretraining and finetuning.

Requirements

We provide a Docker image for easier reproduction. Please install the following:

Our scripts require the user to have the docker group membership so that docker commands can be run without sudo. We only support Linux with NVIDIA GPUs. We test on Ubuntu 18.04 and V100 cards. We use mixed-precision training hence GPUs with Tensor Cores are recommended.

Getting Started

General

  1. Create a folder that stores pretrained models, all the data, and results.

    PATH_TO_STORAGE=/path/to/your/data/
    mkdir -p $PATH_TO_STORAGE/txt_db  # annotations
    mkdir -p $PATH_TO_STORAGE/vis_db  # image and video 
    mkdir -p $PATH_TO_STORAGE/finetune  # finetuning results
    mkdir -p $PATH_TO_STORAGE/pretrained  # pretrained models
  2. Download pretrained models.

    Our e2e pretrained ClipBERT model (849MB), can be downloaded with the following command.

    bash scripts/download_pretrained.sh $PATH_TO_STORAGE

    This pretrained model can be used for finetuning on video-text tasks and image-text tasks. For your convenience, this script will also download bert-base-uncased and grid-feat-vqa model weights, which are used as initialization for pretraining.

  3. Launch the Docker container for running the experiments.

    # docker image should be automatically pulled
    source launch_container.sh $PATH_TO_STORAGE/txt_db $PATH_TO_STORAGE/img_db \
        $PATH_TO_STORAGE/finetune $PATH_TO_STORAGE/pretrained

    The launch script respects $CUDA_VISIBLE_DEVICES environment variable. Note that the source code is mounted into the container under /clipbert instead of built into the image so that user modification will be reflected without re-building the image. (Data folders are mounted into the container separately for flexibility on folder structures.)

Downstream Task Finetuning

Text-to-Video Retrieval

Tasks: MSRVTT retrieval, DiDeMo and ActivityNet Captions paragprah-to-video retrieval, MSRVTT MC Test.

  1. Download data.

    # outside the container  
    # download videos + annotations for $DSET
    bash scripts/download_$DSET.sh $PATH_TO_STORAGE

    $DSET can be one of msrvtt, didemo, anet.

  2. Finetuning.

    # inside the container
    horovodrun -np 4 python src/tasks/run_video_retrieval.py \
        --config $CONFIG_PATH \
        --output_dir $OUTPUT_DIR
    
    # for single GPU
    python src/tasks/run_video_retrieval.py \
        --config $CONFIG_PATH \
        --output_dir $OUTPUT_DIR

    $CONFIG_PATH should be set to one of the .json config files available at src/configs prefixed with _ret. For example, you can use src/configs/msrvtt_ret_base_resnet50.json for MSRVTT retrieval.

  3. Run inference.

    # inside the container
    horovodrun -np 4 python src/tasks/run_video_retrieval.py \
      --do_inference 1 --output_dir $OUTPUT_DIR \
      --inference_split val --inference_model_step $STEP \
      --inference_txt_db $TXT_DB \
      --inference_img_db $IMG_DB --inference_batch_size 64 \
      --inference_n_clips $INFERENCE_N_CLIPS

    $STEP is an integer, it tells the script to use the checkpoint $OUTPUT_DIR/ckpt/model_step_$STEP.pt for inference. $TXT_DB and $IMG_DB are path to annotation file and video data. You can use TXT_DB=/txt/downstream/msrvtt_retrieval/msrvtt_retrieval_val.jsonl and IMG_DB=/img/msrvtt for inference on MSRVTT retrieval val split. The results will be written under $OUTPUT_DIR. You can use different $INFERENCE_N_CLIPS for inference, such as 1 or 16. Using more clips will have a large impact on inference speed and memory usage. You may want to use smaller batch sizes if larger values are set.

    After MSRVTT retrieval model is trained, you can use it for inference for the MSRVTT MC Test task as well, which is essentially a retrieval task in a multiple-choice task setup.

    # inside the container
    horovodrun -np 4 python src/tasks/run_msrvtt_mc.py \
      --do_inference 1 --output_dir $OUTPUT_DIR \
      --inference_split val --inference_model_step $STEP \
      --inference_txt_db /txt/downstream/msrvtt_retrieval_mc/msrvtt_retrieval_mc_test.jsonl \
      --inference_img_db /img/msrvtt --inference_batch_size 64 \
      --inference_n_clips $INFERENCE_N_CLIPS

Video Question Answering

Tasks: TGIF-QA action, transition, and frameQA tasks; MSRVTT-QA.

  1. Download data.

    # outside the container  
    # download MSRVTT videos, and QA + retrieval annotations
    bash scripts/download_msrvtt.sh $PATH_TO_STORAGE  
    # download TGIF-QA videos and annotations
    bash scripts/download_tgif_qa.sh $PATH_TO_STORAGE  
  2. Finetuning.

    # inside the container
    horovodrun -np 4 python src/tasks/run_video_qa.py \
        --config $CONFIG_PATH \
        --output_dir $OUTPUT_DIR

    $CONFIG_PATH should be set to one of the .json config files available at src/configs contains the substring _qa. For example, you can use src/configs/msrvtt_qa_base_resnet50.json for MSRVTT-QA.

  3. Run inference.

    # inside the container
    horovodrun -np 4 python src/tasks/run_video_qa.py \
      --do_inference 1 --output_dir $OUTPUT_DIR \
      --inference_split val --inference_model_step $STEP \
      --inference_txt_db $TXT_DB \
      --inference_img_db $IMG_DB --inference_batch_size 64 \
      --inference_n_clips $INFERENCE_N_CLIPS

    $STEP is an integer, which tells the script to use the checkpoint $OUTPUT_DIR/ckpt/model_step_$STEP.pt for inference. $TXT_DB and $IMG_DB are path to annotation file and video data. You can use TXT_DB=/txt/downstream/msrvtt_retrieval/msrvtt_qa_val.jsonl and IMG_DB=/img/msrvtt for inference on MSRVTT QA val split.

    The results will be written under $OUTPUT_DIR. You can use different $INFERENCE_N_CLIPS for inference, such as 1 or 16. Using more clips will have a large impact on inference speed and memory usage. You may want to use smaller batch sizes if larger values are set.

Image Question Answering (VQA)

  1. Download data

    # outside the container
    # download COCO and VG data
    bash scripts/download_coco_vg.sh $PATH_TO_STORAGE
    # download VQA annotations
    bash scripts/download_vqa.sh $PATH_TO_STORAGE
  2. Finetuning

    # inside the container
    horovodrun -np 4 python src/tasks/run_vqa.py \
        --config src/configs/vqa_base_resnet50.json \
        --output_dir $OUTPUT_DIR
  3. Inference

    # inside the container
    horovodrun -np 4 python src/tasks/run_vqa.py \
      --do_inference 1 --output_dir $OUTPUT_DIR \
      --inference_split val --inference_model_step $STEP \
      --inference_txt_db $TXT_DB \
      --inference_img_db $IMG_DB \
      --inference_batch_size 64

Pretraining

  1. Download data

    # outside the container
    bash scripts/download_coco_vg.sh $PATH_TO_STORAGE
  2. Pretraining

    #inside the container
    horovodrun -np 8 python src/pretrain/run_pretrain.py \
        --config src/configs/pretrain_indomain_base_resnet50_mlm_itm.json \
        --output_dir $OUTPUT_DIR 

Data Preprocessing

ClipBERT takes raw video and text as inputs, there is no need to do feature extraction. However, to improve data loading speed, we use LMDB to store the raw image and video files. You can use the following script to convert a list of videos with file extensions mp4 and avi into LMDB:

# outside the container
python src/preprocessing/file2lmdb.py \
    --data_root /path/to/videos \
    --lmdb_save_dir /path/to/save/lmdb \
    --ext avi mp4 \
    --file_type video 

For images, use appropriate file extensions for --ext and --file_type image. Text annotation files are reorganized into jsonl files, see example preprocessed files downloaded by the scripts in scripts/.

Citation

If you find this code useful for your research, please consider citing:

@article{lei2021less,
  title={Less is More: ClipBERT for Video-and-Language Learningvia Sparse Sampling},
  author={Lei, Jie and Li, Linjie and Zhou, Luowei and Gan, Zhe and Berg, Tamara L. and Bansal, Mohit and Liu, Jingjing},
  journal={arXiv},
  year={2021}
}

Acknowledgement

We thank Yen-Chun Chen and Ruotian Luo for suggestions on the implementation. We also thank other members and interns at Microsoft Multimodal AI for their helpful discussions.

This code used resources from transformers, UNITER, HERO, grid-feats-vqa, SlowFast, Detectron2. The code is implemented using PyTorch, with multi-GPU support from Horovod and mixed precision support from apex. We thank the authors for open-sourcing their awesome projects.

License

MIT

Owner
Jie Lei 雷杰
UNC CS PhD student, vision+language.
Jie Lei 雷杰
Code for text augmentation method leveraging large-scale language models

HyperMix Code for our paper GPT3Mix and conducting classification experiments using GPT-3 prompt-based data augmentation. Getting Started Installing P

NAVER AI 47 Dec 20, 2022
File-based TF-IDF: Calculates keywords in a document, using a word corpus.

File-based TF-IDF Calculates keywords in a document, using a word corpus. Why? Because I found myself with hundreds of plain text files, with no way t

Jakob Lindskog 1 Feb 11, 2022
Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

🌳 Fingerprinting Fine-tuned Language Models in the wild This is the code and dataset for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned La

LCS2-IIITDelhi 5 Sep 13, 2022
Plugin repository for Macast

Macast-plugins Plugin repository for Macast. How to use third-party player plugin Download Macast from GitHub Release. Download the plugin you want fr

109 Jan 04, 2023
Fuzzy String Matching in Python

FuzzyWuzzy Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

SeatGeek 8.8k Jan 01, 2023
Two-stage text summarization with BERT and BART

Two-Stage Text Summarization Description We experiment with a 2-stage summarization model on CNN/DailyMail dataset that combines the ability to filter

Yukai Yang (Alexis) 6 Oct 22, 2022
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS)

This repository is an implementation of Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS) with a vocoder that works in real-time. Feel free to check my the

Corentin Jemine 38.5k Jan 03, 2023
Curso práctico: NLP de cero a cien 🤗

Curso Práctico: NLP de cero a cien Comprende todos los conceptos y arquitecturas clave del estado del arte del NLP y aplícalos a casos prácticos utili

Somos NLP 147 Jan 06, 2023
Use Google's BERT for named entity recognition (CoNLL-2003 as the dataset).

For better performance, you can try NLPGNN, see NLPGNN for more details. BERT-NER Version 2 Use Google's BERT for named entity recognition (CoNLL-2003

Kaiyinzhou 1.2k Dec 26, 2022
Repo for Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive Text Summarization

ESACL: Enhanced Seq2Seq Autoencoder via Contrastive Learning for AbstractiveText Summarization This repo is for our paper "Enhanced Seq2Seq Autoencode

Rachel Zheng 14 Nov 01, 2022
Example code for "Real-World Natural Language Processing"

Real-World Natural Language Processing This repository contains example code for the book "Real-World Natural Language Processing." AllenNLP (2.5.0 or

Masato Hagiwara 303 Dec 17, 2022
LUKE -- Language Understanding with Knowledge-based Embeddings

LUKE (Language Understanding with Knowledge-based Embeddings) is a new pre-trained contextualized representation of words and entities based on transf

Studio Ousia 587 Dec 30, 2022
Pipeline for training LSA models using Scikit-Learn.

Latent Semantic Analysis Pipeline for training LSA models using Scikit-Learn. Usage Instead of writing custom code for latent semantic analysis, you j

Dani El-Ayyass 23 Sep 05, 2022
GVT is a generic translation tool for parts of text on the PC screen with Text to Speak functionality.

GVT is a generic translation tool for parts of text on the PC screen with Text to Speech functionality. I wanted to create it because the existing tools that I experimented with did not satisfy me in

Nuked 1 Aug 21, 2022
Shellcode antivirus evasion framework

Schrodinger's Cat Schrodinger'sCat is a Shellcode antivirus evasion framework Technical principle Please visit my blog https://idiotc4t.com/ How to us

idiotc4t 27 Jul 09, 2022
Training open neural machine translation models

Train Opus-MT models This package includes scripts for training NMT models using MarianNMT and OPUS data for OPUS-MT. More details are given in the Ma

Language Technology at the University of Helsinki 167 Jan 03, 2023
A library for Multilingual Unsupervised or Supervised word Embeddings

MUSE: Multilingual Unsupervised and Supervised Embeddings MUSE is a Python library for multilingual word embeddings, whose goal is to provide the comm

Facebook Research 3k Jan 06, 2023
Lumped-element impedance calculator and frequency-domain plotter.

fastZ: Lumped-Element Impedance Calculator fastZ is a small tool for calculating and visualizing electrical impedance in Python. Features include: Sup

Wesley Hileman 47 Nov 18, 2022
Which Apple Keeps Which Doctor Away? Colorful Word Representations with Visual Oracles

Which Apple Keeps Which Doctor Away? Colorful Word Representations with Visual Oracles (TASLP 2022)

Zhuosheng Zhang 3 Apr 14, 2022
This is the source code of RPG (Reward-Randomized Policy Gradient)

RPG (Reward-Randomized Policy Gradient) Zhenggang Tang*, Chao Yu*, Boyuan Chen, Huazhe Xu, Xiaolong Wang, Fei Fang, Simon Shaolei Du, Yu Wang, Yi Wu (

40 Nov 25, 2022