Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Overview

ALPRO

Align and Prompt: Video-and-Language Pre-training with Entity Prompts [Paper]

Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, Steven C.H. Hoi

Official PyTorch code for ALPRO. This repository supports pre-training as well as finetuning on

  • Text-Video Retrieval on MSRVTT and DiDeMo.
  • Video Question Anwsering on MSRVTT and MSVD.

Requirements

Our implementation is tested on Ubuntu 20.04.1 with NVIDIA A100 GPUs. Supports for other platforms and hardwares are possible with no warrant. To install the required packages:

cd env && bash install_pkg.sh

Data Preparation

  1. Download Annotations and Pre-trained Checkpoints

  2. Download raw videos of downstream datasets.

    • MSRVTT:
      • download train_val_videos.zip and test_videos.zip from e.g. here.

      • check md5sum:

        51f2394d279cf84f1642defd9a651e6f  train_val_videos.zip
        0af68454cec9d586e92805739f3911d0  test_videos.zip
      • unzip all the videos into data/msrvtt_ret/videos (10k in total).

      • create the following soft link:

        ln -s data/msrvtt_ret/videos data/msrvtt_qa/videos```
    • MSVD:
      • download from official release:

        wget -nc https://www.cs.utexas.edu/users/ml/clamp/videoDescription/YouTubeClips.tar
      • check md5sum:

        9bdb20fcf14d59524a6febca9f6a8d89  YouTubeClips.tar
      • unzip all the videos to data/msvd_qa/videos (1,970 videos in total).

        mkdir data/msvd_qa/videos/ 
        tar xvf YouTubeClips.tar -C data/msvd_qa/videos --strip-components=1
    • DiDeMo:
      • Following instructions and download from the official release here;
      • unzip all the videos into data/didemo_ret/videos.
      • Note there might be a couple videos missing. See here to download. However, as they account for a small portion of training set, you may feel safe to ignore.
      • Convert all the DiDeMo videos into *.mp4 format using e.g. ffmpeg.
      • We obtained 10,463 videos following these steps (with one video [email protected]_5753455690_1e04ccb364 missing).
  3. The directory is expected to be in the structure below:

    .
    |-config_release  # configuration files
    |-data  # text annotations and raw videos
    |---didemo_ret
    |-----txt
    |-----videos
    |---msrvtt_qa/...
    |---msrvtt_ret/...
    |---msvd_qa/...
    |-env  # scripts to install packages
    |-ext  # external resources, e.g. bert tokenizer
    |-output  # checkpoints for pre-trained/finetuned models
    |---downstreams
    |-----didemo_ret
    |-------public
    |---------ckpt # official finetuned checkpoints
    |---------log # inference log
    |---------results_test
    |-----------step_best_1_mean
    |-----msrvtt_qa/...
    |-----msrvtt_ret/...
    |-----msvd_qa/...
    |-run_scripts  # bash scripts to launch experiments
    |-src  # source code

Inference with Official Checkpoints

cd run_scripts
bash inf_msrvtt_ret.sh
# {'text2video': {'r1': 33.9, 'r5': 60.7, 'r10': 73.2, 'medianR': 3.0, 'meanR': 27.404}}
bash inf_didemo_ret.sh
# {'text2video': {'r1': 35.9, 'r5': 67.5, 'r10': 78.8, 'medianR': 3.0, 'meanR': 19.125}}
bash inf_msrvtt_qa.sh
# {'ratios': {'what_ratio': [68.48, 49872], 'who_ratio': [27.99, 20385], 'how_ratio': [2.25, 1640], 'where_ratio': [0.34, 250], 'when_ratio': [0.93, 677]}, 'overall_acc': 42.12, 'what_acc': 36.05, 'who_acc': 52.24, 'how_acc': 85.67, 'where_acc': 42.8, 'when_acc': 78.88}
bash inf_msvd_qa.sh
# {'ratios': {'what_ratio': [61.93, 8150], 'who_ratio': [34.6, 4554], 'how_ratio': [2.81, 370], 'where_ratio': [0.21, 28], 'when_ratio': [0.44, 58]}, 'overall_acc': 45.91, 'what_acc': 37.02, 'who_acc': 58.59, 'how_acc': 81.62, 'where_acc': 46.43, 'when_acc': 72.41}

Downstream Task Finetuning

  • To finetune on downstream tasks with the pre-trained checkpoint output/pretrain/alpro_pretrained_ckpt.pt

    cd run_scripts
    bash ft_msrvtt_ret.sh
    bash ft_didemo_ret.sh
    bash ft_msrvtt_qa.sh
    bash ft_msvd_qa.sh

    For example, with MSRVTT retrieval:

    cd ALPRO/
    
    export PYTHONPATH="$PYTHONPATH:$PWD"
    echo $PYTHONPATH
    
    CONFIG_PATH='config_release/msrvtt_ret.json'
    
    horovodrun -np 8 python src/tasks/run_video_retrieval.py \ # change -np to GPUs numbers.
        --config $CONFIG_PATH \
        --output_dir /export/home/workspace/experiments/alpro/finetune/msrvtt_ret/$(date '+%Y%m%d%H%M%S')  # change to your local path to store finetuning ckpts and logs 
  • Run inference with locally-finetuned checkpoints.

     cd ALPRO/
    
     export PYTHONPATH="$PYTHONPATH:$PWD"
     echo $PYTHONPATH
    
     STEP='best'
    
     CONFIG_PATH='config_release/msrvtt_ret.json'
     OUTPUT_DIR='[INPUT_YOUR_OUTPUT_PATH_HERE]'
    
     TXT_DB='data/msrvtt_ret/txt/test.jsonl'
     IMG_DB='data/msrvtt_ret/videos'
    
     horovodrun -np 8 python src/tasks/run_video_retrieval.py \
         --do_inference 1 \
         --inference_split test \
         --inference_model_step $STEP \
         --inference_txt_db $TXT_DB \
         --inference_img_db $IMG_DB \
         --inference_batch_size 64 \
         --output_dir $OUTPUT_DIR \
         --config $CONFIG_PATH
    • OUTPUT_DIR is the path after the --output_dir option in the finetuning script.
    • $STEP is a string, which tells the script to use the checkpoint $OUTPUT_DIR/ckpt/model_step_$STEP.pt for inference.

Pretraining

  1. Download WebVid2M and CC-3M.

    • Put WebVid2M videos under data/webvid2m;
    • 💡 we downsample webvid2m videos to 10% of the original FPS to speed-up video loading;
    • change data/cc3m/txt/cc3m.json with local image paths.
  2. Training Prompter:

    cd run_scripts && bash pt_prompter.sh
  3. Training video-language model:

    cd run_scripts && bash pt_alpro.sh

    If you would like to use custom prompter weight, please change teacher_weights_path in config_release/pretrain_alpro.json

  4. To finetune with pre-trained checkpoints, please change e2e_weights_path in the finetuning config files, e.g. config_release/msrvtt_ret.json.

Citation

If you find ALPRO useful for your research, please consider citing:

  @inproceedings{li2021align,
    title={Align and Prompt: Video-and-Language Pre-training with Entity Prompts},
    author={Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, Steven C.H. Hoi},
    booktitle={arxiv},
    year={2021}
  }

Acknowledgement

We thank members at Salesforce Research for their helpful discussions.

The implementation of ALPRO relies on resources from ClipBERT, transformers, TimeSformer, The code is implemented using PyTorch, with multi-GPU support from Horovod and gradient-checkpoint. We thank the original authors for their open-sourcing and encourage ALPRO users to cite their works when applicable.

Owner
Salesforce
A variety of vendor agnostic projects which power Salesforce
Salesforce
Create and implement a deep learning library from scratch.

In this project, we create and implement a deep learning library from scratch. Table of Contents Deep Leaning Library Table of Contents About The Proj

Rishabh Bali 22 Aug 23, 2022
Run containerized, rootless applications with podman

Why? restrict scope of file system access run any application without root privileges creates usable "Desktop applications" to integrate into your nor

119 Dec 27, 2022
TyXe: Pyro-based BNNs for Pytorch users

TyXe: Pyro-based BNNs for Pytorch users TyXe aims to simplify the process of turning Pytorch neural networks into Bayesian neural networks by leveragi

87 Jan 03, 2023
SimplEx - Explaining Latent Representations with a Corpus of Examples

SimplEx - Explaining Latent Representations with a Corpus of Examples Code Author: Jonathan Crabbé ( Jonathan Crabbé 14 Dec 15, 2022

MMdnn is a set of tools to help users inter-operate among different deep learning frameworks. E.g. model conversion and visualization. Convert models between Caffe, Keras, MXNet, Tensorflow, CNTK, PyTorch Onnx and CoreML.

MMdnn MMdnn is a comprehensive and cross-framework tool to convert, visualize and diagnose deep learning (DL) models. The "MM" stands for model manage

Microsoft 5.7k Jan 09, 2023
Face and Body Tracking for VRM 3D models on the web.

Kalidoface 3D - Face and Full-Body tracking for Vtubing on the web! A sequal to Kalidoface which supports Live2D avatars, Kalidoface 3D is a web app t

Rich 257 Jan 02, 2023
Pretraining Representations For Data-Efficient Reinforcement Learning

Pretraining Representations For Data-Efficient Reinforcement Learning Max Schwarzer, Nitarshan Rajkumar, Michael Noukhovitch, Ankesh Anand, Laurent Ch

Mila 40 Dec 11, 2022
2021:"Bridging Global Context Interactions for High-Fidelity Image Completion"

TFill arXiv | Project This repository implements the training, testing and editing tools for "Bridging Global Context Interactions for High-Fidelity I

Chuanxia Zheng 111 Jan 08, 2023
YOLOv5 🚀 is a family of object detection architectures and models pretrained on the COCO dataset

YOLOv5 🚀 is a family of object detection architectures and models pretrained on the COCO dataset, and represents Ultralytics open-source research int

阿才 73 Dec 16, 2022
MVFNet: Multi-View Fusion Network for Efficient Video Recognition (AAAI 2021)

MVFNet: Multi-View Fusion Network for Efficient Video Recognition (AAAI 2021) Overview We release the code of the MVFNet (Multi-View Fusion Network).

2 Jan 29, 2022
Learning High-Speed Flight in the Wild

Learning High-Speed Flight in the Wild This repo contains the code associated to the paper Learning Agile Flight in the Wild. For more information, pl

Robotics and Perception Group 391 Dec 29, 2022
ATOMIC 2020: On Symbolic and Neural Commonsense Knowledge Graphs

(Comet-) ATOMIC 2020: On Symbolic and Neural Commonsense Knowledge Graphs Paper Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, Jeff Da, Keisuke Sa

AI2 152 Dec 27, 2022
[NeurIPS 2021] Official implementation of paper "Learning to Simulate Self-driven Particles System with Coordinated Policy Optimization".

Code for Coordinated Policy Optimization Webpage | Code | Paper | Talk (English) | Talk (Chinese) Hi there! This is the source code of the paper “Lear

DeciForce: Crossroads of Machine Perception and Autonomy 81 Dec 19, 2022
JFB: Jacobian-Free Backpropagation for Implicit Models

JFB: Jacobian-Free Backpropagation for Implicit Models

Typal Research 28 Dec 11, 2022
Gradient Inversion with Generative Image Prior

Gradient Inversion with Generative Image Prior This repository is an implementation of "Gradient Inversion with Generative Image Prior", accepted to N

MLLab @ Postech 25 Jan 09, 2023
Creative Applications of Deep Learning w/ Tensorflow

Creative Applications of Deep Learning w/ Tensorflow This repository contains lecture transcripts and homework assignments as Jupyter Notebooks for th

Parag K Mital 1.5k Dec 30, 2022
A high-performance distributed deep learning system targeting large-scale and automated distributed training.

HETU Documentation | Examples Hetu is a high-performance distributed deep learning system targeting trillions of parameters DL model training, develop

DAIR Lab 150 Dec 21, 2022
Waymo motion prediction challenge 2021: 3rd place solution

Waymo motion prediction challenge 2021: 3rd place solution 📜 Technical report 🗨️ Presentation 🎉 Announcement 🛆Motion Prediction Channel Website 🛆

158 Jan 08, 2023
Empirical Study of Transformers for Source Code & A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

Transformers for variable misuse, function naming and code completion tasks The official PyTorch implementation of: Empirical Study of Transformers fo

Bayesian Methods Research Group 56 Nov 15, 2022
Curved Projection Reformation

Description Assuming that we already know the image of the centerline, we want the lumen to be displayed on a plane, which requires curved projection

夜听残荷 5 Sep 11, 2022