VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

Related tags

Deep Learningvimpac
Overview

VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

This is a release of our VIMPAC paper to illustrate the implementations. The pretrained checkpoints and scripts will be soon open-sourced in HuggingFace transformers.

Authors: Hao Tan, Jie Lei, Thomas Wolf, Mohit Bansal

Data Preprocessing

Please refer to video2token folder for the detailed README file.

For pre-training, the dataset is usually large, and we suggest to use FPS=2 during extraction. For downstream tasks, we suggest using FPS=16 that enables a higher frame rate for short videos.

We recommend to store the data locally at data/video_tokens. If different paths are used, please specify the path of VIDEO_CODE_PATHS and VIDEO_ANNO_PATHS in vimpac/data.py.

Pre-Trained Weights

We provide the pre-trained weights with their links. Please download the pre-trained weight and extract them under snap/.

Pre-Training

The default pre-training uses the HowTo100M dataset. The pre-training data could be switched to Kinetics-700 and other datasets by specifying the --dataset-name argument. We have validated that the mask-then-predict task works reasonablely well on Kinetics-700 datasets. However, the average length of video clips inside K-700 is 10 seconds thus not sure supporting the long-range contrastive learning.

Small Model

We first provide the script to pre-train a small model (6 layers, 512 dimensions, 256 frame-size, and 5 clip length):

bash scripts/pretrain/small.sh 0,1,2,3

We here annotate some essential arguments inside the pre-training scripts. For a full descriptions for all the arguments, please check param.py

We also provide two debugging options:

# bash scripts/pretrain/small.sh 0,1,2,3 --tqdm        # Show progress bar.
# bash scripts/pretrain/small.sh 0,1,2,3 --debug       # Only run a few steps per epoch.

Large Model

We follow BERT to pre-train our large model in two stages. The first stage pretrains for 90 epochs using frame-size 128 and clip-length 5. The second stage pretrains for 10 epochs using frame-size 256 and clip-length 5.

Scripts for the first stage:

bash scripts/pretrain/large.sh 0,1,2,3

Then we could directly run the script for the second stage without any further changes. It will load the last snapshot from the first stage, do interpolation for larger spatial size, and continue pre-training.

bash scripts/pretrain/large_frame256cont.sh 0,1,2,3

Fine-Tuning

After run the pre-training in pre-training or download the pre-trained weights from pre-trained-weights, we fine-tune the models on several downstream tasks. The arguments in these scripts are consistent with the hyperparameters in the paper. Please refer to Table 11 and Table 12 of our paper for a detailed list of all these hyperparameters.

SSV2

bash scripts/finetune/small_ssv2.sh 0,1,2,3

Diving48

bash scripts/finetune/small_diving48.sh 0,1,2,3

UCF101

bash scripts/finetune/small_ucf101.sh 0,1,2,3

HMDB51

bash scripts/finetune/small_hmdb51.sh 0,1,2,3

Change the Input Shape

Following ViT, we support the use of different input sizes from pre-training by interpolating the positional embedding. This is done by passing the --different-shape option. Otherwise, an error will pop up if the fine-tuning input shape is different from the pre-training. A larger input shape generally improves the results. We here take SSV2 as an example.

Longer clip length (10; default 5):

bash scripts/finetune/small_ssv2.sh 0,1,2,3 --different-shape --clip-len 10 --bs-per-gpu 4

Long clip length (10; default 5) + higher frame rate (4; default 2)

bash scripts/finetune/small_ssv2.sh 0,1,2,3 --different-shape --clip-len 10 --frame-rate 4 --bs-per-gpu 4

Long clip length (10; default 5) + higher frame rate (4; default 2) + larger input size (256; default 128). Please also make sure that VQ-VAE code with input-size 256 has been extracted as in Pre-processing.

bash scripts/finetune/small_ssv2.sh 0,1,2,3 --different-shape --clip-len 10 --frame-rate 4 --frame-size 256 --bs-per-gpu 2

Large Models

We provide scripts to run large models. Frame 128:

bash scripts/finetune/large_frame128_ucf101.sh 0,1,2,3

Frame 256:

bash scripts/finetune/large_frame256_ucf101.sh 0,1,2,3

The input shape could be changed as in change input shape. Our final model use the scripts of:

bash scripts/finetune/large_frame256_ucf101.sh 0,1,2,3 --different-shape --clip-len 10 --frame-rate 4 --frame-size 256 --bs-per-gpu 2

Acknowledgement

This work was granted access to the HPC resources of IDRIS under the allocation 20XX-AD011011621R1 made by GENCI. We thank Teven Le Scao and Victor Sanh for their help on the way.

Owner
Hao Tan
NLP @ UNC Chapel Hill
Hao Tan
Vector.ai assignment

fabio-tests-nisargatman Low Level Approach: ###Tables: continents: id*, name, population, area, createdAt, updatedAt countries: id*, name, population,

Ravi Pullagurla 1 Nov 09, 2021
Generative Modelling of BRDF Textures from Flash Images [SIGGRAPH Asia, 2021]

Neural Material Official code repository for the paper: Generative Modelling of BRDF Textures from Flash Images [SIGGRAPH Asia, 2021] Henzler, Deschai

Philipp Henzler 80 Dec 20, 2022
A modified version of DeepMind's Alphafold2 to divide CPU part (MSA and template searching) and GPU part (prediction model)

ParallelFold Author: Bozitao Zhong This is a modified version of DeepMind's Alphafold2 to divide CPU part (MSA and template searching) and GPU part (p

Bozitao Zhong 77 Dec 22, 2022
Compare outputs between layers written in Tensorflow and layers written in Pytorch

Compare outputs of Wasserstein GANs between TensorFlow vs Pytorch This is our testing module for the implementation of improved WGAN in Pytorch Prereq

Hung Nguyen 72 Dec 20, 2022
Improving Object Detection by Estimating Bounding Box Quality Accurately

Improving Object Detection by Estimating Bounding Box Quality Accurately Abstrac

2 Apr 14, 2022
TransMVSNet: Global Context-aware Multi-view Stereo Network with Transformers.

TransMVSNet This repository contains the official implementation of the paper: "TransMVSNet: Global Context-aware Multi-view Stereo Network with Trans

旷视研究院 3D 组 155 Dec 29, 2022
Pynomial - a lightweight python library for implementing the many confidence intervals for the risk parameter of a binomial model

Pynomial - a lightweight python library for implementing the many confidence intervals for the risk parameter of a binomial model

Demetri Pananos 9 Oct 04, 2022
On Evaluation Metrics for Graph Generative Models

On Evaluation Metrics for Graph Generative Models Authors: Rylee Thompson, Boris Knyazev, Elahe Ghalebi, Jungtaek Kim, Graham Taylor This is the offic

13 Jan 07, 2023
This program automatically runs Python code copied in clipboard

CopyRun This program runs Python code which is copied in clipboard WARNING!! USE AT YOUR OWN RISK! NO GUARANTIES IF ANYTHING GETS BROKEN. DO NOT COPY

vertinski 4 Sep 10, 2021
Video Swin Transformer - PyTorch

Video-Swin-Transformer-Pytorch This repo is a simple usage of the official implementation "Video Swin Transformer". Introduction Video Swin Transforme

Haofan Wang 116 Dec 20, 2022
MetaShift: A Dataset of Datasets for Evaluating Contextual Distribution Shifts and Training Conflicts (ICLR 2022)

MetaShift: A Dataset of Datasets for Evaluating Distribution Shifts and Training Conflicts This repo provides the PyTorch source code of our paper: Me

88 Jan 04, 2023
Offcial repository for the IEEE ICRA 2021 paper Auto-Tuned Sim-to-Real Transfer.

Offcial repository for the IEEE ICRA 2021 paper Auto-Tuned Sim-to-Real Transfer.

47 Jun 30, 2022
Official git repo for the CHIRP project

CHIRP Project This is the official git repository for the CHIRP project. Pull requests are accepted here, but for the moment, the main repository is s

Dan Smith 77 Jan 08, 2023
In the case of your data having only 1 channel while want to use timm models

timm_custom Description In the case of your data having only 1 channel while want to use timm models (with or without pretrained weights), run the fol

2 Nov 26, 2021
Pytorch implementation of "Training a 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet"

Token Labeling: Training an 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet (arxiv) This is a Pytorch implementation of our te

蒋子航 383 Dec 27, 2022
The codes of paper 'Active-LATHE: An Active Learning Algorithm for Boosting the Error exponent for Learning Homogeneous Ising Trees'

Active-LATHE: An Active Learning Algorithm for Boosting the Error exponent for Learning Homogeneous Ising Trees This project contains the codes of pap

0 Apr 20, 2022
Speedy Implementation of Instance-based Learning (IBL) agents in Python

A Python library to create single or multi Instance-based Learning (IBL) agents that are built based on Instance Based Learning Theory (IBLT) 1 Instal

0 Nov 18, 2021
A robust camera and Lidar fusion based velocity estimator to undistort the pointcloud.

Lidar with Velocity A robust camera and Lidar fusion based velocity estimator to undistort the pointcloud. related paper: Lidar with Velocity : Motion

ISEE Research Group 164 Dec 30, 2022
Deep-learning-roadmap - All You Need to Know About Deep Learning - A kick-starter

Deep Learning - All You Need to Know Sponsorship To support maintaining and upgrading this project, please kindly consider Sponsoring the project deve

Instill AI 4.4k Dec 26, 2022
Code and data to accompany the camera-ready version of "Cross-Attention is All You Need: Adapting Pretrained Transformers for Machine Translation" in EMNLP 2021

Code and data to accompany the camera-ready version of "Cross-Attention is All You Need: Adapting Pretrained Transformers for Machine Translation" in EMNLP 2021

Mozhdeh Gheini 16 Jul 16, 2022