XViT - Space-time Mixing Attention for Video Transformer

Overview

XViT - Space-time Mixing Attention for Video Transformer

This is the official implementation of the XViT paper:

@inproceedings{bulat2021space,
  title={Space-time Mixing Attention for Video Transformer},
  author={Bulat, Adrian and Perez-Rua, Juan-Manuel and Sudhakaran, Swathikiran and Martinez, Brais and Tzimiropoulos, Georgios},
  booktitle={NeurIPS},
  year={2021}
}

In XViT, we introduce a novel Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence and hence induces no overhead compared to an image-based Transformer model. To achieve this, our model makes two approximations to the full space-time attention used in Video Transformers: (a) It restricts time attention to a local temporal window and capitalizes on the Transformer's depth to obtain full temporal coverage of the video sequence. (b) It uses efficient space-time mixing to attend jointly spatial and temporal locations without inducing any additional cost on top of a spatial-only attention model. We also show how to integrate 2 very lightweight mechanisms for global temporal-only attention which provide additional accuracy improvements at minimal computational cost. Our model produces very high recognition accuracy on the most popular video recognition datasets while at the same time is significantly more efficient than other Video Transformer models.

Attention pattern

Model Zoo

We provide a series of models pre-trained on Kinetics-600 and Something-Something-v2.

Kinetics-600

Architecture frames views Top-1 Top-5 url
XViT-B16 16 3x1 84.51% 96.26% model
XViT-B16 16 3x2 84.71% 96.39% model

Something-Something-V2

Architecture frames views Top-1 Top-5 url
XViT-B16 16 32x2 67.19% 91.00% model

Installation

Please make sure your setup satisfies the following requirements:

Requirements

Largely follows the original SlowFast repo requirements:

  • Python >= 3.8
  • Numpy
  • PyTorch >= 1.3
  • hdf5
  • fvcore: pip install 'git+https://github.com/facebookresearch/fvcore'
  • torchvision that matches the PyTorch installation. You can install them together at pytorch.org to make sure of this.
  • simplejson: pip install simplejson
  • GCC >= 4.9
  • PyAV: conda install av -c conda-forge
  • ffmpeg (4.0 is prefereed, will be installed along with PyAV)
  • PyYaml: (will be installed along with fvcore)
  • tqdm: (will be installed along with fvcore)
  • iopath: pip install -U iopath or conda install -c iopath iopath
  • psutil: pip install psutil
  • OpenCV: pip install opencv-python
  • torchvision: pip install torchvision or conda install torchvision -c pytorch
  • tensorboard: pip install tensorboard
  • PyTorchVideo: pip install pytorchvideo
  • Detectron2:
    pip install -U torch torchvision cython
    pip install -U 'git+https://github.com/facebookresearch/fvcore.git' 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'
    git clone https://github.com/facebookresearch/detectron2 detectron2_repo
    pip install -e detectron2_repo
    # You can find more details at https://github.com/facebookresearch/detectron2/blob/master/INSTALL.md

Datasets

1. Kenetics

You can download Kinetics 400/600 datasets following the instructions provided by the cvdfundation repo: https://github.com/cvdfoundation/kinetics-dataset

Afterwars, resize the videos to the shorte edge size of 256 and prepare the csv files for training, validation in testting: train.csv, val.csv, test.csv. The formatof the csv file is:

path_to_video_1 label_1
path_to_video_2 label_2
...
path_to_video_N label_N

Depending on your system, we recommend decoding the videos to frames and then packing each set of frames into a h5 file with the same name as the original video.

2. Something-Something v2

You can download the datasets from the authors webpage: https://20bn.com/datasets/something-something

Perform the same packing procedure as for Kinetics.

Usage

Training

python tools/run_net.py \
  --cfg configs/Kinetics/xvit_B16_16x16_k600.yaml \
  DATA.PATH_TO_DATA_DIR path_to_your_dataset

Evaluation

python tools/run_net.py \
  --cfg configs/Kinetics/xvit_B16_16x16_k600.yaml \
  DATA.PATH_TO_DATA_DIR path_to_your_dataset \
  TEST.CHECKPOINT_FILE_PATH path_to_your_checkpoint \
  TRAIN.ENABLE False \

Acknowledgements

This repo is built using components from SlowFast and timm

License

XViT code is released under the Apache 2.0 license.

Owner
Adrian Bulat
AI Researcher at Samsung AI, member of the deeplearning cult.
Adrian Bulat
Chainer Implementation of Fully Convolutional Networks. (Training code to reproduce the original result is available.)

fcn - Fully Convolutional Networks Chainer implementation of Fully Convolutional Networks. Installation pip install fcn Inference Inference is done as

Kentaro Wada 218 Oct 27, 2022
PyTorch Implementation for AAAI'21 "Do Response Selection Models Really Know What's Next? Utterance Manipulation Strategies for Multi-turn Response Selection"

UMS for Multi-turn Response Selection Implements the model described in the following paper Do Response Selection Models Really Know What's Next? Utte

Taesun Whang 47 Nov 22, 2022
Face Library is an open source package for accurate and real-time face detection and recognition

Face Library Face Library is an open source package for accurate and real-time face detection and recognition. The package is built over OpenCV and us

52 Nov 09, 2022
Check out the StyleGAN repo and place it in the same directory hierarchy as the present repo

Variational Model Inversion Attacks Kuan-Chieh Wang, Yan Fu, Ke Li, Ashish Khisti, Richard Zemel, Alireza Makhzani Most commands are in run_scripts. W

Jackson Wang 15 Dec 26, 2022
Guided Internet-delivered Cognitive Behavioral Therapy Adherence Forecasting

Guided Internet-delivered Cognitive Behavioral Therapy Adherence Forecasting #Dataset The folder "Dataset" contains the dataset use in this work and m

0 Jan 08, 2022
给yolov5加个gui界面,使用pyqt5,yolov5是5.0版本

博文地址 https://xugaoxiang.com/2021/06/30/yolov5-pyqt5 代码执行 项目中使用YOLOv5的v5.0版本,界面文件是project.ui pip install -r requirements.txt python main.py 图片检测 视频检测

Xu GaoXiang 215 Dec 30, 2022
Learning Skeletal Articulations with Neural Blend Shapes

This repository provides an end-to-end library for automatic character rigging and blend shapes generation as well as a visualization tool. It is based on our work Learning Skeletal Articulations wit

Peizhuo 504 Dec 30, 2022
Vector Neurons: A General Framework for SO(3)-Equivariant Networks

Vector Neurons: A General Framework for SO(3)-Equivariant Networks Created by Congyue Deng, Or Litany, Yueqi Duan, Adrien Poulenard, Andrea Tagliasacc

Congyue Deng 332 Dec 29, 2022
Official code base for the poster "On the use of Cortical Magnification and Saccades as Biological Proxies for Data Augmentation" published in NeurIPS 2021 Workshop (SVRHM)

Self-Supervised Learning (SimCLR) with Biological Plausible Image Augmentations Official code base for the poster "On the use of Cortical Magnificatio

Binxu 8 Aug 17, 2022
MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens

MSG-Transformer Official implementation of the paper MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens, by Jiemin

Hust Visual Learning Team 68 Nov 16, 2022
Books, Presentations, Workshops, Notebook Labs, and Model Zoo for Software Engineers and Data Scientists wanting to learn the TF.Keras Machine Learning framework

Books, Presentations, Workshops, Notebook Labs, and Model Zoo for Software Engineers and Data Scientists wanting to learn the TF.Keras Machine Learning framework

Google Cloud Platform 792 Dec 28, 2022
The implementation of "Optimizing Shoulder to Shoulder: A Coordinated Sub-Band Fusion Model for Real-Time Full-Band Speech Enhancement"

SF-Net for fullband SE This is the repo of the manuscript "Optimizing Shoulder to Shoulder: A Coordinated Sub-Band Fusion Model for Real-Time Full-Ban

Guochen Yu 36 Dec 02, 2022
Adabelief-Optimizer - Repository for NeurIPS 2020 Spotlight "AdaBelief Optimizer: Adapting stepsizes by the belief in observed gradients"

AdaBelief Optimizer NeurIPS 2020 Spotlight, trains fast as Adam, generalizes well as SGD, and is stable to train GANs. Release of package We have rele

Juntang Zhuang 998 Dec 29, 2022
[AAAI 2021] EMLight: Lighting Estimation via Spherical Distribution Approximation and [ICCV 2021] Sparse Needlets for Lighting Estimation with Spherical Transport Loss

EMLight: Lighting Estimation via Spherical Distribution Approximation (AAAI 2021) Update 12/2021: We release our Virtual Object Relighting (VOR) Datas

Fangneng Zhan 144 Jan 06, 2023
Code for Efficient Visual Pretraining with Contrastive Detection

Code for DetCon This repository contains code for the ICCV 2021 paper "Efficient Visual Pretraining with Contrastive Detection" by Olivier J. Hénaff,

DeepMind 56 Nov 13, 2022
Weighted QMIX: Expanding Monotonic Value Function Factorisation

This repo contains the cleaned-up code that was used in "Weighted QMIX: Expanding Monotonic Value Function Factorisation"

whirl 82 Dec 29, 2022
Autotype on websites that have copy-paste disabled like Moodle, HackerEarth contest etc.

Autotype A quick and small python script that helps you autotype on websites that have copy paste disabled like Moodle, HackerEarth contests etc as it

Tushar 32 Nov 03, 2022
Integrated physics-based and ligand-based modeling.

ComBind ComBind integrates data-driven modeling and physics-based docking for improved binding pose prediction and binding affinity prediction. Given

Dror Lab 44 Oct 26, 2022
code for "Self-supervised edge features for improved Graph Neural Network training",

Self-supervised edge features for improved Graph Neural Network training Data availability: Here is a link to the raw data for the organoids dataset.

Neal Ravindra 23 Dec 02, 2022
Cweqgen - The CW Equation Generator

The CW Equation Generator The cweqgen (pronouced like "Queck-Jen") package provi

2 Jan 15, 2022