This is an official implementation for "Video Swin Transformers".

Last update: Jan 03, 2023

Overview

Video Swin Transformer

By Ze Liu*, Jia Ning*, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin and Han Hu.

This repo is the official implementation of "Video Swin Transformer". It is based on mmaction2.

Updates

06/25/2021 Initial commits

Introduction

Video Swin Transformer is initially described in "Video Swin Transformer", which advocates an inductive bias of locality in video Transformers, leading to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models. Our approach achieves state-of-the-art accuracy on a broad range of video recognition benchmarks, including on action recognition (84.9 top-1 accuracy on Kinetics-400 and 86.1 top-1 accuracy on Kinetics-600 with ~20x less pre-training data and ~3x smaller model size) and temporal modeling (69.6 top-1 accuracy on Something-Something v2).

Results and Models

Kinetics 400

Backbone	Pretrain	Lr Schd	spatial crop	[email protected]	[email protected]	#params	FLOPs	config	model
Swin-T	ImageNet-1K	30ep	224	78.8	93.6	28M	87.9G	config	github/baidu
Swin-S	ImageNet-1K	30ep	224	80.6	94.5	50M	165.9G	config	github/baidu
Swin-B	ImageNet-1K	30ep	224	80.6	94.6	88M	281.6G	config	github/baidu
Swin-B	ImageNet-22K	30ep	224	82.7	95.5	88M	281.6G	config	github/baidu

Kinetics 600

Backbone	Pretrain	Lr Schd	spatial crop	[email protected]	[email protected]	#params	FLOPs	config	model
Swin-B	ImageNet-22K	30ep	224	84.0	96.5	88M	281.6G	config	github/baidu

Something-Something V2

Backbone	Pretrain	Lr Schd	spatial crop	[email protected]	[email protected]	#params	FLOPs	config	model
Swin-B	Kinetics 400	60ep	224	69.6	92.7	89M	320.6G	config	github/baidu

Notes:

Pre-trained image models can be downloaded from Swin Transformer for ImageNet Classification.
The pre-trained model of SSv2 could be downloaded at github/baidu.
Access code for baidu is swin.

Usage

Installation

Please refer to install.md for installation.

We also provide docker file cuda10.1 (image url) and cuda11.0 (image url) for convenient usage.

Data Preparation

Please refer to data_preparation.md for a general knowledge of data preparation. The supported datasets are listed in supported_datasets.md.

Inference

# single-gpu testing
python tools/test.py <CONFIG_FILE> <CHECKPOINT_FILE> --eval top_k_accuracy

# multi-gpu testing
bash tools/dist_test.sh <CONFIG_FILE> <CHECKPOINT_FILE> <GPU_NUM> --eval top_k_accuracy

Training

To train a video recognition model with pre-trained image models (for Kinetics-400 and Kineticc-600 datasets), run:

# single-gpu training
python tools/train.py <CONFIG_FILE> --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

# multi-gpu training
bash tools/dist_train.sh <CONFIG_FILE> <GPU_NUM> --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

For example, to train a Swin-T model for Kinetics-400 dataset with 8 gpus, run:

bash tools/dist_train.sh configs/recognition/swin/swin_tiny_patch244_window877_kinetics400_1k.py 8 --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL>

To train a video recognizer with pre-trained video models (for Something-Something v2 datasets), run:

# single-gpu training
python tools/train.py <CONFIG_FILE> --cfg-options load_from=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

# multi-gpu training
bash tools/dist_train.sh <CONFIG_FILE> <GPU_NUM> --cfg-options load_from=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

For example, to train a Swin-B model for SSv2 dataset with 8 gpus, run:

bash tools/dist_train.sh configs/recognition/swin/swin_base_patch244_window1677_sthv2.py 8 --cfg-options load_from=<PRETRAIN_MODEL>

Note: use_checkpoint is used to save GPU memory. Please refer to this page for more details.

Apex (optional):

We use apex for mixed precision training by default. To install apex, use our provided docker or run:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

If you would like to disable apex, comment out the following code block in the configuration files:

# do not use mmcv version fp16
fp16 = None
optimizer_config = dict(
    type="DistOptimizerHook",
    update_interval=1,
    grad_clip=None,
    coalesce=True,
    bucket_size_mb=-1,
    use_fp16=True,
)

Citation

If you find our work useful in your research, please cite:

@article{liu2021video,
  title={Video Swin Transformer},
  author={Liu, Ze and Ning, Jia and Cao, Yue and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Hu, Han},
  journal={arXiv preprint arXiv:2106.13230},
  year={2021}
}

@article{liu2021Swin,
  title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
  author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},
  journal={arXiv preprint arXiv:2103.14030},
  year={2021}
}

This is an official implementation for "Video Swin Transformers".

Related tags

Overview

Video Swin Transformer

Updates

Introduction

Results and Models

Kinetics 400

Kinetics 600

Something-Something V2

Usage

Installation

Data Preparation

Inference

Training

Apex (optional):

Citation

Other Links

Owner

Swin Transformer

Automated Attendance Project Using Face Recognition

PyTorch implementation of "Dataset Knowledge Transfer for Class-Incremental Learning Without Memory" (WACV2022)

Code for the paper "Curriculum Dropout", ICCV 2017

Weakly Supervised Text-to-SQL Parsing through Question Decomposition

Context Axial Reverse Attention Network for Small Medical Objects Segmentation

A Python Package for Convex Regression and Frontier Estimation

MISSFormer: An Effective Medical Image Segmentation Transformer

Code of the paper "Part Detector Discovery in Deep Convolutional Neural Networks" by Marcel Simon, Erik Rodner and Joachim Denzler

EfficientNetV2 implementation using PyTorch

Code and data for the EMNLP 2021 paper "Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts". Coming soon!

Extract MNIST handwritten digits dataset binary file into bmp images

Official codes: Self-Supervised Learning by Estimating Twin Class Distribution

Clairvoyance: a Unified, End-to-End AutoML Pipeline for Medical Time Series

Survival analysis (SA) is a well-known statistical technique for the study of temporal events.

The official implementation of the IEEE S&P`22 paper "SoK: How Robust is Deep Neural Network Image Classification Watermarking".

This thesis is mainly concerned with state-space methods for a class of deep Gaussian process (DGP) regression problems

FastFace: Lightweight Face Detection Framework

Fast mesh denoising with data driven normal filtering using deep variational autoencoders

Learning Chinese Character style with conditional GAN

Implementation of STAM (Space Time Attention Model), a pure and simple attention model that reaches SOTA for video classification