Local-Global Stratified Transformer for Efficient Video Recognition

Overview

DualFormer

This repo is the implementation of our manuscript entitled "Local-Global Stratified Transformer for Efficient Video Recognition". Our model is built on a popular video package called mmaction2. This repo also refers to the code templates provided by PVT, Twins and Swin. This repo is released under the Apache 2.0 license.

Introduction

DualFormer is a Transformer architecture that can effectively and efficiently perform space-time attention for video recognition. Specifically, our DualFormer stratifies the full space-time attention into dual cascaded levels, i.e., to first learn fine-grained local space-time interactions among nearby 3D tokens, followed by the capture of coarse-grained global dependencies between the query token and the coarse-grained global pyramid contexts. Experimental results show the superiority of DualFormer on five video benchmarks against existing methods. In particular, DualFormer sets new state-of-the-art 82.9%/85.2% top-1 accuracy on Kinetics-400/600 with ∼1000G inference FLOPs which is at least 3.2× fewer than existing methods with similar performances.

Installation & Requirement

Please refer to install.md for installation. The docker files are also provided for convenient usage - cuda10.1 and cuda11.0.

All models are trained on 8 Nvidia A100 GPUs. For example, training a DualFormer-T on Kinetics-400 takes ∼31 hours on 8 A100 GPUs, while training a larger model DualFormer-B on Kinetics-400 requires ∼3 days on 8 A100 GPUs.

Data Preparation

Please first see data_preparation.md for a general knowledge of data preparation.

  • For Kinetics-400/600, as these are dynamic datasets (videos may be removed from YouTube), we employ this repo to download the original files and the annotatoins. Only a few number of corrupted videos are removed (around 50).
  • For other datasets, i.e., HMDB-51, UCF-101 and Diving-48, we use the data downloader provided by mmaction2 as aforementioned.

The full supported datasets are listed below (more details in supported_datasets.md):

HMDB51 (Homepage) (ICCV'2011) UCF101 (Homepage) (CRCV-IR-12-01) ActivityNet (Homepage) (CVPR'2015) Kinetics-[400/600/700] (Homepage) (CVPR'2017)
SthV1 (Homepage) (ICCV'2017) SthV2 (Homepage) (ICCV'2017) Diving48 (Homepage) (ECCV'2018) Jester (Homepage) (ICCV'2019)
Moments in Time (Homepage) (TPAMI'2019) Multi-Moments in Time (Homepage) (ArXiv'2019) HVU (Homepage) (ECCV'2020) OmniSource (Homepage) (ECCV'2020)

Models

We present a major part of the model results, the configuration files, and downloading links in the following table. The FLOPs is computed by fvcore, where we omit the classification head since it has low impact to the FLOPs.

Dataset Version Pretrain GFLOPs Param (M) Top-1 Config Download
K400 Tiny IN-1K 240 21.8 79.5 link link
K400 Small IN-1K 636 48.9 80.6 link link
K400 Base IN-1K 1072 86.8 81.1 link link
K600 Base IN-22K 1072 86.8 85.2 link link
Diving-48 Small K400 1908 48.9 81.8 link link
HMDB-51 Small K400 1908 48.9 76.4 link link
UCF-101 Small K400 1908 48.9 97.5 link link

Visualization

We visualize the attention maps at the last layer of our model generated by Grad-CAM on Kinetics-400. As shown in the following three gifs, our model successfully learns to focus on the relevant parts in the video clip. Left: flying kites. Middle: counting money. Right: walking dogs.

You can use the following commend to visualize the attention weights:

python demo/demo_gradcam.py 
    
     
     
       --target-layer-name 
      
        --out-filename 
        
       
      
     
    
   

For example, to visualize the last layer of DualFormer-S on a K400 video (-cii-Z0dW2E_000020_000030.mp4), please run:

python demo/demo_gradcam.py \
    configs/recognition/dualformer/dualformer_small_patch244_window877_kinetics400_1k.py \
    checkpoints/k400/dualformer_small_patch244_window877.pth \
    /dataset/kinetics-400/train_files/-cii-Z0dW2E_000020_000030.mp4 \
    --target-layer-name backbone/blocks/3/3 --fps 10 \
    --out-filename output/-cii-Z0dW2E_000020_000030.gif

User Guide

Folder Structure

As our implementation is based on mmaction2, we specify our contributions as follows:

Testing

# single-gpu testing
python tools/test.py 
    
    
      --eval top_k_accuracy

# multi-gpu testing
bash tools/dist_test.sh 
      
       
       
         --eval top_k_accuracy 
       
      
     
    
   

Example 1: to validate a DualFormer-T model on Kinetics-400 dataset with 8 GPUs, please run:

bash tools/dist_test.sh configs/recognition/dualformer/dualformer_tiny_patch244_window877_kinetics400_1k.py checkpoints/k400/dualformer_tiny_patch244_window877.pth 8 --eval top_k_accuracy

You will obtain the result as follows:

Example 2: to validate a DualFormer-S model on Diving-48 dataset with 4 GPUs, please run:

bash tools/dist_test.sh configs/recognition/dualformer/dualformer_small_patch244_window877_diving48.py checkpoints/diving48/dualformer_small_patch244_window877.pth 4 --eval top_k_accuracy 

The output will be as follows:

Training from scratch

To train a video recognition model from scratch for Kinetics-400, please run:

# single-gpu training
python tools/train.py 
   
     [other optional arguments]

# multi-gpu training
bash tools/dist_train.sh 
     
     
       [other optional arguments]

     
    
   

For example, to train a DualFormer-T model for Kinetics-400 dataset with 8 GPUs, please run:

bash tools/dist_train.sh ./configs/recognition/dualformer/dualformer_tiny_patch244_window877_kinetics400_1k.py 8 

Training a DualFormer-S model for Kinetics-400 dataset with 8 GPUs, please run:

bash tools/dist_train.sh ./configs/recognition/dualformer/dualformer_small_patch244_window877_kinetics400_1k.py 8 

Training with pre-trained 2D models

To train a video recognition model with pre-trained image models, please run:

# single-gpu training
python tools/train.py 
   
     --cfg-options model.backbone.pretrained=
    
      [model.backbone.use_checkpoint=True] [other optional arguments]

# multi-gpu training
bash tools/dist_train.sh 
      
      
        --cfg-options model.backbone.pretrained=
       
         [model.backbone.use_checkpoint=True] [other optional arguments] 
       
      
     
    
   

For example, to train a DualFormer-T model for Kinetics-400 dataset with 8 GPUs, please run:

bash tools/dist_train.sh ./configs/recognition/dualformer/dualformer_tiny_patch244_window877_kinetics400_1k.py 8 --cfg-options model.backbone.pretrained=
    

   

Training a DualFormer-B model for Kinetics-400 dataset with 8 GPUs, please run:

bash tools/dist_train.sh ./configs/recognition/dualformer/dualformer_base_patch244_window877_kinetics400_1k.py 8 --cfg-options model.backbone.pretrained=
    

   

Note: use_checkpoint is used to save GPU memory. Please refer to this page for more details.

Training with Token Labelling

We also present the first attempt to improve the video recognition model by generalizing Token Labelling to videos as additional augmentations, in which MixToken is turned off as it does not work on our video datasets. For instance, to train a small version of DualFormer using DualFormer-B as the annotation model on the fly, please run:

bash tools/dist_train.sh configs/recognition/dualformer/dualformer_tiny_tokenlabel_patch244_window877_kinetics400_1k.py 8 --cfg-options model.backbone.pretrained='checkpoints/pretrained_2d/dualformer_tiny.pth' --validate 

Notice that we place the checkpoint of the annotation model at 'checkpoints/k400/dualformer_base_patch244_window877.pth'. You can change it to anywhere you want, or modify the path variable in this file.

We present two examples of visualization of token labelling on video data. For simiplicity, we omit several frames and thus each example only shows 5 frames with uniform sampling rate. For each frame, each value p(i,j) on the left hand side means the pseudo label (index) at each patch of the last stage provided by the annotation model.

  • Visualization example 1 (Correct label: pushing cart, index: 262).
  • Visualization example 2 (Correct label: dribbling basketball, index: 99).

              

Apex (optional):

We use apex for mixed precision training by default. To install apex, use our provided docker or run:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

If you would like to disable apex, comment out the following code block in the configuration files:

# do not use mmcv version fp16
fp16 = None
optimizer_config = dict(
    type="DistOptimizerHook",
    update_interval=1,
    grad_clip=None,
    coalesce=True,
    bucket_size_mb=-1,
    use_fp16=True,
)

Citation

If you find our work useful in your research, please cite:

@article{liang2021dualformer,
         title={DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition}, 
         author={Yuxuan Liang and Pan Zhou and Roger Zimmermann and Shuicheng Yan},
         year={2021},
         journal={arXiv preprint arXiv:2112.04674},
}

Acknowledgement

We would like to thank the authors of the following helpful codebases:

Please kindly consider star these related packages as well. Thank you much for your attention.

Owner
Sea AI Lab
Sea AI Lab
8-week curriculum for AI Builders

curriculum 8-week curriculum for AI Builders สารบัญ บทที่ 1 - Machine Learning คืออะไร บทที่ 2 - ชุดข้อมูลมหัศจรรย์และถิ่นที่อยู่ บทที่ 3 - Stochastic

AI Builders 134 Jan 03, 2023
Plaything for Autistic Children (demo for PaddlePaddle/Wechaty/Mixlab project)

星星的孩子 - 一款为孤独症孩子设计的聊天机器人游戏 孤独症儿童是目前常常被忽视的一类群体。他们有着类似性格内向的特征,实际却受着广泛性发育障碍的折磨。 项目背景 这类儿童在与人交往时存在着沟通障碍,其特点表现在: 社交交流差,互动障碍明显 认知能力有限,被动认知 兴趣狭窄,重复刻板,缺乏变化和想象

Tianyi Pan 35 Nov 24, 2022
Generative vs Discriminative: Rethinking The Meta-Continual Learning (NeurIPS 2021)

Generative vs Discriminative: Rethinking The Meta-Continual Learning (NeurIPS 2021) In this repository we provide PyTorch implementations for GeMCL; a

4 Apr 15, 2022
A python library for time-series smoothing and outlier detection in a vectorized way.

tsmoothie A python library for time-series smoothing and outlier detection in a vectorized way. Overview tsmoothie computes, in a fast and efficient w

Marco Cerliani 517 Dec 28, 2022
Binary Stochastic Neurons in PyTorch

Binary Stochastic Neurons in PyTorch http://r2rt.com/binary-stochastic-neurons-in-tensorflow.html https://github.com/pytorch/examples/tree/master/mnis

Onur Kaplan 54 Nov 21, 2022
Benchmarks for the Optimal Power Flow Problem

Power Grid Lib - Optimal Power Flow This benchmark library is curated and maintained by the IEEE PES Task Force on Benchmarks for Validation of Emergi

A Library of IEEE PES Power Grid Benchmarks 207 Dec 08, 2022
ICCV2021 Oral SA-ConvONet: Sign-Agnostic Optimization of Convolutional Occupancy Networks

Sign-Agnostic Convolutional Occupancy Networks Paper | Supplementary | Video | Teaser Video | Project Page This repository contains the implementation

63 Nov 18, 2022
Code and data form the paper BERT Got a Date: Introducing Transformers to Temporal Tagging

BERT Got a Date: Introducing Transformers to Temporal Tagging Satya Almasian*, Dennis Aumiller*, and Michael Gertz Heidelberg University Contact us vi

54 Dec 04, 2022
PyTorch Implementation of Region Similarity Representation Learning (ReSim)

ReSim This repository provides the PyTorch implementation of Region Similarity Representation Learning (ReSim) described in this paper: @Article{xiao2

Tete Xiao 74 Jan 03, 2023
Language Used: Python . Made in Jupyter(Anaconda) notebook.

FACE-DETECTION-ATTENDENCE-SYSTEM Made in Jupyter(Anaconda) notebook. Language Used: Python Steps to perform before running the program : Install Anaco

1 Jan 12, 2022
[CVPR 2021] Teachers Do More Than Teach: Compressing Image-to-Image Models (CAT)

CAT arXiv Pytorch implementation of our method for compressing image-to-image models. Teachers Do More Than Teach: Compressing Image-to-Image Models Q

Snap Research 160 Dec 09, 2022
This is the code for ACL2021 paper A Unified Generative Framework for Aspect-Based Sentiment Analysis

This is the code for ACL2021 paper A Unified Generative Framework for Aspect-Based Sentiment Analysis Install the package in the requirements.txt, the

108 Dec 23, 2022
YKKDetector For Python

YKKDetector OpenCVを利用した機械学習データをもとに、VRChatのスクリーンショットなどからYKKさん(もとい「幽狐族のお姉様」)を検出できるソフトウェアです。 マニュアル こちらから実行環境のセットアップから解説する詳細なマニュアルをご覧いただけます。 ライセンス 本ソフトウェア

あんふぃとらいと 5 Dec 07, 2021
Code repo for EMNLP21 paper "Zero-Shot Information Extraction as a Unified Text-to-Triple Translation"

Zero-Shot Information Extraction as a Unified Text-to-Triple Translation Source code repo for paper Zero-Shot Information Extraction as a Unified Text

cgraywang 88 Dec 31, 2022
An efficient and effective learning to rank algorithm by mining information across ranking candidates. This repository contains the tensorflow implementation of SERank model. The code is developed based on TF-Ranking.

SERank An efficient and effective learning to rank algorithm by mining information across ranking candidates. This repository contains the tensorflow

Zhihu 44 Oct 20, 2022
💃 VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena

💃 VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena.

Heidelberg-NLP 17 Nov 07, 2022
RepVGG: Making VGG-style ConvNets Great Again

RepVGG: Making VGG-style ConvNets Great Again (PyTorch) This is a super simple ConvNet architecture that achieves over 80% top-1 accuracy on ImageNet

2.8k Jan 04, 2023
Learning Facial Representations from the Cycle-consistency of Face (ICCV 2021)

Learning Facial Representations from the Cycle-consistency of Face (ICCV 2021) This repository contains the code for our ICCV2021 paper by Jia-Ren Cha

Jia-Ren Chang 40 Dec 27, 2022
Code repo for realtime multi-person pose estimation in CVPR'17 (Oral)

Realtime Multi-Person Pose Estimation By Zhe Cao, Tomas Simon, Shih-En Wei, Yaser Sheikh. Introduction Code repo for winning 2016 MSCOCO Keypoints Cha

Zhe Cao 4.9k Dec 31, 2022