AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

Last update: Dec 26, 2022

Related tags

Deep Learning AdaFocusV2

Overview

AdaFocusV2

This repo contains the official code and pre-trained models for AdaFocusV2.

AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

Introduction

Recent works have shown that the computational efficiency of video recognition can be significantly improved by reducing the spatial redundancy. As a representative work, the adaptive focus method (AdaFocus) has achieved a favorable trade-off between accuracy and inference speed by dynamically identifying and attending to the informative regions in each video frame. However, AdaFocus requires a complicated three-stage training pipeline (involving reinforcement learning), leading to slow convergence and is unfriendly to practitioners. This work reformulates the training of AdaFocus as a simple one-stage algorithm by introducing a differentiable interpolation-based patch selection operation, enabling efficient end-to-end optimization. We further present an improved training scheme to address the issues introduced by the one-stage formulation, including the lack of supervision, input diversity and training stability. Moreover, a conditional-exit technique is proposed to perform temporal adaptive computation on top of AdaFocus without additional training. Extensive experiments on six benchmark datasets (i.e., ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2, and Jester) demonstrate that our model significantly outperforms the original AdaFocus and other competitive baselines, while being considerably more simple and efficient to train.

Results

Compared with AdaFocusV1

ActivityNet, FCVID and Mini-Kinetics

Something-Something V1&V2 and Jester

Visualization

Get Started

Please go to the folder Experiments on ActivityNet, FCVID and Mini-Kinetics and Experiments on Sth-Sth and Jester for specific docs.

Contact

If you have any question, feel free to contact the authors or raise an issue. Yulin Wang: [email protected].

AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

Related tags

Overview

AdaFocusV2

Introduction

Results

Get Started

Contact

Owner

[NeurIPS 2021] Garment4D: Garment Reconstruction from Point Cloud Sequences

This is implementation of AlexNet(2012) with 3D Convolution on TensorFlow (AlexNet 3D).

PyTorch reimplementation of Diffusion Models

What can linearized neural networks actually say about generalization?

auto-tuning momentum SGD optimizer

Benchmark for Answering Existential First Order Queries with Single Free Variable

Object tracking and object detection is applied to track golf puts in real time and display stats/games.

Qt-GUI implementation of the YOLOv5 algorithm (ver.6 and ver.5)

Next-gen Rowhammer fuzzer that uses non-uniform, frequency-based patterns.

Orange Chicken: Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation

The project is an official implementation of our paper "3D Human Pose Estimation with Spatial and Temporal Transformers".

Transformer in Computer Vision

It is an open dataset for object detection in remote sensing images.

Weakly Supervised Learning of Instance Segmentation with Inter-pixel Relations, CVPR 2019 (Oral)

IAST: Instance Adaptive Self-training for Unsupervised Domain Adaptation (ECCV 2020)

TCube generates rich and fluent narratives that describes the characteristics, trends, and anomalies of any time-series data (domain-agnostic) using the transfer learning capabilities of PLMs.

implement of SwiftNet:Real-time Video Object Segmentation

BESS: Balanced Evolutionary Semi-Stacking for Disease Detection via Partially Labeled Imbalanced Tongue Data

Replication of Pix2Seq with Pretrained Model

Code release for "BoxeR: Box-Attention for 2D and 3D Transformers"