Implementation of TimeSformer, a pure attention-based solution for video classification

Last update: Jan 03, 2023

Overview

TimeSformer - Pytorch

Implementation of TimeSformer, a pure and simple attention-based solution for reaching SOTA on video classification. This repository will only house the best performing variant, 'Divided Space-Time Attention', which is nothing more than attention along the time axis before the spatial.

Install

$ pip install timesformer-pytorch

Usage

import torch
from timesformer_pytorch import TimeSformer

model = TimeSformer(
    dim = 512,
    image_size = 224,
    patch_size = 16,
    num_frames = 8,
    num_classes = 10,
    depth = 12,
    heads = 8,
    dim_head =  64,
    attn_dropout = 0.1,
    ff_dropout = 0.1
)

video = torch.randn(2, 8, 3, 224, 224) # (batch x frames x channels x height x width)
pred = model(video) # (2, 10)

Citations

@misc{bertasius2021spacetime,
    title   = {Is Space-Time Attention All You Need for Video Understanding?}, 
    author  = {Gedas Bertasius and Heng Wang and Lorenzo Torresani},
    year    = {2021},
    eprint  = {2102.05095},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

Comments

How to deal with varying length video? Thanks

Dear all, I am wondering if TimeSformer can handle different videos with diverse lengths? Is it possible to use mask as the original Transformer? Any ideas, thanks a lot.

opened by junyongyou 2
fix runtime error in SpaceTime Attention

There is a shape mismatch error in Attention. When we splice out the classification token from the first token of each sequence in q, k and v, the shape becomes (batch_size * num_heads, num_frames * num_patches - 1, head_dim). Then we try to reshape the tensor by taking out a factor of num_frames or num_patches (depending on whether it is space or time attention) from dimension 1. That doesn't work because we subtracted out the classification token.

I found that performing the rearrange operation before splicing the token fixes the issue.

I recreate the problem and illustrate the solution in this notebook: https://colab.research.google.com/drive/1lHFcn_vgSDJNSqxHy7rtqhMVxe0nUCMS?usp=sharing.

By the way, thank you to @lucidrains; all of your implementations on attention-based models are helping me more than you know.

opened by adam-mehdi 1
Update timesformer_pytorch.py

fixing issue for scaling

File "/home/aarti9/.local/lib/python3.6/site-packages/timesformer_pytorch/timesformer_pytorch.py", line 82, in forward q *= self.scale

RuntimeError: Output 0 of ViewBackward is a view and is being modified inplace. This view is an output of a function that returns multiple views. Inplace operators on such views is forbidden. You should replace the inplace operation by an out-of-place one.

opened by aarti9 0
Fine-tune with new datasets

Thank you so much for your great effort. I can predict the images using the given .py files. But, I couldn't find train.py files, so how to fine-tune the network with new datasets? where should i define the image samples of the new dataset ?

opened by Jeba-create 0
problem in timesformer_pytorch.py

start from line 182 video = rearrange(video, 'b f c (h p1) (w p2) -> b (f h w) (p1 p2 c)', p1 = p, p2 = p) i think this should be video = rearrange(video, 'b f c (hp p1) (wp p2) -> b (f hp wp) (p1 p2 c)', p1 = p, p2 = p)

opened by Weizhongjin 2
Imagenet Pretrained Weights

Thanks for the work! In their paper they say For all our experiments, we adopt the “Base” ViT model architecture (Dosovitskiy et al., 2020) pretrained on ImageNet.

I know that you said the official weights trained on kinetics and such are not officially released yet. However, I am not interested in those but am actually in need of the initial weights of the network just based on ViT Imagenet pretraining. I need to train this implementation of yours starting from those. From what it looks like, you don't have weights for this implementation that come from imagenet pretraining, do you?

opened by RaivoKoot 5

Releases(0.4.1)

0.4.1(Aug 25, 2021)

Source code(tar.gz)
Source code(zip)
0.4.0(Aug 16, 2021)

Source code(tar.gz)
Source code(zip)
0.3.3(Jul 4, 2021)

Source code(tar.gz)
Source code(zip)
0.3.2(Apr 26, 2021)

Source code(tar.gz)
Source code(zip)
0.3.1(Apr 25, 2021)

Source code(tar.gz)
Source code(zip)
0.2.1(Apr 21, 2021)

Source code(tar.gz)
Source code(zip)
0.1.1(Mar 23, 2021)

Source code(tar.gz)
Source code(zip)
0.1.0(Mar 21, 2021)

Source code(tar.gz)
Source code(zip)
0.0.5(Mar 18, 2021)

Source code(tar.gz)
Source code(zip)
0.0.4(Feb 11, 2021)

Source code(tar.gz)
Source code(zip)
0.0.3(Feb 11, 2021)

Source code(tar.gz)
Source code(zip)
0.0.2(Feb 11, 2021)

Source code(tar.gz)
Source code(zip)
0.0.1a(Feb 11, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention. It's all we need.

GitHub Repository

A unofficial pytorch implementation of PAN(PSENet2): Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network

Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network Requirements pytorch 1.1+ torchvision 0.3+ pyclipper opencv3 gcc

400 Dec 26, 2022

A framework for analyzing computer vision models with simulated data

3DB: A framework for analyzing computer vision models with simulated data Paper Quickstart guide Blog post Installation Follow instructions on: https:

112 Jan 01, 2023

An implementation of the "Attention is all you need" paper without extra bells and whistles, or difficult syntax

Simple Transformer An implementation of the "Attention is all you need" paper without extra bells and whistles, or difficult syntax. Note: The only ex

29 Jun 16, 2022

Fast (simple) spectral synthesis and emission-line fitting of DESI spectra.

FastSpecFit Introduction This repository contains code and documentation to perform fast, simple spectral synthesis and emission-line fitting of DESI

5 Aug 02, 2022

An efficient implementation of GPNN

Efficient-GPNN An efficient implementation of GPNN as depicted in "Drop the GAN: In Defense of Patches Nearest Neighbors as Single Image Generative Mo

7 Apr 16, 2022

CONditionals for Ordinal Regression and classification in PyTorch

CONDOR pytorch implementation for ordinal regression with deep neural networks. Documentation: https://GarrettJenkinson.github.io/condor_pytorch About

7 Jul 25, 2022

ROS Basics and TurtleSim

Waypoint Follower Anna Garverick This package draws given waypoints, then waits for a service call with a start position to send the turtle to each wa

1 Dec 13, 2021

A simple pytorch pipeline for semantic segmentation.

SegmentationPipeline -- Pytorch A simple pytorch pipeline for semantic segmentation. Requirements : torch=1.9.0 tqdm albumentations=1.0.3 opencv-pyt

4 Feb 22, 2022

The fastest way to visualize GradCAM with your Keras models.

VizGradCAM VizGradCam is the fastest way to visualize GradCAM in Keras models. GradCAM helps with providing visual explainability of trained models an

58 Nov 19, 2022

Measure WWjj polarization fraction

WlWl Polarization Measure WWjj polarization fraction Paper: arXiv:2109.09924 Notice: This code can only be used for the inference process, if you want

4 Apr 10, 2022

UMT is a unified and flexible framework which can handle different input modality combinations, and output video moment retrieval and/or highlight detection results.

Unified Multi-modal Transformers This repository maintains the official implementation of the paper UMT: Unified Multi-modal Transformers for Joint Vi

84 Jan 04, 2023

Implementation of TimeSformer, a pure attention-based solution for video classification

Related tags

Overview

TimeSformer - Pytorch

Install

Usage

Citations

Comments

How to deal with varying length video? Thanks

fix runtime error in SpaceTime Attention

Update timesformer_pytorch.py

Fine-tune with new datasets

problem in timesformer_pytorch.py

Imagenet Pretrained Weights

Releases(0.4.1)

0.4.1(Aug 25, 2021)

0.4.0(Aug 16, 2021)

0.3.3(Jul 4, 2021)

0.3.2(Apr 26, 2021)

0.3.1(Apr 25, 2021)

0.2.1(Apr 21, 2021)

0.1.1(Mar 23, 2021)

0.1.0(Mar 21, 2021)

0.0.5(Mar 18, 2021)

0.0.4(Feb 11, 2021)

0.0.3(Feb 11, 2021)

0.0.2(Feb 11, 2021)

0.0.1a(Feb 11, 2021)

Owner

Phil Wang

A unofficial pytorch implementation of PAN(PSENet2): Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network

A framework for analyzing computer vision models with simulated data

An implementation of the "Attention is all you need" paper without extra bells and whistles, or difficult syntax

Fast (simple) spectral synthesis and emission-line fitting of DESI spectra.

An efficient implementation of GPNN

CONditionals for Ordinal Regression and classification in PyTorch

ROS Basics and TurtleSim

A simple pytorch pipeline for semantic segmentation.

The fastest way to visualize GradCAM with your Keras models.

Measure WWjj polarization fraction

UMT is a unified and flexible framework which can handle different input modality combinations, and output video moment retrieval and/or highlight detection results.

Cortex-compatible model server for Python and TensorFlow

PyTorch Implementation of Unsupervised Depth Completion with Calibrated Backprojection Layers (ORAL, ICCV 2021)

It's final year project of Diploma Engineering. This project is based on Computer Vision.

This is the official github repository of the Met dataset

Open-L2O: A Comprehensive and Reproducible Benchmark for Learning to Optimize Algorithms

Code for STFT Transformer used in BirdCLEF 2021 competition.

Real time sign language recognition

Computationally Efficient Optimization of Plackett-Luce Ranking Models for Relevance and Fairness

Use graph-based analysis to re-classify stocks and to improve Markowitz portfolio optimization