[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

Overview

TubeDETR: Spatio-Temporal Video Grounding with Transformers

WebsiteSTVG DemoPaper

PWC PWC PWC

This repository provides the code for our paper. This includes:

  • Software setup, data downloading and preprocessing instructions for the VidSTG, HC-STVG1 and HC-STVG2.0 datasets
  • Training scripts and pretrained checkpoints
  • Evaluation scripts and demo

Setup

Download FFMPEG and add it to the PATH environment variable. The code was tested with version ffmpeg-4.2.2-amd64-static. Then create a conda environment and install the requirements with the following commands:

conda create -n tubedetr_env python=3.8
conda activate tubedetr_env
pip install -r requirements.txt

Data Downloading

Setup the paths where you are going to download videos and annotations in the config json files.

VidSTG: Download VidOR videos and annotations from the VidOR dataset providers. Then download the VidSTG annotations from the VidSTG dataset providers. The vidstg_vid_path folder should contain a folder video containing the unzipped video folders. The vidstg_ann_path folder should contain both VidOR and VidSTG annotations.

HC-STVG: Download HC-STVG1 and HC-STVG2.0 videos and annotations from the HC-STVG dataset providers. The hcstvg_vid_path folder should contain a folder video containing the unzipped video folders. The hcstvg_ann_path folder should contain both HC-STVG1 and HC-STVG2.0 annotations.

Data Preprocessing

To preprocess annotation files, run:

python preproc/preproc_vidstg.py
python preproc/preproc_hcstvg.py
python preproc/preproc_hcstvgv2.py

Training

Download pretrained RoBERTa tokenizer and model weights in the TRANSFORMERS_CACHE folder. Download pretrained ResNet-101 model weights in the TORCH_HOME folder. Download MDETR pretrained model weights with ResNet-101 backbone in the current folder.

VidSTG To train on VidSTG, run:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS --use_env main.py --ema \
--load=pretrained_resnet101_checkpoint.pth --combine_datasets=vidstg --combine_datasets_val=vidstg \
--dataset_config config/vidstg.json --output-dir=OUTPUT_DIR

HC-STVG2.0 To train on HC-STVG2.0, run:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS --use_env main.py --ema \
--load=pretrained_resnet101_checkpoint.pth --combine_datasets=hcstvg --combine_datasets_val=hcstvg \
--v2 --dataset_config config/hcstvg.json --epochs=20 --output-dir=OUTPUT_DIR

HC-STVG1 To train on HC-STVG1, run:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS --use_env main.py --ema \
--load=pretrained_resnet101_checkpoint.pth --combine_datasets=hcstvg --combine_datasets_val=hcstvg \
--dataset_config config/hcstvg.json --epochs=40 --eval_skip=40 --output-dir=OUTPUT_DIR

Baselines

  • To remove time encoding, add --no_time_embed.
  • To remove the temporal self-attention in the space-time decoder, add --no_tsa.
  • To train from ImageNet initialization, pass an empty string to the argument --load and add --sted_loss_coef=5 --lr=2e-5 --text_encoder_lr=2e-5 --epochs=20 --lr_drop=20 for VidSTG or --epochs=60 --lr_drop=60 for HC-STVG1.
  • To train with a randomly initalized temporal self-attention, add --rd_init_tsa.
  • To train with a different spatial resolution (e.g. res=352) or temporal stride (e.g. k=4), add --resolution=224 or --stride=5.
  • To train with the slow-only variant, add --no_fast.
  • To train with alternative designs for the fast branch, add --fast=VARIANT.

Available Checkpoints

Training data parameters url size
MDETR init + VidSTG k=4 res=352 Drive 3.0GB
MDETR init + VidSTG k=2 res=224 Drive 3.0GB
ImageNet init + VidSTG k=4 res=352 Drive 3.0GB
MDETR init + HC-STVG2.0 k=4 res=352 Drive 3.0GB
MDETR init + HC-STVG2.0 k=2 res=224 Drive 3.0GB
MDETR init + HC-STVG1 k=4 res=352 Drive 3.0GB
ImageNet init + HC-STVG1 k=4 res=352 Drive 3.0GB

Evaluation

For evaluation only, simply run the same commands as for training with --resume=CHECKPOINT --eval. For this to be done on the test set, add --test (in this case predictions and attention weights are also saved).

Spatio-Temporal Video Grounding Demo

You can also use a pretrained model to infer a spatio-temporal tube on a video of your choice (VIDEO_PATH with potential START and END timestamps) given the natural language query of your choice (CAPTION) with the following command:

python demo_stvg.py --load=CHECKPOINT --caption_example CAPTION --video_example VIDEO_PATH --start_example=START --end_example=END --output-dir OUTPUT_PATH

Note that we also host an online demo at this link, the code of which is available at server_stvg.py and server_stvg.html.

Acknowledgements

This codebase is built on the MDETR codebase. The code for video spatial data augmentation is inspired by torch_videovision.

Citation

If you found this work useful, consider giving this repository a star and citing our paper as followed:

@inproceedings{yang2022tubedetr,
title={TubeDETR: Spatio-Temporal Video Grounding with Transformers},
author={Yang, Antoine and Miech, Antoine and Sivic, Josef and Laptev, Ivan and Schmid, Cordelia},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2022}}
Owner
Antoine Yang
PhD Student in Computer Vision at Inria Paris
Antoine Yang
Base pretrained models and datasets in pytorch (MNIST, SVHN, CIFAR10, CIFAR100, STL10, AlexNet, VGG16, VGG19, ResNet, Inception, SqueezeNet)

This is a playground for pytorch beginners, which contains predefined models on popular dataset. Currently we support mnist, svhn cifar10, cifar100 st

Aaron Chen 2.4k Dec 28, 2022
Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers (arXiv2021)

Polyp-PVT by Bo Dong, Wenhai Wang, Deng-Ping Fan, Jinpeng Li, Huazhu Fu, & Ling Shao. This repo is the official implementation of "Polyp-PVT: Polyp Se

Deng-Ping Fan 102 Jan 05, 2023
BEAS: Blockchain Enabled Asynchronous & Secure Federated Machine Learning

BEAS Blockchain Enabled Asynchronous and Secure Federated Machine Learning Default Network Configuration: The default application uses the HyperLedger

Harpreet Virk 11 Nov 20, 2022
QueryDet: Cascaded Sparse Query for Accelerating High-Resolution SmallObject Detection

QueryDet-PyTorch This repository is the official implementation of our paper: QueryDet: Cascaded Sparse Query for Accelerating High-Resolution Small O

Chenhongyi Yang 276 Dec 31, 2022
Fully convolutional deep neural network to remove transparent overlays from images

Fully convolutional deep neural network to remove transparent overlays from images

Marc Belmont 1.1k Jan 06, 2023
Reproducing Results from A Hybrid Approach to Targeting Social Assistance

title author date output Reproducing Results from A Hybrid Approach to Targeting Social Assistance Lendie Follett and Heath Henderson 12/28/2021 html_

Lendie Follett 0 Jan 06, 2022
Code for Deterministic Neural Networks with Appropriate Inductive Biases Capture Epistemic and Aleatoric Uncertainty

Deep Deterministic Uncertainty This repository contains the code for Deterministic Neural Networks with Appropriate Inductive Biases Capture Epistemic

Jishnu Mukhoti 69 Nov 28, 2022
Motion Reconstruction Code and Data for Skills from Videos (SFV)

Motion Reconstruction Code and Data for Skills from Videos (SFV) This repo contains the data and the code for motion reconstruction component of the S

268 Dec 01, 2022
This app is a simple example of using Strealit to create a financial data web app.

Streamlit Demo: Finance Chart This app is a simple example of using Streamlit to create a financial data web app. This demo use streamlit, pandas and

91 Jan 02, 2023
Dense Deep Unfolding Network with 3D-CNN Prior for Snapshot Compressive Imaging, ICCV2021 [PyTorch Code]

Dense Deep Unfolding Network with 3D-CNN Prior for Snapshot Compressive Imaging, ICCV2021 [PyTorch Code]

Jian Zhang 20 Oct 24, 2022
This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

AdapterHub 18 Dec 09, 2022
Large scale and asynchronous Hyperparameter Optimization at your fingertip.

Syne Tune This package provides state-of-the-art distributed hyperparameter optimizers (HPO) where trials can be evaluated with several backend option

Amazon Web Services - Labs 236 Jan 01, 2023
Repository for "Improving evidential deep learning via multi-task learning," published in AAAI2022

Improving evidential deep learning via multi task learning It is a repository of AAAI2022 paper, “Improving evidential deep learning via multi-task le

deargen 11 Nov 19, 2022
Finding an Unsupervised Image Segmenter in each of your Deep Generative Models

Finding an Unsupervised Image Segmenter in each of your Deep Generative Models Description Recent research has shown that numerous human-interpretable

Luke Melas-Kyriazi 61 Oct 17, 2022
Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering (NAACL 2021)

Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering Abstract In open-domain question answering (QA), retrieve-and-read mec

Clova AI Research 34 Apr 13, 2022
In this project, we develop a face recognize platform based on MTCNN object-detection netcwork and FaceNet self-supervised network.

模式识别大作业——人脸检测与识别平台 本项目是一个简易的人脸检测识别平台,提供了人脸信息录入和人脸识别的功能。前端采用 html+css+js,后端采用 pytorch,

Xuhua Huang 5 Aug 02, 2022
Code for 1st place solution in Sleep AI Challenge SNU Hospital

Sleep AI Challenge SNU Hospital 2021 Code for 1st place solution for Sleep AI Challenge (Note that the code is not fully organized) Refer to the notio

Saewon Yang 13 Jan 03, 2022
Sample Code for "Pessimism Meets Invariance: Provably Efficient Offline Mean-Field Multi-Agent RL"

Sample Code for "Pessimism Meets Invariance: Provably Efficient Offline Mean-Field Multi-Agent RL" This is the official codebase for Pessimism Meets I

3 Sep 19, 2022
Repository features UNet inspired architecture used for segmenting lungs on chest X-Ray images

Lung Segmentation (2D) Repository features UNet inspired architecture used for segmenting lungs on chest X-Ray images. Demo See the application of the

163 Sep 21, 2022
CausalNLP is a practical toolkit for causal inference with text as treatment, outcome, or "controlled-for" variable.

CausalNLP CausalNLP is a practical toolkit for causal inference with text as treatment, outcome, or "controlled-for" variable. Install pip install -U

Arun S. Maiya 95 Jan 03, 2023