[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

Last update: Dec 27, 2022

Overview

TubeDETR: Spatio-Temporal Video Grounding with Transformers

This repository provides the code for our paper. This includes:

Software setup, data downloading and preprocessing instructions for the VidSTG, HC-STVG1 and HC-STVG2.0 datasets
Training scripts and pretrained checkpoints
Evaluation scripts and demo

Setup

Download FFMPEG and add it to the PATH environment variable. The code was tested with version ffmpeg-4.2.2-amd64-static. Then create a conda environment and install the requirements with the following commands:

conda create -n tubedetr_env python=3.8
conda activate tubedetr_env
pip install -r requirements.txt

Data Downloading

Setup the paths where you are going to download videos and annotations in the config json files.

VidSTG: Download VidOR videos and annotations from the VidOR dataset providers. Then download the VidSTG annotations from the VidSTG dataset providers. The vidstg_vid_path folder should contain a folder video containing the unzipped video folders. The vidstg_ann_path folder should contain both VidOR and VidSTG annotations.

HC-STVG: Download HC-STVG1 and HC-STVG2.0 videos and annotations from the HC-STVG dataset providers. The hcstvg_vid_path folder should contain a folder video containing the unzipped video folders. The hcstvg_ann_path folder should contain both HC-STVG1 and HC-STVG2.0 annotations.

Data Preprocessing

To preprocess annotation files, run:

python preproc/preproc_vidstg.py
python preproc/preproc_hcstvg.py
python preproc/preproc_hcstvgv2.py

Training

Download pretrained RoBERTa tokenizer and model weights in the TRANSFORMERS_CACHE folder. Download pretrained ResNet-101 model weights in the TORCH_HOME folder. Download MDETR pretrained model weights with ResNet-101 backbone in the current folder.

VidSTG To train on VidSTG, run:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS --use_env main.py --ema \
--load=pretrained_resnet101_checkpoint.pth --combine_datasets=vidstg --combine_datasets_val=vidstg \
--dataset_config config/vidstg.json --output-dir=OUTPUT_DIR

HC-STVG2.0 To train on HC-STVG2.0, run:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS --use_env main.py --ema \
--load=pretrained_resnet101_checkpoint.pth --combine_datasets=hcstvg --combine_datasets_val=hcstvg \
--v2 --dataset_config config/hcstvg.json --epochs=20 --output-dir=OUTPUT_DIR

HC-STVG1 To train on HC-STVG1, run:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS --use_env main.py --ema \
--load=pretrained_resnet101_checkpoint.pth --combine_datasets=hcstvg --combine_datasets_val=hcstvg \
--dataset_config config/hcstvg.json --epochs=40 --eval_skip=40 --output-dir=OUTPUT_DIR

Baselines

To remove time encoding, add --no_time_embed.
To remove the temporal self-attention in the space-time decoder, add --no_tsa.
To train from ImageNet initialization, pass an empty string to the argument --load and add --sted_loss_coef=5 --lr=2e-5 --text_encoder_lr=2e-5 --epochs=20 --lr_drop=20 for VidSTG or --epochs=60 --lr_drop=60 for HC-STVG1.
To train with a randomly initalized temporal self-attention, add --rd_init_tsa.
To train with a different spatial resolution (e.g. res=352) or temporal stride (e.g. k=4), add --resolution=224 or --stride=5.
To train with the slow-only variant, add --no_fast.
To train with alternative designs for the fast branch, add --fast=VARIANT.

Available Checkpoints

Training data	parameters	url	size
MDETR init + VidSTG	k=4 res=352	Drive	3.0GB
MDETR init + VidSTG	k=2 res=224	Drive	3.0GB
ImageNet init + VidSTG	k=4 res=352	Drive	3.0GB
MDETR init + HC-STVG2.0	k=4 res=352	Drive	3.0GB
MDETR init + HC-STVG2.0	k=2 res=224	Drive	3.0GB
MDETR init + HC-STVG1	k=4 res=352	Drive	3.0GB
ImageNet init + HC-STVG1	k=4 res=352	Drive	3.0GB

Evaluation

For evaluation only, simply run the same commands as for training with --resume=CHECKPOINT --eval. For this to be done on the test set, add --test (in this case predictions and attention weights are also saved).

Spatio-Temporal Video Grounding Demo

You can also use a pretrained model to infer a spatio-temporal tube on a video of your choice (VIDEO_PATH with potential START and END timestamps) given the natural language query of your choice (CAPTION) with the following command:

python demo_stvg.py --load=CHECKPOINT --caption_example CAPTION --video_example VIDEO_PATH --start_example=START --end_example=END --output-dir OUTPUT_PATH

Note that we also host an online demo at this link, the code of which is available at server_stvg.py and server_stvg.html.

Acknowledgements

This codebase is built on the MDETR codebase. The code for video spatial data augmentation is inspired by torch_videovision.

Citation

If you found this work useful, consider giving this repository a star and citing our paper as followed:

@inproceedings{yang2022tubedetr,
title={TubeDETR: Spatio-Temporal Video Grounding with Transformers},
author={Yang, Antoine and Miech, Antoine and Sivic, Josef and Laptev, Ivan and Schmid, Cordelia},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2022}}

[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

Related tags

Overview

TubeDETR: Spatio-Temporal Video Grounding with Transformers

Setup

Data Downloading

Data Preprocessing

Training

Available Checkpoints

Evaluation

Spatio-Temporal Video Grounding Demo

Acknowledgements

Citation

Owner

Antoine Yang

RM Operation can equivalently convert ResNet to VGG, which is better for pruning; and can help RepVGG perform better when the depth is large.

NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR2021)

Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

Learning based AI for playing multi-round Koi-Koi hanafuda card games. Have fun.

PyTorch implementation of the NIPS-17 paper "Poincaré Embeddings for Learning Hierarchical Representations"

Code & Data for Enhancing Photorealism Enhancement

Vector Quantization, in Pytorch

MMRazor: a model compression toolkit for model slimming and AutoML

Semantic-aware Grad-GAN for Virtual-to-Real Urban Scene Adaption

CRLT: A Unified Contrastive Learning Toolkit for Unsupervised Text Representation Learning

Rainbow DQN implementation that outperforms the paper's results on 40% of games using 20x less data 🌈

Python implementation of "Single Image Haze Removal Using Dark Channel Prior"

一些经典的CTR算法的复现; LR, FM, FFM, AFM, DeepFM，xDeepFM, PNN, DCN, DCNv2, DIFM, AutoInt, FiBiNet,AFN,ONN,DIN, DIEN ... （pytorch, tf2.0）

PyTorch implementation of PP-LCNet

Online Multi-Granularity Distillation for GAN Compression (ICCV2021)

Multi Agent Path Finding Algorithms

Analyzes your GitHub Profile and presents you with a report on how likely you are to become the next MLH Fellow!

A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN.

Source code for deep symbolic optimization.

SCALE: Modeling Clothed Humans with a Surface Codec of Articulated Local Elements (CVPR 2021)