Temporally Efficient Vision Transformer for Video Instance Segmentation, CVPR 2022, Oral

Overview

Temporally Efficient Vision Transformer for Video Instance Segmentation

Temporally Efficient Vision Transformer for Video Instance Segmentation (CVPR 2022, Oral)

by Shusheng Yang1,3, Xinggang Wang1 📧 , Yu Li4, Yuxin Fang1, Jiemin Fang1,2, Wenyu Liu1, Xun Zhao3, Ying Shan3.

1 School of EIC, HUST, 2 AIA, HUST, 3 ARC Lab, Tencent PCG, 4 IDEA.

( 📧 ) corresponding author.


  • This repo provides code, models and training/inference recipes for TeViT(Temporally Efficient Vision Transformer for Video Instance Segmentation).
  • TeViT is a transformer-based end-to-end video instance segmentation framework. We build our framework upon the query-based instance segmentation methods, i.e., QueryInst.
  • We propose a messenger shift mechanism in the transformer backbone, as well as a spatiotemporal query interaction head in the instance heads. These two designs fully utlizes both frame-level and instance-level temporal context information and obtains strong temporal modeling capacity with negligible extra computational cost.

Overall Arch

Models and Main Results

  • We provide both checkpoints and codalab server submissions on YouTube-VIS-2019 dataset.
Name AP [email protected] [email protected] [email protected] [email protected] model submission
TeViT_MsgShifT 46.3 70.6 50.9 45.2 54.3 link link
TeViT_MsgShifT_MST 46.9 70.1 52.9 45.0 53.4 link link
  • We have conducted multiple runs due to the training instability and checkpoints above are all the best one among multiple runs. The average performances are reported in our paper.
  • Besides basic models, we also provide TeViT with ResNet-50 and Swin-L backbone, models are also trained on YouTube-VIS-2019 dataset.
  • MST denotes multi-scale traning.
Name AP [email protected] [email protected] [email protected] [email protected] model submission
TeViT_R50 42.1 67.8 44.8 41.3 49.9 link link
TeViT_Swin-L_MST 56.8 80.6 63.1 52.0 63.3 link link
  • Due to backbone limitations, TeViT models with ResNet-50 and Swin-L backbone are conducted with STQI Head only (i.e., without our proposed messenger shift mechanism).
  • With Swin-L as backbone network, we apply more instance queries (i.e., from 100 to 300) and stronger data augmentation strategies. Both of them can further boost the final performance.

Installation

Prerequisites

  • Linux
  • Python 3.7+
  • CUDA 10.2+
  • GCC 5+

Prepare

  • Clone the repository locally:
git clone https://github.com/hustvl/TeViT.git
  • Create a conda virtual environment and activate it:
conda create --name tevit python=3.7.7
conda activate tevit
pip install git+https://github.com/youtubevos/cocoapi.git#"egg=pycocotools&subdirectory=PythonAPI
  • Install Python requirements
torch==1.9.0
torchvision==0.10.0
mmcv==1.4.8
pip install -r requirements.txt
  • Please follow Docs to install MMDetection
python setup.py develop
  • Download YouTube-VIS 2019 dataset from here, and organize dataset as follows:
TeViT
├── data
│   ├── youtubevis
│   │   ├── train
│   │   │   ├── 003234408d
│   │   │   ├── ...
│   │   ├── val
│   │   │   ├── ...
│   │   ├── annotations
│   │   │   ├── train.json
│   │   │   ├── valid.json

Inference

python tools/test_vis.py configs/tevit/tevit_msgshift.py $PATH_TO_CHECKPOINT

After inference process, the predicted results is stored in results.json, submit it to the evaluation server to get the final performance.

Training

  • Download the COCO pretrained QueryInst with PVT-B1 backbone from here.
  • Train TeViT with 8 GPUs:
./tools/dist_train.sh configs/tevit/tevit_msgshift.py 8 --no-validate --cfg-options load_from=$PATH_TO_PRETRAINED_WEIGHT
  • Train TeViT with multi-scale data augmentation:
./tools/dist_train.sh configs/tevit/tevit_msgshift_mstrain.py 8 --no-validate --cfg-options load_from=$PATH_TO_PRETRAINED_WEIGHT
  • The whole training process will cost about three hours with 8 TESLA V100 GPUs.
  • To train TeViT with ResNet-50 or Swin-L backbone, please download the COCO pretrained weights from QueryInst.

Acknowledgement ❤️

This code is mainly based on mmdetection and QueryInst, thanks for their awesome work and great contributions to the computer vision community!

Citation

If you find our paper and code useful in your research, please consider giving a star and citation 📝 :

@inproceedings{yang2022tevit,
  title={Temporally Efficient Vision Transformer for Video Instance Segmentation,
  author={Yang, Shusheng and Wang, Xinggang and Li, Yu and Fang, Yuxin and Fang, Jiemin and Liu and Zhao, Xun and Shan, Ying},
  booktitle =   {Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR)},
  year      =   {2022}
}
Issues
  • I can't find pycocotools.ytvos

    I can't find pycocotools.ytvos

    Traceback (most recent call last): File "./tools/train.py", line 17, in from mmdet.apis import init_random_seed, set_random_seed, train_detector File "/home/hss/TeViT-main/mmdet/apis/init.py", line 2, in from .inference import (async_inference_detector, inference_detector, File "/home/hss/TeViT-main/mmdet/apis/inference.py", line 12, in from mmdet.datasets import replace_ImageToTensor File "/home/hss/TeViT-main/mmdet/datasets/init.py", line 18, in Traceback (most recent call last): File "./tools/train.py", line 17, in from .youtubevis import YoutubeVISDataset File "/home/hss/TeViT-main/mmdet/datasets/youtubevis.py", line 9, in from pycocotools.ytvos import YTVOS ModuleNotFoundError: No module named 'pycocotools.ytvos' from mmdet.apis import init_random_seed, set_random_seed, train_detector File "/home/hss/TeViT-main/mmdet/apis/init.py", line 2, in from .inference import (async_inference_detector, inference_detector, File "/home/hss/TeViT-main/mmdet/apis/inference.py", line 12, in from mmdet.datasets import replace_ImageToTensor File "/home/hss/TeViT-main/mmdet/datasets/init.py", line 18, in from .youtubevis import YoutubeVISDataset File "/home/hss/TeViT-main/mmdet/datasets/youtubevis.py", line 9, in from pycocotools.ytvos import YTVOS

    documentation 
    opened by czj942650673 5
  • About demo test

    About demo test

    Thanks for your excellent work. For the test of image_demo.py(or video_demo.py) in /demo, use the demo.jpg(or demo.mp4) as input, there is a problem. Could you please provide some advice? Looking forward to your reply.

    /TeViT/mmdet/apis/inference.py:50: UserWarning: Class names are not saved in the checkpoint's meta data, use COCO classes by default. warnings.warn('Class names are not saved in the checkpoint's '

    Traceback (most recent call last): File "image_demo.py", line 65, in main(args) File "image_demo.py", line 35, in main result = inference_detector(model, args.img) File "/TeViT/mmdet/apis/inference.py", line 137, in inference_detector data['img_metas'] = [img_metas.data[0] for img_metas in data['img_metas']] TypeError: 'DataContainer' object is not iterable

    File "video_demo.py", line 61, in main() File "video_demo.py", line 47, in main result = inference_detector(model, frame) File "/TeViT/mmdet/apis/inference.py", line 137, in inference_detector data['img_metas'] = [img_metas.data[0] for img_metas in data['img_metas']] TypeError: 'DataContainer' object is not iterable

    opened by LuoYingzhao 1
  • Are you interested in creating a PR about this work for mmtracking?

    Are you interested in creating a PR about this work for mmtracking?

    Wonderful work! We're very interested in your work. VIS is a future key development direction in mmtracking. We'll appreciate it if you can create a PR about this work in mmtracking.

    opened by JingweiZhang12 1
  •  'TeViT is not in the models registry'

    'TeViT is not in the models registry'

    Hi, I am trying to run this repo and i have followed all the steps mentioned in the description.

    While running the inference code, I am getting the following error. python tools/test_vis.py configs/tevit/tevit_msgshift.py checkpoints/tevit_r50.pth Traceback (most recent call last): File "tools/test_vis.py", line 130, in <module> main(args) File "tools/test_vis.py", line 53, in main cfg_options=args.cfg_options) File "/home/quidich/.virtualenvs/tevit/lib/python3.6/site-packages/mmdet/apis/inference.py", line 43, in init_detector model = build_detector(config.model, test_cfg=config.get('test_cfg')) File "/home/quidich/.virtualenvs/tevit/lib/python3.6/site-packages/mmdet/models/builder.py", line 59, in build_detector cfg, default_args=dict(train_cfg=train_cfg, test_cfg=test_cfg)) File "/home/quidich/.virtualenvs/tevit/lib/python3.6/site-packages/mmcv/utils/registry.py", line 212, in build return self.build_func(*args, **kwargs, registry=self) File "/home/quidich/.virtualenvs/tevit/lib/python3.6/site-packages/mmcv/cnn/builder.py", line 27, in build_model_from_cfg return build_from_cfg(cfg, registry, default_args) File "/home/quidich/.virtualenvs/tevit/lib/python3.6/site-packages/mmcv/utils/registry.py", line 45, in build_from_cfg f'{obj_type} is not in the {registry.name} registry') KeyError: 'TeViT is not in the models registry'

    Kindly, could you guide me if I want to create a file that directly runs this algorithm on a video or set of images and provide an out in the form of images and json format which gets saved in some folder?

    opened by PoojanPanchal 2
  • About YTVOS

    About YTVOS

    Hi, vealocia, Maybe we need to do it first to run this code. pip install git+https://github.com/youtubevos/cocoapi.git#"egg=pycocotools&subdirectory=PythonAPI So I think you can add it to requirements.🥰

    documentation 
    opened by pixeli99 1
Owner
Hust Visual Learning Team
Hust Visual Learning Team belongs to the Artificial Intelligence Research Institute in the School of EIC in HUST, Lead by @xinggangw
Hust Visual Learning Team
Official Pytorch implementation of "Beyond Static Features for Temporally Consistent 3D Human Pose and Shape from a Video", CVPR 2021

TCMR: Beyond Static Features for Temporally Consistent 3D Human Pose and Shape from a Video Qualtitative result Paper teaser video Introduction This r

Hongsuk Choi 184 Jun 21, 2022
This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Swin Transformer for Object Detection This repo contains the supported code and configuration files to reproduce object detection results of Swin Tran

Swin Transformer 1.2k Jun 23, 2022
Not All Points Are Equal: Learning Highly Efficient Point-based Detectors for 3D LiDAR Point Clouds (CVPR 2022, Oral)

Not All Points Are Equal: Learning Highly Efficient Point-based Detectors for 3D LiDAR Point Clouds (CVPR 2022, Oral) This is the official implementat

Yifan Zhang 145 Jun 21, 2022
[CVPR2021 Oral] End-to-End Video Instance Segmentation with Transformers

VisTR: End-to-End Video Instance Segmentation with Transformers This is the official implementation of the VisTR paper: Installation We provide instru

Yuqing Wang 643 Jun 24, 2022
[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

TubeDETR: Spatio-Temporal Video Grounding with Transformers Website • STVG Demo • Paper This repository provides the code for our paper. This includes

Antoine Yang 61 Jun 25, 2022
TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks [Paper] [Project Website] This repository holds the source code, pretra

Humam Alwassel 70 Jun 23, 2022
A PyTorch Reimplementation of TecoGAN: Temporally Coherent GAN for Video Super-Resolution

TecoGAN-PyTorch Introduction This is a PyTorch reimplementation of TecoGAN: Temporally Coherent GAN for Video Super-Resolution (VSR). Please refer to

null 151 Jun 19, 2022
Scribble-Supervised LiDAR Semantic Segmentation, CVPR 2022 (ORAL)

Scribble-Supervised LiDAR Semantic Segmentation Dataset and code release for the paper Scribble-Supervised LiDAR Semantic Segmentation, CVPR 2022 (ORA

null 52 Jun 1, 2022
FreeSOLO for unsupervised instance segmentation, CVPR 2022

FreeSOLO: Learning to Segment Objects without Annotations This project hosts the code for implementing the FreeSOLO algorithm for unsupervised instanc

NVIDIA Research Projects 189 Jun 26, 2022
[ArXiv 2021] Data-Efficient Instance Generation from Instance Discrimination

InsGen - Data-Efficient Instance Generation from Instance Discrimination Data-Efficient Instance Generation from Instance Discrimination Ceyuan Yang,

GenForce: May Generative Force Be with You 78 Jun 2, 2022
"MST++: Multi-stage Spectral-wise Transformer for Efficient Spectral Reconstruction" (CVPRW 2022) & (Winner of NTIRE 2022 Challenge on Spectral Reconstruction from RGB)

MST++: Multi-stage Spectral-wise Transformer for Efficient Spectral Reconstruction (CVPRW 2022) Yuanhao Cai, Jing Lin, Zudi Lin, Haoqian Wang, Yulun Z

Yuanhao Cai 171 Jun 15, 2022
[CVPR 2022] CoTTA Code for our CVPR 2022 paper Continual Test-Time Domain Adaptation

CoTTA Code for our CVPR 2022 paper Continual Test-Time Domain Adaptation Prerequisite Please create and activate the following conda envrionment. To r

Qin Wang 42 Jun 24, 2022
Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Phil Wang 10.5k Jun 20, 2022
This repository builds a basic vision transformer from scratch so that one beginner can understand the theory of vision transformer.

vision-transformer-from-scratch This repository includes several kinds of vision transformers from scratch so that one beginner can understand the the

null 1 Dec 24, 2021
Leveraging Instance-, Image- and Dataset-Level Information for Weakly Supervised Instance Segmentation

Leveraging Instance-, Image- and Dataset-Level Information for Weakly Supervised Instance Segmentation This paper has been accepted and early accessed

Yun Liu 36 Mar 12, 2022
Official pytorch implementation of paper "Inception Convolution with Efficient Dilation Search" (CVPR 2021 Oral).

IC-Conv This repository is an official implementation of the paper Inception Convolution with Efficient Dilation Search. Getting Started Download Imag

Jie Liu 62 Mar 21, 2022
Code for CVPR 2021 oral paper "Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts"

Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts The rapid progress in 3D scene understanding has come with growing dem

Facebook Research 151 Jun 19, 2022
Implementation of "Efficient Regional Memory Network for Video Object Segmentation" (Xie et al., CVPR 2021).

RMNet This repository contains the source code for the paper Efficient Regional Memory Network for Video Object Segmentation. Cite this work @inprocee

Haozhe Xie 73 Jun 17, 2022
Stratified Transformer for 3D Point Cloud Segmentation (CVPR 2022)

Stratified Transformer for 3D Point Cloud Segmentation Xin Lai*, Jianhui Liu*, Li Jiang, Liwei Wang, Hengshuang Zhao, Shu Liu, Xiaojuan Qi, Jiaya Jia

DV Lab 102 Jun 21, 2022