ReferFormer - Official Implementation of ReferFormer

Last update: Dec 29, 2022

Overview

The official implementation of the paper:

Language as Queries for Referring
Video Object Segmentation

Language as Queries for Referring Video Object Segmentation

Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, Ping Luo

Abstract

In this work, we propose a simple and unified framework built upon Transformer, termed ReferFormer. It views the language as queries and directly attends to the most relevant regions in the video frames. Concretely, we introduce a small set of object queries conditioned on the language as the input to the Transformer. In this manner, all the queries are obligated to find the referred objects only. They are eventually transformed into dynamic kernels which capture the crucial object-level information, and play the role of convolution filters to generate the segmentation masks from feature maps. The object tracking is achieved naturally by linking the corresponding queries across frames. This mechanism greatly simplifies the pipeline and the end-to-end framework is significantly different from the previous methods. Extensive experiments on Ref-Youtube-VOS, Ref-DAVIS17, A2D-Sentences and JHMDB-Sentences show the effectiveness of ReferFormer.

Requirements

We test the codes in the following environments, other versions may also be compatible:

CUDA 11.1
Python 3.7
Pytorch 1.8.1

Installation

Please refer to install.md for installation.

Data Preparation

Please refer to data.md for data preparation.

We provide the pretrained model for different visual backbones. You may download them here and put them in the directory pretrained_weights.

After the organization, we expect the directory struture to be the following:

ReferFormer/
├── data/
│   ├── ref-youtube-vos/
│   ├── ref-davis/
│   ├── a2d_sentences/
│   ├── jhmdb_sentences/
├── davis2017/
├── datasets/
├── models/
├── scipts/
├── tools/
├── util/
├── pretrained_weights/
├── eval_davis.py
├── main.py
├── engine.py
├── inference_ytvos.py
├── inference_davis.py
├── opts.py
...

Model Zoo

All the models are trained using 8 NVIDIA Tesla V100 GPU. You may change the --backbone parameter to use different backbones (see here).

Note: If you encounter the OOM error, please add the command --use_checkpoint (we add this command for Swin-L, Video-Swin-S and Video-Swin-B models).

Ref-Youtube-VOS

To evaluate the results, please upload the zip file to the competition server.

Backbone	J&F	CFBI J&F	Pretrain	Model	Submission	CFBI Submission
ResNet-50	55.6	59.4	weight	model	link	link
ResNet-101	57.3	60.3	weight	model	link	link
Swin-T	58.7	61.2	weight	model	link	link
Swin-L	62.4	63.3	weight	model	link	link
Video-Swin-T*	55.8	-	-	model	link	-
Video-Swin-T	59.4	-	weight	model	link	-
Video-Swin-S	60.1	-	weight	model	link	-
Video-Swin-B	62.9	-	weight	model	link	-

* indicates the model is trained from scratch.

Ref-DAVIS17

As described in the paper, we report the results using the model trained on Ref-Youtube-VOS without finetune.

Backbone	J&F	J	F	Model
ResNet-50	58.5	55.8	61.3	model
Swin-L	60.5	57.6	63.4	model
Video-Swin-B	61.1	58.1	64.1	model

A2D-Sentences

The pretrained models are the same as those provided for Ref-Youtube-VOS.

Backbone	Overall IoU	Mean IoU	mAP	Pretrain	Model
Video-Swin-T	77.6	69.6	52.8	weight	model \| log
Video-Swin-S	77.7	69.8	53.9	weight	model \| log
Video-Swin-B	78.6	70.3	55.0	weight	model \| log

JHMDB-Sentences

As described in the paper, we report the results using the model trained on A2D-Sentences without finetune.

Backbone	Overall IoU	Mean IoU	mAP	Model
Video-Swin-T	71.9	71.0	42.2	model
Video-Swin-S	72.8	71.5	42.4	model
Video-Swin-B	73.0	71.8	43.7	model

Get Started

Please see Ref-Youtube-VOS, Ref-DAVIS17, A2D-Sentences and JHMDB-Sentences for details.

Acknowledgement

This repo is based on Deformable DETR and VisTR. We also refer to the repositories MDETR and MTTR. Thanks for their wonderful works.

Citation

@article{wu2022referformer,
      title={Language as Queries for Referring Video Object Segmentation}, 
      author={Jiannan Wu and Yi Jiang and Peize Sun and Zehuan Yuan and Ping Luo},
      journal={arXiv preprint arXiv:2201.00487},
      year={2022},
}

ReferFormer - Official Implementation of ReferFormer

Related tags

Overview

Language as Queries for Referring
Video Object Segmentation

Abstract

Requirements

Installation

Data Preparation

Model Zoo

Ref-Youtube-VOS

Ref-DAVIS17

A2D-Sentences

JHMDB-Sentences

Get Started

Acknowledgement

Citation

Owner

Jonas Wu

Code release for "Self-Tuning for Data-Efficient Deep Learning" (ICML 2021)

Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms

[NeurIPS 2020] Blind Video Temporal Consistency via Deep Video Prior

A PyTorch implementation of QANet.

Yet Another Robotics and Reinforcement (YARR) learning framework for PyTorch.

Codes and models of NeurIPS2021 paper - DominoSearch: Find layer-wise fine-grained N:M sparse schemes from dense neural networks

Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics

This is the official PyTorch implementation of our paper: "Artistic Style Transfer with Internal-external Learning and Contrastive Learning".

Efficient Training of Audio Transformers with Patchout

an implementation of Revisiting Adaptive Convolutions for Video Frame Interpolation using PyTorch

A PyTorch implementation of "TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?"

SimDeblur is a simple framework for image and video deblurring, implemented by PyTorch

ICLR 2021, Fair Mixup: Fairness via Interpolation

Reference code for the paper "Cross-Camera Convolutional Color Constancy" (ICCV 2021)

Generating Images with Recurrent Adversarial Networks

Code for HLA-Face: Joint High-Low Adaptation for Low Light Face Detection (CVPR21)

Tool for working with Y-chromosome data from YFull and FTDNA

House3D: A Rich and Realistic 3D Environment

BankNote-Net: Open dataset and encoder model for assistive currency recognition

REGTR: End-to-end Point Cloud Correspondences with Transformers