Unified tracking framework with a single appearance model

Last update: Dec 24, 2022

Related tags

Deep Learning UniTrack

Overview

Paper: Do different tracking tasks require different appearance model?

[ArXiv] (comming soon) [Project Page] (comming soon)

UniTrack is a simple and Unified framework for versatile visual Tracking tasks.

As an important problem in computer vision, tracking has been fragmented into a multitude of different experimental setups. As a consequence, the literature has fragmented too, and now the novel approaches proposed by the community are usually specialized to fit only one specific setup. To understand to what extend this specialization is actually necessary, we present UniTrack, a solution to address multiple different tracking tasks within the same framework. All tasks share the same universal appearance model. UniTrack enjoys the following advantages,

Do NOT need training on a specific tracking task.
Good performance in existing tracking tasks, thus can serve as strong baselines for each task.
Could be easily adapted to novel tasks with different setup.
Could serve as an evaluation platform to test pre-trained representations on tracking tasks (e.g. via self-supervised models).

Tasks & Framework

Tasks

We classify existing tracking tasks along four axes: (1) Single or multiple targets; (2) Users specify targets or automatic detectors specify targets; (3) Observation formats (bounding box/mask/pose); (2) Class-agnostic or class-specific (i.e. human/vehicles). We mainly expriment on 5 tasks: SOT, VOS, MOT, MOTS, and PoseTrack. Task setups are summarized in the above figure.

Appearance model

An appearance model is the only learnable component in UniTrack. It should provide universal visual representation, and is usually pre-trained on large-scale dataset in supervised or unsupervised manners. Typical examples include ImageNet pre-trained ResNets (supervised), and recent self-supervised models such as MoCo and SimCLR (unsupervised).

Propagation and Association

Two fundamental algorithm building blocks in UniTrack. Both employ features extracted by the appearance model as input. For propagation we adopt exiting methods such as cross correlation, DCF, and mask propation. For association we employ a simple algorithm and develop a novel similarity metric to make full use of the appearance model.

Results

Below we show results of UniTrack with a simple ImageNet Pre-trained ResNet-18 as the appearance model. More results (other tasks/datasets, more visualization) can be found in results.md.

Qualitative results

Single Object Tracking (SOT) on OTB-2015

Video Object Segmentation (VOS) on DAVIS-2017 val split

Multiple Object Tracking (MOT) on MOT-16 test set private detector track (Detections from FairMOT)

Multiple Object Tracking and Segmentation (MOTS) on MOTS challenge test set (Detections from COSTA_st)

Pose Tracking on PoseTrack-2018 val split (Detections from LightTrack)

Quantitative results

Single Object Tracking (SOT) on OTB-2015

Method	SiamFC	SiamRPN	SiamRPN++	UDT*	UDT+*	LUDT*	LUDT+*	UniTrack_XCorr*	UniTrack_DCF*
AUC	58.2	63.7	69.6	59.4	63.2	60.2	63.9	55.5	61.8

* indicates non-supervised methods

Video Object Segmentation (VOS) on DAVIS-2017 val split

Method	SiamMask	FeelVOS	STM	Colorization*	TimeCycle*	UVC*	CRW*	VFS*	UniTrack*
J-mean	54.3	63.7	79.2	34.6	40.1	56.7	64.8	66.5	58.4

* indicates non-supervised methods

Multiple Object Tracking (MOT) on MOT-16 test set private detector track

Method	POI	DeepSORT-2	JDE	CTrack	TubeTK	TraDes	CSTrack	FairMOT*	UniTrack*
IDF-1	65.1	62.2	55.8	57.2	62.2	64.7	71.8	72.8	71.8
IDs	805	781	1544	1897	1236	1144	1071	1074	683
MOTA	66.1	61.4	64.4	67.6	66.9	70.1	70.7	74.9	74.7

* indicates methods using the same detections

Multiple Object Tracking and Segmentation (MOTS) on MOTS challenge test set

Method	TrackRCNN	SORTS	PointTrack	GMPHD	COSTA_st*	UniTrack*
IDF-1	42.7	57.3	42.9	65.6	70.3	67.2
IDs	567	577	868	566	421	622
sMOTA	40.6	55.0	62.3	69.0	70.2	68.9

* indicates methods using the same detections

Pose Tracking on PoseTrack-2018 val split

Method	MDPN	OpenSVAI	Miracle	KeyTrack	LightTrack*	UniTrack*
IDF-1	-	-	-	-	52.2	73.2
IDs	-	-	-	-	3024	6760
sMOTA	50.6	62.4	64.0	66.6	64.8	63.5

* indicates methods using the same detections

Getting started

Demo

Update log

[2021.6.24]: Start writing docs, please stay tuned!

Acknowledgement

VideoWalk by Allan A. Jabri

SOT code by Zhipeng Zhang

Unified tracking framework with a single appearance model

Related tags

Overview

Tasks & Framework

Tasks

Appearance model

Propagation and Association

Results

Qualitative results

Quantitative results

Getting started

Demo

Update log

Acknowledgement

Owner

ZhongdaoWang

10x faster matrix and vector operations

This code is an implementation for Singing TTS.

Learning to Initialize Neural Networks for Stable and Efficient Training

code for ICCV 2021 paper 'Generalized Source-free Domain Adaptation'

PyTorch implementation of MoCo: Momentum Contrast for Unsupervised Visual Representation Learning

Implementation of SegNet: A Deep Convolutional Encoder-Decoder Architecture for Semantic Pixel-Wise Labelling

Share a benchmark that can easily apply reinforcement learning in Job-shop-scheduling

DeepLabv3+：Encoder-Decoder with Atrous Separable Convolution语义分割模型在tensorflow2当中的实现

Official code for ICCV2021 paper "M3D-VTON: A Monocular-to-3D Virtual Try-on Network"

PESTO: Switching Point based Dynamic and Relative Positional Encoding for Code-Mixed Languages

基于tensorflow 2.x的图片识别工具集

A Pytorch implementation of SMU: SMOOTH ACTIVATION FUNCTION FOR DEEP NETWORKS USING SMOOTHING MAXIMUM TECHNIQUE

Implementation of SSMF: Shifting Seasonal Matrix Factorization

Effect of Different Encodings and Distance Functions on Quantum Instance-based Classifiers

The comma.ai Calibration Challenge!

Run containerized, rootless applications with podman

Code for "Diversity can be Transferred: Output Diversification for White- and Black-box Attacks"

A TensorFlow implementation of the Mnemonic Descent Method.

Does Pretraining for Summarization Reuqire Knowledge Transfer?

This is an official implementation for "Exploiting Temporal Contexts with Strided Transformer for 3D Human Pose Estimation".