Official implement of "CAT: Cross Attention in Vision Transformer".

Last update: Dec 15, 2022

Related tags

Overview

CAT: Cross Attention in Vision Transformer

This is official implement of "CAT: Cross Attention in Vision Transformer".

Abstract

Since Transformer has found widespread use in NLP, the potential of Transformer in CV has been realized and has inspired many new approaches. However, the computation required for replacing word tokens with image patches for Transformer after the tokenization of the image is vast(e.g., ViT), which bottlenecks model training and inference. In this paper, we propose a new attention mechanism in Transformer termed Cross Attention, which alternates attention inner the image patch instead of the whole image to capture local information and apply attention between image patches which are divided from single-channel feature maps to capture global information. Both operations have less computation than standard self-attention in Transformer. By alternately applying attention inner patch and between patches, we implement cross attention to maintain the performance with lower computational cost and build a hierarchical network called Cross Attention Transformer(CAT) for other vision tasks. Our base model achieves state-of-the-arts on ImageNet-1K, and improves the performance of other methods on COCO and ADE20K, illustrating that our network has the potential to serve as general backbones.

CAT achieves strong performance on COCO object detection(implemented with mmdectection) and ADE20K semantic segmentation(implemented with mmsegmantation).

Pretrained Models and Results on ImageNet-1K

name	resolution	[email protected]	[email protected]	#params	FLOPs	model	log
CAT-T	224x224	80.3	95.0	17M	2.8G	github	github
CAT-S^*	224x224	81.8	95.6	37M	5.9G	github	github
CAT-B	224x224	82.8	96.1	52M	8.9G	github	github
CAT-T-v2	224x224	81.7	95.5	36M	3.9G	Coming	Coming

Note: ^* indicates new version of model and log.

Models and Results on Object Detection (COCO 2017 val)

Backbone	Method	pretrain	Lr Schd	box mAP	mask mAP	#params	FLOPs	model	log
CAT-S	Mask R-CNN⁺	ImageNet-1K	1x	41.6	38.6	57M	295G	github	github
CAT-B	Mask R-CNN⁺	ImageNet-1K	1x	41.8	38.7	71M	356G	github	github
CAT-S	FCOS	ImageNet-1K	1x	40.0	-	45M	245G	github	github
CAT-B	FCOS	ImageNet-1K	1x	41.0	-	59M	303G	github	github
CAT-S	ATSS	ImageNet-1K	1x	42.0	-	45M	243G	github	github
CAT-B	ATSS	ImageNet-1K	1x	42.5	-	59M	303G	github	github
CAT-S	RetinaNet	ImageNet-1K	1x	40.1	-	47M	276G	github	github
CAT-B	RetinaNet	ImageNet-1K	1x	41.4	-	62M	337G	github	github
CAT-S	Cascade R-CNN	ImageNet-1K	1x	44.1	-	82M	270G	github	github
CAT-B	Cascade R-CNN	ImageNet-1K	1x	44.8	-	96M	330G	github	github
CAT-S	Cascade R-CNN⁺	ImageNet-1K	1x	45.2	-	82M	270G	github	github
CAT-B	Cascade R-CNN⁺	ImageNet-1K	1x	46.3	-	96M	330G	github	github

Note: ⁺ indicates multi-scale training.

Models and Results on Semantic Segmentation (ADE20K val)

Backbone	Method	pretrain	Crop Size	Lr Schd	mIoU	mIoU (ms+flip)	#params	FLOPs	model	log
CAT-S	Semantic FPN	ImageNet-1K	512x512	80K	40.6	42.1	41M	214G	github	github
CAT-B	Semantic FPN	ImageNet-1K	512x512	80K	42.2	43.6	55M	276G	github	github
CAT-S	Semantic FPN	ImageNet-1K	512x512	160K	42.2	42.8	41M	214G	github	github
CAT-B	Semantic FPN	ImageNet-1K	512x512	160K	43.2	44.9	55M	276G	github	github

Citing CAT

You can cite the paper as:

@article{lin2021cat,
  title={CAT: Cross Attention in Vision Transformer},
  author={Hezheng Lin and Xing Cheng and Xiangyu Wu and Fan Yang and Dong Shen and Zhongyuan Wang and Qing Song and Wei Yuan},
  journal={arXiv preprint arXiv:2106.05786},
  year={2021}
}

Started

Please refer to get_started.

Acknowledgement

Our implementation is mainly based on Swin.

You might also like...

Implement A3C for Mujoco gym envs

pytorch-a3c-mujoco Disclaimer: my implementation right now is unstable (you ca refer to the learning curve below), I'm not sure if it's my problems. A

70 Dec 12, 2022

Perfect implement. Model shared. x0.5 (Top1:60.646) and 1.0x (Top1:69.402).

Shufflenet-v2-Pytorch Introduction This is a Pytorch implementation of faceplusplus's ShuffleNet-v2. For details, please read the following papers:

423 Dec 7, 2022

implement of SwiftNet:Real-time Video Object Segmentation

SwiftNet The official PyTorch implementation of SwiftNet:Real-time Video Object Segmentation, which has been accepted by CVPR2021. Requirements Python

64 Dec 14, 2022

The implement of papar "Enhanced Graph Learning for Collaborative Filtering via Mutual Information Maximization"

SIGIR2021-EGLN The implement of paper "Enhanced Graph Learning for Collaborative Filtering via Mutual Information Maximization" Neural graph based Col

15 Dec 27, 2022

a Pytorch easy re-implement of "YOLOX: Exceeding YOLO Series in 2021"

A pytorch easy re-implement of "YOLOX: Exceeding YOLO Series in 2021" 1. Notes This is a pytorch easy re-implement of "YOLOX: Exceeding YOLO Series in

91 Dec 26, 2022

PyTorch Implement of Context Encoders: Feature Learning by Inpainting

Context Encoders: Feature Learning by Inpainting This is the Pytorch implement of CVPR 2016 paper on Context Encoders 1) Semantic Inpainting Demo Inst

321 Dec 25, 2022

Implement Decoupled Neural Interfaces using Synthetic Gradients in Pytorch

disclaimer: this code is modified from pytorch-tutorial Image classification with synthetic gradient in Pytorch I implement the Decoupled Neural Inter

114 Dec 22, 2022

Demonstrates how to divide a DL model into multiple IR model files (division) and introduce a simplest way to implement a custom layer works with OpenVINO IR models.

Demonstration of OpenVINO techniques - Model-division and a simplest-way to support custom layers Description: Model Optimizer in Intel(r) OpenVINO(tm

12 Nov 9, 2022

Implement some metaheuristics and cost functions

Metaheuristics This repot implement some metaheuristics and cost functions. Metaheuristics JAYA Implement Jaya optimizer without constraints. Cost fun

1 Mar 23, 2022

Official implement of "CAT: Cross Attention in Vision Transformer".

Related tags

Overview

CAT: Cross Attention in Vision Transformer

Abstract

Pretrained Models and Results on ImageNet-1K

Models and Results on Object Detection (COCO 2017 val)

Models and Results on Semantic Segmentation (ADE20K val)

Citing CAT

Started

Acknowledgement

You might also like...

Implement A3C for Mujoco gym envs

Perfect implement. Model shared. x0.5 (Top1:60.646) and 1.0x (Top1:69.402).

implement of SwiftNet:Real-time Video Object Segmentation

The implement of papar "Enhanced Graph Learning for Collaborative Filtering via Mutual Information Maximization"

a Pytorch easy re-implement of "YOLOX: Exceeding YOLO Series in 2021"

PyTorch Implement of Context Encoders: Feature Learning by Inpainting

Implement Decoupled Neural Interfaces using Synthetic Gradients in Pytorch

Demonstrates how to divide a DL model into multiple IR model files (division) and introduce a simplest way to implement a custom layer works with OpenVINO IR models.

Implement some metaheuristics and cost functions

Releases(v1.0)

v1.0(Jun 5, 2022)

Owner

Pretrained Pytorch face detection (MTCNN) and recognition (InceptionResnet) models

An AI Assistant More Than a Toolkit

Pytorch Implementation for Dilated Continuous Random Field

An open source Python package for plasma science that is under development

PyTorch implementation of UPFlow (unsupervised optical flow learning)

An intelligent, flexible grammar of machine learning.

Background Matting: The World is Your Green Screen

List of papers, code and experiments using deep learning for time series forecasting

DARTS-: Robustly Stepping out of Performance Collapse Without Indicators

Benchmarks for semi-supervised domain generalization.

Official implementation for CVPR 2021 paper: Adaptive Class Suppression Loss for Long-Tail Object Detection

A DCGAN to generate anime faces using custom mined dataset

Image classification for projects and researches

Guided Internet-delivered Cognitive Behavioral Therapy Adherence Forecasting

Tensorflow 2.x implementation of Panoramic BlitzNet for object detection and semantic segmentation on indoor panoramic images.

A Survey on Deep Learning Technique for Video Segmentation

A toolkit for Lagrangian-based constrained optimization in Pytorch

A generalist algorithm for cell and nucleus segmentation.

MaRS - a recursive filtering framework that allows for truly modular multi-sensor integration

[ICCV 2021] Relaxed Transformer Decoders for Direct Action Proposal Generation