Official implement of "CAT: Cross Attention in Vision Transformer".

Last update: Dec 15, 2022

Related tags

Overview

CAT: Cross Attention in Vision Transformer

This is official implement of "CAT: Cross Attention in Vision Transformer".

Abstract

Since Transformer has found widespread use in NLP, the potential of Transformer in CV has been realized and has inspired many new approaches. However, the computation required for replacing word tokens with image patches for Transformer after the tokenization of the image is vast(e.g., ViT), which bottlenecks model training and inference. In this paper, we propose a new attention mechanism in Transformer termed Cross Attention, which alternates attention inner the image patch instead of the whole image to capture local information and apply attention between image patches which are divided from single-channel feature maps to capture global information. Both operations have less computation than standard self-attention in Transformer. By alternately applying attention inner patch and between patches, we implement cross attention to maintain the performance with lower computational cost and build a hierarchical network called Cross Attention Transformer(CAT) for other vision tasks. Our base model achieves state-of-the-arts on ImageNet-1K, and improves the performance of other methods on COCO and ADE20K, illustrating that our network has the potential to serve as general backbones.

CAT achieves strong performance on COCO object detection(implemented with mmdectection) and ADE20K semantic segmentation(implemented with mmsegmantation).

Pretrained Models and Results on ImageNet-1K

name	resolution	[email protected]	[email protected]	#params	FLOPs	model	log
CAT-T	224x224	80.3	95.0	17M	2.8G	github	github
CAT-S^*	224x224	81.8	95.6	37M	5.9G	github	github
CAT-B	224x224	82.8	96.1	52M	8.9G	github	github
CAT-T-v2	224x224	81.7	95.5	36M	3.9G	Coming	Coming

Note: ^* indicates new version of model and log.

Models and Results on Object Detection (COCO 2017 val)

Backbone	Method	pretrain	Lr Schd	box mAP	mask mAP	#params	FLOPs	model	log
CAT-S	Mask R-CNN⁺	ImageNet-1K	1x	41.6	38.6	57M	295G	github	github
CAT-B	Mask R-CNN⁺	ImageNet-1K	1x	41.8	38.7	71M	356G	github	github
CAT-S	FCOS	ImageNet-1K	1x	40.0	-	45M	245G	github	github
CAT-B	FCOS	ImageNet-1K	1x	41.0	-	59M	303G	github	github
CAT-S	ATSS	ImageNet-1K	1x	42.0	-	45M	243G	github	github
CAT-B	ATSS	ImageNet-1K	1x	42.5	-	59M	303G	github	github
CAT-S	RetinaNet	ImageNet-1K	1x	40.1	-	47M	276G	github	github
CAT-B	RetinaNet	ImageNet-1K	1x	41.4	-	62M	337G	github	github
CAT-S	Cascade R-CNN	ImageNet-1K	1x	44.1	-	82M	270G	github	github
CAT-B	Cascade R-CNN	ImageNet-1K	1x	44.8	-	96M	330G	github	github
CAT-S	Cascade R-CNN⁺	ImageNet-1K	1x	45.2	-	82M	270G	github	github
CAT-B	Cascade R-CNN⁺	ImageNet-1K	1x	46.3	-	96M	330G	github	github

Note: ⁺ indicates multi-scale training.

Models and Results on Semantic Segmentation (ADE20K val)

Backbone	Method	pretrain	Crop Size	Lr Schd	mIoU	mIoU (ms+flip)	#params	FLOPs	model	log
CAT-S	Semantic FPN	ImageNet-1K	512x512	80K	40.6	42.1	41M	214G	github	github
CAT-B	Semantic FPN	ImageNet-1K	512x512	80K	42.2	43.6	55M	276G	github	github
CAT-S	Semantic FPN	ImageNet-1K	512x512	160K	42.2	42.8	41M	214G	github	github
CAT-B	Semantic FPN	ImageNet-1K	512x512	160K	43.2	44.9	55M	276G	github	github

Citing CAT

You can cite the paper as:

@article{lin2021cat,
  title={CAT: Cross Attention in Vision Transformer},
  author={Hezheng Lin and Xing Cheng and Xiangyu Wu and Fan Yang and Dong Shen and Zhongyuan Wang and Qing Song and Wei Yuan},
  journal={arXiv preprint arXiv:2106.05786},
  year={2021}
}

Started

Please refer to get_started.

Acknowledgement

Our implementation is mainly based on Swin.

You might also like...

Implement A3C for Mujoco gym envs

pytorch-a3c-mujoco Disclaimer: my implementation right now is unstable (you ca refer to the learning curve below), I'm not sure if it's my problems. A

70 Dec 12, 2022

Perfect implement. Model shared. x0.5 (Top1:60.646) and 1.0x (Top1:69.402).

Shufflenet-v2-Pytorch Introduction This is a Pytorch implementation of faceplusplus's ShuffleNet-v2. For details, please read the following papers:

423 Dec 7, 2022

implement of SwiftNet:Real-time Video Object Segmentation

SwiftNet The official PyTorch implementation of SwiftNet:Real-time Video Object Segmentation, which has been accepted by CVPR2021. Requirements Python

64 Dec 14, 2022

The implement of papar "Enhanced Graph Learning for Collaborative Filtering via Mutual Information Maximization"

SIGIR2021-EGLN The implement of paper "Enhanced Graph Learning for Collaborative Filtering via Mutual Information Maximization" Neural graph based Col

15 Dec 27, 2022

a Pytorch easy re-implement of "YOLOX: Exceeding YOLO Series in 2021"

A pytorch easy re-implement of "YOLOX: Exceeding YOLO Series in 2021" 1. Notes This is a pytorch easy re-implement of "YOLOX: Exceeding YOLO Series in

91 Dec 26, 2022

PyTorch Implement of Context Encoders: Feature Learning by Inpainting

Context Encoders: Feature Learning by Inpainting This is the Pytorch implement of CVPR 2016 paper on Context Encoders 1) Semantic Inpainting Demo Inst

321 Dec 25, 2022

Implement Decoupled Neural Interfaces using Synthetic Gradients in Pytorch

disclaimer: this code is modified from pytorch-tutorial Image classification with synthetic gradient in Pytorch I implement the Decoupled Neural Inter

114 Dec 22, 2022

Demonstrates how to divide a DL model into multiple IR model files (division) and introduce a simplest way to implement a custom layer works with OpenVINO IR models.

Demonstration of OpenVINO techniques - Model-division and a simplest-way to support custom layers Description: Model Optimizer in Intel(r) OpenVINO(tm

12 Nov 9, 2022

Implement some metaheuristics and cost functions

Metaheuristics This repot implement some metaheuristics and cost functions. Metaheuristics JAYA Implement Jaya optimizer without constraints. Cost fun

1 Mar 23, 2022

Official implement of "CAT: Cross Attention in Vision Transformer".

Related tags

Overview

CAT: Cross Attention in Vision Transformer

Abstract

Pretrained Models and Results on ImageNet-1K

Models and Results on Object Detection (COCO 2017 val)

Models and Results on Semantic Segmentation (ADE20K val)

Citing CAT

Started

Acknowledgement

You might also like...

Implement A3C for Mujoco gym envs

Perfect implement. Model shared. x0.5 (Top1:60.646) and 1.0x (Top1:69.402).

implement of SwiftNet:Real-time Video Object Segmentation

The implement of papar "Enhanced Graph Learning for Collaborative Filtering via Mutual Information Maximization"

a Pytorch easy re-implement of "YOLOX: Exceeding YOLO Series in 2021"

PyTorch Implement of Context Encoders: Feature Learning by Inpainting

Implement Decoupled Neural Interfaces using Synthetic Gradients in Pytorch

Demonstrates how to divide a DL model into multiple IR model files (division) and introduce a simplest way to implement a custom layer works with OpenVINO IR models.

Implement some metaheuristics and cost functions

Releases(v1.0)

v1.0(Jun 5, 2022)

Owner

Code and datasets for the paper "KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction"

Public Code for NIPS submission SimiGrad: Fine-Grained Adaptive Batching for Large ScaleTraining using Gradient Similarity Measurement

A supplementary code for Editable Neural Networks, an ICLR 2020 submission.

Pytorch implementation for the paper: Contrastive Learning for Cold-start Recommendation

Joint deep network for feature line detection and description

salabim - discrete event simulation in Python

Demystifying How Self-Supervised Features Improve Training from Noisy Labels

POPPY (Physical Optics Propagation in Python) is a Python package that simulates physical optical propagation including diffraction

Source code for our paper "Empathetic Response Generation with State Management"

Relative Uncertainty Learning for Facial Expression Recognition

A clear, concise, simple yet powerful and efficient API for deep learning.

The Adapter-Bot: All-In-One Controllable Conversational Model

YoHa - A practical hand tracking engine.

Build tensorflow keras model pipelines in a single line of code. Created by Ram Seshadri. Collaborators welcome. Permission granted upon request.

PyTorch code for the paper: FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning

Implementation of CSRL from the AAAI2022 paper: Constraint Sampling Reinforcement Learning: Incorporating Expertise For Faster Learning

Reading list for research topics in Masked Image Modeling

This repository contains project created during the Data Challenge module at London School of Hygiene & Tropical Medicine

Semantically Contrastive Learning for Low-light Image Enhancement

Source code and notebooks to reproduce experiments and benchmarks on Bias Faces in the Wild (BFW).