Twins: Revisiting the Design of Spatial Attention in Vision Transformers

Very recently, a variety of vision transformer architectures for dense prediction tasks have been proposed and they show that the design of spatial attention is critical to their success in these tasks. In this work, we revisit the design of the spatial attention and demonstrate that a carefully-devised yet simple spatial attention mechanism performs favourably against the state-of-the-art schemes. As a result, we propose two vision transformer architectures, namely, Twins- PCPVT and Twins-SVT. Our proposed architectures are highly-efficient and easy to implement, only involving matrix multiplications that are highly optimized in modern deep learning frameworks. More importantly, the proposed architectures achieve excellent performance on a wide range of visual tasks including image- level classification as well as dense detection and segmentation. The simplicity and strong performance suggest that our proposed architectures may serve as stronger backbones for many vision tasks.

Figure 1. Twins-SVT-S Architecture (Right side shows the inside of two consecutive Transformer Encoders).

Model Zoo

Image Classification

We provide baseline Twins models pretrained on ImageNet 2012.

Name	Alias in paper	[email protected]	FLOPs(G)	#params (M)	url
PVT+CPVT-Small	Twins-PCPVT-S	81.2	3.7	24.1	pcpvt_small.pth
PVT+CPVT-Base	Twins-PCPVT-B	82.7	6.4	43.8	pcpvt_base.pth
ALT-GVT-Small	Twins-SVT-S	81.3	2.8	24	alt_gvt_small.pth
ALT-GVT-Base	Twins-SVT-B	83.1	8.3	56	alt_gvt_base.pth
ALT-GVT-Large	Twins-SVT-L	83.3	14.8	99.2	alt_gvt_large.pth

^ Note: Our code will be released soon.

Citation

@article{chu2021Twins,
	title={Twins: Revisiting the Design of Spatial Attention in Vision Transformers},
	author={Xiangxiang Chu and Zhi Tian and Yuqing Wang and Bo Zhang and Haibing Ren and Xiaolin Wei and Huaxia Xia and Chunhua Shen},
	journal={Arxiv preprint 2104.13840},
	url={https://arxiv.org/pdf/2104.13840.pdf},
	year={2021}
}

Twins: Revisiting the Design of Spatial Attention in Vision Transformers

Related tags

Overview

Twins: Revisiting the Design of Spatial Attention in Vision Transformers

Model Zoo

Image Classification

Citation

Owner

💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

Stereo Hybrid Event-Frame (SHEF) Cameras for 3D Perception, IROS 2021

Self-Supervised Monocular DepthEstimation with Internal Feature Fusion(arXiv), BMVC2021

An implementation of "Optimal Textures: Fast and Robust Texture Synthesis and Style Transfer through Optimal Transport"

Simulation-based performance analysis of server-less Blockchain-enabled Federated Learning

Training DALL-E with volunteers from all over the Internet using hivemind and dalle-pytorch (NeurIPS 2021 demo)

Job-Recommend-Competition - Vectorwise Interpretable Attentions for Multimodal Tabular Data

Implementation of our paper 'RESA: Recurrent Feature-Shift Aggregator for Lane Detection' in AAAI2021.

Answering Open-Domain Questions of Varying Reasoning Steps from Text

A framework for attentive explainable deep learning on tabular data

Neural Scene Flow Prior (NeurIPS 2021 spotlight)

Deduplicating Training Data Makes Language Models Better

[CVPR 2022] TransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial Editing

Code repository for "Stable View Synthesis".

Official repository of "Investigating Tradeoffs in Real-World Video Super-Resolution"

Does MAML Only Work via Feature Re-use? A Data Set Centric Perspective

PyTorch implementation for "HyperSPNs: Compact and Expressive Probabilistic Circuits", NeurIPS 2021

Cancer Drug Response Prediction via a Hybrid Graph Convolutional Network

The InterScript dataset contains interactive user feedback on scripts generated by a T5-XXL model.

Semi-Supervised Learning for Fine-Grained Classification