A curated list and survey of awesome Vision Transformers.

Overview
awesome-vit

English | 简体中文

A curated list and survey of awesome Vision Transformers.

You can use mind mapping software to open the mind mapping source file. You can also download the mind mapping HD pictures if you just want to browse them.

Contents

Survey

Only typical algorithms are listed in each category.

Image Classification

Chinese Blogs

Attention-based

image

Training Strategy

image

  • [DeiT] Training data-efficient image transformers & distillation through attention (ICML 2021-2020.12) [Paper]
  • [Token Labeling] All Tokens Matter: Token Labeling for Training Better Vision Transformers (2021.4) [Paper]
Model Improvements
Tokenization Module

image

Image to Token:

  • Non-overlapping Patch Embedding

    • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
    • [TNT] Transformer in Transformer (NeurIPS 2021-2021.3) [Paper]
    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
  • Overlapping Patch Embedding

    • [T2T-ViT] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet (2021.1) [Paper]

    • [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]

    • [PVTv2] PVTv2: Improved Baselines with Pyramid Vision Transformer (2021.6) [Paper]

    • [ViTAE] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias (2021.6) [Paper]

    • [PS-ViT] Vision Transformer with Progressive Sampling (2021.8) [Paper]

Token to Token:

  • Fixed sampling window tokenization
    • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
  • Dynamic sampling tokenization
    • [PS-ViT] Vision Transformer with Progressive Sampling (2021.8) [Paper]
    • [TokenLearner] TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? (2021.6) [Paper]
Position Encoding Module

image

Explicit position encoding:

  • Absolute position encoding
    • [Transformer] Attention is All You Need] (NIPS 2017-2017.06) [Paper]
    • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
    • [PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (2021.2) [Paper]
  • Relative position encoding
    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
    • [Swin Transformer V2] Swin Transformer V2: Scaling Up Capacity and Resolution (2021.11) [Paper]
    • [Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]

Implicit position encoding:

  • [CPVT] Conditional Positional Encodings for Vision Transformers (2021.2) [Paper]
  • [CSWin Transformer] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows (2021.07) [Paper]
  • [PVTv2] PVTv2: Improved Baselines with Pyramid Vision Transformer (2021.6) [Paper]
  • [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]
Attention Module

image

Include only global attention:

  • Multi-Head attention module

    • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
  • Reduce global attention computation

    • [PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (2021.2) [Paper]

    • [PVTv2] PVTv2: Improved Baselines with Pyramid Vision Transformer (2021.6) [Paper]

    • [Twins] Twins: Revisiting the Design of Spatial Attention in Vision Transformers (2021.4) [Paper]

    • [P2T] P2T: Pyramid Pooling Transformer for Scene Understanding (2021.6) [Paper]

    • [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]

    • [MViT] Multiscale Vision Transformers (2021.4) [Paper]

    • [Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]

  • Generalized linear attention

    • [T2T-ViT] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet (2021.1) [Paper]

Introduce extra local attention:

  • Local window mode

    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
    • [Swin Transformer V2] Swin Transformer V2: Scaling Up Capacity and Resolution (2021.11) [Paper]
    • [Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]
    • [Twins] Twins: Revisiting the Design of Spatial Attention in Vision Transformers (2021.4) [Paper]
    • [GG-Transformer] Glance-and-Gaze Vision Transformer (2021.6) [Paper]
    • [Shuffle Transformer] Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer (2021.6) [Paper]
    • [MSG-Transformer] MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens (2021.5) [Paper]
    • [CSWin Transformer] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows (2021.07) [Paper]
  • Introduce convolutional local inductive bias

    • [ViTAE] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias (2021.6) [Paper]
    • [ELSA] ELSA: Enhanced Local Self-Attention for Vision Transformer (2021.12) [Paper]
  • Sparse attention

    • [Sparse Transformer] Sparse Transformer: Concentrated Attention Through Explicit Selection [Paper]
FFN Module

image

Improve performance with Conv's local information extraction capability:

  • [LocalViT] LocalViT: Bringing Locality to Vision Transformers (2021.4) [Paper]
  • [CeiT] Incorporating Convolution Designs into Visual Transformers (2021.3) [Paper]
Normalization Module Location

image

  • Pre Normalization

    • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
  • Post Normalization

    • [Swin Transformer V2] Swin Transformer V2: Scaling Up Capacity and Resolution (2021.11) [Paper]
Classification Prediction Head Module

image

  • Class Tokens

    • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
    • [CeiT] Incorporating Convolution Designs into Visual Transformers (2021.3) [Paper]
  • Avgerage Pooling

    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
    • [CPVT] Conditional Positional Encodings for Vision Transformers (2021.2) [Paper]
    • [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]
Others

image

(1) How to output multi-scale feature map

  • Patch merging

    • [PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (2021.2) [Paper]
    • [Twins] Twins: Revisiting the Design of Spatial Attention in Vision Transformers (2021.4) [Paper]
    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
    • [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]
    • [CSWin Transformer] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows (2021.07) [Paper]
    • [MetaFormer] MetaFormer is Actually What You Need for Vision (2021.11) [Paper]
  • Pooling attention

    • [MViT] Multiscale Vision Transformers (2021.4) [Paper][Imporved MViT]

    • [Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]

  • Dilation convolution

    • [ViTAE] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias (2021.6) [Paper]

(2) How to train a deeper Transformer

  • [Cait] Going deeper with Image Transformers (2021.3) [Paper]
  • [DeepViT] DeepViT: Towards Deeper Vision Transformer (2021.3) [Paper]

MLP-based

image

  • [MLP-Mixer] MLP-Mixer: An all-MLP Architecture for Vision (2021.5) [Paper]

  • [ResMLP] ResMLP: Feedforward networks for image classification with data-efficient training (CVPR2021-2021.5) [Paper]

  • [gMLP] Pay Attention to MLPs (2021.5) [Paper]

  • [CycleMLP] CycleMLP: A MLP-like Architecture for Dense Prediction (2021.7) [Paper]

ConvMixer-based

  • [ConvMixer] Patches Are All You Need [Paper]

General Architecture Analysis

image

  • Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight (2021.6) [Paper]
  • A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP (2021.8) [Paper]
  • [MetaFormer] MetaFormer is Actually What You Need for Vision (2021.11) [Paper]
  • [ConvNeXt] A ConvNet for the 2020s (2022.01) [Paper]

Others

Object Detection

Semantic Segmentation

back to top

Papers

Transformer Original Paper

  • [Transformer] Attention is All You Need] (NIPS 2017-2017.06) [Paper]

ViT Original Paper

  • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]

Image Classification

2020

  • [DeiT] Training data-efficient image transformers & distillation through attention (ICML 2021-2020.12) [Paper]
  • [Sparse Transformer] Sparse Transformer: Concentrated Attention Through Explicit Selection [Paper]

2021

  • [T2T-ViT] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet (2021.1) [Paper]

  • [PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (2021.2) [Paper]

  • [CPVT] Conditional Positional Encodings for Vision Transformers (2021.2) [Paper]

  • [TNT] Transformer in Transformer (NeurIPS 2021-2021.3) [Paper]

  • [Cait] Going deeper with Image Transformers (2021.3) [Paper]

  • [DeepViT] DeepViT: Towards Deeper Vision Transformer (2021.3) [Paper]

  • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]

  • [CeiT] Incorporating Convolution Designs into Visual Transformers (2021.3) [Paper]

  • [LocalViT] LocalViT: Bringing Locality to Vision Transformers (2021.4) [Paper]

  • [MViT] Multiscale Vision Transformers (2021.4) [Paper]

  • [Twins] Twins: Revisiting the Design of Spatial Attention in Vision Transformers (2021.4) [Paper]

  • [Token Labeling] All Tokens Matter: Token Labeling for Training Better Vision Transformers (2021.4) [Paper]

  • [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]

  • [MLP-Mixer] MLP-Mixer: An all-MLP Architecture for Vision (2021.5) [Paper]

  • [ResMLP] ResMLP: Feedforward networks for image classification with data-efficient training (CVPR2021-2021.5) [Paper]

  • [gMLP] Pay Attention to MLPs (2021.5) [Paper]

  • [MSG-Transformer] MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens (2021.5) [Paper]

  • [PVTv2] PVTv2: Improved Baselines with Pyramid Vision Transformer (2021.6) [Paper]

  • [TokenLearner] TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? (2021.6) [Paper]

  • Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight (2021.6) [Paper]

  • [P2T] P2T: Pyramid Pooling Transformer for Scene Understanding (2021.6) [Paper]

  • [GG-Transformer] Glance-and-Gaze Vision Transformer (2021.6) [Paper]

  • [Shuffle Transformer] Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer (2021.6) [Paper]

  • [ViTAE] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias (2021.6) [Paper]

  • [CycleMLP] CycleMLP: A MLP-like Architecture for Dense Prediction (2021.7) [Paper]

  • [CSWin Transformer] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows (2021.07) [Paper]

  • [PS-ViT] Vision Transformer with Progressive Sampling (2021.8) [Paper]

  • A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP (2021.8) [Paper]

  • [Swin Transformer V2] Swin Transformer V2: Scaling Up Capacity and Resolution (2021.11) [Paper]

  • [MetaFormer] MetaFormer is Actually What You Need for Vision (2021.11) [Paper]

  • [Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]

  • [ELSA] ELSA: Enhanced Local Self-Attention for Vision Transformer (2021.12) [Paper]

  • [ConvMixer] Patches Are All You Need [Paper]

2022

  • [ConvNeXt] A ConvNet for the 2020s (2022.01) [Paper]

Object Detection

Semantic Segmentation

back to top

Stay tuned and PRs are welcomed!

Owner
OpenMMLab
OpenMMLab
Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.

WECHSEL Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. arXiv: https://arx

Institute of Computational Perception 45 Dec 29, 2022
This YoloV5 based model is fit to detect people and different types of land vehicles, and displaying their density on a fitted map, according to their coordinates and detected labels.

This YoloV5 based model is fit to detect people and different types of land vehicles, and displaying their density on a fitted map, according to their

Liron Bdolah 8 May 22, 2022
Extreme Lightwegith Portrait Segmentation

Extreme Lightwegith Portrait Segmentation Please go to this link to download code Requirements python 3 pytorch = 0.4.1 torchvision==0.2.1 opencv-pyt

HYOJINPARK 59 Dec 16, 2022
Code for "Neural Parts: Learning Expressive 3D Shape Abstractions with Invertible Neural Networks", CVPR 2021

Neural Parts: Learning Expressive 3D Shape Abstractions with Invertible Neural Networks This repository contains the code that accompanies our CVPR 20

Despoina Paschalidou 161 Dec 20, 2022
Integrated Semantic and Phonetic Post-correction for Chinese Speech Recognition

Integrated Semantic and Phonetic Post-correction for Chinese Speech Recognition | paper | dataset | pretrained detection model | Authors: Yi-Chang Che

Yi-Chang Chen 1 Aug 23, 2022
Tool for installing and updating MiSTer cores and other files

MiSTer Downloader This tool installs and updates all the cores and other extra files for your MiSTer. It also updates the menu core, the MiSTer firmwa

72 Dec 24, 2022
Camera ready code repo for the NeuRIPS 2021 paper: "Impression learning: Online representation learning with synaptic plasticity".

Impression-Learning-Camera-Ready Camera ready code repo for the NeuRIPS 2021 paper: "Impression learning: Online representation learning with synaptic

2 Feb 09, 2022
frida工具的缝合怪

fridaUiTools fridaUiTools是一个界面化整理脚本的工具。新人的练手作品。参考项目ZenTracer,觉得既然可以界面化,那么应该可以把功能做的更加完善一些。跨平台支持:win、mac、linux 功能缝合怪。把一些常用的frida的hook脚本简单统一输出方式后,整合进来。并且

diveking 997 Jan 09, 2023
Example-custom-ml-block-keras - Custom Keras ML block example for Edge Impulse

Custom Keras ML block example for Edge Impulse This repository is an example on

Edge Impulse 8 Nov 02, 2022
Source code for NAACL 2021 paper "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference"

TR-BERT Source code and dataset for "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference". The code is based on huggaface's transformers.

THUNLP 37 Oct 30, 2022
Existing Literature about Machine Unlearning

Machine Unlearning Papers 2021 Brophy and Lowd. Machine Unlearning for Random Forests. In ICML 2021. Bourtoule et al. Machine Unlearning. In IEEE Symp

Jonathan Brophy 213 Jan 08, 2023
Repository for the Bias Benchmark for QA dataset.

BBQ Repository for the Bias Benchmark for QA dataset. Authors: Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Tho

ML² AT CILVR 18 Nov 18, 2022
Tensorflow 2.x implementation of Panoramic BlitzNet for object detection and semantic segmentation on indoor panoramic images.

Deep neural network for object detection and semantic segmentation on indoor panoramic images. The implementation is based on the papers:

Alejandro de Nova Guerrero 9 Nov 24, 2022
Evolutionary Population Curriculum for Scaling Multi-Agent Reinforcement Learning

Evolutionary Population Curriculum for Scaling Multi-Agent Reinforcement Learning This is the code for implementing the MADDPG algorithm presented in

97 Dec 21, 2022
Predicting Event Memorability from Contextual Visual Semantics

Predicting Event Memorability from Contextual Visual Semantics

0 Oct 06, 2021
GAN example for Keras. Cuz MNIST is too small and there should be something more realistic.

Keras-GAN-Animeface-Character GAN example for Keras. Cuz MNIST is too small and there should an example on something more realistic. Some results Trai

160 Sep 20, 2022
Implementation of Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

advantage-weighted-regression Implementation of Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning, by Peng et al. (

Omar D. Domingues 1 Dec 02, 2021
SnapMix: Semantically Proportional Mixing for Augmenting Fine-grained Data (AAAI 2021)

SnapMix: Semantically Proportional Mixing for Augmenting Fine-grained Data (AAAI 2021) PyTorch implementation of SnapMix | paper Method Overview Cite

DavidHuang 126 Dec 30, 2022
Vikrant Deshpande 1 Nov 17, 2022
Emulation and Feedback Fuzzing of Firmware with Memory Sanitization

BaseSAFE This repository contains the BaseSAFE Rust APIs, introduced by "BaseSAFE: Baseband SAnitized Fuzzing through Emulation". The example/ directo

Security in Telecommunications 138 Dec 16, 2022