A curated list and survey of awesome Vision Transformers.

Overview
awesome-vit

English | 简体中文

A curated list and survey of awesome Vision Transformers.

You can use mind mapping software to open the mind mapping source file. You can also download the mind mapping HD pictures if you just want to browse them.

Contents

Survey

Only typical algorithms are listed in each category.

Image Classification

Chinese Blogs

Attention-based

image

Training Strategy

image

  • [DeiT] Training data-efficient image transformers & distillation through attention (ICML 2021-2020.12) [Paper]
  • [Token Labeling] All Tokens Matter: Token Labeling for Training Better Vision Transformers (2021.4) [Paper]
Model Improvements
Tokenization Module

image

Image to Token:

  • Non-overlapping Patch Embedding

    • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
    • [TNT] Transformer in Transformer (NeurIPS 2021-2021.3) [Paper]
    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
  • Overlapping Patch Embedding

    • [T2T-ViT] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet (2021.1) [Paper]

    • [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]

    • [PVTv2] PVTv2: Improved Baselines with Pyramid Vision Transformer (2021.6) [Paper]

    • [ViTAE] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias (2021.6) [Paper]

    • [PS-ViT] Vision Transformer with Progressive Sampling (2021.8) [Paper]

Token to Token:

  • Fixed sampling window tokenization
    • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
  • Dynamic sampling tokenization
    • [PS-ViT] Vision Transformer with Progressive Sampling (2021.8) [Paper]
    • [TokenLearner] TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? (2021.6) [Paper]
Position Encoding Module

image

Explicit position encoding:

  • Absolute position encoding
    • [Transformer] Attention is All You Need] (NIPS 2017-2017.06) [Paper]
    • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
    • [PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (2021.2) [Paper]
  • Relative position encoding
    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
    • [Swin Transformer V2] Swin Transformer V2: Scaling Up Capacity and Resolution (2021.11) [Paper]
    • [Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]

Implicit position encoding:

  • [CPVT] Conditional Positional Encodings for Vision Transformers (2021.2) [Paper]
  • [CSWin Transformer] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows (2021.07) [Paper]
  • [PVTv2] PVTv2: Improved Baselines with Pyramid Vision Transformer (2021.6) [Paper]
  • [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]
Attention Module

image

Include only global attention:

  • Multi-Head attention module

    • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
  • Reduce global attention computation

    • [PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (2021.2) [Paper]

    • [PVTv2] PVTv2: Improved Baselines with Pyramid Vision Transformer (2021.6) [Paper]

    • [Twins] Twins: Revisiting the Design of Spatial Attention in Vision Transformers (2021.4) [Paper]

    • [P2T] P2T: Pyramid Pooling Transformer for Scene Understanding (2021.6) [Paper]

    • [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]

    • [MViT] Multiscale Vision Transformers (2021.4) [Paper]

    • [Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]

  • Generalized linear attention

    • [T2T-ViT] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet (2021.1) [Paper]

Introduce extra local attention:

  • Local window mode

    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
    • [Swin Transformer V2] Swin Transformer V2: Scaling Up Capacity and Resolution (2021.11) [Paper]
    • [Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]
    • [Twins] Twins: Revisiting the Design of Spatial Attention in Vision Transformers (2021.4) [Paper]
    • [GG-Transformer] Glance-and-Gaze Vision Transformer (2021.6) [Paper]
    • [Shuffle Transformer] Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer (2021.6) [Paper]
    • [MSG-Transformer] MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens (2021.5) [Paper]
    • [CSWin Transformer] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows (2021.07) [Paper]
  • Introduce convolutional local inductive bias

    • [ViTAE] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias (2021.6) [Paper]
    • [ELSA] ELSA: Enhanced Local Self-Attention for Vision Transformer (2021.12) [Paper]
  • Sparse attention

    • [Sparse Transformer] Sparse Transformer: Concentrated Attention Through Explicit Selection [Paper]
FFN Module

image

Improve performance with Conv's local information extraction capability:

  • [LocalViT] LocalViT: Bringing Locality to Vision Transformers (2021.4) [Paper]
  • [CeiT] Incorporating Convolution Designs into Visual Transformers (2021.3) [Paper]
Normalization Module Location

image

  • Pre Normalization

    • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
  • Post Normalization

    • [Swin Transformer V2] Swin Transformer V2: Scaling Up Capacity and Resolution (2021.11) [Paper]
Classification Prediction Head Module

image

  • Class Tokens

    • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
    • [CeiT] Incorporating Convolution Designs into Visual Transformers (2021.3) [Paper]
  • Avgerage Pooling

    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
    • [CPVT] Conditional Positional Encodings for Vision Transformers (2021.2) [Paper]
    • [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]
Others

image

(1) How to output multi-scale feature map

  • Patch merging

    • [PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (2021.2) [Paper]
    • [Twins] Twins: Revisiting the Design of Spatial Attention in Vision Transformers (2021.4) [Paper]
    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
    • [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]
    • [CSWin Transformer] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows (2021.07) [Paper]
    • [MetaFormer] MetaFormer is Actually What You Need for Vision (2021.11) [Paper]
  • Pooling attention

    • [MViT] Multiscale Vision Transformers (2021.4) [Paper][Imporved MViT]

    • [Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]

  • Dilation convolution

    • [ViTAE] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias (2021.6) [Paper]

(2) How to train a deeper Transformer

  • [Cait] Going deeper with Image Transformers (2021.3) [Paper]
  • [DeepViT] DeepViT: Towards Deeper Vision Transformer (2021.3) [Paper]

MLP-based

image

  • [MLP-Mixer] MLP-Mixer: An all-MLP Architecture for Vision (2021.5) [Paper]

  • [ResMLP] ResMLP: Feedforward networks for image classification with data-efficient training (CVPR2021-2021.5) [Paper]

  • [gMLP] Pay Attention to MLPs (2021.5) [Paper]

  • [CycleMLP] CycleMLP: A MLP-like Architecture for Dense Prediction (2021.7) [Paper]

ConvMixer-based

  • [ConvMixer] Patches Are All You Need [Paper]

General Architecture Analysis

image

  • Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight (2021.6) [Paper]
  • A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP (2021.8) [Paper]
  • [MetaFormer] MetaFormer is Actually What You Need for Vision (2021.11) [Paper]
  • [ConvNeXt] A ConvNet for the 2020s (2022.01) [Paper]

Others

Object Detection

Semantic Segmentation

back to top

Papers

Transformer Original Paper

  • [Transformer] Attention is All You Need] (NIPS 2017-2017.06) [Paper]

ViT Original Paper

  • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]

Image Classification

2020

  • [DeiT] Training data-efficient image transformers & distillation through attention (ICML 2021-2020.12) [Paper]
  • [Sparse Transformer] Sparse Transformer: Concentrated Attention Through Explicit Selection [Paper]

2021

  • [T2T-ViT] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet (2021.1) [Paper]

  • [PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (2021.2) [Paper]

  • [CPVT] Conditional Positional Encodings for Vision Transformers (2021.2) [Paper]

  • [TNT] Transformer in Transformer (NeurIPS 2021-2021.3) [Paper]

  • [Cait] Going deeper with Image Transformers (2021.3) [Paper]

  • [DeepViT] DeepViT: Towards Deeper Vision Transformer (2021.3) [Paper]

  • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]

  • [CeiT] Incorporating Convolution Designs into Visual Transformers (2021.3) [Paper]

  • [LocalViT] LocalViT: Bringing Locality to Vision Transformers (2021.4) [Paper]

  • [MViT] Multiscale Vision Transformers (2021.4) [Paper]

  • [Twins] Twins: Revisiting the Design of Spatial Attention in Vision Transformers (2021.4) [Paper]

  • [Token Labeling] All Tokens Matter: Token Labeling for Training Better Vision Transformers (2021.4) [Paper]

  • [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]

  • [MLP-Mixer] MLP-Mixer: An all-MLP Architecture for Vision (2021.5) [Paper]

  • [ResMLP] ResMLP: Feedforward networks for image classification with data-efficient training (CVPR2021-2021.5) [Paper]

  • [gMLP] Pay Attention to MLPs (2021.5) [Paper]

  • [MSG-Transformer] MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens (2021.5) [Paper]

  • [PVTv2] PVTv2: Improved Baselines with Pyramid Vision Transformer (2021.6) [Paper]

  • [TokenLearner] TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? (2021.6) [Paper]

  • Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight (2021.6) [Paper]

  • [P2T] P2T: Pyramid Pooling Transformer for Scene Understanding (2021.6) [Paper]

  • [GG-Transformer] Glance-and-Gaze Vision Transformer (2021.6) [Paper]

  • [Shuffle Transformer] Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer (2021.6) [Paper]

  • [ViTAE] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias (2021.6) [Paper]

  • [CycleMLP] CycleMLP: A MLP-like Architecture for Dense Prediction (2021.7) [Paper]

  • [CSWin Transformer] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows (2021.07) [Paper]

  • [PS-ViT] Vision Transformer with Progressive Sampling (2021.8) [Paper]

  • A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP (2021.8) [Paper]

  • [Swin Transformer V2] Swin Transformer V2: Scaling Up Capacity and Resolution (2021.11) [Paper]

  • [MetaFormer] MetaFormer is Actually What You Need for Vision (2021.11) [Paper]

  • [Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]

  • [ELSA] ELSA: Enhanced Local Self-Attention for Vision Transformer (2021.12) [Paper]

  • [ConvMixer] Patches Are All You Need [Paper]

2022

  • [ConvNeXt] A ConvNet for the 2020s (2022.01) [Paper]

Object Detection

Semantic Segmentation

back to top

Stay tuned and PRs are welcomed!

Owner
OpenMMLab
OpenMMLab
Light-Head R-CNN

Light-head R-CNN Introduction We release code for Light-Head R-CNN. This is my best practice for my research. This repo is organized as follows: light

jemmy li 835 Dec 06, 2022
an implementation of 3D Ken Burns Effect from a Single Image using PyTorch

3d-ken-burns This is a reference implementation of 3D Ken Burns Effect from a Single Image [1] using PyTorch. Given a single input image, it animates

Simon Niklaus 1.4k Dec 28, 2022
Deploy optimized transformer based models on Nvidia Triton server

🤗 Hugging Face Transformer submillisecond inference 🤯 and deployment on Nvidia Triton server Yes, you can perfom inference with transformer based mo

Lefebvre Sarrut Services 1.2k Jan 05, 2023
Exploit Camera Raw Data for Video Super-Resolution via Hidden Markov Model Inference

RawVSR This repo contains the official codes for our paper: Exploit Camera Raw Data for Video Super-Resolution via Hidden Markov Model Inference Xiaoh

Xiaohong Liu 23 Oct 08, 2022
Double pendulum simulator using a symplectic Euler's method and Hamiltonian mechanics

Symplectic Double Pendulum Simulator Double pendulum simulator using a symplectic Euler's method. The program calculates the momentum and position of

Scott Marino 1 Jan 12, 2022
Explore extreme compression for pre-trained language models

Code for paper "Exploring extreme parameter compression for pre-trained language models ICLR2022"

twinkle 16 Nov 14, 2022
retweet 4 satoshi ⚡️

rt4sat retweet 4 satoshi This bot is the codebase for https://twitter.com/rt4sat please feel free to create an issue if you saw any bugs basically thi

6 Sep 30, 2022
Simulate genealogical trees and genomic sequence data using population genetic models

msprime msprime is a population genetics simulator based on tskit. Msprime can simulate random ancestral histories for a sample of individuals (consis

Tskit developers 150 Dec 14, 2022
General Vision Benchmark, a project from OpenGVLab

Introduction We build GV-B(General Vision Benchmark) on Classification, Detection, Segmentation and Depth Estimation including 26 datasets for model e

174 Dec 27, 2022
Using deep learning model to detect breast cancer.

Breast-Cancer-Detection Breast cancer is the most frequent cancer among women, with around one in every 19 women at risk. The number of cases of breas

1 Feb 13, 2022
Build Low Code Automated Tensorflow, What-IF explainable models in just 3 lines of code.

Build Low Code Automated Tensorflow explainable models in just 3 lines of code.

Hasan Rafiq 170 Dec 26, 2022
SCALoss: Side and Corner Aligned Loss for Bounding Box Regression (AAAI2022).

SCALoss PyTorch implementation of the paper "SCALoss: Side and Corner Aligned Loss for Bounding Box Regression" (AAAI 2022). Introduction IoU-based lo

TuZheng 20 Sep 07, 2022
Unified file system operation experience for different backend

megfile - Megvii FILE library Docs: http://megvii-research.github.io/megfile megfile provides a silky operation experience with different backends (cu

MEGVII Research 76 Dec 14, 2022
一些经典的CTR算法的复现; LR, FM, FFM, AFM, DeepFM,xDeepFM, PNN, DCN, DCNv2, DIFM, AutoInt, FiBiNet,AFN,ONN,DIN, DIEN ... (pytorch, tf2.0)

CTR Algorithm 根据论文, 博客, 知乎等方式学习一些CTR相关的算法 理解原理并自己动手来实现一遍 pytorch & tf2.0 保持一颗学徒的心! Schedule Model pytorch tensorflow2.0 paper LR ✔️ ✔️ \ FM ✔️ ✔️ Fac

luo han 149 Dec 20, 2022
Single-Stage Instance Shadow Detection with Bidirectional Relation Learning (CVPR 2021 Oral)

Single-Stage Instance Shadow Detection with Bidirectional Relation Learning (CVPR 2021 Oral) Tianyu Wang*, Xiaowei Hu*, Chi-Wing Fu, and Pheng-Ann Hen

Steve Wong 51 Oct 20, 2022
Time Series Forecasting with Temporal Fusion Transformer in Pytorch

Forecasting with the Temporal Fusion Transformer Multi-horizon forecasting often contains a complex mix of inputs – including static (i.e. time-invari

Nicolás Fornasari 6 Jan 24, 2022
GPU-Accelerated Deep Learning Library in Python

Hebel GPU-Accelerated Deep Learning Library in Python Hebel is a library for deep learning with neural networks in Python using GPU acceleration with

Hannes Bretschneider 1.2k Dec 21, 2022
A collection of 100 Deep Learning images and visualizations

A collection of Deep Learning images and visualizations. The project has been developed by the AI Summer team and currently contains almost 100 images.

AI Summer 65 Sep 12, 2022
Image classification for projects and researches

This is a tool to help you quickly solve classification problems including: data analysis, training, report results and model explanation.

Nguyễn Trường Lâu 2 Dec 27, 2021
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Segmentation Transformer Implementation of Segmentation Transformer in PyTorch, a new model to achieve SOTA in semantic segmentation while using trans

Abhay Gupta 161 Dec 08, 2022