Official PyTorch implementation of Less is More: Pay Less Attention in Vision Transformers.

Last update: Jan 01, 2023

Overview

Less is More: Pay Less Attention in Vision Transformers

Official PyTorch implementation of Less is More: Pay Less Attention in Vision Transformers.

By Zizheng Pan, Bohan Zhuang, Haoyu He, Jing Liu and Jianfei Cai.

In our paper, we present a novel Less attention vIsion Transformer (LIT), building upon the fact that convolutions, fully-connected (FC) layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences. LIT uses pure multi-layer perceptrons (MLPs) to encode rich local patterns in the early stages while applying self-attention modules to capture longer dependencies in deeper layers. Moreover, we further propose a learned deformable token merging module to adaptively fuse informative patches in a non-uniform manner.

If you use this code for a paper please cite:

@article{pan2021less,
  title={Less is More: Pay Less Attention in Vision Transformers},
  author={Pan, Zizheng and Zhuang, Bohan and He, Haoyu and Liu, Jing and Cai, Jianfei},
  journal={arXiv preprint arXiv:2105.14217},
  year={2021}
}

Usage

First, clone this repository.

git clone https://github.com/MonashAI/LIT

Next, create a conda virtual environment.

# Make sure you have a NVIDIA GPU.
cd LIT/
bash setup_env.sh [conda_install_path] [env_name]

# For example
bash setup_env.sh /home/anaconda3 lit

Note: We use PyTorch 1.7.1 with CUDA 10.1 for all experiments. The setup_env.sh has illustrated all dependencies we used in our experiments. You may want to edit this file to install a different version of PyTorch or any other packages.

Data Preparation

Download the ImageNet 2012 dataset from here, and prepare the dataset based on this script. The file structure should look like:

imagenet
├── train
│   ├── class1
│   │   ├── img1.jpeg
│   │   ├── img2.jpeg
│   │   └── ...
│   ├── class2
│   │   ├── img3.jpeg
│   │   └── ...
│   └── ...
└── val
    ├── class1
    │   ├── img4.jpeg
    │   ├── img5.jpeg
    │   └── ...
    ├── class2
    │   ├── img6.jpeg
    │   └── ...
    └── ...

Model Zoo

We provide baseline LIT models pretrained on ImageNet 2012.

Name	Params (M)	FLOPs (G)	Top-1 Acc. (%)	Model	Log
LIT-Ti	19	3.6	81.1	google drive/github	log
LIT-S	27	4.1	81.5	google drive/github	log
LIT-M	48	8.6	83.0	google drive/github	log
LIT-B	86	15.0	83.4	google drive/github	log

Training and Evaluation

In our implementation, we have different training strategies for LIT-Ti and other LIT models. Therefore, we provide two codebases.

For LIT-Ti, please refer to code_for_lit_ti.

For LIT-S, LIT-M, LIT-B, please refer to code_for_lit_s_m_b.

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.

Acknowledgement

This repository has adopted codes from DeiT, PVT and Swin, we thank the authors for their open-sourced code.

This is an official implementation of CvT: Introducing Convolutions to Vision Transformers.

Introduction This is an official implementation of CvT: Introducing Convolutions to Vision Transformers. We present a new architecture, named Convolut

175 Jan 8, 2023

Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention

cosFormer Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention Update log 2022/2/28 Add core code License This

120 Dec 15, 2022

Official implementation of CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

CrossViT This repository is the official implementation of CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification. ArXiv If

168 Dec 29, 2022

The official implementation of ELSA: Enhanced Local Self-Attention for Vision Transformer

ELSA: Enhanced Local Self-Attention for Vision Transformer By Jingkai Zhou, Pich

87 Dec 19, 2022

Implementation of SE3-Transformers for Equivariant Self-Attention, in Pytorch.

SE3 Transformer - Pytorch Implementation of SE3-Transformers for Equivariant Self-Attention, in Pytorch. May be needed for replicating Alphafold2 resu

207 Dec 23, 2022

Pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering".

TRAnsformer Routing Networks (TRAR) This is an official implementation for ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visu

49 Nov 10, 2022

PyTorch Implementation of CvT: Introducing Convolutions to Vision Transformers

CvT: Introducing Convolutions to Vision Transformers Pytorch implementation of CvT: Introducing Convolutions to Vision Transformers Usage: img = torch

193 Jan 3, 2023

A PyTorch implementation of ViTGAN based on paper ViTGAN: Training GANs with Vision Transformers.

ViTGAN: Training GANs with Vision Transformers A PyTorch implementation of ViTGAN based on paper ViTGAN: Training GANs with Vision Transformers. Refer

127 Dec 23, 2022

Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Transformer in Transformer Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image c

272 Dec 23, 2022

Comments

Problem about DCN

I have some problems about compiling DCN, thus it is hard for me to use the DCN.deform_conv2d_forward and DCN.deform_conv2d_backward functions. Can 'deform_conv2d_naive' be used instead of this part? Or is there other methods for me to accomplish this DCN.deform_conv2d part?

opened by jarygrace 3
How to use LITNet as a beckbone of object detection

Could you please release the code of using LITNet as a beckbone of RetinaNet as mentioned in your paper? I wonder how to use it as a beckbone for object detection...

opened by sunhuisunhui 2

Official PyTorch implementation of Less is More: Pay Less Attention in Vision Transformers.

Related tags

Overview

Less is More: Pay Less Attention in Vision Transformers

Usage

Data Preparation

Model Zoo

Training and Evaluation

License

Acknowledgement

You might also like...

This is an official implementation of CvT: Introducing Convolutions to Vision Transformers.

Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention

Official implementation of CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

The official implementation of ELSA: Enhanced Local Self-Attention for Vision Transformer

Implementation of SE3-Transformers for Equivariant Self-Attention, in Pytorch.

Pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering".

PyTorch Implementation of CvT: Introducing Convolutions to Vision Transformers

A PyTorch implementation of ViTGAN based on paper ViTGAN: Training GANs with Vision Transformers.

Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Comments

Problem about DCN

How to use LITNet as a beckbone of object detection

Releases(v2.1)

v2.1(Mar 10, 2022)

v2.0(Jun 26, 2021)

v1.0(Jun 8, 2021)

Owner

Deep Learning with PyTorch made easy 🚀 !

CasualHealthcare's Pneumonia detection with Artificial Intelligence (Convolutional Neural Network)

Fast Neural Style for Image Style Transform by Pytorch

DIVeR: Deterministic Integration for Volume Rendering

ServiceX Transformer that converts flat ROOT ntuples into columnwise data

In this repo we reproduce and extend results of Learning in High Dimension Always Amounts to Extrapolation by Balestriero et al. 2021

Live training loss plot in Jupyter Notebook for Keras, PyTorch and others

The code for "Deep Level Set for Box-supervised Instance Segmentation in Aerial Images".

Fully Convolutional Refined Auto Encoding Generative Adversarial Networks for 3D Multi Object Scenes

Code for Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding

A PyTorch port of the Neural 3D Mesh Renderer

Python package for downloading ECMWF reanalysis data and converting it into a time series format.

The InterScript dataset contains interactive user feedback on scripts generated by a T5-XXL model.

Implementation of Kronecker Attention in Pytorch

YOLOX-CondInst - Implement CondInst which is a instances segmentation method on YOLOX

Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy" (ICLR 2022 Spotlight)

Evolutionary Scale Modeling (esm): Pretrained language models for proteins

Code for Referring Image Segmentation via Cross-Modal Progressive Comprehension, CVPR2020.

Taming Transformers for High-Resolution Image Synthesis

Code for paper "Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs"