DeepViT

This repo is the official implementation of "DeepViT: Towards Deeper Vision Transformer". The repo is based on the timm library (https://github.com/rwightman/pytorch-image-models) by Ross Wightman

Introduction

Deep Vision Transformer is initially described in arxiv, which observes the attention collapese phenomenon when training deep vision transformers: In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper. More specifically, we empirically observe that such scaling difficulty is caused by the attention collapse issue: as the transformer goes deeper, the attention maps gradually become similar and even much the same after certain layers. In other words, the feature maps tend to be identical in the top layers of deep ViT models. This fact demonstrates that in deeper layers of ViTs, the self-attention mechanism fails to learn effective concepts for representation learning and hinders the model from getting expected performance gain. Based on above observation, we propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost. The pro-posed method makes it feasible to train deeper ViT models with consistent performance improvements via minor modification to existing ViT models. Notably, when training a deep ViT model with 32 transformer blocks, the Top-1 classification accuracy can be improved by 1.6% on ImageNet.

2. DeepViT Models

Model	Re-attention	Top1 Acc (%)	#params	#Similar Blocks	Checkpoint
ViT-16	NA	78.88	24.5M	5	[here](comming soon)
DeepViT-16	FC	79.10	24.5M	0	[here](comming soon)
ViT-24	NA	79.35	36.3M	11	[here](comming soon)
DeepViT-24	FC	79.99	36.3M	0	[here](comming soon)
ViT-32	NA	79.27	48.1M	15	[here](comming soon)
DeepViT_t-32	FC	80.90	48.1M	0	[here](comming soon)

Citing DeepVit

@article{zhou2021deepvit,
  title={DeepViT: Towards Deeper Vision Transformer},
  author={Zhou, Daquan and Kang, Bingyi and Jin, Xiaojie and Yang, Linjie and Lian, Xiaochen and Hou, Qibin and Feng, Jiashi},
  journal={arXiv preprint arXiv:2103.11886},
  year={2021}
}

《DeepViT: Towards Deeper Vision Transformer》(2021)

Related tags

Overview

DeepViT

Introduction

2. DeepViT Models

Citing DeepVit

Owner

Multivariate Time Series Transformer, public version

A Python-based development platform for automated trading systems - from backtesting to optimisation to livetrading.

A Deep learning based streamlit web app which can tell with which bollywood celebrity your face resembles.

Code for Phase diagram of Stochastic Gradient Descent in high-dimensional two-layer neural networks

Price-Prediction-For-a-Dream-Home - A machine learning based linear regression trained model for house price prediction.

Code for "Infinitely Deep Bayesian Neural Networks with Stochastic Differential Equations"

Predicting future trajectories of people in cameras of novel scenarios and views.

Automatic Calibration for Non-repetitive Scanning Solid-State LiDAR and Camera Systems

Official PyTorch implementation of U-GAT-IT: Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translation

Official Implement of CVPR 2021 paper “Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT Benchmark for Crowd Counting”

Streaming Anomaly Detection Framework in Python (Outlier Detection for Streaming Data)

A package related to building quasi-fibration symmetries

Spatial Contrastive Learning for Few-Shot Classification (SCL)

Reproduction of Vision Transformer in Tensorflow2. Train from scratch and Finetune.

Neural HMMs are all you need (for high-quality attention-free TTS)

A Pytorch Implementation of Domain adaptation of object detector using scissor-like networks

Variational Attention: Propagating Domain-Specific Knowledge for Multi-Domain Learning in Crowd Counting (ICCV, 2021)

A Physics-based Noise Formation Model for Extreme Low-light Raw Denoising (CVPR 2020 Oral & TPAMI 2021)

RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering

DeiT: Data-efficient Image Transformers