SPT_LSA_ViT - Implementation for Visual Transformer for Small-size Datasets

Last update: Jan 01, 2023

Related tags

Deep Learning SPT_LSA_ViT

Overview

Vision Transformer for Small-Size Datasets

Seung Hoon Lee and Seunghyun Lee and Byung Cheol Song | Paper

Inha University

Abstract

Recently, the Vision Transformer (ViT), which applied the transformer structure to the image classification task, has outperformed convolutional neural networks. However, the high performance of the ViT results from pre-training using a large-size dataset such as JFT-300M, and its dependence on a large dataset is interpreted as due to low locality inductive bias. This paper proposes Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA), which effectively solve the lack of locality inductive bias and enable it to learn from scratch even on small-size datasets. Moreover, SPT and LSA are generic and effective add-on modules that are easily applicable to various ViTs. Experimental results show that when both SPT and LSA were applied to the ViTs, the performance improved by an average of 2.96% in Tiny-ImageNet, which is a representative small-size dataset. Especially, Swin Transformer achieved an overwhelming performance improvement of 4.08% thanks to the proposed SPT and LSA.

Method

Shifted Patch Tokenization

Locality Self-Attention

Model Performance

Small-Size Dataset Classification

Model	FLOPs	CIFAR10	CIFAR100	SVHN	Tiny-ImageNet
ViT	189.8	93.58	73.81	97.82	57.07
SL-ViT	199.2	94.53	76.92	97.79	61.07
T2T	643.0	95.30	77.00	97.90	60.57
SL-T2T	671.4	95.57	77.36	97.91	61.83
CaiT	613.8	94.91	76.89	98.13	64.37
SL-CaiT	623.3	95.81	80.32	98.28	67.18
PiT	279.2	94.24	74.99	97.83	60.25
SL-PiT	322.9	95.88	79.00	97.93	62.91
Swin	242.3	94.46	76.87	97.72	60.87
SL-Swin	284.9	95.93	79.99	97.92	64.95

Accuracy-Throughput Graph

How to train models

Pure ViT

python main.py --model vit

SL-Swin

python main.py --model swin --is_LSA --is_SPT

Citation

@misc{lee2021vision,
      title={Vision Transformer for Small-Size Datasets}, 
      author={Seung Hoon Lee and Seunghyun Lee and Byung Cheol Song},
      year={2021},
      eprint={2112.13492},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

SPT_LSA_ViT - Implementation for Visual Transformer for Small-size Datasets

Related tags

Overview

Vision Transformer for Small-Size Datasets

Abstract

Method

Shifted Patch Tokenization

Locality Self-Attention

Model Performance

Small-Size Dataset Classification

Accuracy-Throughput Graph

How to train models

Pure ViT

SL-Swin

Citation

Owner

Lee SeungHoon

minimizer-space de Bruijn graphs (mdBG) for whole genome assembly

Revisting Open World Object Detection

Convnet transfer - Code for paper How transferable are features in deep neural networks?

Mask2Former: Masked-attention Mask Transformer for Universal Image Segmentation in TensorFlow 2

SiT: Self-supervised vIsion Transformer

VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

Efficient Sparse Attacks on Videos using Reinforcement Learning

An open software package to develop BCI based brain and cognitive computing technology for recognizing user's intention using deep learning

Diffusion Probabilistic Models for 3D Point Cloud Generation (CVPR 2021)

This is the source code for: Context-aware Entity Typing in Knowledge Graphs.

Code for the paper "VisualBERT: A Simple and Performant Baseline for Vision and Language"

Deep deconfounded recommender (Deep-Deconf) for paper "Deep causal reasoning for recommendations"

Offical implementation for "Trash or Treasure? An Interactive Dual-Stream Strategy for Single Image Reflection Separation".

PushForKiCad - AISLER Push for KiCad EDA

MTCNN face detection implementation for TensorFlow, as a PIP package.

Pytorch implementation of paper "Efficient Nearest Neighbor Language Models" (EMNLP 2021)

PyTorch implementation for Stochastic Fine-grained Labeling of Multi-state Sign Glosses for Continuous Sign Language Recognition.

Simulated garment dataset for virtual try-on

Code for Transformers Solve Limited Receptive Field for Monocular Depth Prediction

Rule Based Classification Project For Python