DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Last update: Jan 01, 2023

Overview

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Created by Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, Cho-Jui Hsieh

This repository contains PyTorch implementation for DynamicViT.

We introduce a dynamic token sparsification framework to prune redundant tokens in vision transformers progressively and dynamically based on the input:

Our code is based on pytorch-image-models, DeiT and LV-ViT

[Project Page] [arXiv]

Model Zoo

We provide our DynamicViT models pretrained on ImageNet:

name	arch	rho	[email protected]	[email protected]	FLOPs	url
DynamicViT-256/0.7	`deit_256`	0.7	76.532	93.118	1.3G	Google Drive / Tsinghua Cloud
DynamicViT-384/0.7	`deit_small`	0.7	79.316	94.676	2.9G	Google Drive / Tsinghua Cloud
DynamicViT-LV-S/0.5	`lvvit_s`	0.5	81.970	95.756	3.7G	Google Drive / Tsinghua Cloud
DynamicViT-LV-S/0.7	`lvvit_s`	0.7	83.076	96.252	4.6G	Google Drive / Tsinghua Cloud
DynamicViT-LV-M/0.7	`lvvit_m`	0.7	83.816	96.584	8.5G	Google Drive / Tsinghua Cloud

Usage

Requirements

torch>=1.7.0
torchvision>=0.8.1
timm==0.4.5

Data preparation: download and extract ImageNet images from http://image-net.org/. The directory structure should be

│ILSVRC2012/
├──train/
│  ├── n01440764
│  │   ├── n01440764_10026.JPEG
│  │   ├── n01440764_10027.JPEG
│  │   ├── ......
│  ├── ......
├──val/
│  ├── n01440764
│  │   ├── ILSVRC2012_val_00000293.JPEG
│  │   ├── ILSVRC2012_val_00002138.JPEG
│  │   ├── ......
│  ├── ......

Model preparation: download pre-trained DeiT and LV-ViT models for training DynamicViT:

sh download_pretrain.sh

Demo

We provide a Jupyter notebook where you can run the visualization of DynamicViT.

To run the demo, you need to install matplotlib.

Evaluation

To evaluate a pre-trained DynamicViT model on ImageNet val with a single GPU, run:

python infer.py --data-path /path/to/ILSVRC2012/ --arch arch_name --model-path /path/to/model --base_rate 0.7

Training

To train DynamicViT models on ImageNet, run:

DeiT-small

python -m torch.distributed.launch --nproc_per_node=8 --use_env main_dynamic_vit.py  --output_dir logs/dynamic-vit_deit-small --arch deit_small --input-size 224 --batch-size 96 --data-path /path/to/ILSVRC2012/ --epochs 30 --dist-eval --distill --base_rate 0.7

LV-ViT-S

python -m torch.distributed.launch --nproc_per_node=8 --use_env main_dynamic_vit.py  --output_dir logs/dynamic-vit_lvvit-s --arch lvvit_s --input-size 224 --batch-size 64 --data-path /path/to/ILSVRC2012/ --epochs 30 --dist-eval --distill --base_rate 0.7

LV-ViT-M

python -m torch.distributed.launch --nproc_per_node=8 --use_env main_dynamic_vit.py  --output_dir logs/dynamic-vit_lvvit-m --arch lvvit_m --input-size 224 --batch-size 48 --data-path /path/to/ILSVRC2012/ --epochs 30 --dist-eval --distill --base_rate 0.7

You can train models with different keeping ratio by adjusting base_rate. DynamicViT can also achieve comparable performance with only 15 epochs training (around 0.1% lower accuracy).

License

MIT License

Citation

If you find our work useful in your research, please consider citing:

@article{rao2021dynamicvit,
  title={DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification},
  author={Rao, Yongming and Zhao, Wenliang and Liu, Benlin and Lu, Jiwen and Zhou, Jie and Hsieh, Cho-Jui},
  journal={arXiv preprint arXiv:2106.02034},
  year={2021}
}

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Related tags

Overview

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Model Zoo

Usage

Requirements

Demo

Evaluation

Training

License

Citation

Owner

Yongming Rao

Tree Nested PyTorch Tensor Lib

Competitive Programming Club, Clinify's Official repository for CP problems hosting by club members.

A high performance implementation of HDBSCAN clustering.

Fast, accurate and reliable software for algebraic CT reconstruction

3rd Place Solution for ICCV 2021 Workshop SSLAD Track 3A - Continual Learning Classification Challenge

Code for the paper "Adapting Monolingual Models: Data can be Scarce when Language Similarity is High"

Huawei Hackathon 2021 - Sweden (Stockholm)

Dynamic Environments with Deformable Objects (DEDO)

Code for T-Few from "Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning"

an implementation of Video Frame Interpolation via Adaptive Separable Convolution using PyTorch

Learning Dynamic Network Using a Reuse Gate Function in Semi-supervised Video Object Segmentation.

Implement the Pareto Optimizer and pcgrad to make a self-adaptive loss for multi-task

bespoke tooling for offensive security's Windows Usermode Exploit Dev course (OSED)

The official implementation for ACL 2021 "Challenges in Information Seeking QA: Unanswerable Questions and Paragraph Retrieval".

An efficient and easy-to-use deep learning model compression framework

PyTorch wrappers for using your model in audacity!

Weakly Supervised Learning of Instance Segmentation with Inter-pixel Relations, CVPR 2019 (Oral)

The LaTeX and Python code for generating the paper, experiments' results and visualizations reported in each paper is available (whenever possible) in the paper's directory

Use AI to generate a optimized stock portfolio

《LightXML: Transformer with dynamic negative sampling for High-Performance Extreme Multi-label Text Classiﬁcation》(AAAI 2021) GitHub: