DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Last update: Jan 01, 2023

Overview

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Created by Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, Cho-Jui Hsieh

This repository contains PyTorch implementation for DynamicViT.

We introduce a dynamic token sparsification framework to prune redundant tokens in vision transformers progressively and dynamically based on the input:

Our code is based on pytorch-image-models, DeiT and LV-ViT

[Project Page] [arXiv]

Model Zoo

We provide our DynamicViT models pretrained on ImageNet:

name	arch	rho	[email protected]	[email protected]	FLOPs	url
DynamicViT-256/0.7	`deit_256`	0.7	76.532	93.118	1.3G	Google Drive / Tsinghua Cloud
DynamicViT-384/0.7	`deit_small`	0.7	79.316	94.676	2.9G	Google Drive / Tsinghua Cloud
DynamicViT-LV-S/0.5	`lvvit_s`	0.5	81.970	95.756	3.7G	Google Drive / Tsinghua Cloud
DynamicViT-LV-S/0.7	`lvvit_s`	0.7	83.076	96.252	4.6G	Google Drive / Tsinghua Cloud
DynamicViT-LV-M/0.7	`lvvit_m`	0.7	83.816	96.584	8.5G	Google Drive / Tsinghua Cloud

Usage

Requirements

torch>=1.7.0
torchvision>=0.8.1
timm==0.4.5

Data preparation: download and extract ImageNet images from http://image-net.org/. The directory structure should be

│ILSVRC2012/
├──train/
│  ├── n01440764
│  │   ├── n01440764_10026.JPEG
│  │   ├── n01440764_10027.JPEG
│  │   ├── ......
│  ├── ......
├──val/
│  ├── n01440764
│  │   ├── ILSVRC2012_val_00000293.JPEG
│  │   ├── ILSVRC2012_val_00002138.JPEG
│  │   ├── ......
│  ├── ......

Model preparation: download pre-trained DeiT and LV-ViT models for training DynamicViT:

sh download_pretrain.sh

Demo

We provide a Jupyter notebook where you can run the visualization of DynamicViT.

To run the demo, you need to install matplotlib.

Evaluation

To evaluate a pre-trained DynamicViT model on ImageNet val with a single GPU, run:

python infer.py --data-path /path/to/ILSVRC2012/ --arch arch_name --model-path /path/to/model --base_rate 0.7

Training

To train DynamicViT models on ImageNet, run:

DeiT-small

python -m torch.distributed.launch --nproc_per_node=8 --use_env main_dynamic_vit.py  --output_dir logs/dynamic-vit_deit-small --arch deit_small --input-size 224 --batch-size 96 --data-path /path/to/ILSVRC2012/ --epochs 30 --dist-eval --distill --base_rate 0.7

LV-ViT-S

python -m torch.distributed.launch --nproc_per_node=8 --use_env main_dynamic_vit.py  --output_dir logs/dynamic-vit_lvvit-s --arch lvvit_s --input-size 224 --batch-size 64 --data-path /path/to/ILSVRC2012/ --epochs 30 --dist-eval --distill --base_rate 0.7

LV-ViT-M

python -m torch.distributed.launch --nproc_per_node=8 --use_env main_dynamic_vit.py  --output_dir logs/dynamic-vit_lvvit-m --arch lvvit_m --input-size 224 --batch-size 48 --data-path /path/to/ILSVRC2012/ --epochs 30 --dist-eval --distill --base_rate 0.7

You can train models with different keeping ratio by adjusting base_rate. DynamicViT can also achieve comparable performance with only 15 epochs training (around 0.1% lower accuracy).

License

MIT License

Citation

If you find our work useful in your research, please consider citing:

@article{rao2021dynamicvit,
  title={DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification},
  author={Rao, Yongming and Zhao, Wenliang and Liu, Benlin and Lu, Jiwen and Zhou, Jie and Hsieh, Cho-Jui},
  journal={arXiv preprint arXiv:2106.02034},
  year={2021}
}

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Related tags

Overview

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Model Zoo

Usage

Requirements

Demo

Evaluation

Training

License

Citation

Owner

Yongming Rao

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

TorchIO is a Medical image preprocessing and augmentation toolkit for deep learning. Part of the PyTorch Ecosystem.

Deep Learning Package based on TensorFlow

PyTorch code for the paper "Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval".

ROCKET: Exceptionally fast and accurate time series classification using random convolutional kernels

Auto-updating data to assist in investment to NEPSE

Applications using the GTN library and code to reproduce experiments in "Differentiable Weighted Finite-State Transducers"

Pyeventbus: a publish/subscribe event bus

Seasonal Contrast: Unsupervised Pre-Training from Uncurated Remote Sensing Data

Dynamic wallpaper generator.

Miscellaneous and lightweight network tools

Compute descriptors for 3D point cloud registration using a multi scale sparse voxel architecture

Cossim - Sharpened Cosine Distance implementation in PyTorch

PaSST: Efficient Training of Audio Transformers with Patchout

Message Passing on Cell Complexes

Using deep actor-critic model to learn best strategies in pair trading

How Effective is Incongruity? Implications for Code-mix Sarcasm Detection.

A lightweight deep network for fast and accurate optical flow estimation.

Official Pytorch implementation of "Beyond Static Features for Temporally Consistent 3D Human Pose and Shape from a Video", CVPR 2021

Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition in CVPR19