Pytorch implementation of "Training a 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet"

Last update: Dec 27, 2022

Overview

Token Labeling: Training an 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet (arxiv)

This is a Pytorch implementation of our technical report.

Comparison between the proposed LV-ViT and other recent works based on transformers. Note that we only show models whose model sizes are under 100M.

Training Pipeline

Our codes are based on the pytorch-image-models by Ross Wightman.

LV-ViT Models

Model	layer	dim	Image resolution	Param	Top 1	Download
LV-ViT-S	16	384	224	26.15M	83.3	link
LV-ViT-S	16	384	384	26.30M	84.4	link
LV-ViT-M	20	512	224	55.83M	84.0	link
LV-ViT-M	20	512	384	56.03M	85.4	link
LV-ViT-L	24	768	448	150.47M	86.2	link

Requirements

torch>=1.4.0 torchvision>=0.5.0 pyyaml timm==0.4.5

data prepare: ImageNet with the following folder structure, you can extract imagenet by this script.

│imagenet/
├──train/
│  ├── n01440764
│  │   ├── n01440764_10026.JPEG
│  │   ├── n01440764_10027.JPEG
│  │   ├── ......
│  ├── ......
├──val/
│  ├── n01440764
│  │   ├── ILSVRC2012_val_00000293.JPEG
│  │   ├── ILSVRC2012_val_00002138.JPEG
│  │   ├── ......
│  ├── ......

Validation

Replace DATA_DIR with your imagenet validation set path and MODEL_DIR with the checkpoint path

CUDA_VISIBLE_DEVICES=0 bash eval.sh /path/to/imagenet/val /path/to/checkpoint

Label data

We provide NFNet-F6 generated dense label map here. As NFNet-F6 are based on pure ImageNet data, no extra training data is involved.

Training

Coming soon

Reference

If you use this repo or find it useful, please consider citing:

@misc{jiang2021token,
      title={Token Labeling: Training an 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet}, 
      author={Zihang Jiang and Qibin Hou and Li Yuan and Daquan Zhou and Xiaojie Jin and Anran Wang and Jiashi Feng},
      year={2021},
      eprint={2104.10858},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Related projects

T2T-ViT, Re-labeling ImageNet.

Comments

error: download the pretrained model but couldn't be unzipped

tar -xvf lvvit_s-26M-384-84-4.pth.tar tar: This does not look like a tar archive tar: Skipping to next header tar: Exiting with failure status due to previous errors

opened by Williamlizl 10
The accuracy of the validation set is 0，and the loss is always around 13

Hello! I use ILSVRC2012_img_train and ILSVRC2012_img_val, and use the provided label_top5_train_nfnet from Google Drive. I train lv-vit-s with batch_size 64 without apex for one epoch. Thanks for your advice.

opened by yifanQi98 7
Pretrained weights for LV-ViT-T

Hi,

Thanks for sharing your work. Could you also provide the pre-trained weights for the LV-ViT-T model variant, the one that achieves 79.1% top1-acc. as mentioned in Table 1 of your paper?

All the best, Marc

opened by marc345 5
train error: AttributeError: 'tuple' object has no attribute 'log_softmax'
Hi, thanks for you great work. When I train script, some error occurs: AttributeError: 'tuple' object has no attribute 'log_softmax'

with amp_autocast(): output = model(input) loss = loss_fn(output, target) # error occurs

and loss function is train_loss_fn = LabelSmoothingCrossEntropy(smoothing=0.0).cuda()

by the way: Could you please tell me why we need to specify smoothing=0.0?
opened by lxy5513 5
RuntimeError: CUDA error: device-side assert triggered

I am a green hand of DL. When I run the code of volo with tlt in a single or multi GPU, I get an error as follows: /pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [25,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. Traceback (most recent call last): File "main.py", line 949, in main() File "main.py", line 664, in main optimizers=optimizers) File "main.py", line 773, in train_one_epoch label_size=args.token_label_size) File "/opt/conda/lib/python3.6/site-packages/tlt/data/mixup.py", line 90, in mixup_target y1 = get_labelmaps_with_coords(target, num_classes, on_value=on_value, off_value=off_value, device=device, label_size=label_size) File "/opt/conda/lib/python3.6/site-packages/tlt/data/mixup.py", line 64, in get_labelmaps_with_coords num_classes=num_classes,device=device) File "/opt/conda/lib/python3.6/site-packages/tlt/data/mixup.py", line 16, in get_featuremaps _label_topk[1][:, :, :].long(), RuntimeError: CUDA error: device-side assert triggered.

I can't fix this problem right now.

opened by JIAOJIAYUASD 4
Generating label for custom dataset

Hello,

Thank you for sharing your work. I am currently trying to generate token label to a custom dataset for model lvvit_s, but I keep getting the loss close to 7 and the Accuracy 0 (not pre-trained and using 1 GPU in Google Colab). I also tried using the pre-trained model with --transfer but got 0 in both Loss and Acc . What option should I use for a custom dataset?

opened by AleMaiaF 2
generate_label.py unable to find model lvvit_s

Hi,

When I tried to run the label generation script for the model lvvit_s it returned an error "RuntimeError: Unknown model".

Solution: It worked when I added the line "import tlt.models" in the file generate_label.py.

opened by AleMaiaF 2
Can Token labeling reach higher than annotator model?

Greetings,

Thank you for this incredible research.

I would like to know if it is possible to use Token Labeling to achieve scores higher than that of the annotator model, I believe this was the case with VOLO D5 model where it achieved higher score than NFNet, model used for annotation.

opened by ErenBalatkan 1
label_map does not do the same augmentation (random crop) as the input image
Hi Thanks so much for the nice work! I am curious if you could share the insight on processing of the label_map. If I understand it correctly, after we load image and the corresponding, we shall do the same cropping/ flip/ resize, but in https://github.com/zihangJiang/TokenLabeling/blob/aa438eff9b9fc2daa8c8b4cc6bfaa6e3721f995e/tlt/data/label_transforms_factory.py#L58-L73 Seems only image was cropped, but the label map does not do the same cropping, which make the label map not match with the image?

Shall we do

return torchvision_F.resized_crop( img, i, j, h, w, self.size, interpolation ), torchvision_F.resized_crop( label_map, i / ratio, j / ratio, h / ratio, w / ratio, self.size, interpolation )

Thanks
opened by haooooooqi 1
Python3.6, ok; Python3.8, error

Test: [ 0/1] Time: 11.293 (11.293) Loss: 0.7043 (0.7043) [email protected]: 42.1875 (42.1875) [email protected]: 100.0000 (100.0000) Test: [ 1/1] Time: 0.108 (5.701) Loss: 0.5847 (0.6689) [email protected]: 89.8148 (56.3187) [email protected]: 100.0000 (100.0000) free(): invalid pointer free(): invalid pointer Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 303, in <module> main() File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 294, in main raise subprocess.CalledProcessError(returncode=process.returncode, subprocess.CalledProcessError: Command '['/opt/conda/bin/python3.8', '-u', 'main.py', '--local_rank=1', './dataset/c/c', '--model', 'lvvit_s', '-b', '128', '--apex-amp', '--img-size', '224', '--drop-path', '0.1', '--token-label', '--token-label-size', '14', '--dense-weight', '0.0', '--num-classes', '2', '--finetune', './pretrained/lvvit_s-26M-384-84-4.pth.tar']' died with <Signals.SIGABRT: 6>. [email protected]:/puxin_libochao/TokenLabeling# CUDA_VISIBLE_DEVICES=0,1 bash ./distributed_train.sh 2 ./dataset/c/c --model lvvit_s -b 128 --apex-amp --img-size 224 --drop-path 0.1 --token-label --token-label-size 14 --dense-weight 0.0 --num-classes 2 --finetune ./pretrained/lvvit_s-26M-384-84-4.pth.tar

opened by Williamlizl 1
A Bag of Training Techniques for ViT

Hi, thanks for your wonderful work. I have a question that whether training techniques mentioned in the LV-Vit can be used in other downstream task like object detection? In your paper, I see that many of this techniques are used in ImageNet. Thanks!

opened by qdd1234 1
how to apply token labeling to CNN ?

Hello ~ I'm interested in your token labeling technique, So I want to apply this technique in CNN based model because ViT is very heavy to train.

can I get the your code with CNN token labeling? if you're not give me some detail for implementing

thank you.

opened by HoJ00n2 0
Model settings for Cifar10

I am interested if there is any LV-ViT- model setup you have tested for Cifar10. I would like to know the proper setup of all blocks in none pretrained weights settings.

opened by Aminullah6264 0

Releases(v0.2.0)

v0.2.0(Jul 21, 2021)
Add docs and description

Add script to generate token label data

Source code(tar.gz)
Source code(zip)
lvvit_t.pth(32.61 MB)
v0.1.1(Jul 17, 2021)
More detail can be found in README

Fix some bugs

Source code(tar.gz)
Source code(zip)
0.1.0(Jun 16, 2021)

Source code(tar.gz)
Source code(zip)
v1.1-seg(Jun 3, 2021)

Source code(tar.gz)
Source code(zip)
upernet_lvvit_l.pth(795.80 MB)
upernet_lvvit_m.pth(294.96 MB)
upernet_lvvit_s.pth(169.68 MB)
1.1(Apr 24, 2021)

Release our pre-trained model
Source code(tar.gz)
Source code(zip)
lvvit_m-56M-448-85.5.tar(214.17 MB)
1.0(Apr 23, 2021)

Release our pre-trained model
Source code(tar.gz)
Source code(zip)
lvvit_l-150M-448-86.2.pth.tar(574.07 MB)
lvvit_l-150M-512-86.4.pth.tar(574.77 MB)
lvvit_m-56M-224-84.0.pth.tar(213.02 MB)
lvvit_m-56M-384-85.4.pth.tar(213.76 MB)
lvvit_m-56M-448-85.5.pth.tar(214.17 MB)
lvvit_s-26M-224-83.3.pth.tar(99.80 MB)
lvvit_s-26M-384-84.4.pth.tar(100.36 MB)

Owner

蒋子航

Now a Ph.D. student supervised by Prof. Feng Jiashi in ECE, NUS.

GitHub Repository

Code for ICCV 2021 paper Graph-to-3D: End-to-End Generation and Manipulation of 3D Scenes using Scene Graphs

Graph-to-3D This is the official implementation of the paper Graph-to-3d: End-to-End Generation and Manipulation of 3D Scenes Using Scene Graphs | arx

33 Jan 06, 2023

Video Representation Learning by Recognizing Temporal Transformations. In ECCV, 2020.

Video Representation Learning by Recognizing Temporal Transformations [Project Page] Simon Jenni, Givi Meishvili, and Paolo Favaro. In ECCV, 2020. Thi

46 Nov 14, 2022

Code for STFT Transformer used in BirdCLEF 2021 competition.

STFT_Transformer Code for STFT Transformer used in BirdCLEF 2021 competition. The STFT Transformer is a new way to use Transformers similar to Vision

69 Sep 29, 2022

The codebase for our paper "Generative Occupancy Fields for 3D Surface-Aware Image Synthesis" (NeurIPS 2021)

Generative Occupancy Fields for 3D Surface-Aware Image Synthesis (NeurIPS 2021) Project Page | Paper Xudong Xu, Xingang Pan, Dahua Lin and Bo Dai GOF

97 Nov 10, 2022

"Learning and Analyzing Generation Order for Undirected Sequence Models" in Findings of EMNLP, 2021

undirected-generation-dev This repo contains the source code of the models described in the following paper "Learning and Analyzing Generation Order f

0 Mar 25, 2022

VR Viewport Pose Model for Quantifying and Exploiting Frame Correlations

This repository contains the introduction to the collected VRViewportPose dataset and the code for the IEEE INFOCOM 2022 paper: "VR Viewport Pose Model for Quantifying and Exploiting Frame Correlatio

0 Aug 10, 2022

Official Pytorch implementation of C3-GAN

Official pytorch implemenation of C3-GAN Contrastive Fine-grained Class Clustering via Generative Adversarial Networks [Paper] Authors: Yunji Kim, Jun

114 Dec 02, 2022

Pipeline for employing a Lightweight deep learning models for LOW-power systems

PL-LOW A high-performance deep learning model lightweight pipeline that gradually lightens deep neural networks in order to utilize high-performance d

9 Aug 13, 2022

PyTorch implementation for paper "Full-Body Visual Self-Modeling of Robot Morphologies".

Full-Body Visual Self-Modeling of Robot Morphologies Boyuan Chen, Robert Kwiatkowskig, Carl Vondrick, Hod Lipson Columbia University Project Website |

32 Jan 02, 2023

Official implementation of the paper 'High-Resolution Photorealistic Image Translation in Real-Time: A Laplacian Pyramid Translation Network' in CVPR 2021

LPTN Paper | Supplementary Material | Poster High-Resolution Photorealistic Image Translation in Real-Time: A Laplacian Pyramid Translation Network Ji

372 Dec 26, 2022

Code for the paper "Adapting Monolingual Models: Data can be Scarce when Language Similarity is High"

Wietse de Vries • Martijn Bartelds • Malvina Nissim • Martijn Wieling Adapting Monolingual Models: Data can be Scarce when Language Similarity is High

5 Aug 02, 2021

Code for CVPR2021 "Visualizing Adapted Knowledge in Domain Transfer". Visualization for domain adaptation. #explainable-ai

Visualizing Adapted Knowledge in Domain Transfer @inproceedings{hou2021visualizing, title={Visualizing Adapted Knowledge in Domain Transfer}, auth

80 Dec 25, 2022

Pytorch implementation of "Training a 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet"

Related tags

Overview

Token Labeling: Training an 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet (arxiv)

Training Pipeline

LV-ViT Models

Requirements

Validation

Label data

Training

Reference

Related projects

Comments

Releases(v0.2.0)

v0.2.0(Jul 21, 2021)

v0.1.1(Jul 17, 2021)

0.1.0(Jun 16, 2021)

v1.1-seg(Jun 3, 2021)

1.1(Apr 24, 2021)

1.0(Apr 23, 2021)

Owner

蒋子航

Code for ICCV 2021 paper Graph-to-3D: End-to-End Generation and Manipulation of 3D Scenes using Scene Graphs

Video Representation Learning by Recognizing Temporal Transformations. In ECCV, 2020.

Code for STFT Transformer used in BirdCLEF 2021 competition.

The codebase for our paper "Generative Occupancy Fields for 3D Surface-Aware Image Synthesis" (NeurIPS 2021)

"Learning and Analyzing Generation Order for Undirected Sequence Models" in Findings of EMNLP, 2021

VR Viewport Pose Model for Quantifying and Exploiting Frame Correlations

Official Pytorch implementation of C3-GAN

Pipeline for employing a Lightweight deep learning models for LOW-power systems

PyTorch implementation for paper "Full-Body Visual Self-Modeling of Robot Morphologies".

Official implementation of the paper 'High-Resolution Photorealistic Image Translation in Real-Time: A Laplacian Pyramid Translation Network' in CVPR 2021

Code for the paper "Adapting Monolingual Models: Data can be Scarce when Language Similarity is High"

Code for CVPR2021 "Visualizing Adapted Knowledge in Domain Transfer". Visualization for domain adaptation. #explainable-ai

Learning a mapping from images to psychological similarity spaces with neural networks.

N-Person-Check-Checker-Splitter - A calculator app use to divide checks

A state-of-the-art semi-supervised method for image recognition

Visual dialog agents with pre-trained vision-and-language encoders.

An evaluation toolkit for voice conversion models.

Differentiable rasterization applied to 3D model simplification tasks

Weight estimation in CT by multi atlas techniques

a morph transfer UGATIT for image translation.