VOLO: Vision Outlooker for Visual Recognition

Related tags

Deep Learningvolo
Overview

VOLO: Vision Outlooker for Visual Recognition, arxiv

This is a PyTorch implementation of our paper. We present Vision Outlooker (VOLO). We show that our VOLO achieves SOTA performance on ImageNet and CityScapes. No extra training data is used in our work.

ImageNet top-1 accuracy comparison with the state-of-the-art (sota) CNN-based and Transformer-based models. All results are based on the best test resolutions. Our VOLO-D5 achieves SOTA performance on ImageNet without extra data in 2021/06.

(Updating... codes and models for downstream tasks like semantic segmentation are coming soon.)

Reference

@misc{yuan2021volo,
      title={VOLO: Vision Outlooker for Visual Recognition}, 
      author={Li Yuan and Qibin Hou and Zihang Jiang and Jiashi Feng and Shuicheng Yan},
      year={2021},
      eprint={2106.13112},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

1. Requirements

torch>=1.7.0; torchvision>=0.8.0; timm==0.4.5; tlt==0.1.0; pyyaml; apex-amp

data prepare: ImageNet with the following folder structure, you can extract imagenet by this script.

│imagenet/
├──train/
│  ├── n01440764
│  │   ├── n01440764_10026.JPEG
│  │   ├── n01440764_10027.JPEG
│  │   ├── ......
│  ├── ......
├──val/
│  ├── n01440764
│  │   ├── ILSVRC2012_val_00000293.JPEG
│  │   ├── ILSVRC2012_val_00002138.JPEG
│  │   ├── ......
│  ├── ......

Directory structure in this repo:

│volo/
├──figures/
├──loss/
│  ├── __init__.py
│  ├── cross_entropy.py
├──models/
│  ├── __init__.py
│  ├── volo.py
├──utils/
│  ├── __init__.py
│  ├── utils.py
├──LICENSE
├──README.md
├──distributed_train.sh
├──main.py
├──validate.py

2. VOLO Models

Model #params Image resolution Top1 Acc Download
volo_d1 27M 224 84.2 here
volo_d1 ↑384 27M 384 85.2 here
volo_d2 59M 224 85.2 here
volo_d2 ↑384 59M 384 86.0 here
volo_d3 86M 224 85.4 here
volo_d3 ↑448 86M 448 86.3 here
volo_d4 193M 224 85.7 here
volo_d4 ↑448 193M 448 86.8 here
volo_d5 296M 224 86.1 here
volo_d5 ↑448 296M 448 87.0 here
volo_d5 ↑512 296M 512 87.1 here

Usage

Instructions on how to use our pre-trained VOLO models:

from models.volo import *
from utils import load_pretrained_weights 

# create model
model = volo_d1()

# load the pretrained weights
# change num_classes based on dataset, can work for different image size 
# as we interpolate the position embeding for different image size.
load_pretrained_weights(model, "/path/to/pretrained/weights", use_ema=False, 
                        strict=False, num_classes=1000)  

3. Validation

To evaluate our VOLO models, run:

python3 validate.py /path/to/imagenet  --model volo_d1 \
  --checkpoint /path/to/checkpoint --no-test-pool --apex-amp --img-size 224 -b 128

Change the --img-size from 224 to 384 or 448 for different image resolution, for example, to evaluate volo-d5 on 512 (87.1), run:

python3 validate.py /path/to/imagenet  --model volo_d5 \
  --checkpoint /path/to/volo_d5_512 --no-test-pool --apex-amp --img-size 512 -b 32

4. Train

Download token labeling data as we use token labeling, details about token labling are in here.

For each VOLO model, we first train it with image-size as 224 then finetune on image-size as 384 or 448/512:

train volo_d1 on 224 and finetune on 384 8 GPU, batch_size=1024, 19G GPU-memory in each GPU with apex-amp (mixed precision training)

Train volo_d1 on 224 with 310 epoch, acc=84.2

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 /path/to/imagenet \
  --model volo_d1 --img-size 224 \
  -b 128 --lr 1.6e-3 --img-size 224 --drop-path 0.1 --apex-amp \
  --token-label --token-label-size 14 --token-label-data /path/to/token_label_data

Finetune on 384 with 40 epoch based on the pretrained checkpoint on 224, final acc=85.2 on 384

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 /path/to/imagenet \
  --model volo_d1 --img-size 384 \
  -b 64 --lr 8.0e-6 --min-lr 4.0e-6 --drop-path 0.1 --epochs 30 --apex-amp \
  --weight-decay 1.0e-8 --warmup-epochs 5  --ground-truth \
  --token-label --token-label-size 24 --token-label-data /path/to/token_label_data \
  --finetune /path/to/pretrained_224_volo_d1/
train volo_d2 on 224 and finetune on 384 8 GPU, batch_size=1024, 27G GPU-memory in each GPU with apex-amp (mixed precision training)

Train volo_d2 on 224 with 300 epoch, acc=85.2

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 /path/to/imagenet \
  --model volo_d2 --img-size 224 \
  -b 128 --lr 1.0e-3 --img-size 224 --drop-path 0.2 --apex-amp \
  --token-label --token-label-size 14 --token-label-data /path/to/token_label_data

Finetune on 384 with 30 epoch based on the pretrained checkpoint on 224, final acc=86.0 on 384

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 /path/to/imagenet \
  --model volo_d2 --img-size 384 \
  -b 48 --lr 8.0e-6 --min-lr 4.0e-6 --drop-path 0.2 --epochs 30 --apex-amp \
  --weight-decay 1.0e-8 --warmup-epochs 5  --ground-truth \
  --token-label --token-label-size 24 --token-label-data /path/to/token_label_data \
  --finetune /path/to/pretrained_224_volo_d2/

5. Acknowledgement

We gratefully acknowledge the support of NVIDIA AI Tech Center (NVAITC) to this research project, especially the great helps in GPU technology supports from Terry Jianxiong Yin (NVAITC) and Qingyi Tao (NVAITC).

Related project: T2T-ViT, Token_labeling, pytorch-image-models, official imagenet example

LICENSE

This repo is under the Apache-2.0 license. For commercial use, please contact with the authors.

Comments
  • Main diference between

    Main diference between "outlook attention" and "Involution".

    Thanks for your excellent work!!! I noticed your "outlook attention" is very similar with the "Involution"(https://github.com/d-li14/involution).I just want to know the main difference. As I see, the main difference is that you use a extra linear projection on the input iteself and you use a extra softmax to generate the attention weight. However I did not find the detailed comparation between these two methods in your paper.

    question 
    opened by wuyongfa-genius 7
  • Colab Notebook doesn't work & gives wrong results.

    Colab Notebook doesn't work & gives wrong results.

    Hi Creators, Thanks for making a new SOTA models & also for open-sourcing it.

    I was trying the colab notebook and it's throwing an error!! image

    After going through the Usage and adding those params. It gave me different and wrong results compared to demo colab Notebook. image

    opened by debparth 4
  • Could you please release an ablation study that compares LV-VIT with / without outlook

    Could you please release an ablation study that compares LV-VIT with / without outlook

    This is a great job that proposes a new attention way. However, I want to figure out its ability when comparing all the things in the same condition.

    Could you please release an ablation study that compares the outlook and attention under the same training policy, hyperparameters (network width, depth), and architectures (for example ViT or LV-ViT)?

    So that we can better know the effectiveness of the outlook.

    opened by theFoxofSky 4
  • semantic segmentation

    semantic segmentation

    Hello, thanks very much for sharing the code for your tremendous research!

    For semantic segmentation, did you just run evaluation with multiple square tiles to handle the non-square resolution of Cityscapes? Can you share any details, like decoder head architecture?

    opened by ajtao 3
  • Finetune with 512 image size

    Finetune with 512 image size

    Hello,

    I am finetuning a model with an image size of 512 and --token-label-size 24. Is the label size enough for 512 image size? Should I use a higher label size? How do I really know the correct label size?

    Thank you in advance!

    opened by javierrodenas 2
  • AttributeError: 'tuple' object has no attribute 'log_softmax'

    AttributeError: 'tuple' object has no attribute 'log_softmax'

    I was trying to train VOLO on some custom data using the following command

    !python3 main.py -data_dir "/content/dataset" --dataset "ImageFolder" --train-split "train" --val-split "valid" --num-classes 3 --epochs 100 --batch-size 64

    Unfortunately, I keep on getting the following error:

    Traceback (most recent call last):
      File "main.py", line 948, in <module>
        main()
      File "main.py", line 664, in main
        optimizers=optimizers)
      File "main.py", line 783, in train_one_epoch
        loss = loss_fn(output, target)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
        return forward_call(*input, **kwargs)
      File "/usr/local/lib/python3.7/dist-packages/timm/loss/cross_entropy.py", line 35, in forward
        loss = torch.sum(-target * F.log_softmax(x, dim=-1), dim=-1)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1768, in log_softmax
        ret = input.log_softmax(dim)
    AttributeError: 'tuple' object has no attribute 'log_softmax'
    

    Any advice on how I can go about fixing this or what is causing this error to occur?

    opened by SamratSahoo 2
  • TypeError: forward() takes 1 positional argument but 2 were given

    TypeError: forward() takes 1 positional argument but 2 were given

    Hi,

    when I used volo_d1 pretrained model for my training. I loaded this model successfully. But I received this error when I do the training step. Could you please give me some hint? Here is the input before forward() function I traced. The original one is resnet18, it worked fine. d1_224.txt resnet18.txt

    Thank you in advance.

    Best regards, Hui Yu

    opened by YHDASHEN 2
  • AttributeError: 'tuple' object has no attribute 'max'

    AttributeError: 'tuple' object has no attribute 'max'

    When I use the code to use the pre-training model for training, I find that the data becomes tuple data when passing through the model, resulting in the model can not continue training。 model = volo_d2() load_pretrained_weights( model,'./path/to/pretrained/weights/d2_224_85.2.pth.tar', use_ema=False, strict=False,num_classes=2) print(model) ...... def train(train_loader,model,criterion, optimizer, epoch,args,scheduler=None): print("train--------------") avg_loss=0 avg_acc=0 model.train() for batch_idx,(image,target) in enumerate(train_loader): # measure data loading time image,target=Variable(image.cuda()),Variable(target.cuda()) print(type(image)) image=image.cuda() target=target.cuda() optimizer.zero_grad() logits =model(image)

        #m = [t.cpu().numpy() for t in logits]
        #m = [o.cpu().detach() for o in m]
        #logits = torch.tensor(m)
        #logits = torch.tensor([item.cpu().detach().numpy() for item in logits]).cuda()
    
        print(type(logits))
    
    preds=logits.max(1, keepdim=True)[1] # get the index of the max log-probability
    

    AttributeError: 'tuple' object has no attribute 'max'

    opened by zhang-pan 2
  • Confusion in class OutlookAttention moduel

    Confusion in class OutlookAttention moduel

    in class OutlookAttention, there is self.v = nn.Linear(dim, dim, bias=qkv_bias) and the input of this class is x whose shape is B, H, W, C = x.shape. My quesion is how this code v = self.v(x).permute(0, 3, 1, 2) # B, C, H, W can go well without exception because matrix multiplication [B, H, W, C] * [dim, dim] will do here. And also in the original paper, Algorithm 1 implements v_pj = nn.Linear(C, C). But in your codes, C is replaced with dim. Thanks!

    opened by axhiao 1
  • UnboundLocalError: local variable 'input' referenced before assignment

    UnboundLocalError: local variable 'input' referenced before assignment

    When I run the main program with "python main. py . /data", the following error occurs:
    

    D:\Python36\lib\site-packages\torchvision\transforms\transforms.py:258: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum. "Argument interpolation should be of type InterpolationMode instead of int. " Traceback (most recent call last): File "main.py", line 948, in main() File "main.py", line 664, in main optimizers=optimizers) File "main.py", line 746, in train_one_epoch for batch_idx, (input, target) in enumerate(loader): File "D:\Python36\lib\site-packages\tlt\data\loader.py", line 105, in iter yield input, target UnboundLocalError: local variable 'input' referenced before assignment

    Please give me more advice and look forward to your reply. thank you very much.

    opened by zhang-pan 1
  • There is a problem when loading the pretrained weights

    There is a problem when loading the pretrained weights

    A problem happen when I load the pretrained weight you provided.


    UnpicklingError Traceback (most recent call last) in 9 # as we interpolate the position embeding for different image size. 10 load_pretrained_weights(model, "/home/featurize/work/checkpoints/archive/data.pkl", use_ema=False, ---> 11 strict=False, num_classes=1000)

    /cloud/volo/utils/utils.py in load_pretrained_weights(model, checkpoint_path, use_ema, strict, num_classes) 140 num_classes=1000): 141 '''load pretrained weight for VOLO models''' --> 142 state_dict = load_state_dict(checkpoint_path, model, use_ema, num_classes) 143 model.load_state_dict(state_dict, strict=strict) 144

    /cloud/volo/utils/utils.py in load_state_dict(checkpoint_path, model, use_ema, num_classes) 92 if checkpoint_path and os.path.isfile(checkpoint_path): 93 # checkpoint = torch.load(checkpoint_path, map_location='cpu') ---> 94 checkpoint = torch.load(checkpoint_path) 95 state_dict_key = 'state_dict' 96 if isinstance(checkpoint, dict):

    /environment/python/versions/miniconda3-4.7.12/lib/python3.7/site-packages/torch/serialization.py in load(f, map_location, pickle_module, **pickle_load_args) 591 return torch.jit.load(opened_file) 592 return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args) --> 593 return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) 594 595

    /environment/python/versions/miniconda3-4.7.12/lib/python3.7/site-packages/torch/serialization.py in _legacy_load(f, map_location, pickle_module, **pickle_load_args) 760 "functionality.") 761 --> 762 magic_number = pickle_module.load(f, **pickle_load_args) 763 if magic_number != MAGIC_NUMBER: 764 raise RuntimeError("Invalid magic number; corrupt file?")

    UnpicklingError: A load persistent id instruction was encountered, but no persistent_load function was specified.

    opened by JonnesLin 1
  • Ablation study and official code problem

    Ablation study and official code problem

    First, thanks for your contribution, this paper inspired me a lot. However, I still have some questions as follows, I hope you can answer:

    1. about the code, I think the output shape of Unfold operation is not right in your code, even if it has no error
    2. about the ablation study, it would be better to compare the dynamic convolution with your outlook attention. they are very similar exactly with each other, the only difference is the weights generation method. I am very interested in this.
    3. according to your paper, I modified your code with my own understanding: https://github.com/xingshulicc/Vision-In-Transformer-Model/blob/main/outlook_attention.py.
      Hoping you can give me some advice on my code. Thanks again.
    opened by xingshulicc 0
  • Question about computational complexity formulation of Outlooker Attention

    Question about computational complexity formulation of Outlooker Attention

    image

    Greetings! Thanks for all your inspiring and excellent VOLO work!!! In reading this paper, I get trouble in comprehending the formulation (8), which depicts the complexity of Outlooker Attention. I tried to inference the cost from the pytorch-like code provided aforementioned, however cannot get to the formulation (8). Would you mind providing any insight about the calculation process? Thanks a lot.

    opened by ligeng0197 0
  • pre trained model file is broken.

    pre trained model file is broken.

    I download pre trained model with the link in the document. but when I try to use it, it can't report an error:

    [email protected]:~/workspace/DeepLearning/VOLO/pretrained_models$ tar -xf d1_384_85.2.pth.tar tar: This does not look like a tar archive tar: Skipping to next header tar: Exiting with failure status due to previous errors

    opened by scotthuang1989 1
  • volo-d1 training without token label data

    volo-d1 training without token label data

    Hi,

    Congratulations on your excellent work and many thanks for making the code public. I have trained a model using the base settings and no token labels:

    CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 /path/to/imagenet --model volo_d1 --img-size 224 -b 128 --lr 1.6e-3 --drop-path 0.1 --apex-amp

    which reached best accuracy 81.72% after 310 epochs. I believe the expected best acc should be about 83.8% which is quite higher than what I get at the moment.

    Can you see any issue with the command used to train the model? Any help would be really appreciated.

    Best, Michael

    opened by michaeltrs 1
  • When training own dataset, an error occurs when changing numberclasses to the corresponding category. If it is the default, it will report an error

    When training own dataset, an error occurs when changing numberclasses to the corresponding category. If it is the default, it will report an error

    AMP not enabled. Training in float32. Using native Torch DistributedDataParallel. Scheduled epochs: 310 /pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [15,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. Traceback (most recent call last): File "main.py", line 948, in main() File "main.py", line 664, in main optimizers=optimizers) File "main.py", line 782, in train_one_epoch output = model(input) File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 610, in forward self._sync_params() File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1048, in _sync_params authoritative_rank, File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 979, in _distributed_broadcast_coalesced self.process_group, tensors, buffer_size, authoritative_rank RuntimeError: CUDA error: device-side assert triggered terminate called after throwing an instance of 'std::runtime_error' what(): NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:136, unhandled cuda error, NCCL version 2.7.8

    opened by hx358031364 0
  • The code for Semantic Segmentation?

    The code for Semantic Segmentation?

    Dear authors: Thanks for your wonderful work! the result in Semantic Segmentation task looks nice, thus, could the code and config about the Semantic Segmentation be published? Thanks!

    opened by HITerStudy 0
Owner
Sea AI Lab
Sea AI Lab
Official codebase for "B-Pref: Benchmarking Preference-BasedReinforcement Learning" contains scripts to reproduce experiments.

B-Pref Official codebase for B-Pref: Benchmarking Preference-BasedReinforcement Learning contains scripts to reproduce experiments. Install conda env

48 Dec 20, 2022
YOLOv5 + ROS2 object detection package

YOLOv5-ROS YOLOv5 + ROS2 object detection package This program changes the input of detect.py (ultralytics/yolov5) to sensor_msgs/Image of ROS2. Requi

Ar-Ray 23 Dec 19, 2022
Code for 'Blockwise Sequential Model Learning for Partially Observable Reinforcement Learning' (AAAI 2022)

Blockwise Sequential Model Learning Code for 'Blockwise Sequential Model Learning for Partially Observable Reinforcement Learning' (AAAI 2022) For ins

2 Jun 17, 2022
Pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering".

TRAnsformer Routing Networks (TRAR) This is an official implementation for ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visu

Ren Tianhe 49 Nov 10, 2022
GazeScroller - Using Facial Movements to perform Hands-free Gesture on the system

GazeScroller Using Facial Movements to perform Hands-free Gesture on the system

2 Jan 05, 2022
Backend code to use MCPI's python API to make infinite worlds with custom generation

inf-mcpi Backend code to use MCPI's python API to make infinite worlds with custom generation Does not save player-placed blocks! Generation is still

5 Oct 04, 2022
Official code for "On the Frequency Bias of Generative Models", NeurIPS 2021

Frequency Bias of Generative Models Generator Testbed Discriminator Testbed This repository contains official code for the paper On the Frequency Bias

35 Nov 01, 2022
Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.

AVATAR Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation. AVATAR stands for jAVA-pyThon progrAm tRanslation. AV

Wasi Ahmad 26 Dec 03, 2022
Embracing Single Stride 3D Object Detector with Sparse Transformer

SST: Single-stride Sparse Transformer This is the official implementation of paper: Embracing Single Stride 3D Object Detector with Sparse Transformer

TuSimple 385 Dec 28, 2022
Unsupervised Feature Loss (UFLoss) for High Fidelity Deep learning (DL)-based reconstruction

Unsupervised Feature Loss (UFLoss) for High Fidelity Deep learning (DL)-based reconstruction Official github repository for the paper High Fidelity De

28 Dec 16, 2022
Implementation of Monocular Direct Sparse Localization in a Prior 3D Surfel Map (DSL)

DSL Project page: https://sites.google.com/view/dsl-ram-lab/ Monocular Direct Sparse Localization in a Prior 3D Surfel Map Authors: Haoyang Ye, Huaiya

Haoyang Ye 93 Nov 30, 2022
The source codes for TME-BNA: Temporal Motif-Preserving Network Embedding with Bicomponent Neighbor Aggregation.

TME The source codes for TME-BNA: Temporal Motif-Preserving Network Embedding with Bicomponent Neighbor Aggregation. Our implementation is based on TG

2 Feb 10, 2022
Exploring the Dual-task Correlation for Pose Guided Person Image Generation

Dual-task Pose Transformer Network The source code for our paper "Exploring Dual-task Correlation for Pose Guided Person Image Generation“ (CVPR2022)

63 Dec 15, 2022
BookMyShowPC - Movie Ticket Reservation App made with Tkinter

Book My Show PC What is this? Movie Ticket Reservation App made with Tkinter. Tk

The Nithin Balaji 3 Dec 09, 2022
Codes for paper "KNAS: Green Neural Architecture Search"

KNAS Codes for paper "KNAS: Green Neural Architecture Search" KNAS is a green (energy-efficient) Neural Architecture Search (NAS) approach. It contain

90 Dec 22, 2022
Time should be taken seer-iously

TimeSeers seers - (Noun) plural form of seer - A person who foretells future events by or as if by supernatural means TimeSeers is an hierarchical Bay

279 Dec 26, 2022
Fast Differentiable Matrix Sqrt Root

Official Pytorch implementation of ICLR 22 paper Fast Differentiable Matrix Square Root

YueSong 42 Dec 30, 2022
The spiritual successor to knockknock for PyTorch Lightning, get notified when your training ends

Who's there? The spiritual successor to knockknock for PyTorch Lightning, to get a notification when your training is complete or when it crashes duri

twsl 70 Oct 06, 2022
Adaptive, interpretable wavelets across domains (NeurIPS 2021)

Adaptive wavelets Wavelets which adapt given data (and optionally a pre-trained model). This yields models which are faster, more compressible, and mo

Yu Group 50 Dec 16, 2022
Pytorch implementation of Distributed Proximal Policy Optimization: https://arxiv.org/abs/1707.02286

Pytorch-DPPO Pytorch implementation of Distributed Proximal Policy Optimization: https://arxiv.org/abs/1707.02286 Using PPO with clip loss (from https

Alexis David Jacq 163 Dec 26, 2022