Official repository for "Intriguing Properties of Vision Transformers" (2021)

Last update: Dec 27, 2022

Overview

Intriguing Properties of Vision Transformers

Muzammal Naseer, Kanchana Ranasinghe, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, & Ming-Hsuan Yang

Abstract: Vision transformers (ViT) have demonstrated impressive performance across various machine vision tasks. These models are based on multi-head self-attention mechanisms that can flexibly attend to a sequence of image patches to encode contextual cues. An important question is how such flexibility (in attending image-wide context conditioned on a given patch) can facilitate handling nuisances in natural images e.g., severe occlusions, domain shifts, spatial permutations, adversarial and natural perturbations. We systematically study this question via an extensive set of experiments encompassing three ViT families and provide comparisons with a high-performing convolutional neural network (CNN). We show and analyze the following intriguing properties of ViT: (a) Transformers are highly robust to severe occlusions, perturbations and domain shifts, e.g., retain as high as 60% top-1 accuracy on ImageNet even after randomly occluding 80% of the image content. (b) The robust performance to occlusions is not due to a bias towards local textures, and ViTs are significantly less biased towards textures compared to CNNs. When properly trained to encode shape-based features, ViTs demonstrate shape recognition capability comparable to that of human visual system, previously unmatched in the literature. (c) Using ViTs to encode shape representation leads to an interesting consequence of accurate semantic segmentation without pixel-level supervision. (d) Off-the-shelf features from a single ViT model can be combined to create a feature ensemble, leading to high accuracy rates across a range of classification datasets in both traditional and few-shot learning paradigms. We show effective features of ViTs are due to flexible and dynamic receptive fields possible via self-attention mechanisms. Our code will be publicly released.

Citation

@misc{naseer2021intriguing,
      title={Intriguing Properties of Vision Transformers}, 
      author={Muzammal Naseer and Kanchana Ranasinghe and Salman Khan and Munawar Hayat and Fahad Shahbaz Khan and Ming-Hsuan Yang},
      year={2021},
      eprint={2105.10497},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

We are in the process of cleaning our code. We will update this repo shortly. Here are the highlights of what to expect :)

~~Pretrained ViT models trained on Stylized ImageNet (along with distilled ones). We will provide code to use these models for auto-segmentation~~.
~~Training and Evaluations for our proposed off-the-shelf ensemble features.~~
~~Code to evaluate any model on our proposed occulusion stratagies (random, foreground and background).~~
~~Code for evaluation of permutation invaraince.~~
~~Pretrained models to study the effect of varying patch sizes and positional encoding.~~
Pretrained adversarial patches and code to evalute them.
Training on Stylized Imagenet.

Requirements

pip install -r requirements.txt

Shape Biased Models

Our shape biased pretrained models can be downloaded from here. Code for evaluating their shape bias using auto segmentation on the PASCAL VOC dataset can be found under scripts. Please fix any paths as necessary. You may place the VOC devkit folder under data/voc of fix the paths appropriately.

Running segmentation evaluation on models:

./scripts/eval_segmentation.sh

Visualizing segmentation for images in a given folder:

./scripts/visualize_segmentation.sh

Off the Shelf Classification

Training code for off-the-shelf experiment in classify_metadataset.py. Seven datasets (aircraft CUB DTD fungi GTSRB Places365 INAT) available by default. Set the appropriate dir path in classify_md.sh by fixing DATA_PATH.

Run training and evaluation for a selected dataset (aircraft by default) using selected model (DeiT-T by default):

./scripts/classify_md.sh

Occlusion Evaluation

Evaluation on ImageNet val set (change path in script) for our proposed occlusion techniques:

./scripts/evaluate_occlusion.sh

Permutation Invariance Evaluation

Evaluation on ImageNet val set (change path in script) for the shuffle operation:

./scripts/evaluate_shuffle.sh

Varying Patch Sizes and Positional Encoding

Pretrained models to study the effect of varying patch sizes and positional encoding:

DeiT-T Model	Top-1	Top-5	Pretrained
No Pos. Enc.	68.3	89.0	Link
Patch 22	68.7	89.0	Link
Patch 28	65.2	86.7	Link
Patch 32	63.1	85.3	Link
Patch 38	55.2	78.8	Link

References

Code borrowed from DeiT and DINO repositories.

Comments

Question about links of pretrained models

Hi! First of all, thank the authors for the exciting work! I noticed that the checkpoint link of the pretrained 'deit_tiny_distilled_patch16_224' in vit_models/deit.py is different from the one of the shape-biased model DeiT-T-SIN (distilled), as given in README.md. I thought deit_tiny_distilled_patch16_224 has the same definition with DeiT-T-SIN (distilled). Do they have differences in model architecture or training procedure?

opened by ZhouqyCH 3
Two questions on your paper
Hi. This is heonjin.

Firstly, big thanks to you and your paper. well-read and precise paper! I have two questions on your paper.

Please take a look at Figure 9. On the 'no positional encoding' experiment, there is a peak on 196 shuffle size of "DeiT-T-no-pos". Why is there a peak? and I wonder why there is a decreasing from 0 shuffle size to 64 of "DeiT-T-no-pos".

On the Figure 14, On the Aircraft(few shot), Flower(few shot) dataset, CNN performs better than DeiT. Could you explain this why?

Thanks in advance.
opened by hihunjin 2
Attention maps DINO Patchdrop

Hi, thanks for the amazing paper.

My question is about how which patches are dropped from the image with the DINO model. It looks like in the code in evaluate.py on line 132 head_number = 1. I want to understand the reason why this number was chosen (the other params used to index the attention maps seem to make sense). Wouldn't averaging the attention maps across heads give you better segmentation?

Thanks,

Ravi

opened by rraju1 1
Support CPU when visualizing segmentations

Most of the code to visualize segmentation is ready for GPU and CPU, but I bumped into this one place where there is a hard-coded .cuda() call. I changed it to .to(device) to support CPU.

opened by cgarbin 0
Expand the instructions to install the PASCAL VOC dataset

I inspected the code to understand the expected directory structure. This note in the README may help other users put the dataset in the right place from the start.

opened by cgarbin 0
Add note to use Python 3.8 because of PyTorch 1.7

PyTorch 1.7 requires Python 3.8. Refer to the discussion in https://github.com/pytorch/pytorch/issues/47354.

Suggest adding this note to the README to help reproduce the environment because running pip install -r requirements.txt with the wrong version of Python gives an obscure error message.

opened by cgarbin 0
Amazing work, but can it work on DETR？

ViT family show strong robustness on RandomDrop and Domain shift Problem. The thing is , I 'm working on object detection these days,detr is an end to end object detection methods which adopted Transformer's encoder decoder part, but the backbone I use , is Resnet50, it can still find the properties that your paper mentioned. Above all I want to ask two questions: (1).Do these intriguing properties come from encoder、decoder part？ (2).What's the difference between distribution shift and domain shift(I saw distribution shift first time on your paper)?

opened by 1184125805 0

Releases(v0)

v0(Jun 7, 2021)

Pretrained models for Stylized ImageNet (SIN), no pos-encoding, and different patch size experiments.
Source code(tar.gz)
Source code(zip)
deit_s_sin.pth(336.67 MB)
deit_s_sin_dist.pth(342.56 MB)
deit_s_sin_dist_aug.pth(342.56 MB)
deit_t_sin.pth(87.44 MB)
deit_t_sin_dist.pth(90.39 MB)
no_pos_deit_t.pth(86.86 MB)
patch_22_deit_t.pth(89.16 MB)
patch_28_deit_t.pth(91.69 MB)
patch_32_deit_t.pth(93.76 MB)
patch_38_deit_t.pth(97.38 MB)

Owner

Muzammal Naseer

PhD student at Australian National University.

GitHub Repository

NAS-HPO-Bench-II is the first benchmark dataset for joint optimization of CNN and training HPs.

NAS-HPO-Bench-II API Overview NAS-HPO-Bench-II is the first benchmark dataset for joint optimization of CNN and training HPs. It helps a fair and low-

8 Nov 21, 2022

Losslandscapetaxonomy - Taxonomizing local versus global structure in neural network loss landscapes

Taxonomizing local versus global structure in neural network loss landscapes Int

8 Dec 30, 2022

Implementation of ToeplitzLDA for spatiotemporal stationary time series data.

Code for the ToeplitzLDA classifier proposed in here. The classifier conforms sklearn and can be used as a drop-in replacement for other LDA classifiers. For in-depth usage refer to the learning from

5 Nov 07, 2022

Companion code for "Bayesian logistic regression for online recalibration and revision of risk prediction models with performance guarantees"

Companion code for "Bayesian logistic regression for online recalibration and revision of risk prediction models with performance guarantees" Installa

0 Oct 13, 2021

optimization routines for hyperparameter tuning

Hyperopt: Distributed Hyperparameter Optimization Hyperopt is a Python library for serial and parallel optimization over awkward search spaces, which

398 Nov 09, 2022

On the Adversarial Robustness of Visual Transformer

On the Adversarial Robustness of Visual Transformer Code for our paper "On the Adversarial Robustness of Visual Transformers"

35 Dec 14, 2022

Core ML tools contain supporting tools for Core ML model conversion, editing, and validation.

Core ML Tools Use coremltools to convert machine learning models from third-party libraries to the Core ML format. The Python package contains the sup

3k Jan 08, 2023

Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.

WECHSEL Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. arXiv: https://arx

45 Dec 29, 2022

TensorFlow implementation of "A Simple Baseline for Bayesian Uncertainty in Deep Learning"

7 Aug 28, 2022

Inferred Model-based Fuzzer

IMF: Inferred Model-based Fuzzer IMF is a kernel API fuzzer that leverages an automated API model inferrence techinque proposed in our paper at CCS. I

104 Sep 28, 2022

The source codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'

BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data This repository provides the implementation details for

124 Dec 27, 2022

Reference code for the paper CAMS: Color-Aware Multi-Style Transfer.

CAMS: Color-Aware Multi-Style Transfer Mahmoud Afifi1, Abdullah Abuolaim*1, Mostafa Hussien*2, Marcus A. Brubaker1, Michael S. Brown1 1York University

36 Dec 04, 2022

Code for ACL 21: Generating Query Focused Summaries from Query-Free Resources

marge This repository releases the code for Generating Query Focused Summaries from Query-Free Resources. Please cite the following paper [bib] if you

28 Nov 10, 2022

Code release for ICCV 2021 paper "Anticipative Video Transformer"

Anticipative Video Transformer Ranked first in the Action Anticipation task of the CVPR 2021 EPIC-Kitchens Challenge! (entry: AVT-FB-UT) [project page

123 Dec 13, 2022

😊 Python module for face feature changing

PyWarping Python module for face feature changing Installation pip install pywarping If you get an error: No such file or directory: 'cmake': 'cmake',

10 Sep 10, 2021

Explainability of the Implications of Supervised and Unsupervised Face Image Quality Estimations Through Activation Map Variation Analyses in Face Recognition Models

Explainable_FIQA_WITH_AMVA Note This is the official repository of the paper: Explainability of the Implications of Supervised and Unsupervised Face I

3 May 08, 2022

Automatic packaging of the open-composite libs for OvGME

OvGME Packager for OpenXR – OpenComposite for DCS Note This repository is currently unsupported and needs to be migrated to the upstream OpenComposite

12 Nov 03, 2022

[CVPR 2022 Oral] EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation

EPro-PnP EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation In CVPR 2022 (Oral). [paper] Hanshen

842 Jan 04, 2023

Yggdrasil - A simplistic bot designed to streamline your server experience

Ygggdrasil A simplistic bot designed to streamline your server experience. Desig

1 Dec 14, 2022

Implementation of H-Transformer-1D, Hierarchical Attention for Sequence Learning using 🤗 transformers

hierarchical-transformer-1d Implementation of H-Transformer-1D, Hierarchical Attention for Sequence Learning using 🤗 transformers In Progress!! 2021.

7 Nov 06, 2022