Exploring whether attention is necessary for vision transformers

Last update: Jan 07, 2023

Overview

Do You Even Need Attention?
A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

Paper/Report

TL;DR

We replace the attention layer in a vision transformer with a feed-forward layer and find that it still works quite well on ImageNet.

Abstract

The strong performance of vision transformers on image classification and other vision tasks is often attributed to the design of their multi-head attention layers. However, the extent to which attention is responsible for this strong performance remains unclear. In this short report, we ask: is the attention layer even necessary? Specifically, we replace the attention layer in a vision transformer with a feed-forward layer applied over the patch dimension. The resulting architecture is simply a series of feed-forward layers applied over the patch and feature dimensions in an alternating fashion. In experiments on ImageNet, this architecture performs surprisingly well: a ViT/DeiT-base-sized model obtains 74.9% top-1 accuracy, compared to 77.9% and 79.9% for ViT and DeiT respectively. These results indicate that aspects of vision transformers other than attention, such as the patch embedding, may be more responsible for their strong performance than previously thought. We hope these results prompt the community to spend more time trying to understand why our current models are as effective as they are.

Note

This is concurrent research with MLP-Mixer from Google Research. The ideas are exacty the same, with the one difference being that they use (a lot) more compute.

Pretrained models and logs

Here is a Weights and Biases report with the expected training trajectory: W&B

name	[email protected]	#params	url
FF-tiny	61.4	7.7M	model
FF-base	74.9	62M	model
FF-large	71.4	206M	-

Note: I haven't uploaded the FF-Large model because (1) it's over GitHub's file storage limit, and (2) I don't see why anyone would want it, given that it performs worse than the base model. That being said, if you want it, reach out to me and I'll send it to you.

How to train

The model definition in vision_transformer_linear.py is designed to be run with the repo from DeiT, which is itself based on the wonderful timm package.

Steps:

Clone the DeiT repo and move the file into it

git clone https://github.com/facebookresearch/deit
mv vision_transformer_linear.py deit
cd deit

Add a line to import vision_transformer_linear in main.py. For example, add the following after the import statements (around line 27):

+ import vision_transformer_linear

Train:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch \
--nproc_per_node=8 \
--master_port 10490 \
--use_env main.py \
--model linear_tiny \
--batch-size 128 \
--drop 0.1 \
--output_dir outputs/linear-tiny \
--data-path your/path/to/imagenet

Citation

If you build upon this idea, feel free to drop a citation (and also cite MLP-Mixer).

@article{melaskyriazi2021doyoueven,
  title={Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet},
  author={Luke Melas-Kyriazi},
  journal=arxiv,
  year=2021
}

Comments

On details about the experiments

In the experiment section of the report, you mention:

Such a comparison is not exactly fair because the feed-forward model uses stronger training augmentations.

I wonder what the augmentations are, for details about the augmentation seems missing from the paper.

opened by boredtylin 2
Positional Encoding Ablation

Hi Luke, thank you for sharing this amazing work.

In your arxiv document, I cannot find any mention of positional encoding, but I see that you use them in your code. Did you conduct any ablation study on the PE? i.e., how much does it affect the performance, with and without?

Thank you in advance.

opened by jjparkcv 0
Interaction between patches through a transpose may have a stronger role to play ?

Hi, I was going through your exp report. You have made a point that since you are able to get a good performance without using attention layer so good performance of ViT may be more to do with it's embedding layer than attention .

But I believe, It's also may be to do with how you have established an interaction between patches through a transpose very similar to what was done in MLP-Mixer .

Would love to know your thoughts on this ?

opened by rakshith291 0

The first dataset of composite images with rationality score indicating whether the object placement in a composite image is reasonable.

Object-Placement-Assessment-Dataset-OPA Object-Placement-Assessment (OPA) is to verify whether a composite image is plausible in terms of the object p

53 Nov 15, 2022

Face Mask Detection is a project to determine whether someone is wearing mask or not, using deep neural network.

face-mask-detection Face Mask Detection is a project to determine whether someone is wearing mask or not, using deep neural network. It contains 3 scr

13 Jan 18, 2022

This program was designed to detect whether someone is wearing a facemask through a live video stream.

This program was designed to detect whether someone is wearing a facemask through a live video stream. A custom lightweight CNN trained with TensorFlow on a public dataset provided by Kaggle is used to detect whether each face detected by the cv2 face detection dnn is wearing a mask

0 Apr 2, 2022

A system used to detect whether a person is wearing a medical mask or not.

Mask_Detection_System A system used to detect whether a person is wearing a medical mask or not. To open the program, please follow these steps: Make

0 Nov 17, 2022

Diabet Feature Engineering - Predict whether people have diabetes when their characteristics are specified

6 Jan 18, 2022

Diabetes-Feature-Engineering - A machine learning model that can predict whether people have diabetes when their characteristics are specified

Diabetes-Feature-Engineering Aim Developing a machine learning model that can pr

0 Feb 23, 2022

Multivariate Time Series Forecasting with efficient Transformers. Code for the paper "Long-Range Transformers for Dynamic Spatiotemporal Forecasting."

Spacetimeformer Multivariate Forecasting This repository contains the code for the paper, "Long-Range Transformers for Dynamic Spatiotemporal Forecast

440 Jan 2, 2023

Implementation of SE3-Transformers for Equivariant Self-Attention, in Pytorch.

SE3 Transformer - Pytorch Implementation of SE3-Transformers for Equivariant Self-Attention, in Pytorch. May be needed for replicating Alphafold2 resu

207 Dec 23, 2022

Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

PyTorch Implementation of Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers 1 Using Colab Please notic

489 Jan 7, 2023

Exploring whether attention is necessary for vision transformers

Related tags

Overview

Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

Paper/Report

TL;DR

Abstract

Note

Pretrained models and logs

How to train

Citation

You might also like...

The first dataset of composite images with rationality score indicating whether the object placement in a composite image is reasonable.

Face Mask Detection is a project to determine whether someone is wearing mask or not, using deep neural network.

This program was designed to detect whether someone is wearing a facemask through a live video stream.

A system used to detect whether a person is wearing a medical mask or not.

Diabet Feature Engineering - Predict whether people have diabetes when their characteristics are specified

Diabetes-Feature-Engineering - A machine learning model that can predict whether people have diabetes when their characteristics are specified

Multivariate Time Series Forecasting with efficient Transformers. Code for the paper "Long-Range Transformers for Dynamic Spatiotemporal Forecasting."

Implementation of SE3-Transformers for Equivariant Self-Attention, in Pytorch.

Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

Comments

On details about the experiments

Positional Encoding Ablation

Interaction between patches through a transpose may have a stronger role to play ?

Releases(v0.0.1)

v0.0.1(May 6, 2021)

Owner

Luke Melas-Kyriazi

Official Code Implementation of the paper : XAI for Transformers: Better Explanations through Conservative Propagation

This is the official PyTorch implementation of our paper: "Artistic Style Transfer with Internal-external Learning and Contrastive Learning".

Jupyter notebooks showing best practices for using cx_Oracle, the Python DB API for Oracle Database

Code for Transformer Hawkes Process, ICML 2020.

Game Agent Framework. Helping you create AIs / Bots that learn to play any game you own!

Code for the ICML 2021 paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

Dark Finix: All in one hacking framework with almost 100 tools

Meta Self-learning for Multi-Source Domain Adaptation： A Benchmark

Implementation of ToeplitzLDA for spatiotemporal stationary time series data.

A PyTorch Implementation of Single Shot MultiBox Detector

A simple implementation of Kalman filter in single object tracking

Gender Classification Machine Learning Model using Sk-learn in Python with 97%+ accuracy and deployment

Rational Activation Functions - Replacing Padé Activation Units

This is the pytorch implementation for the paper: *Learning Accurate Performance Predictors for Ultrafast Automated Model Compression*, which is in submission to TPAMI

DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative Networks

Camview - A CLI-tool used to stream CCTV online footage based on URL params

This is the reference implementation for "Coresets via Bilevel Optimization for Continual Learning and Streaming"

This repository compare a selfie with images from identity documents and response if the selfie match.

PowerGridworld: A Framework for Multi-Agent Reinforcement Learning in Power Systems

Official repository for Automated Learning Rate Scheduler for Large-Batch Training (8th ICML Workshop on AutoML)

Do You Even Need Attention?
A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

This is the pytorch implementation for the paper: Learning Accurate Performance Predictors for Ultrafast Automated Model Compression, which is in submission to TPAMI