awesome-fast-attention

A curated list of efficient attention modules (last update: Wed, 10 Mar 2021 23:52:22 +0000)

Efficient Attention
Articles/Surveys/Benchmarks

Efficient Attention

Paper (citations)	Implementation	Computational Complexity	AutoRegressive	Main Idea
Generating Wikipedia by Summarizing Long Sequences (282)	memory-compressed-attention	$\mathcal{O}({b}\cdot\frac{N}{b}\cdot\frac{N}{{b}\cdot{k}}\cdot{D})$	✔️	EXPAND compresses key and value + blocked attention
CBAM: Convolutional Block Attention Module (999+)	attention-module	$\mathcal{O}(({N}\cdot{D}+\frac{{D}^2}{r})+({N}\cdot{D}\cdot{k}^2))$	❌	EXPAND combines the SE attention with a per pixel(local) weight
Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks (16)	set_transformer	$\mathcal{O}({N}\cdot{K}\cdot{D})$	❌	EXPAND uses K relay nodes
CCNet: Criss-Cross Attention for Semantic Segmentation (296)	CCNet	$\mathcal{O}({N}\cdot({H}+{W})\cdot{D})$	❌	EXPAND each pixel attends to its row and column simultaneously
Efficient Attention: Attention with Linear Complexities (16)	efficient-attention	$\mathcal{O}({N}\cdot{D}^2)$	❌	EXPAND Softmax(Q)(Softmax(K^T)V)
Star-Transformer (40)	fastNLP	$\mathcal{O}({N}\cdot{D})$	❌	EXPAND uses a relay(global) node and attends to/from that node
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond (199)	GCNet	$\mathcal{O}({N}\cdot{D}^2)$	❌	EXPAND squeeze and excitation with an attention pooling (instead of a GAP)
Generating Long Sequences with Sparse Transformers (257)	DeepSpeed	$\mathcal{O}({N}\cdot\sqrt{N}\cdot{D})$	✔️	EXPAND sparse block based attention
SCRAM: Spatially Coherent Randomized Attention Maps (1)	-	$\mathcal{O}({N}\cdot\log({N})\cdot{D})$	✔️	EXPAND uses PatchMatch to find close keys
Interlaced Sparse Self-Attention for Semantic Segmentation (24)	IN_PAPER	$\mathcal{O}({N}\cdot{D}^2+{N}\cdot\sqrt{N}\cdot{D})$	✔️	EXPAND combination of a short length and then long range(dilated) attention
Permutohedral Attention Module for Efficient Non-Local Neural Networks (3)	Permutohedral_attention_module	$\mathcal{O}({N}\cdot{D}^2)$	❌	EXPAND uses permutohedral lattice approximation algorithm to approximate the attention output
Large Memory Layers with Product Keys (43)	XLM	$\mathcal{O}({Q}\cdot({K}+{k}^2)\cdot{D})$	✔️	EXPAND search for nearest neighbor keys
Expectation-Maximization Attention Networks for Semantic Segmentation (79)	EMANet	$\mathcal{O}({N}\cdot{k}\cdot{D})$	❌	EXPAND applys expectation maximization to cluster keys into k clusters
BP-Transformer: Modelling Long-Range Context via Binary Partitioning (15)	BPT	$\mathcal{O}({N}\cdot{k}\cdot\log(\frac{N}{k})\cdot{D})$	✔️	EXPAND attends to distant tokens coarsely and attends to close tokens in a more fine-grained manner
Compressive Transformers for Long-Range Sequence Modelling (48)	compressive-transformer-pytorch	$\mathcal{O}({N}^2\cdot{D})$	✔️	EXPAND compresses distant tokens instead of just stop_grad() ing them, more efficient version of transformerXL
Axial Attention in Multidimensional Transformers (36)	axial-attention	$\mathcal{O}({N}\cdot({H}+{W})\cdot{D})$	✔️	EXPAND apply attention on each axis separately
Reformer: The Efficient Transformer (216)	trax	$\mathcal{O}({N}\cdot\log({N})\cdot{D}^2)$	✔️	EXPAND uses LSH to find close keys
Sparse Sinkhorn Attention (16)	sinkhorn-transformer	$\mathcal{O}(\frac{{N}^2}{n_b}+{n_b}^2)$	✔️	EXPAND uses a cost matrix to limit attention between buckets
Transformer on a Diet (2)	transformer-on-diet	$\mathcal{O}({N}\cdot{k}\cdot{D})$	✔️	EXPAND dilated transformer like wavenet
Time-aware Large Kernel Convolutions (9)	TaLKConvolutions	$\mathcal{O}({N}\cdot{D})$	✔️	EXPAND calculate mean over a dynamic subsequence around each token with the help of summed-area table
SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection (2)	-	$\mathcal{O}({N}\cdot{k}\cdot{D})$	✔️	EXPAND learns the q, k connections == dynamically creates a sparse attention matrix
Efficient Content-Based Sparse Attention with Routing Transformers (38)	routing-transformer	$\mathcal{O}({N}\cdot\sqrt{N}\cdot{D})$	✔️	EXPAND computes attention with same-cluster tokens (computed by online k-means)
Neural Architecture Search for Lightweight Non-Local Networks (11)	AutoNL	$\mathcal{O}((\frac{H}{h}\cdot\frac{W}{w})\cdot(\frac{D}{k})^2)$	❌	EXPAND computes Q(KV) and also down samples q, k, v both in spatial and channel dimensions
Longformer: The Long-Document Transformer (159)	longformer	$\mathcal{O}({N}\cdot({k}+{g})\cdot{D})$	✔️	EXPAND global + blocked attention
ETC: Encoding Long and Structured Inputs in Transformers (16)	-	$\mathcal{O}(({N}\cdot{g}+{g}^2+{N}\cdot{k})\cdot{D})$	❌	EXPAND combines global attention (star transformer with multiple global tokens) with local attention
Multi-scale Transformer Language Models (2)	IN_PAPER	$\mathcal{O}({N}^2\cdot{D})$	✔️	EXPAND UNet like + retina attetion is something close to BP-Transformer
Synthesizer: Rethinking Self-Attention in Transformer Models (26)	Synthesizer-Rethinking-Self-Attention-Transformer-Models	$\mathcal{O}({N}^2\cdot{D})$	✔️	EXPAND does not compute pairwise interactions
Jukebox: A Generative Model for Music (45)	jukebox	$\mathcal{O}({N}\cdot\sqrt{N}\cdot{D})$	✔️	EXPAND better attention patterns from Sparse Transformer
Input-independent Attention Weights Are Expressive Enough: A Study of Attention in Self-supervised Audio Transformers (0)	-	$\mathcal{O}({N}^2\cdot{D})$	✔️	EXPAND does not compute pairwise interactions and uses fixed mask patters
GMAT: Global Memory Augmentation for Transformers (2)	gmat	$\mathcal{O}({m}\cdot({N}+{m})\cdot{D})$	❌	EXPAND adds global tokens
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (45)	fast-transformers	$\mathcal{O}({N}\cdot{D}^2)$	✔️	EXPAND uses phi(q)(phi(k)v) and also improves the sequential sampling step
Linformer: Self-Attention with Linear Complexity (47)	linformer-pytorch	$\mathcal{O}({N}\cdot{k}\cdot{D})$	❌	EXPAND project key and value from nd to kd
Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers (8)	google-research	$\mathcal{O}({N}\cdot{D}^2\cdot\log({D}))$	✔️	EXPAND calculate an unbiased stochastic approximation of the attention matrix
Kronecker Attention Networks (1)	kronecker-attention-pytorch	$\mathcal{O}(({H}+{W})^2\cdot{D})$	❌	EXPAND uses horizontal and lateral average matrices
Real-time Semantic Segmentation with Fast Attention (5)	-	$\mathcal{O}({N}\cdot{D}^2)$	❌	EXPAND l2_norm(q)(l2_norm(k)v)
Fast Transformers with Clustered Attention (6)	fast-transformers	$\mathcal{O}({N}\cdot{k}\cdot{D})$	❌	EXPAND groups queries together with LSH
Big Bird: Transformers for Longer Sequences (60)	DeepSpeed	$\mathcal{O}(({g}^2+{N}\cdot({k}+{g}+{r}))\cdot{D})$	❌	EXPAND ETC with random connections
Tensor Low-Rank Reconstruction for Semantic Segmentation (3)	-	$\mathcal{O}(({D}\cdot{H}\cdot{W}+{D}^2+{H}^2+{W}^2)\cdot{r})$	❌	EXPAND decompose the full attention tensor into rank one tensors (CP decomposition)
Looking for change? Roll the Dice and demand Attention (0)	IN_PAPER	$\mathcal{O}({H}\cdot{W}\cdot{D})$	❌	EXPAND uses the fractal tanimoto similarity to compare queries with keys inside the attention module
Rethinking Attention with Performers (30)	google-research	$\mathcal{O}({N}\cdot{m}\cdot{D})$	✔️	EXPAND unbiased approximation of the attention matrix with softmax kernel
Memformer: The Memory-Augmented Transformer (0)	memformer	$\mathcal{O}({N}\cdot{D})$	✔️	EXPAND attend to memory slots + Memory-Replay BackPropagation
SMYRF: Efficient Attention using Asymmetric Clustering (1)	smyrf	$\mathcal{O}({N}\cdot\log({N})\cdot{D})$	❌	EXPAND LSH with balanced clusters
Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting (0)	Informer2020	$\mathcal{O}({N}\cdot\log({N})\cdot{D})$	✔️	EXPAND sparse attention + funnel like encoder
Sub-Linear Memory: How to Make Performers SLiM (0)	google-research	$\mathcal{O}({N}\cdot{m}\cdot{D})$	✔️	EXPAND Performer but with sublinear Memory usage
Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention (0)	Nystromformer	$\mathcal{O}({N}\cdot{D})$	❌	EXPAND uses Nystrom method to approximate the attention matrix
Linear Transformers Are Secretly Fast Weight Memory Systems (0)	fast-weight-transformers	$\mathcal{O}({N}\cdot{m}\cdot{D})$	✔️	EXPAND show that linear transformers are basically fast weight networks + propose a new kernel function to linearise attention, balancing simplicity and effectiveness
LambdaNetworks: Modeling Long-Range Interactions Without Attention (6)	lambda-networks	$\mathcal{O}({N}^2\cdot{k}\cdot\frac{v}{h})$	✔️	EXPAND generates a linear layer based on context + decouple pos/context
Random Feature Attention (2)	-	$\mathcal{O}({N}\cdot{D})$	✔️	EXPAND kernel approximation and also transformers are rnn

A curated list of efficient attention modules

Related tags

Overview

awesome-fast-attention

Table of Contents

Efficient Attention

Articles/Surveys/Benchmarks

Owner

Sepehr Sameni

Simple text to phones converter for multiple languages

Japanese synonym library

The simple project to separate mixed voice (2 clean voices) to 2 separate voices.

NLP library designed for reproducible experimentation management

NSFW A chatbot based on GPT2-chitchat

Indonesia spellchecker with python

voice2json is a collection of command-line tools for offline speech/intent recognition on Linux

Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2021).

BeautyNet is an AI powered model which can tell you whether you're beautiful or not.

NLP techniques such as named entity recognition, sentiment analysis, topic modeling, text classification with Python to predict sentiment and rating of drug from user reviews.

Linear programming solver for paper-reviewer matching and mind-matching

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5

A paper list for aspect based sentiment analysis.

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

STonKGs is a Sophisticated Transformer that can be jointly trained on biomedical text and knowledge graphs

Code for CVPR 2021 paper: Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning

End-to-end image captioning with EfficientNet-b3 + LSTM with Attention

A curated list of efficient attention modules

Related tags

Overview

awesome-fast-attention

Table of Contents

Efficient Attention

Articles/Surveys/Benchmarks

Owner

Sepehr Sameni

Simple text to phones converter for multiple languages

Japanese synonym library

The simple project to separate mixed voice (2 clean voices) to 2 separate voices.

NLP library designed for reproducible experimentation management

**NSFW** A chatbot based on GPT2-chitchat

Indonesia spellchecker with python

voice2json is a collection of command-line tools for offline speech/intent recognition on Linux

Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2021).

BeautyNet is an AI powered model which can tell you whether you're beautiful or not.

NLP techniques such as named entity recognition, sentiment analysis, topic modeling, text classification with Python to predict sentiment and rating of drug from user reviews.

Linear programming solver for paper-reviewer matching and mind-matching

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5

A paper list for aspect based sentiment analysis.

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

STonKGs is a Sophisticated Transformer that can be jointly trained on biomedical text and knowledge graphs

Code for CVPR 2021 paper: Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning

End-to-end image captioning with EfficientNet-b3 + LSTM with Attention

NSFW A chatbot based on GPT2-chitchat