Implementation of the Transformer variant proposed in "Transformer Quality in Linear Time"

Last update: Dec 28, 2022

Overview

FLASH - Pytorch

Implementation of the Transformer variant proposed in the paper Transformer Quality in Linear Time

Install

$ pip install FLASH-pytorch

Usage

The main novel circuit in this paper is the "Gated Attention Unit", which they claim can replace multi-headed attention while reducing it to just one head.

It uses a relu squared activation in place of the softmax, the activation of which was first seen in the Primer paper, and the use of ReLU in ReLA Transformer. The gating style seems mostly inspired by gMLPs.

import torch
from flash_pytorch import GAU

gau = GAU(
    dim = 512,
    query_key_dim = 128,     # query / key dimension
    causal = True,           # autoregressive or not
    expansion_factor = 2,    # hidden dimension = dim * expansion_factor
)

x = torch.randn(1, 1024, 512)
out = gau(x) # (1, 1024, 512)

The authors then combine GAU with Katharopoulos linear attention, using grouping of the sequences to overcome a known issue with autoregressive linear attention.

This combination of the quadratic gated attention unit with grouped linear attention they named FLASH

You can also use this quite easily

import torch
from flash_pytorch import FLASH

flash = FLASH(
    dim = 512,
    group_size = 256,             # group size
    causal = True,                # autoregressive or not
    query_key_dim = 128,          # query / key dimension
    expansion_factor = 2.         # hidden dimension = dim * expansion_factor
)

x = torch.randn(1, 1111, 512)     # sequence will be auto-padded to nearest group size
out = flash(x) # (1, 1111, 512)

Finally, you can use the full FLASH transformer as mentioned in the paper. This contains all the positional embeddings mentioned in the paper. Absolute positional embedding uses scaled sinusoidal. GAU quadratic attention will get one-headed T5 relative positional bias. On top of all this, both GAU attention as well as the linear attention will be rotary embedded (RoPE).

import torch
from flash_pytorch import FLASHTransformer

model = FLASHTransformer(
    num_tokens = 20000,          # number of tokens
    dim = 512,                   # model dimension
    depth = 12,                  # depth
    causal = True,               # autoregressive or not
    group_size = 256,            # size of the groups
    query_key_dim = 128,         # dimension of queries / keys
    expansion_factor = 2.,       # hidden dimension = dim * expansion_factor
    norm_type = 'scalenorm',     # in the paper, they claimed scalenorm led to faster training at no performance hit. the other option is 'layernorm' (also default)
    shift_tokens = True          # discovered by an independent researcher in Shenzhen @BlinkDL, this simply shifts half of the feature space forward one step along the sequence dimension - greatly improved convergence even more in my local experiments
)

x = torch.randint(0, 20000, (1, 1024))
logits = model(x) # (1, 1024, 20000)

Test on Autoregressive Enwik8

$ python train.py

Citations

@article{Hua2022TransformerQI,
    title   = {Transformer Quality in Linear Time},
    author  = {Weizhe Hua and Zihang Dai and Hanxiao Liu and Quoc V. Le},
    journal = {ArXiv},
    year    = {2022},
    volume  = {abs/2202.10447}
}

@software{peng_bo_2021_5196578,
    author    = {PENG Bo},
    title     = {BlinkDL/RWKV-LM: 0.01},
    month     = {aug},
    year      = {2021},
    publisher = {Zenodo},
    version   = {0.01},
    doi       = {10.5281/zenodo.5196578},
    url       = {https://doi.org/10.5281/zenodo.5196578}
}

Comments

einsum operation in Linear Attention Part
Hi, Thanks a lot for your FLASH_pytorch, which helps a lot. I found that there are some differences from the paper in the Linear Attention Part: https://github.com/lucidrains/FLASH-pytorch/blob/main/flash_pytorch/flash_pytorch.py#L342-L343

lin_kv = einsum('b g n d, b g n e -> b d e', lin_k, v) / n lin_out = einsum('b g n d, b d e -> b g n e', lin_q, lin_kv)

the lin_kv is three-dim (bde) And the code in the paper is

lin_kv = tf.einsum('bhke,bgh→bgke', lin_kv, mask) linear = tf.einsum('bgnk,bgke→bgne', lin_q, lin_kv)

the lin_kv is four-dim (bgke) It seems that the two ways are not equivalent.

Looking forward to your reply. Best,
opened by ShomyLiu 5
mask error
x = torch.randint(0, 20000, (1, 1024)) mask = x.ne(0) logits = model(x, mask=mask)

RuntimeError: The size of tensor a (1024) must match the size of tensor b (128) at non-singleton dimension 2
opened by keyunluo 1
Speed on TPU

Hi, Thanks for the code! I test it on Google TPU v3, the training speed seems slower than my expectation. Maybe there is some operation which is not lower on TPU.

opened by magicknight 0
About the "shift_tokens"

Thank you for your amazing code.

In the class of FLASH, I find a flag: shift_tokens, and the corresponding code is as following: if self.shift_tokens: x_shift, x_pass = normed_x.chunk(2, dim = -1) x_shift = F.pad(x_shift, (0, 0, 1, -1), value = 0.) normed_x = torch.cat((x_shift, x_pass), dim = -1)

Assume we have normed_x in the shape [1024, 512], the x_shift/x_pass is the shape of [1024, 256]. Then it adds a row (with all 0 value) and remove the last row in the x_shift, and concat x_shift and x_pass to get the normed_x.

In my opinion, the F.pad operation will make the row in x_shift and x_pass do not match again.

May I know why it works?

Kang

opened by kangzhao2 1
Cross-Attention?

Hi, @lucidrains. Thank you for sharing this excellent implementation with us all! Do you have any thoughts as to what changes would need to be made to make cross-attention possible with your FLASH model?

opened by amorehead 2

Releases(0.1.6)

0.1.6(Sep 23, 2022)

null
Source code(tar.gz)
Source code(zip)
v0.1.5(Jun 19, 2022)

null
Source code(tar.gz)
Source code(zip)
v0.1.4(Jun 18, 2022)

null
Source code(tar.gz)
Source code(zip)
0.1.2(Apr 8, 2022)

Source code(tar.gz)
Source code(zip)
0.1.1(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.15a(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.14(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.12(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.11(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.10(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.9(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.8(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.7(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.6(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.1a(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.5(Mar 28, 2022)

Source code(tar.gz)
Source code(zip)
0.0.4(Mar 28, 2022)

Source code(tar.gz)
Source code(zip)
0.0.3(Mar 28, 2022)

Source code(tar.gz)
Source code(zip)
0.0.2a(Mar 28, 2022)

Source code(tar.gz)
Source code(zip)
0.0.1(Mar 28, 2022)

Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention. It's all we need

GitHub Repository

[EMNLP 2020] Keep CALM and Explore: Language Models for Action Generation in Text-based Games

Contextual Action Language Model (CALM) and the ClubFloyd Dataset Code and data for paper Keep CALM and Explore: Language Models for Action Generation

43 Dec 16, 2022

PyTorch Implementation of SSTNs for hyperspectral image classifications from the IEEE T-GRS paper "Spectral-Spatial Transformer Network for Hyperspectral Image Classification: A FAS Framework."

PyTorch Implementation of SSTN for Hyperspectral Image Classification Paper links: SSTN published on IEEE T-GRS. Also, you can directly find the imple

54 Dec 19, 2022

Torch-based tool for quantizing high-dimensional vectors using additive codebooks

Trainable multi-codebook quantization This repository implements a utility for use with PyTorch, and ideally GPUs, for training an efficient quantizer

41 Jan 07, 2023

Codecov coverage standard for Python

Python-Standard Last Updated: 01/07/22 00:09:25 What is this? This is a Python application, with basic unit tests, for which coverage is uploaded to C

10 Nov 04, 2022

for taichi voxel-challange event

Taichi Voxel Challenge Figure: result of python3 example6.py. Please replace the image above (demo.jpg) with yours, so that other people can immediate

20 Nov 26, 2022

SatelliteNeRF - PyTorch-based Neural Radiance Fields adapted to satellite domain

SatelliteNeRF PyTorch-based Neural Radiance Fields adapted to satellite domain.

46 Nov 20, 2022

Implementation of "JOKR: Joint Keypoint Representation for Unsupervised Cross-Domain Motion Retargeting"

JOKR: Joint Keypoint Representation for Unsupervised Cross-Domain Motion Retargeting Pytorch implementation for the paper "JOKR: Joint Keypoint Repres

45 Dec 25, 2022

Neural-net-from-scratch - A simple Neural Network from scratch in Python using the Pymathrix library

A Simple Neural Network from scratch A Simple Neural Network from scratch in Pyt

2 Jan 07, 2022

Reference code for the paper CAMS: Color-Aware Multi-Style Transfer.

CAMS: Color-Aware Multi-Style Transfer Mahmoud Afifi1, Abdullah Abuolaim*1, Mostafa Hussien*2, Marcus A. Brubaker1, Michael S. Brown1 1York University

36 Dec 04, 2022

A JAX-based research framework for writing differentiable numerical simulators with arbitrary discretizations

jaxdf - JAX-based Discretization Framework Overview | Example | Installation | Documentation ⚠️ This library is still in development. Breaking changes

65 Dec 23, 2022

This is a student data management application developed in Python and TKinter. It utilizes the TKinter pillow library to include images to buttons. I've separated TKinter elements into their own individual classes. The user can change the smilely face color for each button individually or by entire row.

Smiley Face Cube Display Table of Contents Project Description Getting Started Prerequisites Installation & Deployment Additional Documentation Projec

0 Aug 04, 2021

Implementation of the Transformer variant proposed in "Transformer Quality in Linear Time"

Related tags

Overview

FLASH - Pytorch

Install

Usage

Test on Autoregressive Enwik8

Citations

Comments

einsum operation in Linear Attention Part

mask error

Speed on TPU

About the "shift_tokens"

Cross-Attention?

Releases(0.1.6)

0.1.6(Sep 23, 2022)

v0.1.5(Jun 19, 2022)

v0.1.4(Jun 18, 2022)

0.1.2(Apr 8, 2022)

0.1.1(Mar 29, 2022)

0.0.15a(Mar 29, 2022)

0.0.14(Mar 29, 2022)

0.0.12(Mar 29, 2022)

0.0.11(Mar 29, 2022)

0.0.10(Mar 29, 2022)

0.0.9(Mar 29, 2022)

0.0.8(Mar 29, 2022)

0.0.7(Mar 29, 2022)

0.0.6(Mar 29, 2022)

0.0.1a(Mar 29, 2022)

0.0.5(Mar 28, 2022)

0.0.4(Mar 28, 2022)

0.0.3(Mar 28, 2022)

0.0.2a(Mar 28, 2022)

0.0.1(Mar 28, 2022)

Owner

Phil Wang

[EMNLP 2020] Keep CALM and Explore: Language Models for Action Generation in Text-based Games

PyTorch Implementation of SSTNs for hyperspectral image classifications from the IEEE T-GRS paper "Spectral-Spatial Transformer Network for Hyperspectral Image Classification: A FAS Framework."

Torch-based tool for quantizing high-dimensional vectors using additive codebooks

Codecov coverage standard for Python

for taichi voxel-challange event

SatelliteNeRF - PyTorch-based Neural Radiance Fields adapted to satellite domain

Implementation of "JOKR: Joint Keypoint Representation for Unsupervised Cross-Domain Motion Retargeting"

Neural-net-from-scratch - A simple Neural Network from scratch in Python using the Pymathrix library

Reference code for the paper CAMS: Color-Aware Multi-Style Transfer.

A JAX-based research framework for writing differentiable numerical simulators with arbitrary discretizations

Supplementary code for TISMIR paper "Sliding-Window Pitch-Class Histograms as a Means of Modeling Musical Form"

Implementation of ToeplitzLDA for spatiotemporal stationary time series data.

Distilling Motion Planner Augmented Policies into Visual Control Policies for Robot Manipulation (CoRL 2021)

An All-MLP solution for Vision, from Google AI

PPO is a very popular Reinforcement Learning algorithm at present.

Plotting points that lie on the intersection of the given curves using gradient descent.

[Official] Exploring Temporal Coherence for More General Video Face Forgery Detection(ICCV 2021)

Code for ICLR 2020 paper "VL-BERT: Pre-training of Generic Visual-Linguistic Representations".

The mini-AlphaStar (mini-AS, or mAS) - mini-scale version (non-official) of the AlphaStar (AS)