Implementation of a Transformer using ReLA (Rectified Linear Attention)

Last update: Oct 14, 2022

Related tags

Overview

ReLA (Rectified Linear Attention) Transformer

Implementation of a Transformer using ReLA (Rectified Linear Attention). It will also contain an attempt to combine the feedforward into the ReLA layer as memory key / values, as proposed in All Attention, suggestion made by Charles Foster.

Install

$ pip install rela-transformer

Usage

import torch
from rela_transformer.rela_transformer import ReLATransformer

model = ReLATransformer(
    num_tokens = 20000,
    dim = 512,
    depth = 8,
    max_seq_len = 1024,
    dim_head = 64,
    heads = 8,
    causal = True
)

x = torch.randint(0, 20000, (1, 1024))
logits = model(x) # (1, 1024, 20000)

Enwik8

$ python train.py

Citations

@misc{zhang2021sparse,
    title   = {Sparse Attention with Linear Units},
    author  = {Biao Zhang and Ivan Titov and Rico Sennrich},
    year    = {2021},
    eprint  = {2104.07012},
    archivePrefix = {arXiv},
    primaryClass = {cs.CL}
}

You might also like...

Attention for PyTorch with Linear Memory Footprint

Attention for PyTorch with Linear Memory Footprint Unofficially implements https://arxiv.org/abs/2112.05682 to get Linear Memory Cost on Attention (+

11 Jan 9, 2022

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Episodic Transformers (E.T.) Episodic Transformer for Vision-and-Language Navigation Alexander Pashevich, Cordelia Schmid, Chen Sun Episodic Transform

62 Dec 24, 2022

VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

VSR-Transformer By Jiezhang Cao, Yawei Li, Kai Zhang, Luc Van Gool This paper proposes a new Transformer for video super-resolution (called VSR-Transf

225 Nov 13, 2022

Implementation of H-Transformer-1D, Hierarchical Attention for Sequence Learning using 🤗 transformers

hierarchical-transformer-1d Implementation of H-Transformer-1D, Hierarchical Attention for Sequence Learning using 🤗 transformers In Progress!! 2021.

7 Nov 6, 2022

Implementation of the 😇 Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones

HaloNet - Pytorch Implementation of the Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones. This re

189 Nov 22, 2022

Implementation of STAM (Space Time Attention Model), a pure and simple attention model that reaches SOTA for video classification

STAM - Pytorch Implementation of STAM (Space Time Attention Model), yet another pure and simple SOTA attention model that bests all previous models in

109 Dec 28, 2022

Official Pytorch Implementation of Relational Self-Attention: What's Missing in Attention for Video Understanding

Relational Self-Attention: What's Missing in Attention for Video Understanding This repository is the official implementation of "Relational Self-Atte

43 Dec 7, 2022

Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Memory Efficient Attention Pytorch Implementation of a memory efficient multi-head attention as proposed in the paper, Self-attention Does Not Need O(

180 Jan 5, 2023

Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention

cosFormer Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention Update log 2022/2/28 Add core code License This

120 Dec 15, 2022

Comments

LayerNorm/GatedRMS inconsistency
Hi! looking through pipeline it seems there are some inconsistencies with normalisation

# ReLA input to GRMSNorm # att code output: Linear(inner_dim, dim) + GRMSNorm # next in FF module input to LayerNorm

here we have problem with double norm since we have last layer GRMSNorm in att and first layer LayerNorm in FF.

looking at the paper it seems that in ReLA GRMSNorm is applied to result of mult(attn, v) before output projection not after projection like in this code. I also confused about usage of LayerNorm in FF should it be GRMSNorm instead? not clear from the paper as well
opened by inspirit 6

Releases(0.0.7)

0.0.7(Apr 6, 2022)

Source code(tar.gz)
Source code(zip)
0.0.6(Feb 22, 2022)

Source code(tar.gz)
Source code(zip)
0.0.5(Jan 13, 2022)

Source code(tar.gz)
Source code(zip)
0.0.4(Jan 11, 2022)

Source code(tar.gz)
Source code(zip)
0.0.3(Jan 10, 2022)

Source code(tar.gz)
Source code(zip)
0.0.2a(Jan 10, 2022)

Source code(tar.gz)
Source code(zip)
0.0.2(Jan 10, 2022)

Source code(tar.gz)
Source code(zip)
0.0.1(Jan 10, 2022)

Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention. It's all we need

GitHub Repository

git《FSCE: Few-Shot Object Detection via Contrastive Proposal Encoding》(CVPR 2021) GitHub: [fig8]

FSCE: Few-Shot Object Detection via Contrastive Proposal Encoding (CVPR 2021) This repo contains the implementation of our state-of-the-art fewshot ob

233 Dec 29, 2022

Make your AirPlay devices as TTS speakers

Apple AirPlayer Home Assistant integration component, make your AirPlay devices as TTS speakers. Before Use 2021.6.X or earlier Apple Airplayer compon

117 Dec 15, 2022

Numbering permanent and deciduous teeth via deep instance segmentation in panoramic X-rays

Numbering permanent and deciduous teeth via deep instance segmentation in panoramic X-rays In this repo, you will find the instructions on how to requ

4 Jul 21, 2022

tensorflow implementation of 'YOLO : Real-Time Object Detection'

YOLO_tensorflow (Version 0.3, Last updated :2017.02.21) 1.Introduction This is tensorflow implementation of the YOLO:Real-Time Object Detection It can

1.7k Nov 21, 2022

Rethinking Portrait Matting with Privacy Preserving

Rethinking Portrait Matting with Privacy Preserving This is the official repository of the paper Rethinking Portrait Matting with Privacy Preserving.

184 Jan 03, 2023

This package is for running the semantic SLAM algorithm using extracted planar surfaces from the received detection

Semantic SLAM This package can perform optimization of pose estimated from VO/VIO methods which tend to drift over time. It uses planar surfaces extra

125 Dec 02, 2022

The world's simplest facial recognition api for Python and the command line

Face Recognition You can also read a translated version of this file in Chinese 简体中文版 or in Korean 한국어 or in Japanese 日本語. Recognize and manipulate fa

46.9k Jan 03, 2023

Python implementation of the multistate Bennett acceptance ratio (MBAR)

pymbar Python implementation of the multistate Bennett acceptance ratio (MBAR) method for estimating expectations and free energy differences from equ

169 Dec 02, 2022

NaijaSenti is an open-source sentiment and emotion corpora for four major Nigerian languages

NaijaSenti is an open-source sentiment and emotion corpora for four major Nigerian languages. This project was supported by lacuna-fund initiatives. Jump straight to one of the sections below, or jus

14 Dec 20, 2022

This repository contains code to run experiments in the paper "Signal Strength and Noise Drive Feature Preference in CNN Image Classifiers."

Signal Strength and Noise Drive Feature Preference in CNN Image Classifiers This repository contains code to run experiments in the paper "Signal Stre

0 Jan 19, 2022

Implementation of a Transformer using ReLA (Rectified Linear Attention)

Related tags

Overview

ReLA (Rectified Linear Attention) Transformer

Install

Usage

Enwik8

Citations

You might also like...

Attention for PyTorch with Linear Memory Footprint

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

Implementation of H-Transformer-1D, Hierarchical Attention for Sequence Learning using 🤗 transformers

Implementation of the 😇 Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones

Implementation of STAM (Space Time Attention Model), a pure and simple attention model that reaches SOTA for video classification

Official Pytorch Implementation of Relational Self-Attention: What's Missing in Attention for Video Understanding

Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention

Comments

LayerNorm/GatedRMS inconsistency

Releases(0.0.7)

0.0.7(Apr 6, 2022)

0.0.6(Feb 22, 2022)

0.0.5(Jan 13, 2022)

0.0.4(Jan 11, 2022)

0.0.3(Jan 10, 2022)

0.0.2a(Jan 10, 2022)

0.0.2(Jan 10, 2022)

0.0.1(Jan 10, 2022)

Owner

Phil Wang

git《FSCE: Few-Shot Object Detection via Contrastive Proposal Encoding》(CVPR 2021) GitHub: [fig8]

Make your AirPlay devices as TTS speakers

Numbering permanent and deciduous teeth via deep instance segmentation in panoramic X-rays

tensorflow implementation of 'YOLO : Real-Time Object Detection'

Rethinking Portrait Matting with Privacy Preserving

This package is for running the semantic SLAM algorithm using extracted planar surfaces from the received detection

The world's simplest facial recognition api for Python and the command line

Python implementation of the multistate Bennett acceptance ratio (MBAR)

NaijaSenti is an open-source sentiment and emotion corpora for four major Nigerian languages

This repository contains code to run experiments in the paper "Signal Strength and Noise Drive Feature Preference in CNN Image Classifiers."

FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet.

DeepFaceEditing: Deep Face Generation and Editing with Disentangled Geometry and Appearance Control

Goal of the project : Detecting Temporal Boundaries in Sign Language videos

Source code for CVPR 2021 paper "Riggable 3D Face Reconstruction via In-Network Optimization"

CTF challenges from redpwnCTF 2021

Official code for our CVPR '22 paper "Dataset Distillation by Matching Training Trajectories"

1st place solution in CCF BDCI 2021 ULSEG challenge

Megaverse is a new 3D simulation platform for reinforcement learning and embodied AI research

Source code for paper "Deep Superpixel-based Network for Blind Image Quality Assessment"

An unreferenced image captioning metric (ACL-21)