Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Last update: Jan 05, 2023

Overview

Memory Efficient Attention Pytorch

Implementation of a memory efficient multi-head attention as proposed in the paper, Self-attention Does Not Need O(n²) Memory. In addition, the module will take care of masking, causal masking, as well as cross attention.

Install

$ pip install memory-efficient-attention-pytorch

Usage

For autoregressive language model

import torch
from memory_efficient_attention_pytorch import Attention

attn = Attention(
    dim = 512,
    dim_head = 64,                # dimension per head
    heads = 8,                    # number of attention heads
    causal = True,                # autoregressive or not
    memory_efficient = True,      # whether to use memory efficient attention (can be turned off to test against normal attention)
    q_bucket_size = 1024,         # bucket size along queries dimension
    k_bucket_size = 2048          # bucket size along key / values dimension
).cuda()

x = torch.randn(1, 65536, 512).cuda()
out = attn(x) # (1, 65536, 512)

Cross attention

import torch
from memory_efficient_attention_pytorch import Attention

cross_attn = Attention(
    dim = 512,
    dim_head = 64,
    heads = 8,
    memory_efficient = True,
    q_bucket_size = 1024,
    k_bucket_size = 2048
).cuda()

x = torch.randn(1, 65536, 512).cuda()
context = torch.randn(1, 65536, 512).cuda()
mask = torch.ones(1, 65536).bool().cuda()

out = cross_attn(x, context = context, mask = mask) # (1, 65536, 512)

benchmark and see how much torch jit helps
look at Triton and Keops and see if either can be a fit

Citations

@misc{rabe2021selfattention,
    title   = {Self-attention Does Not Need $O(n^2)$ Memory}, 
    author  = {Markus N. Rabe and Charles Staats},
    year    = {2021},
    eprint  = {2112.05682},
    archivePrefix = {arXiv},
    primaryClass = {cs.LG}
}

@misc{liu2021swin,
    title   = {Swin Transformer V2: Scaling Up Capacity and Resolution},
    author  = {Ze Liu and Han Hu and Yutong Lin and Zhuliang Yao and Zhenda Xie and Yixuan Wei and Jia Ning and Yue Cao and Zheng Zhang and Li Dong and Furu Wei and Baining Guo},
    year    = {2021},
    eprint  = {2111.09883},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

Comments

[feature request] Combining with flash attention?

There is a new algorithm to optimize the qkv attention, https://github.com/HazyResearch/flash-attention https://arxiv.org/abs/2205.14135 It optimises the qkv attention part. Maybe you can look into integrating it with this.

opened by Vbansal21 15
i did this, we could build on top

Hi there!

It seems I did already some of the code... https://github.com/CHARM-Tx/linear_mem_attention_pytorch could we build on top of this? I talked to https://github.com/Chillee about an experimental functionality from functorch: https://github.com/pytorch/functorch that would allow for increased speed (mainly i want to match jax perofmance but its just difficult w/ pytorch imperative style).

I would love to collaborate on this if you want!

opened by hypnopump 5
Added dropout support to memory efficient variant

Hey Phil,

I have been using this repository for a project and I wanted to add dropout for completeness. I checked consistency with perceiver-ar impl.. I hope this is helpful.

-Matt

opened by usryokousha 2
Making this work with relative position bias from XTransformers

Is there a way to make this work with RelativePositionBias. Currently this produces an attention bias of size $BHN^2$ where B is batch size, H is number of heads and N is input size. Can this be chunked and computed per chunk?

opened by pfeatherstone 5
save_for_backward can only save variables, but argument 5 is of type bool

Hi,

Thank you for your indescribable work. I was trying to test your method specifically for cross-attention but It seems I get the error " save_for_backward can only save variables, but argument 5 is of type bool". I am not sure what I am doing wrong. I tried your own examples too but get the same error.

Can you please help me out?

Code:

import torch from memory_efficient_attention_pytorch import Attention

cross_attn = Attention( dim = 512, dim_head = 64, heads = 8, memory_efficient = True, q_bucket_size = 1024, k_bucket_size = 2048 ).cuda() (# out = sm_mod(inp1)) did this to avoid being a header x = torch.randn(1, 65536, 512).cuda() context = torch.randn(1, 65536, 512).cuda() (# mask = torch.ones(1, 65536).bool().cuda()) did this to avoid being a heading out = cross_attn(x

ERROR:

File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/abali/.vscode-server/extensions/ms-python.python-2022.8.1/pythonFiles/lib/python/debugpy/main.py", line 45, in cli.main() File "/home/abali/.vscode-server/extensions/ms-python.python-2022.8.1/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 444, in main run() File "/home/abali/.vscode-server/extensions/ms-python.python-2022.8.1/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 285, in run_file runpy.run_path(target_as_str, run_name=compat.force_str("main")) File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/runpy.py", line 265, in run_path return _run_module_code(code, init_globals, run_name, File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/runpy.py", line 97, in _run_module_code _run_code(code, mod_globals, init_globals, File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/data/stars/user/abali/Phd_work/ISBI2023/X3D-Multigrid/CrossAttn_X3d_v2.py", line 872, in out = cross_attn(x, context = context, mask = mask) # (1, 65536, 512) print(out) File "/home/abali/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/site-packages/memory_efficient_attention_pytorch/memory_efficient_attention.py", line 215, in forward out = attn_fn(q, k, v, mask = mask, attn_bias = attn_bias, causal = self.causal, q_bucket_size = q_bucket_size, k_bucket_size = k_bucket_size) File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/site-packages/memory_efficient_attention_pytorch/memory_efficient_attention.py", line 127, in memory_efficient_attention exp_weight_chunk, weighted_value_chunk, weight_max_chunk = summarize_qkv_fn( File "/home/abali/.local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 163, in checkpoint return CheckpointFunction.apply(function, preserve, *args) TypeError: save_for_backward can only save variables, but argument 5 is of type bool

opened by aliabid2243 1
Checkpointing is not compatible with .grad() or when an `inputs` parameter is passed to .backward()

https://github.com/lucidrains/memory-efficient-attention-pytorch/blob/35559a05572f9d4eb982a8e2e399b40a2d61b85c/memory_efficient_attention_pytorch/memory_efficient_attention.py#L95

Should this be: summarize_qkv_fn = summarize_qkv_chunk if needs_backwards else checkpointed_summarize_qkv_chunk instead of: summarize_qkv_fn = checkpointed_summarize_qkv_chunk if needs_backwards else summarize_qkv_chunk

opened by vrobot 0

Releases(0.1.1)

0.1.1(Dec 30, 2022)

null
Source code(tar.gz)
Source code(zip)
0.1.0(Dec 30, 2022)

Source code(tar.gz)
Source code(zip)
0.0.27(Nov 1, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.26(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.25(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.24(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.23(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.22(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.21(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.20(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.19(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.18(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.17(Mar 22, 2022)

Source code(tar.gz)
Source code(zip)
0.0.16(Mar 21, 2022)

Source code(tar.gz)
Source code(zip)
0.0.15(Mar 13, 2022)

Source code(tar.gz)
Source code(zip)
0.0.14(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.12(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.11(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.10(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.9(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.8(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.7(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.6(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.5(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.4(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.2(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.1(Mar 3, 2022)

Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention. It's all we need

GitHub Repository

PRIN/SPRIN: On Extracting Point-wise Rotation Invariant Features

PRIN/SPRIN: On Extracting Point-wise Rotation Invariant Features Overview This repository is the Pytorch implementation of PRIN/SPRIN: On Extracting P

17 Mar 02, 2022

DeepFill v1/v2 with Contextual Attention and Gated Convolution, CVPR 2018, and ICCV 2019 Oral

Generative Image Inpainting An open source framework for generative image inpainting task, with the support of Contextual Attention (CVPR 2018) and Ga

2.9k Dec 16, 2022

TensorFlow implementation of AlexNet and its training and testing on ImageNet ILSVRC 2012 dataset

AlexNet training on ImageNet LSVRC 2012 This repository contains an implementation of AlexNet convolutional neural network and its training and testin

161 Nov 25, 2022

Background Matting: The World is Your Green Screen

Background Matting: The World is Your Green Screen By Soumyadip Sengupta, Vivek Jayaram, Brian Curless, Steve Seitz, and Ira Kemelmacher-Shlizerman Th

4.6k Jan 04, 2023

Fuse radar and camera for detection

SAF-FCOS: Spatial Attention Fusion for Obstacle Detection using MmWave Radar and Vision Sensor This project hosts the code for implementing the SAF-FC

18 Jan 01, 2023

IJCAI2020 & IJCV 2020 :city_sunrise: Unsupervised Scene Adaptation with Memory Regularization in vivo

Seg_Uncertainty In this repo, we provide the code for the two papers, i.e., MRNet：Unsupervised Scene Adaptation with Memory Regularization in vivo, IJ

348 Jan 05, 2023

Official PyTorch implementation of "Rapid Neural Architecture Search by Learning to Generate Graphs from Datasets" (ICLR 2021)

Rapid Neural Architecture Search by Learning to Generate Graphs from Datasets This is the official PyTorch implementation for the paper Rapid Neural A

48 Dec 26, 2022

Easy to use and customizable SOTA Semantic Segmentation models with abundant datasets in PyTorch

Semantic Segmentation Easy to use and customizable SOTA Semantic Segmentation models with abundant datasets in PyTorch Features Applicable to followin

530 Jan 05, 2023

GitHub repository for "Improving Video Generation for Multi-functional Applications"

Improving Video Generation for Multi-functional Applications GitHub repository for "Improving Video Generation for Multi-functional Applications" Pape

328 Dec 07, 2022

Python suite to construct benchmark machine learning datasets from the MIMIC-III clinical database.

MIMIC-III Benchmarks Python suite to construct benchmark machine learning datasets from the MIMIC-III clinical database. Currently, the benchmark data

6 Jan 02, 2023

EigenGAN Tensorflow, EigenGAN: Layer-Wise Eigen-Learning for GANs

Gender Bangs Body Side Pose (Yaw) Lighting Smile Face Shape Lipstick Color Painting Style Pose (Yaw) Pose (Pitch) Zoom & Rotate Flush & Eye Color Mout

321 Dec 01, 2022

"Graph Neural Controlled Differential Equations for Traffic Forecasting", AAAI 2022

Graph Neural Controlled Differential Equations for Traffic Forecasting Setup Python environment for STG-NCDE Install python environment $ conda env cr

55 Dec 28, 2022

a Pytorch easy re-implement of "YOLOX: Exceeding YOLO Series in 2021"

A pytorch easy re-implement of "YOLOX: Exceeding YOLO Series in 2021" 1. Notes This is a pytorch easy re-implement of "YOLOX: Exceeding YOLO Series in

91 Dec 26, 2022

Pytorch Lightning code guideline for conferences

Deep learning project seed Use this seed to start new deep learning / ML projects. Built in setup.py Built in requirements Examples with MNIST Badges

1k Jan 02, 2023

Automatic detection and classification of Covid severity degree in LUS (lung ultrasound) scans

Final-Project Final project in the Technion, Biomedical faculty, by Mor Ventura, Dekel Brav & Omri Magen. Subproject 1: Automatic Detection of LUS Cha

1 Dec 18, 2021

GitHub repository for the ICLR Computational Geometry & Topology Challenge 2021

ICLR Computational Geometry & Topology Challenge 2022 Welcome to the ICLR 2022 Computational Geometry & Topology challenge 2022 --- by the ICLR 2022 W

42 Dec 13, 2022

A generalist algorithm for cell and nucleus segmentation.

Cellpose | A generalist algorithm for cell and nucleus segmentation. Cellpose was written by Carsen Stringer and Marius Pachitariu. To learn about Cel

733 Dec 29, 2022

Source code for "OmniPhotos: Casual 360° VR Photography"

OmniPhotos: Casual 360° VR Photography Project Page | Video | Paper | Demo | Data This repository contains the source code for creating and viewing Om

144 Dec 30, 2022

This project is based on our SIGGRAPH 2021 paper, ROSEFusion: Random Optimization for Online DenSE Reconstruction under Fast Camera Motion .

ROSEFusion 🌹 This project is based on our SIGGRAPH 2021 paper, ROSEFusion: Random Optimization for Online DenSE Reconstruction under Fast Camera Moti

219 Dec 27, 2022

Meta Learning for Semi-Supervised Few-Shot Classification

few-shot-ssl-public Code for paper Meta-Learning for Semi-Supervised Few-Shot Classification. [arxiv] Dependencies cv2 numpy pandas python 2.7 / 3.5+

501 Jan 08, 2023

Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Related tags

Overview

Memory Efficient Attention Pytorch

Install

Usage

Citations

Comments

[feature request] Combining with flash attention?

i did this, we could build on top

Added dropout support to memory efficient variant

Making this work with relative position bias from XTransformers

save_for_backward can only save variables, but argument 5 is of type bool

Code:

ERROR:

Checkpointing is not compatible with .grad() or when an `inputs` parameter is passed to .backward()

Releases(0.1.1)

0.1.1(Dec 30, 2022)

0.1.0(Dec 30, 2022)

0.0.27(Nov 1, 2022)

0.0.26(Jul 23, 2022)

0.0.25(Jul 23, 2022)

0.0.24(Jul 23, 2022)

0.0.23(Jul 23, 2022)

0.0.22(Jul 23, 2022)

0.0.21(Jul 23, 2022)

0.0.20(Jul 23, 2022)

0.0.19(Jul 23, 2022)

0.0.18(Jul 23, 2022)

0.0.17(Mar 22, 2022)

0.0.16(Mar 21, 2022)

0.0.15(Mar 13, 2022)

0.0.14(Mar 4, 2022)

0.0.12(Mar 4, 2022)

0.0.11(Mar 4, 2022)

0.0.10(Mar 4, 2022)

0.0.9(Mar 4, 2022)

0.0.8(Mar 4, 2022)

0.0.7(Mar 4, 2022)

0.0.6(Mar 4, 2022)

0.0.5(Mar 4, 2022)

0.0.4(Mar 4, 2022)

0.0.2(Mar 4, 2022)

0.0.1(Mar 3, 2022)