PyTorch implementation of Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.

Last update: Jul 27, 2022

Overview

ALiBi

PyTorch implementation of Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.

Quickstart

Clone this repository.

git clone https://github.com/jaketae/alibi.git

Navigate to the cloned directory. You can use the bare-bone ALiBi decoder via

>>> import torch; from alibi import ALiBiConfig, ALiBiTransformer
>>> config  = ALiBiConfig()
>>> model = ALiBiTransformer(config)
>>> x = torch.randn(8, 100, 256)
>>> model(x).shape
torch.Size([8, 100, 256])

By default, the model comes with the following parameters:

ALiBiConfig(
    num_layers=6, 
    d_model=256, 
    num_heads=8, 
    max_len=256, 
    dropout=0.1, 
    causal=True, 
    expansion_factor=1
)

To use an encoder instead of a decoder, simply toggle causal=False.

Abstract

Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question remains open: how to achieve extrapolation at inference time to longer sequences than seen during training? We first show that extrapolation can be improved by changing the position representation method, though we find that existing proposals do not allow efficient extrapolation. We introduce a simple and efficient method, Attention with Linear Biases (ALiBi), that allows for extrapolation. ALiBi does not add positional embeddings to the word embeddings; instead, it biases the query-key attention scores with a term that is proportional to their distance. We show that this method allows training a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048, 11% faster and using 11% less memory. ALiBi's inductive bias towards recency allows it to outperform multiple strong position methods on the WikiText-103 benchmark. Finally, we provide analysis of ALiBi to understand why it leads to better performance.

Citation

@misc{press2021train,
	title        = {Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation},
	author       = {Ofir Press and Noah A. Smith and Mike Lewis},
	year         = 2021,
	eprint       = {2108.12409},
	archiveprefix = {arXiv},
	primaryclass = {cs.CL}
}

PyTorch implementation of Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.

Related tags

Overview

ALiBi

Quickstart

Abstract

Citation

Owner

Jake Tae

Public implementation of "Learning from Suboptimal Demonstration via Self-Supervised Reward Regression" from CoRL'21

My take on a practical implementation of Linformer for Pytorch.

Code for NeurIPS 2021 paper "Curriculum Offline Imitation Learning"

An unofficial PyTorch implementation of a federated learning algorithm, FedAvg.

realsense d400 -> jpg + csv

Learning a mapping from images to psychological similarity spaces with neural networks.

Deploy tensorflow graphs for fast evaluation and export to tensorflow-less environments running numpy.

Weakly Supervised End-to-End Learning (NeurIPS 2021)

Code for ICCV2021 paper SPEC: Seeing People in the Wild with an Estimated Camera

Deep generative models of 3D grids for structure-based drug discovery

A modular, open and non-proprietary toolkit for core robotic functionalities by harnessing deep learning

Official implementation of Monocular Quasi-Dense 3D Object Tracking

This repository is for our paper Exploiting Scene Graphs for Human-Object Interaction Detection accepted by ICCV 2021.

Package to compute Mauve, a similarity score between neural text and human text. Install with `pip install mauve-text`.

PyTorch implementation of our Adam-NSCL algorithm from our CVPR2021 (oral) paper "Training Networks in Null Space for Continual Learning"

Repo for our ICML21 paper Unsupervised Learning of Visual 3D Keypoints for Control

Code for the paper One Thing One Click: A Self-Training Approach for Weakly Supervised 3D Semantic Segmentation, CVPR 2021.

Unsupervised Foreground Extraction via Deep Region Competition

Minimalistic PyTorch training loop

This repository is a series of notebooks that show solutions for the projects at Dataquest.io.