Transformer training code for sequential tasks

Last update: Dec 13, 2022

Related tags

Overview

Sequential Transformer

This is a code for training Transformers on sequential tasks such as language modeling. Unlike the original Transformer architecture, it uses caching of previous representations and relative position embeddings to better adapt to sequential tasks. In addition, the code also implements the following projects as described below and in this blog post:

Adaptive Attention Span
All-attention Network

Requirements

You need PyTorch 0.4.1 or above and a cuda-enabled GPU to run the code. If there are multiple GPUs available, the code uses nn.DataParallel to utilize them. For better efficiency, enable distributed training by --distributed argument, which can run on multiple nodes.

Adaptive Attention Span

This code can be used for running experiments in Adaptive Attention Span for Transformers paper. The adaptive span allows a model to learn an optimal context size for each self-attention head from training data. As shown in the below figure, only few heads require long attention span, thus making it possible to increase the context size to 8k tokens without increasing computation time and memory footprint significantly.

An argument --adapt-span enables adaptive span. Otherwise a model will have a fixed attention span. The adaptive-span is implemented as a nn.Module to make it easier to plug it into other models.

Running experiments in the paper

Scripts for running experiments in the paper are located in ./experiments/ directory. For example, a smaller 8-layer version of our model can be trained on a single GPU by running:

bash experiments/enwik8_small.sh

It should reach about 1.3bpc on dev after 150k steps.

For training larger models, multiple GPUs are recommended. In the script files, you can configure the number of available GPUs. Increase the --batch-split argument if you run out of GPU memory (it splits batches into smaller pieces without changing the final result).

We obtained the following results in our experiments:

Experiment	#params	dev	test
enwik8	38M	1.04 bpb	1.02 bpb
enwik8_large	209M	1.00 bpb	0.98 bpb
text8	39M	1.05 bpc	1.11 bpc
text8_large	209M	1.01 bpc	1.07 bpc

A large model training takes about 1.2sec/batch near the end (initially it's faster because the attention spans are smaller) on 8 V100 GPUs. So, for example, the whole enwik8_large training of 170k steps should take less than 2.4 days.

Pre-trained models

You can download pre-trained models by running the get_pretrained.sh script. Then the same scripts in ./experiments/ can be used to evaluate those models. Since the download script puts models in ./checkpoints/, make sure there is no file with the same name. Note that these pre-trained models are obtained by rerunning the training scripts after the code cleanup, so there are small differences from the above results due to the randomness of the training.

All-attention Network

The code also can be used for training All-attention Networks introduced in Augmenting Self-attention with Persistent Memory. If --pers-mem-size argument is set to N, all FF sublayers will be removed from the model and N persistent memory vectors will be added to every self-attention sublayer. The following experiments can be found in ./experiments/ directory.

Experiment	#params	dev	test
enwik8_pers_small.sh	39M	1.03 bpb	1.01 bpb
enwik8_pers.sh	114M	1.00 bpb	0.98 bpb
wiki103_pers.sh	133M	18.8 ppl *	19.7 ppl *

(*This number is slightly better than the paper because it includes end-of-line as a token.)

License

The code is licensed under CC-BY-NC license. See the LICENSE file for more details.

Acknowledgement

We thank Xavier Martinet for helping with cleaning the code. The data preprocessing scripts are downloaded from awd-lstm and transformer-XL repos. The adagrad_with_grad_clip.py is mostly adapted from PyTorch.

Transformer training code for sequential tasks

Related tags

Overview

Sequential Transformer

Requirements

Adaptive Attention Span

Running experiments in the paper

Pre-trained models

All-attention Network

License

Acknowledgement

Owner

Meta Research

A library for Multilingual Unsupervised or Supervised word Embeddings

Fake news detector filters - Smart filter project allow to classify the quality of information and web pages

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

Simple GUI where you can enter an article and get a crisp summarized version.

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

🧪 Cutting-edge experimental spaCy components and features

Installation, test and evaluation of Scribosermo speech-to-text engine

Paddlespeech Streaming ASR GUI

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

An attempt to map the areas with active conflict in Ukraine using open source twitter data.

A fast, efficient universal vector embedding utility package.

An open-source NLP library: fast text cleaning and preprocessing.

原神抽卡记录数据集-Genshin Impact gacha data

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

Practical Machine Learning with Python

Fine-tune GPT-3 with a Google Chat conversation history

多语言降噪预训练模型MBart的中文生成任务

this repository has datasets containing information of Uber pickups in NYC from April 2014 to September 2014 and January to June 2015. data Analysis , virtualization and some insights are gathered here

SAINT PyTorch implementation